Web Programming, Linux System Administation, and Entrepreneurship in Athens Georgia

Category: Linux System Administration (Page 1 of 11)

AWS CodeDeploy Troubleshooting

CodeDeploy with AutoScalingGroups is a bit of a complex mess to get working correctly. Especially with an app that has been working and needs to be updated for more modern functionality

Update the startups scripts with the latest versions from https://github.com/aws-samples/aws-codedeploy-samples/tree/master/load-balancing/elb-v2

I found even the latest scripts there still not working. My instances were starting up then dying shortly afterward. CodeDeploy was failing with the error


LifecycleEvent - ApplicationStart
Script - /deploy/scripts/4_application_start.sh
Script - /deploy/scripts/register_with_elb.sh
[stderr]Running AWS CLI with region:
[stderr][FATAL] Unable to get this instance's ID; cannot continue.

Upon troubleshooting, I found that common_functions.sh has the get_instance_id() function that was running this curl command to get the instance ID


curl -s http://169.254.169.254/latest/meta-data/instance-id

Running that command by itself while an instance was still running returned nothing, which is why it was failing.

It turns out that newer instances use IMDSv2 by default, and it is required (no longer optional). With that configuration, this curl command will fail. In order to fix, this, I replaced the get_instance_id() function with this version:

# Usage: get_instance_id
#
#   Writes to STDOUT the EC2 instance ID for the local instance. Returns non-zero if the local
#   instance metadata URL is inaccessible.

get_instance_id() {
    TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" -s -f)
    if [ $? -ne 0 ] || [ -z "$TOKEN" ]; then
        echo "[FATAL] Failed to obtain IMDSv2 token; cannot continue." >&2
        return 1
    fi

    INSTANCE_ID=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id -s -f)
    if [ $? -ne 0 ] || [ -z "$INSTANCE_ID" ]; then
        echo "[FATAL] Unable to get this instance's ID; cannot continue." >&2
        return 1
    fi

    echo "$INSTANCE_ID"
    return 0
}

This version uses the IMDSv2 API to get a token and uses that token to get the instance-id

With that code replaced, the application successfully registered with the Target Group and the AutoScaling group works correctly

Alternatively (and for troubleshooting), I was able to make IMDSv2 Optional using the AWS Console, and via CloudFormation with this part of the Launch Template:

Resources:
  MyLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: my-launch-template
      LaunchTemplateData:
        ImageId: ami-1234567890abcdef0
        InstanceType: t4g.micro
        MetadataOptions:
          HttpTokens: optional

SSH Key Best Practices for 2025 – Using ed25519, key rotation, and other best practices

Apparently Google thinks I’m an expert at SSH Keys, so I’m providing an update to my previous post two years ago with some slight updates.

You can tell quite a bit about other IT professionals from their Public SSH Key! I often work with others and ask for their key when granting access to a machine I control. Its a negative sign when they ask how to create one. If they provide one in the PuttyGen format, I know they’ve been asked for their key exactly once. A 2048 bit or smaller RSA key means they haven’t generated one in a long time. If they send me an ed25519 key with a comment other than their machine name, I feel confident that they know what they are doing.

For reference, a 4096-bit RSA key will be in this format:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDowuIZFbN2EWbVwK9TG+O0S85yqr7EYc8Odv76H8+K6I7prrdS23c3rIYsKl2mU0PjOKGyyRET0g/BpnU8WZtDGH0lKuRaNUT5tpKvZ1iKgshdYlS5dy25RxpiVC3LrspjmKDY/NkkflKQba2WAF3a5M4AaHxmnOMydk+edBboZhklIUPqUginLglw7CRg/ck99M9kFWPn5PiITIrpSy2y2+dt9xh6eNKI6Ax8GQ4GPHTziGrxFrPWRkyLKtYlYZr6G259E0EsDPtccO5nXR431zLSR7se0svamjhskwWhfhCEAjqEjNUyIXpT76pBX/c7zsVTBc7aY4B1onrtFIfURdJ9jduYwn/qEJem9pETli+Vwu8xOiHv0ekXWiKO9FcON6U7aYPeiTUEkSDjNTQPUEHVxpa7ilwLZa+2hLiTIFYHkgALcrWv/clNszmgifdfJ06c7pOGeEN69S08RKZR+EkiLuV+dH4chU5LWbrAj/1eiRWzHc2HGv92hvS9s/c= someuser@brandonsLaptop

And for comparison, an ed25519 key looks like this:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBLEURucCueNvq4hPRklEMHdt5tj/bSbirlC0BkXrPDI someuser@ip-172-31-74-201

You’ll notice in both of these, the first characters contain the key type. The middle section with all of the random characters contain the base-64 encoded public key. And at the end is a comment that is intended to identify the user to whom it belongs.

The ed25519 key is much shorter than an RSA keys, so if you’ve never seen one before, you might think it is less secure. But this key type is newer, and uses a totally different, more complex algorithm. Although the 256-bit ed25519 key has fewer characters, it is, for all practical purposes, as secure as the 4096-bit RSA key above. The ed25519 algorithm is more computationally complex, so it requires fewer bits for a similar level of security.

The ed25519 algorithm is based on a specific formula for an ellipse instead of prime numbers like the RSA algorithm. It has been in wide use for ~10 years, is supported by all modern software, and as such is the current standard for most professional users. Creating a key is simple with the ssh-keygen command. But before jumping to the actual command, I wanted to also explain a few other tips that I use, and think others should adopt as well.

Keys should created by individuals, not issued to groups

You should never share your private key with anybody. Ever. If a key is ever shared, you have to assume that the other party can impersonate you on any system in which it is used.

I’ve been a part of some teams which create a new server and create a new key to access that server, and share they new key with everybody who needs to accss the machine. I think this practice stems from AWS or other providers who create an SSH key for you, along with a new machine, and the user just continuing the practice. I wish they’d change that.

That’s the backwards way of thinking about it. Individuals should own their own keys. They should be private. And you can add multiple public keys to resources where multiple people need access. Again, I wish AWS and others will allow this more easily instead of allowing only a single key. You then revoke access by removing the public key, instead of having to re-issue a new key whenever the group changes. (Or worse, not changing the key at all!)

Rotating your SSH keys

You should rotate your SSH keys regularly. The thought process here is that if you have used the same key for a long time, and then your laptop with your private key gets lost, or your key compromised, every machine that you’ve been granted access to over that time is potentially at risk, because administrators are notoriously bad about revoking access. By changing out your key regularly, you limit the potential access in the case of a compromised key. Generating a new SSH key also ensures that you are using more modern algorithms and key sizes.

I like to create a new SSH key about every two years. To remind my self to do this, I embed the year I created the key within its name. My last key was created in March 2023, which I have named [email protected]. I’m creating a new key now, at the beginning of 2025, which I’ll name with the current year. Each time I use it, I’m reminded when I created the key, and if it gets to be around 2 years, and I have some time free, I’ll create a new key. Of course I keep all of my older keys in case I need access to something I haven’t accessed for a while. My ssh-agent usually has my two most recent keys loaded. If I do need to use an older one, it is enough of a process to find and use the old one, that the first thing I’ll do is update my key as soon as I get into a system where an old key was needed.

Don’t use the default ssh-keygen comment

I also suggest that you make the SSH key comment something meaningful. If you don’t provide a comment, most ssh-keygen implementations default to your_username@you_machine name which just might be silly or meaningless. In a professional setting, it should clearly identify you. For example BrandonChecketts as a comment is better than me00101@billys2017_macbook_air. It should be meaningful both to you, and to whomever you are sharing it.

I mentioned including the creation month above, which I like to include in the comment because when sharing the public key, it subtly demonstrates that I am security conscious, have rotated it recently, and I know what I’m doing. The comment at the end of the key can be changed without affecting its functionality, so if I might change the comment depending on who I’m sharing it with. When I receive a public key from somebody else that contains a generic comment, I often change the comment to be include their name or email address so I can later remember to whom it belongs to.

Always use a passphrase

Your SSH key is just a tiny file on disk. If your machine is ever lost, stolen, or compromised in any way by an attacker, the file is pretty easy for them to copy. Without it being encrypted with a pass phrase, it is directly usable. And if someone has access to your SSH private key, they probably have access to your bash or terminal history and would know where to use it.

As such, it is important to protect your SSH private key with a decent pass phrase. To avoid typing your pass phrase over and over, use the SSH-Agent, which will remember it for your session.

Understand and use SSH-Agent Forwarding when applicable

SSH Agent Forwarding allows you to ssh into one machine, and then transparently “forward” your SSH keys to the that machine for use authenticating into a machine past it. I most often use this when authenticating to GitHub from a remote machine. Using Agent forwarding means that I don’t have to copy my SSH Private key onto the remote machine in order to authenticate to GitHub from there.

You shouldn’t, however, just blindly use SSH Agent Forwarding everywhere. If you access a compromised machine where an attacker may have access to your account or to the root account, you should NOT use agent forwarding since it is possible for them to intercept your private key. I’ve never seen this exploited, but since it is possible, you should only use SSH Agent Forwarding to systems which you trust.

The ssh-keygen Command

With all of the above context, this is the command you should use to create your ed25519 key:

ssh-keygen -t ed25519 -f ~/.ssh/your-key-filename -C "your-key-comment"

That will ask you for a pass phrase and then show you a randomart image that represents your public key when it is created. The randomart is just a visual representation of your key so that you can see it is different from others.

 $ ssh-keygen -t ed25519 -f ~/.ssh/[email protected] -C "[email protected]"
Generating public/private ed25519 key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in ~/.ssh/[email protected]
Your public key has been saved in ~/.ssh/[email protected]
The key fingerprint is:
SHA256:HiCF8gbV6DpBTC2rq2IMudwBc5+QuB9NqeGtc3pmqEY brandon+2025@roundsphere
The key's randomart image is:
+--[ED25519 256]--+
| o.o.+.          |
|  * +..          |
| o O...          |
|+ A *. .         |
|.B % .  S        |
|=E* =  . .       |
|=+o=    .        |
|+==.=            |
|B..B             |
+----[SHA256]-----+

Obsessive/Compulsive Tip

This may be taking it too far, but I like to have a memorable few digits at the end of the key so that I can confirm the key got copied correctly. One of my keys ends in 7srus, so I think of it as my “7’s ‘R’ Us” key. You can do that over and over again until you find a key that you like with this one-liner:

rm newkey; rm newkey.pub; ssh-keygen -t ed25519 -f ./newkey -C "[email protected]" -N ''; cat newkey.pub;

That creates a key without a passphrase, so you can do it over and over quickly until you find a public key that you “like”. Then protect it with a passphrase with the command

ssh-keygen -p -f newkey

And obviously, then you rename it from newkey and to newkey.pub a more meaningful name.

Replacing your public key when you use it

As you access machines, make sure to add your new key and remove old keys from your ~/.ssh/authorized_keys file. At some point, you should remove your previous key from your ssh-agent and you’ll be forced to use the old key to get in, and replace it with the new key.

Is that complete? What other tips should others know about when creating an SSH Key in 2025 and beyond?

Adding ed25519 SSH Host Keys via cloud-init

SSH Host Keys are they Public / Private keys that identify a server when connecting to it via SSH.
Most people don’t understand very well how these work, and just quickly click, or type ‘yes’ to approve the Key Fingerprint
when you connect via SSH to a server.

The first time you connect to a server, you will see something like this:

The authenticity of host '[myremoteserver.com]:22 ([12.34.56.78]:22)' can't be established.
ED25519 key fingerprint is SHA256:Vqfv339yJU/zRADJ4SlgF8DcZ0d7Cy1zWX69C33d3e4.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This means that it is the first time your computer has connected to the remote SSH. It is asking if the Key Fingerprint is what you expected. Since we don’t tend to communicate key fingerprints in advance, we usually trust that this is correct and just type ‘yes’.

But this is an important part of the Authentication process. There are a number of possible ways that the remote server may NOT be the server you intend. You could have simply typed the hostname wrong. More nefarious examples might include DNS hijacking or rerouting of your traffic.

When you answer ‘yes’ to that question, the host key fingerprint is saved to a file on your machine in ~/.ssh/known_hosts. If you connect to the same host again, it won’t ask that question again, since you’ve already approved it.

Note that SSH Host Keys (sometimes called SSH Instance Keys) are in the same format, but have a different purpose than SSH User Keys with which most people are familiar. The Host Keys are intended to identify the MACHINE, while your user key is meant to identify YOU.

The SSH Host Key is usually created when an instance is turned on for the first time. When the SSH Server Starts, if it doesn’t find existing host keys, it creates them using a pseudo-random number generator. It kindof just magically happens without anyone having to think about it.

I happen to connect to a lot of servers that are turned on by AWS Auto Scaling Groups. Whenever a new server is launched, that instance creates new SSH Host Keys. If a server has been recreated since I last connected to it, I get this nasty error message:

user@my-machine ~ % ssh ubuntu@myremotemachine
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:Vqfv339yJU/zRADJ4SlgF8DcZ0d7Cy1zWX69C33d3e4.
Please contact your system administrator.
Add correct host key in /Users/myusername/.ssh/known_hosts to get rid of this message.
Offending ED25519 key in /Users/myusername/.ssh/known_hosts:16
Host key for [myremotemachine]:22 has changed and you have requested strict checking.
Host key verification failed.

This error message explains that the SSH Host Key of the machine to which I’ve attempted to connect doesn’t match what it used to be. This could be due to a man-in-the-middle attack, or it could be that the host key legitimately changed, as is what happens when my Auto-Scaling group creates a new instance.

You can “fix” this error by editing your ~/.ssh/known_hosts file and removing the offending line that is mentioned. In this example, it is line 16.

I’ve recently gotten tired of fixing my known_hosts file and have started changing my Auto-Scaling groups so that they use the same Host Key each time that the instance starts. That means I don’t get the error message, and it saves me ~10 seconds (and doesn’t break my train-of-thought) when connecting to an instance that has been replaced.

This is an example of what I enter into the UserData section of my CloudFormation template inside the LaunchTemplate section. It specifies two pre-generated SSH Keys so that each time the instances launches, it will have the same host key.

In order to generate these, I usually just launch an instance the first time without it, then grab the four files mentioned. The files are contained in:

  • /etc/ssh/ssh_host_ecdsa_key
  • /etc/ssh/ssh_host_ecdsa_key.pub
  • /etc/ssh/ssh_host_ed25519_key
  • /etc/ssh/ssh_host_ed25519_key.pub

You could also create these files in advance using ssh-keygen.
My example below uses the newer ecdsa and ed25519 keys, and avoids using the older rsa and dsa keys. This should work fine for most modern distributions and SSH Clients.

UserData: !Base64 |
  #cloud-config
  write_files:
    - path: /etc/motd
      owner: root:root
      permissions: '0644'
      content: |
        You are connected to my-hostname

  ssh_keys:
    ecdsa_private:
      -----BEGIN OPENSSH PRIVATE KEY-----
      put-your-private
      key-contents
      here
      -----END OPENSSH PRIVATE KEY-----
   ecdsa_public:
      ssh-ed25519 AAAAyour-public-key-contents-here ecdsa-my-hostname

    ed25519_private:
      -----BEGIN OPENSSH PRIVATE KEY-----
      put-your-private
      key-contents
      here
      -----END OPENSSH PRIVATE KEY-----
   ed25519_public:
      ssh-ed25519 AAAAyour-public-key-contents-here ed25519-my-hostname

There is one downside, that the host keys are now stored in my CloudFormation template, so I need to make sure and keep that secure. Anybody that has access to these keys could impersonate the server on which it is used.

It’s 2023. You Should Be Using an Ed25519 SSH Key (And Other Current Best Practices)

UPDATE: I’ve got an updated post containing SSH Key Best Practices for 2025


Original content from Sept 2023:

I often have to ask other IT professionals for the Public SSH key for access to a server or for other tasks. I really cringe when they ask me what that is or how to create one. I kindof cringe when they give me one from PuttyGen in its native format. I feel a little better when they provide a 4096-bit RSA key without needing an explanation. When somebody provides an Ed25519 key, I feel like I’m working with somebody who knows what they are doing.

A 4096-bit RSA Keys look like this:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDowuIZFbN2EWbVwK9TG+O0S85yqr7EYc8Odv76H8+K6I7prrdS23c3rIYsKl2mU0PjOKGyyRET0g/BpnU8WZtDGH0lKuRaNUT5tpKvZ1iKgshdYlS5dy25RxpiVC3LrspjmKDY/NkkflKQba2WAF3a5M4AaHxmnOMydk+edBboZhklIUPqUginLglw7CRg/ck99M9kFWPn5PiITIrpSy2y2+dt9xh6eNKI6Ax8GQ4GPHTziGrxFrPWRkyLKtYlYZr6G259E0EsDPtccO5nXR431zLSR7se0svamjhskwWhfhCEAjqEjNUyIXpT76pBX/c7zsVTBc7aY4B1onrtFIfURdJ9jduYwn/qEJem9pETli+Vwu8xOiHv0ekXWiKO9FcON6U7aYPeiTUEkSDjNTQPUEHVxpa7ilwLZa+2hLiTIFYHkgALcrWv/clNszmgifdfJ06c7pOGeEN69S08RKZR+EkiLuV+dH4chU5LWbrAj/1eiRWzHc2HGv92hvS9s/c= someuser@brandonsLaptop

And for comparison, an Ed25519 Key looks like this:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBLEURucCueNvq4hPRklEMHdt5tj/bSbirlC0BkXrPDI someuser@ip-172-31-74-201

The Ed25519 key is much shorter, so initially you might think it is less secure. But these keys use a totally different algorithm, so although the key has fewer characters, it is, for all practical purposes, as secure as the RSA key above. You can ask your favorite search engine or AI for more details about the differences.

The Ed25519 algorithm has been around for ~10 years now. It is widely supported by any modern software, and as such is the current standard for most professional users. Creating a key is simple with the ssh-keygen command. But before jumping to the actual command, I wanted to also explain a couple other tips that I use, and think others should pick up as well.

Keys should be issued to individuals, not groups

You should never, ever share your private key with anybody. Ever. If a key is ever shared, you have to assume that the other party can impersonate you on any system in which it is used.

I’ve seen some organizations who create a new machine and use a new SSH Key on it. Then share the key with all of the individuals who need to access the machine. Perhaps this practice comes from AWS or other hosting providers who create an SSH key for you, along with a new machine, and the user not knowing any better.

Although it kindof works, that’s the backwards way of doing it. Individuals should own their own keys. They should be private. And you can add multiple public keys to resources where multiple people need access. You then revoke access by removing the public key, instead of having to re-issue a new key whenever the group changes. (Or worse, not changing the key at all!)

Rotating your keys

You should rotate your SSH keys on some kind of schedule. The main risk you are trying to avoid here is that if you have used the same key for 20 years, and then your laptop with your private key gets lost, or your key compromised, every machine that you’ve been granted access to over that time is potentially at risk, because administrators are notoriously bad about revoking access. By changing out your key regularly, you limit the potential access in the case of a compromised key. Generating a new SSH key also ensures that you are using more modern algorithms and key sizes.

I like to start a new key about every year. To remind my self to do this, I embed the year I created the key within its name. So I last created a key in March 2023, which I have named brandon+2022-03@roundsphere. When it gets to be 2024, I’ll be subtly reminded each time I use it that it’s time to create a new key. I keep all of my older keys if I need them. But they aren’t in memory or in my SSH-Agent. If I do need to use one, it is enough of a process to find the old one, that the first thing I’ll do is update my key as soon as I get in a system where an old key was needed.

Don’t use the default comment

Make the comment meaningful. If you don’t provide a comment, it defaults to your_username@you_machine name which just might be silly or meaningless. In a professional setting, it should clearly identify you. For example BrandonChecketts as a comment is better than me00101@billys2017_macbook_air. It should be meaningful both to you, and to whomever you are sharing it.

I mentioned including the creation month above, which I like because when sharing it, it subtly demonstrates that I am at least somewhat security conscious and I know what I’m doing. The comment at the end of the key isn’t necessary for the key to work correctly, so you can change it when sharing it. I often change the comment to be more meaningful if someone provides me with a key that doesn’t clearly indicate its owner.

Always use a passphrase

Your SSH key is just a tiny file on disk. If your machine is ever lost, stolen, or compromised in any way by an attacker, the file is pretty easy for them to copy. Without it being encrypted with a pass phrase, it is directly usable. And if someone has access to your SSH private key, they probably have access to your history and would know where to use it.

As such, it is important to protect your SSH private key with a decent pass phrase. Note that you can use SSH-Agent so you don’t need to type the passphrase every time you need to use the key.

The Command

This is the command you should use to create your ED25519 Key:

ssh-keygen -t ed25519 -f ~/.ssh/your-key-filename -C "your-key-comment"

That will ask you for a pass phrase and then show you a cool randomart image that represents your public key when it is created

 $ ssh-keygen -t ed25519 -f ./deleteme -C "brandon+2023-09@roundsphere"
Generating public/private ed25519 key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in ./deleteme
Your public key has been saved in ./deleteme.pub
The key fingerprint is:
SHA256:HiCF8gbV6DpBTC2rq2IMudwAc5+QuB9NqeGtc3pmqEY brandon+2023-09@roundsphere
The key's randomart image is:
+--[ED25519 256]--+
| o.o.+.          |
|  * +..          |
| o O...          |
|+ B *. .         |
|.B % .  S        |
|=E* =  . .       |
|=+o=    .        |
|+==.=            |
|B..B             |
+----[SHA256]-----+

Obsessive/Compulsive Tip

I maybe have spent 10 minutes creating a key over an over until I found a key that ended in a few character that I like. One of my keys ends in 7srus, so I think of it as my “7’s ‘R’ Us” key. You can do that over and over again until you find a key that you like with this one-liner:

rm newkey; rm newkey.pub; ssh-keygen -t ed25519 -f ./newkey -C "[email protected]" -N ''; cat newkey.pub;

That creates a key without a passphrase, so you can do it over and over quickly until you find a public key that you “like”. Then protect it with a passphrase with the command

ssh-keygen -p -f newkey

And obviously, then you rename it from newkey to something more meaningful.

What else? Any other tips for creating an SSH key and looking like a professional in 20223?

MySQL 8.0.34 Upgrade and tons of MY-013360 ‘mysql_native_password’ is deprecated warnings

After upgrading a busy server to MySQL 8.0.34 I noticed that my error logs was filling up with tons of these errors. Hundreds of them a second is causing some noticeable cost when they are going to CloudWatch Logs. It looks like the deprecation notice started in MySQL 8.0.34.

2023-08-18T22:01:12.183036Z 19100582 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is deprecated and will be removed in a future release. Please use caching_sha2_password instead'

I could see that all of my active users were using the mysql_native_password plugin with this query:

mysql> select user, host, plugin from mysql.user;
+------------------+-------------+-----------------------+
| user             | host        | plugin                |
+------------------+-------------+-----------------------+
| user1            | %           | mysql_native_password |
| user2            | %           | mysql_native_password |
| user3            | %           | mysql_native_password |
| mysql.infoschema | localhost   | caching_sha2_password |
| mysql.session    | localhost   | caching_sha2_password |
| mysql.sys        | localhost   | caching_sha2_password |
| rdsadmin         | localhost   | mysql_native_password |
+------------------+-------------+-----------------------+
7 rows in set (0.01 sec)

Some googling pointed me to this Stack Overflow article which was somewhat related, and where I figured out how to change the authentication plugin for each user with the command:

ALTER USER user2@'%' IDENTIFIED WITH caching_sha2_password BY 'the_password';

After updating each account, they look correct in the mysql user table:

mysql> select user, host, plugin from mysql.user;
+------------------+-------------+-----------------------+
| user             | host        | plugin                |
+------------------+-------------+-----------------------+
| user1            | %           | caching_sha2_password |
| user2            | %           | caching_sha2_password |
| user3            | %           | caching_sha2_password |
| mysql.infoschema | localhost   | caching_sha2_password |
| mysql.session    | localhost   | caching_sha2_password |
| mysql.sys        | localhost   | caching_sha2_password |
| rdsadmin         | localhost   | mysql_native_password |
+------------------+-------------+-----------------------+
7 rows in set (0.00 sec)

But the error continued at the same volume, so even though the Database user accounts seem to be configured correctly, the MySQL client library that I’m using must still be falling back to mysql_native_password. This application is using PHP 7.4.3, so it’s not too old, and some references indicate that support for caching_sha2_password was released in PHP 7.2, so that shouldn’t be the problem.

I see that the default_authentication_plugin variable is set to mysql_native_password, but this database instance is hosted on RDS, and that configuration value is not modifiable.

I see that the MySQL log_error_suppression_list is also available and could be configured to suppress only the MY-013360 error. Unfortunately, this value is not configurable using MySQL8 Parameter groups.

In the mean time, I’m spending several dollars per day in Cloudwatch logs for this, so to turn it off, I was able to disable deprecation notices from being logged by setting the global log_error_verbosity value to 1 (instead of the default of 2).

This prevented the error from filling up the logs for now. Next I can try upgrading the application to PHP 8 and checking into specific connection parameters that may force it to use caching_sha2_password.

Do you have more or updated information? Or just questions? Please let everybody know in the comments below. FWIW, I’ve created an AWS Re:Post topic requesting the addition of log_error_suppression_list in a parameter group. Feel free to vote that up if you run into this issue.

Migrating 1.2 TB Database From Aurora to MySQL

We have one database server that is running on an old version of Aurora based on MySQL 5.6. AWS is deprecating that version soon and it needs to be upgraded, so I have been working on replacing it. Upgrading the existing 5.6 server to 5.7, then to 8.0 isn’t an option due to an impossibly huge InnoDB transaction history list that will never fix itself. Plus, I want to improve a couple of other things along the way.

I made several attempts and migrating from Aurora 5.6 to Aurora 8.0, but during that process, I grew tired of Aurora quirks and costs. Here are some of my raw notes on what was an embarrassingly long migration of a database server from Aurora to MySQL. Going from MySQL to Aurora took just a couple of clicks. But converting from Aurora back to MySQL took months and a lot of headaches.

TLDR: Along the way, I tried Using Amazon’s Database Migration Service, but eventually gave up for a good old closely monitored mysqldump and custom scripts.

I had a few goals/requirements:

  • Get rid of or soon-to-be-deprecated Aurora instance based on MySQL 5.6
  • Stop Paying for Storage IOPS (often over $100/day)
  • Convert tables from utf8mb3 to utf8mb4
  • Minimal downtime or customer disruption. Some disruption during low-usage times is okay.

A new MySQL 8 instance with a GP3 storage volume and the recently announced RDS Optimized Writes means that MySQL should be able to handle the workload with no problem, and gets this server back into the MySQL realm, where all of our other servers are, and with which we are more comfortable.

Attempts at using AWS Database Migration Service (DMS)

This service looked promising, but has a learning curve. I eventually gave up using it because of repeated problems that would have taken too much effort to try and resolve.

First attempts:
On the surface, it seems like you configure a source, configure a destination, and then tell DMS to sync one to the other and keep them in sync. It does this in two Phases: the Full Dump, and the Change Data Capture (CDC). I learned the hard way that the Full Dump doesn’t include any indexes on the tables! This is done to make it as fast as possible. The second, CDC Phase, just executes statements from the binary log, so without indexes on a 400+G table, they take forever and this will never work.

I also concluded that one of our 300+GB tables can actually be done in a separate process, after the rest of the data is loaded. It contains historic information that will make some things in the application look incomplete until it is loaded, but the application will work with it empty.

Second attempts:
Used DMS for the full dump, the configured it to stop after the full dump, before starting the CDC Process. While it is stopped, I added the database indexes and foreign keys. I tried this several times with varying degrees of success and trying to minimize the amount of time that it took to add the indexes. Some tables were done instantly, some took a couple hours, and some were 12+ hours. At one point I had figured it would take about 62 hours to add the indexes. I think I got that down to 39 hours by increasing the IOPS, running some ALTER TABLES in parallel, etc.

After indexes were added, I started the second phase of DMS – the Change Data Capture is supposed to pick up in time where the Full Dump was taken, and then apply all of the changes from the Binary Logs to the new server. That process didn’t go smoothly. Again, the first attempts looked promising, but then the binary logs on the server were deleted, so it couldn’t continue. I increased the number of days that binary logs were kept, and made more attempts, but they had problems with foreign key and unique constraints on tables.

The biggest problem with these attempts was that it took about 24 hours for the data migration, and about 48 hours to add indexes. So each attempt was several days effort.

Third and last attempts at using DMS:
After getting pretty familiar DMS, I ended up creating the schema via `mysqldump –no-data` then manually editing the file to exclude indexes on some of the biggest tables that would cause the import to go slow. I excluded the one large, historic table. My overall process looked like this:

  • code>mysqldump –defaults-group-suffix=dumpschema –no-data thedatabase |sed “s/utf8 /utf8mb4 /” | sed “s/utf8_/utf8mb4_/” > /tmp/schema-limited-indexes.sql
  • Edit /tmp/schema-limited-indexes.sql and remove foreign keys and indexes on large tables
  • cat /tmp/schema-limited-indexes.sql | mysql –defaults-group-suffix=newserver thedatabase
  • On the new server, run ALTER TABLE the_historic_table ENGINE=blackhole;
  • Start DMS process, make sure to have it stop between Full Load and CDC.
  • Wait ~24+ hours for Full load to complete
  • Add Indexes back that were removed from the schema. I had a list of ALTER TABLE statements to run, with an estimate time that each should take. That was estimated at 39 hours
  • Start second Phase (CDC) of the DMS Task
  • Wait for CDC to complete (time estimate unknown. The faster the above steps worked, the less it had to replay)

Unfortunately, a couple of attempts at this had the CDC phase still fail with Foreign key constraints. I tried several times and don’t know why this happened. Finding the offending rows took many hours since the queries didn’t have indexes and had to do full table scans. In some cases, there were just a few, to a few-dozen rows that existed in one table without the corresponding row in the foreign table. Its as if the binary log position taken when the snapshot was started was off by a few seconds and the dumps of different tables were started at slightly different positions.

After several attempts (taking a couple weeks), I finally gave up on the DMS approach.

Using MySQL Dump

Using mysqldump to move data from one database server to another is a process I have done thousands of times and written many scripts around. It is pretty well understood and predictable. I did a few trial runs to put together this process:

Temporarily Stop all processes on the master server

  • Stop all background processes that write to the server
  • Change the password so that no processes can write to the master
  • Execute SHOW BINARY LOGS on master and note the last binary log file and position. Do this a few times to make sure that it does not change. (Note that this would be easier if RDS allowed FLUSH TABLES WITH READ LOCK, but since it doesn’t, this process should work.

Dump the schema to the new server

This has the sed commands in the middle to convert the old “utf8” colations to the desired “utf8mb4” versions. When dumping 1TB+ of data, I found it helped performance a bit to do the schema changes with the sed commands first. That way the bulk of the data doesn’t have to go through these two commands.

  • mysqldump --defaults-group-suffix=dumpschema --no-data thedatabase |sed "s/utf8 /utf8mb4 /" | sed "s/utf8_/utf8mb4_/" | mysql thedatabase
  • .my.cnf contains this section with the relevant parameters for the dump
    [clientdumpschema]
    host=thehostname.cluster-czizrrfoedlm.us-east-1.rds.amazonaws.com
    port=3306
    user=dumper
    password=thepassword
    ssl-cipher=AES256-SHA:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA
    quick
    compression-algorithms=zlib
    set-gtid-purged=OFF
    max_allowed_packet=1024M
    single-transaction=TRUE
    column_statistics=0
    net_buffer_length=256k
    

Move the data

To move the data, I ran this command. Note that it starts with time so that I could see how long it takes. Also, it includes

time mysqldump --defaults-group-suffix=dumpdata --no-create-info thedatabase | pv |mysql thedatabase

My .my.cnf contains this section for the import

host=thehostname.cluster-czizrrfoedlm.us-east-1.rds.amazonaws.com
port=3306
user=dumper
password=thepassword
ssl-cipher=AES256-SHA:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA
quick
ignore-table=thedatabase.the_big_table
compression-algorithms=zlib
set-gtid-purged=OFF
max_allowed_packet=1024M
single-transaction=TRUE
column_statistics=0
net_buffer_length=256k

Note that the above command includes the linux pv in between which is a nice way to monitor the progress. It displays a simple line to stderr that allows you to see the total transfer size, elapsed time, and current speed.

266.5GiB 57:16:47 [ 100KiB/s] [             <=>         ]

I experimented with several values for the NET_BUFFER_LENGTH parameter by dumping the same multi-GB table over and over with different values for NET_BUFFER_LENGTH. The size of this value determines how many values are included in the INSERT INTO statement generated by mysqldump. I was hoping that a larger value would improve performance, but I found that larger values slowed down. I found the best value was to use 256k.

NET_BUFFER_LENGTH value Elapsed Time
64k 13m 44s
256k 8m 27s
256k 7m 20s
1M 10m 23s
16M 11m 32s

After Migration is Started

After the mysqldump has been started, I re-enabled traffic back to the master server by setting the password back to the original. I kept all background jobs disabled to minimize the amount of data that had to be copied over afterwards.

Final attempt to use DMS

After the mysqldump was finished, I attempted to use the DMS Change Data Capture process to copy over the data that had changed on the master. You can start a Database Migration Task that begins at a specific point in the Master Log position. Maybe. I tried, it, but it failed pretty quickly with a duplicate key constraint. I gave up on DMS and figured I would just move over any data needed manually via custom scripts.

Other findings

In attempting to maximimize the speed of the transfer, I attempted to increase the IOPS on the GP3 volume from its base level of 12,000 to 32,000. Initially that helped, but for some reason I still don’t understand, the throughput was then limited very strictly to 6,000 IOPS. As seen in the chart below, it bursted above that for some short parts, but it was pretty strictly constrained for most of the time. I think this has to do with how RDS uses multiple volumes to store the data. I suspect that each volume has 6,000 capacity, and all of my data was going to a single volume.

RDS IOPS

RDS IOPS Maxed at 6,000

That concludes the notes that I wanted to take. Hopefully somebody else finds these learnings or settings useful. If this has been helpful, or if you have any comments on some of the problems that I experienced, please let me know in the comments below.

Ubuntu 20.04 Cloud-Init Example to Create a User That Can Use sudo

Use the steps below and example config to create a cloud-init file that creates a user, sets their password, and enables SSH access. The Cloud Config documentation has some examples, but they don’t actually work for being able to ssh into a server and run commands via sudo

First, create a password hash with mkpasswd command:

$ mkpasswd -m sha-512
Password:  
$6$nq4v1BtHB8bg$Oc2TouXN1KZu7F406ELRUATiwXwyhC4YhkeSRD2z/I.a8tTnOokDeXt3K4mY8tHgW6n0l/S8EU0O7wIzo.7iw1

Make note of the output string. You need to enter it exactly in the passwd line of your cloud-init config.

This is the minimal configuration to create a user using cloud-init:

users:
  - name: brandon
    groups: [ sudo ]
    shell: /bin/bash
    lock_passwd: false
    passwd: "$6$nq4v1BtHB8bg$Oc2TouXN1KZu7F406ELRUATiwXwyhC4YhkeSRD2z/I.a8tTnOokDeXt3K4mY8tHgW6n0l/S8EU0O7wIzo.7iw1"
    ssh-authorized-keys:
    - ssh-ed25519 AAAAC3NzaC1lZDI1zzzBBBGGGg3BZFFzTexMPpOdq34a6OlzycjkPhsh4Qg2tSWZyXZ my-key-name

A few things that are noteworthy:

  • The string in the passwd field is enclosed in quotes
  • lock_passwd: false is required to use sudo. Otherwise, the system user account created will have a disabled password and will be unable to use sudo. You’ll just continually be asked for a password, even if you enter it correctly.
  • I prefer the method of adding the user to the sudo group to grant access to sudo. There are other ways to make that work as well, but I feel like this is the cleanest.
  • Adding any users, will prevent the default ubuntu user from being created.
  • Solving ECS Stuck in Pending and Frozen / Stalled ECS Hosts Problems

    We’ve had a strange, hard to track-down problem for months now. It has felt like a bug with Amazon ECS, but everything seems to have been working correctly.

    The main way that we’ve observed this problem is that ECS would say that it was launching tasks, but they would stay in a “PENDING” state forever. Conversely, when tasks needed to be killed, the desired state would change to Stopped, but the ECS Console would indicate that they were still running. We discovered quickly, that some of our ECS Host Servers would become completely unresponsive. Sometimes with 100% CPU usage, sometimes with near zero CPU Usage. Terminating the instance, and having the Auto-Scaling group recreate it would generally solve the problem, but its never good to have things frozen without understanding why.

    Often, the host servers would be completely unresponsive. We were usually unable to SSH into the server to investigate. When able to access them, looked through logs and found it full of failures about being unable to talk to external resources. After diving pretty deep, we figured out that the route table was missing a default gateway. It’s hard to talk to anything when you can only use a local network.

    This is an example of a missing default gateway.

    [ec2-user@ip-172-31-45-74 ~]$ route
    Kernel IP routing table
    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
    172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
    172.31.32.0     0.0.0.0         255.255.240.0   U     0      0        0 eth0
    

    On a functioning instance, it should look like this. Notice the destination of 0.0.0.0 with the IP Address to the Default Gateway:

    [ec2-user@ip-172-31-39-228 ~]$ route -n
    Kernel IP routing table
    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
    0.0.0.0         172.31.32.1     0.0.0.0         UG    0      0        0 eth0
    169.254.169.254 0.0.0.0         255.255.255.255 UH    0      0        0 eth0
    172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
    172.31.32.0     0.0.0.0         255.255.240.0   U     0      0        0 eth0
    

    It was puzzling how the machine would work for a while, and then its default gateway would disappear.

    I’m still not certain how exactly that is happening. However, the system log indicates that there is a period of extremely high load
    and it gets frozen for minutes (maybe hours) at a time.

    Some of these log entries are indicitive of major delays:

    Jan 20 13:26:44 ip-172-31-123-45.ec2.internal crond[21992]: (root) INFO (Job execution of per-minute job scheduled for 13:25 delayed into subsequent minute 13:26. Skipping job run.)
    
    Jan 17 21:20:31 ip-172-31-45-166.ec2.internal chronyd[2696]: Forward time jump detected!
    

    Notice how these logs are out of order too:

    Jan 20 13:39:22 ip-172-31-123-45.ec2.internal kernel: R13: 00007faf9dc777a8 R14: 00000000000031f9 R15: 00007faf9dc7d510
    Jan 20 13:28:30 ip-172-31-123-45.ec2.internal dockerd[4660]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.MakeErrorHandler.func1 (httputils.go:107)
    Jan 20 13:36:03 ip-172-31-123-45.ec2.internal dhclient[3275]: XMT: Solicit on eth0, interval 129760ms.
    Jan 20 13:28:30 ip-172-31-123-45.ec2.internal dockerd[4660]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.MakeErrorHandler.func1 (httputils.go:107)
    

    Finally, this may be the thing that ultimately disables the networking. It looks like `oom-killer` killed the `dhclient-script`, which maybe left the network in an very bad state:

    Jan 20 15:28:36 ip-172-31-45-74.ec2.internal kernel: dhclient-script invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
    Jan 20 15:28:36 ip-172-31-45-74.ec2.internal kernel: dhclient-script cpuset=/ mems_allowed=0
    

    You can simply run

    sudo dhclient eth0

    to have it grab the default gateway from DHCP again. But its best to put other memory limits in place to prevent it from running out of resources to begin with.

    Find MySQL indexes that can be removed to free up disk space and improve performance

    I wrote this handy query to find indexes that can be deleted because they have not been in use. It
    queries the performance_schema database for usage on the indexes, and joins on INFORMATION_SCHEMA.TABLES
    to see the index size.

    Indexes that have zero reads and writes are obvious candidates for removal. They take extra write overhead to keep them
    updated, and you can improve performance on a busy server by removing them. You can also free up some disk space
    without them. The size column below helps to understand where you have the most opportunity for saving on disk
    usage.

    mysql>
    SELECT. OBJECT_NAME,
            index_name,
            SUM(INDEX_LENGTH) AS size,
            SUM(count_star) AS count_star,
            SUM(count_read) AS count_read,
            SUM(count_write) AS count_write
    FROM  table_io_waits_summary_by_index_usage
    JOIN information_schema.TABLES
        ON table_io_waits_summary_by_index_usage.OBJECT_SCHEMA = TABLES.TABLE_SCHEMA
       AND table_io_waits_summary_by_index_usage.OBJECT_NAME = TABLES.TABLE_NAME
    WHERE OBJECT_SCHEMA LIKE 'mydatabase%'
    GROUP BY object_name, index_name
    ORDER BY count_star ASC, size DESC
    LIMIT 20;
    
    +------------------------------+---------------------------------+-------------+------------+------------+-------------+
    | OBJECT_NAME                  | index_name                      | size        | count_star | count_read | count_write |
    +------------------------------+---------------------------------+-------------+------------+------------+-------------+
    | transactions                 | order_id                        | 42406641664 |          0 |          0 |           0 |
    | transactions                 | msku-timestamp                  | 42406641664 |          0 |          0 |           0 |
    | transactions                 | fkTransactionsBaseEvent         | 42406641664 |          0 |          0 |           0 |
    | baseEvent                    | PRIMARY                         | 33601945600 |          0 |          0 |           0 |
    | baseEvent                    | eventTypeId                     | 33601945600 |          0 |          0 |           0 |
    | orders                       | modified                        | 20579876864 |          0 |          0 |           0 |
    | orders                       | buyerId-timestamp               | 20579876864 |          0 |          0 |           0 |
    | productReports               | productAd-date-venue            |  8135458816 |          0 |          0 |           0 |
    | shipmentEvent                | id                              |  7831928832 |          0 |          0 |           0 |
    | shipmentEvent                | eventTypeId                     |  7831928832 |          0 |          0 |           0 |
    | historyEvents                | timestamp_venue_entity          |  4567531520 |          0 |          0 |           0 |
    | targetReports                | venueId-date-targetId           |  3069771776 |          0 |          0 |           0 |
    | productAds                   | venue-productAd                 |  1530888192 |          0 |          0 |           0 |
    | keywords                     | venue-keyword                   |   895598592 |          0 |          0 |           0 |
    | targetingExpressions         | venue-target                    |   215269376 |          0 |          0 |           0 |
    | targetingExpressions         | rType-rValue                    |   215269376 |          0 |          0 |           0 |
    | serviceFeeEvent              | PRIMARY                         |    48234496 |          0 |          0 |           0 |
    | serviceFeeEvent              | id                              |    48234496 |          0 |          0 |           0 |
    | serviceFeeEvent              | eventTypeId                     |    48234496 |          0 |          0 |           0 |
    | adGroups                     | venue-adGroup                   |    42336256 |          0 |          0 |           0 |
    

    PHP Sessions with Redis Cluster (using AWS Elasticache)

    I’ve recently been moving some of our project from a single Redis server (or server with a replica) to the more modern Redis Cluster configuration. However, when trying to set up PHP sessions to use the cluster, I found there wasn’t a lot of documentation or examples. This serves as a walk-through for setting up PHP sessions to use a redis Cluster, specifically with Elasticache on AWS.

    First, create your Elasticache Redis Instance like so. Note the “Cluster Mode Enabled” is what causes redis to operate in Cluster mode.

    AWS Elasticache Redis Creation

    Once there servers are launched, make note of the Configuration Endpoint which should look something like: my-redis-server.dltwen.clustercfg.usw1.cache.amazonaws.com:6379

    Finally, use these settings in your php.ini file. The exact location of this file will depend on your OS, but on modern Ubuntu instances, You can place it in /etc/php/7.0/apache2/conf.d/30-redis-sessions.ini

    Note the special syntax for the save_path where is has seed[]=. You only need to put the main cluster configuration endpoint here. Not all of the individual instances as other examples online appear to use.


    session.save_handler = rediscluster
    session.save_path = "seed[]=my-redis-server.dltwen.clustercfg.usw1.cache.amazonaws.com:6379"
    session.gc_maxlifetime = 1296000

    That’s it. Restart your webserver and sessions should now get saved to your Redis cluster.

    IIn the even that something goes wrong, you might see something like this in your web server log files:


    PHP Warning: Unknown: Failed to write session data (redis). Please verify that the current setting of session.save_path is correct (tcp://my-redis-server.dltwen.clustercfg.use1.cache.amazonaws.com:6379) in Unknown on line 0

    « Older posts

    © 2025 Brandon Checketts

    Theme by Anders NorenUp ↑