Category: Linux System Administration (Page 2 of 11)

Docker Syslog Container for Sending Logs to CloudWatch

July 9, 2015 / Brandon / 0 Comments

AWS’s CloudWatch Logs was first available about a year ago, and to my estimation has gone largely unnoticed. The initial iteration was pretty rough, but some recent changes have made it more useful, including the ability to search logs, and generate events for monitoring in CloudWatch from log content.

Unfortunately, the Cloudwatch Logs agent just watches log files on disk and doesn’t act as a syslog server. An AWS blog post explained how to get the the Cloudwatch Logs Agent running inside a container and monitoring the log output from rsyslogd, but the instructions used Amazon’s ECS service, which still doesn’t quite offer the flexibility that CoreOS or Deis offer IMHO. ECS does some magic behind the scenes in passing credentials around that you have to do yourself when using CoreOS.

I’ve just provided a GitHub repository with the tools to make this work pretty easily, as well as a Docker Image with some reasonable defaults.

When trying to pull all of this together to work, I discovered a problem due to a bug in the overlayfs that is in current Deis Releases which causes the AWS Logs agent not to notice changes in the syslog files. A workaround is available that reformats the host OS back to btrfs to solve that particular problem

Note when running on Deis 561+ to revert to btrfs

Deis add Key from an ssh-agent

July 3, 2015 / Brandon / 0 Comments

Evidently it is not possible to add an SSH key directly from an SSH agent. Instead, you can grep the public key from your ~/.ssh/authorized_keys file then, have deis use that key. Or if you only have one line in ~/.ssh/authorized_keys, you can just tell deis to use that file directly with the command

deis keys:add ~/.ssh/authorized_keys

Unattended install of Cloudwatch Logs Agent

July 3, 2015 / Brandon / 1 Comment

So far, I’m pretty impressed with cloudwatch logs. The interface for it isn’t as fancy, and search capability isn’t as deep as other tools like PaperTrail or Loggly, but the cost is significantly less, and I like the fact that you can store different log groups for different lengths of time.

I’m trying to get the cloudwatch logs agent to install as part of an automated script, and couldn’t find any easy instructions to do that, so here is how I got it working with a shell script on an Ubuntu 14.04 host

echo Creating cloudwatch config file in /root/awslogs.conf
cat <<EOF >/root/awslogs.conf
[general]
state_file = /var/awslogs/state/agent-state
## Your config file would have a lot more with the logs that you want to monitor and send to Cloudwatch
EOF

echo Creating aws credentials in /root/.aws/credentials
cat <<EOF > /root/.aws/credentials
[default]
aws_access_key_id = YOUR_AWS_ACCESS_KEY_HERE
aws_secret_access_key = YOUR_AWS_SECRET_KEY_HERE
EOF

echo Downloading cloudwatch logs setup agent
cd /root
wget https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py
echo running non-interactive cloudwatch-logs setup script
python ./awslogs-agent-setup.py --region us-west-2 --non-interactive --configfile=/root/awslogs.conf

DKIM / SPF / SpamAssassin test moved to dkimvalidator.com

February 9, 2015 / Brandon / 1 Comment

For over 7 years, I’ve hosted an email validation tool on this site. I developed the tool back when I was doing a lot with email, and when DKIM and SPF was still pretty tricky to get working. Over the years it has become the single most popular page on the site, and Google likes it pretty well for certain DKIM and SPF keywords. (And strangely “brandon checketts dkim” pops up on Google search suggest you you try to google my name.

In any case, I’ve moved that functionality over to its own site now at https://dkimvalidator.com/ so that it has its own place to call home. It also got a (albeit weak) visual makeover, and all of the underlying libraries have been updated to the latest versions so that they are once again accurate and up-to-date.

Troubleshooting /etc/cron.d/ on Ubuntu

March 29, 2014 / Brandon / 0 Comments

On Debian-based systems, files in /etc/cron.d:
– must be owned by root
– must not be group or world writable
– may be symlinked, but the destination must follow the ownership and file permissions above
– Filenames must contain ONLY letters, numbers, underscores and hyphens. (not periods)
– must contain a username in the 6th column

From the man page:

Files in this directory have to be owned by root, do not need to be executable (they are configuration files, just like /etc/crontab) and must conform to the same naming convention as used by run-parts(8): they must consist solely of upper- and lower-case letters, digits, underscores, and hyphens. This means that they cannot contain any dots.

The man page also provides this explanation to this strange rule:

For example, any file containing dots will be ignored. This is done to prevent cron from running any of the files that are left by the Debian package management
system when handling files in /etc/cron.d/ as configuration files (i.e. files ending in .dpkg-dist, .dpkg-orig, and .dpkg-new).

Increasing the number of simultaneous SASL authentication servers with Postfix

July 5, 2012 / Brandon / 0 Comments

I had a customer complaining lately that messages sent via Gmail to one of my mail servers was occasionally receiving SMTP Authentication failures and bounce backs from Gmail. Fortunately he noticed it happening mainly when he sent a messages to multiple recipients and was able to send me some of the bounces for me to track it down pretty specifically in the postfix logs.

The Error message via Gmail was:

Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 535 535 5.7.0 Error: authentication failed: authentication failure (SMTP AUTH failed with the remote server) (state 7).

This was a little odd, because the SMTP AUTH failure is what I would typically expect with a mistyped username and password. However, I could see that plenty of messages were being sent from the same client. By looking at the specific timestamp of the bounced message, I tracked down the relevant log segment shown below. It indicates 5 concurrent SMTPD sessions where the SASL authentication was successful on 4 of them and failed on the 5th.

Jul  5 12:43:39 mail postfix/smtpd[13602]: connect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[13602]: setting up TLS connection from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14113]: connect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14113]: setting up TLS connection from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14115]: connect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14115]: setting up TLS connection from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14116]: connect from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[14117]: connect from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[14116]: setting up TLS connection from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[14117]: setting up TLS connection from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[13602]: TLS connection established from mail-bk0-f50.google.com[209.85.214.50]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14113]: TLS connection established from mail-bk0-f50.google.com[209.85.214.50]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14115]: TLS connection established from mail-bk0-f50.google.com[209.85.214.50]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14116]: TLS connection established from mail-bk0-f49.google.com[209.85.214.49]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14117]: TLS connection established from mail-bk0-f49.google.com[209.85.214.49]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:40 mail postfix/smtpd[13602]: 2846B11AC5E2: client=mail-bk0-f50.google.com[209.85.214.50], sasl_method=PLAIN, [email protected]
Jul  5 12:43:40 mail postfix/smtpd[14113]: 3290811AC5E3: client=mail-bk0-f50.google.com[209.85.214.50], sasl_method=PLAIN, [email protected]
Jul  5 12:43:40 mail postfix/smtpd[14115]: 3C4AD11AC5E4: client=mail-bk0-f50.google.com[209.85.214.50], sasl_method=PLAIN, [email protected]
Jul  5 12:43:40 mail postfix/cleanup[13420]: 2846B11AC5E2: message-id=
Jul  5 12:43:40 mail postfix/cleanup[14092]: 3290811AC5E3: message-id=
Jul  5 12:43:40 mail postfix/smtpd[14116]: warning: SASL authentication failure: Password verification failed
Jul  5 12:43:40 mail postfix/smtpd[14116]: warning: mail-bk0-f49.google.com[209.85.214.49]: SASL PLAIN authentication failed: authentication failure
Jul  5 12:43:40 mail postfix/cleanup[14121]: 3C4AD11AC5E4: message-id=
Jul  5 12:43:40 mail postfix/qmgr[32242]: 2846B11AC5E2: from=, size=10564, nrcpt=1 (queue active)
Jul  5 12:43:40 mail postfix/qmgr[32242]: 3290811AC5E3: from=, size=10566, nrcpt=1 (queue active)
Jul  5 12:43:40 mail postfix/smtpd[14116]: disconnect from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:40 mail postfix/qmgr[32242]: 3C4AD11AC5E4: from=, size=10568, nrcpt=1 (queue active)
Jul  5 12:43:40 mail postfix/smtpd[13602]: disconnect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:40 mail postfix/smtpd[14113]: disconnect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:40 mail postfix/smtpd[14115]: disconnect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:40 mail postfix/smtpd[14117]: D4F2411AC5E5: client=mail-bk0-f49.google.com[209.85.214.49], sasl_method=PLAIN, [email protected]
Jul  5 12:43:41 mail postfix/cleanup[13420]: D4F2411AC5E5: message-id=
Jul  5 12:43:41 mail postfix/qmgr[32242]: D4F2411AC5E5: from=, size=10565, nrcpt=1 (queue active)
Jul  5 12:43:41 mail postfix/smtpd[14117]: disconnect from mail-bk0-f49.google.com[209.85.214.49]

In looking into the SASL component a bit, I noticed that there were 5 simultaneous SASL servers running. The first one looks like a parent with 4 child processes.

[root@mail postfix]# ps -ef |grep sasl
root      9253     1  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9262  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9263  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9264  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9265  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1

So it seemed likely that the 4 child processes were in use and that Postfix couldn’t open a connection to a 5th simultaneous SASL authentication server, so it responded with a generic SMTP AUTH failure.

To fix, I simply added a couple of extra arguments to the saslauthd command that is run. I added a ‘-c’ parameter to enable caching, and ‘-n 10’ to increase the number of servers to 10. On my CentOS server, I accomplished that by modifying /etc/sysconfig/saslauthd to look like this:

# Directory in which to place saslauthd's listening socket, pid file, and so
# on.  This directory must already exist.
SOCKETDIR=/var/run/saslauthd

# Mechanism to use when checking passwords.  Run "saslauthd -v" to get a list
# of which mechanism your installation was compiled with the ablity to use.
MECH=rimap

# Additional flags to pass to saslauthd on the command line.  See saslauthd(8)
# for the list of accepted flags.
FLAGS="-r -O 127.0.0.1 -c -n 10"

After restarting saslauthd, and a quick test, it looks good so far.

Upgrading Disks in a Software Raid System

August 28, 2011 / Brandon / 1 Comment

I have a system at home with a pair of 640 GB drives, that are getting full. The drives are configured in a Linux Software RAID1 array. Replacing those drives with a pair of 2 TB drives is a pretty easy (although lengthy process).

First, we need to verify that our RAID system is running correctly. Take a look at /proc/mdstat to verify that both drives are online:

root@host:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sdb3[1] sda3[0]
      622339200 blocks [2/2] [UU]

md0 : active raid1 sdb1[1] sda1[0]
      192640 blocks [2/2] [UU]

After verifying that the system is running, I powered the machine off and removed the second hard drive and replaced it with the new drive. Upon starting it back up, Ubuntu noticed that the RAID system was running in degraded mode and I had to hit yes at the console to have it continue booting.

Once the machine was running, I logged in and created a similar partition structure on the new drive using the fdisk command. On my system, I have a small 200 partition for /boot as /dev/sdb1, a 1 GB swap partition, and then the rest of the drive is one big block for the root partition. The I copied the partition table that was on /dev/sda, but for the final partition, made it take up the entire rest of the drive. Make sure to set the first partition as bootable. The partitions on /dev/sdb now look like this:

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          24      192748+  fd  Linux raid autodetect
/dev/sdb2              25         146      979965   82  Linux swap / Solaris
/dev/sdb3             147      243201  1952339287+  fd  Linux raid autodetect

After the partitions match up, I can now add the partitions on /dev/sdb into the RAID array

root@host:~# mdadm --manage /dev/md0 --add /dev/sdb1
root@host:~# mdadm --manage /dev/md1 --add /dev/sdb3

And watch the status of the rebuild using

watch cat /proc/mdstat

After that was done, I installed grub onto the new /dev/sdb

grub /dev/sdb
root (hd1)
setup (hd1,0)

Finally, reboot once and make sure that everything works as expected.

The system is now running with one old drive and one new drive. The next step is to perform the same steps with removing the other old drive and rebuilding and re-adding it to the raid system. The steps are the same as above, except performing them with /dev/sda. I also had to change my BIOS to boot from the second drive.

Once both drives are installed and working with the RAID array, the final part of the process is to increase the size of the file system to the full size of our new drives. I first had to disable the EXT3 file system journal. This was necessary so that the online resize doesn’t run out of memory.

Edit /etc/fstab and change the file system type to ext2, and the arguments to include “noload” which will disable the file system journal. My /etc/fstab looks like this:

# /etc/fstab: static file system information.
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    defaults        0       0
# /dev/md1
UUID=b165f4be-a705-4479-b830-b0c6ee77b730 /               ext2    noload,relatime,errors=remount-ro 0       1
# /dev/md0
UUID=30430fe0-919d-4ea6-b3c2-5e3564344917 /boot           ext3    relatime        0       2
# /dev/sda5
UUID=94b03944-d215-4882-b0e0-dab3a8c50480 none            swap    sw              0       0
# /dev/sdb5
UUID=ebb381ae-e1bb-4918-94ec-c26e388bb539 none            swap    sw              0       0

You then have to run `mkinitramfs` to rebuild the initramfs file to include the desired change to the mount options

root@host:~# mkinitramfs -o /boot/initrd.img-2.6.31-20-generic 2.6.31-20-generic

Reboot to have the system running with journaling disabled on the root partition. Finally, you can actually increase the RAID device to the full size of the device:

root@host:~# mdadm --grow /dev/md3 --size=max

And then resize the file system:

root@host:~# resize2fs /dev/md1
resize2fs 1.41.9 (22-Aug-2009)
Filesystem at /dev/md1 is mounted on /; on-line resizing required
old desc_blocks = 38, new_desc_blocks = 117
Performing an on-line resize of /dev/md1 to 488084800 (4k) blocks.

The system is now up and running with larger drives :). Change the /etc/fstab, rebuild the initramfs, and reboot one more time to re-enable journaling and be running officially.

Troubleshooting OpenVZ Out of Memory and Could Not Allocate Memory Errors

February 18, 2011 / Brandon / 0 Comments

I will sometimes run into an error inside an OpenVZ guest where the system complains about running out of memory. This is usually seen as a message containing “Cannot allocate memory”. It may even occur when trying to enter into the container

[root@host ~]# vzctl enter 908
entered into CT 908
-bash: /root/.bash_profile: Cannot allocate memory

This particular error can be difficult to troubleshoot because OpenVZ has so many different limits. The “Cannot allocate Memory” error is a pretty generic error, so Googling it doesn’t necessarily yield any useful results.

I’ve found that the best way to track down the source of the problem is to examine the /proc/user_beancounter statistics on the host server. The far right column is labeled ‘failcnt’ and is a simple counter for the number of times that particular resource has been exhausted. Any non-zero numbers in that column may indicate a problem. In this example, you can see that the dcachesize parameter is too small and should be increased

[root@host ~]# cat /proc/user_beancounters
Version: 2.5
       uid  resource                     held              maxheld              barrier                limit              failcnt
      908:  kmemsize                  5170890              5776435             43118100             44370492                    0
            lockedpages                     0                    0                  256                  256                    0
            privvmpages                 53561                55089              1048576              1048576                    0
            shmpages                        2                    2                21504                21504                    0
            dummy                           0                    0                    0                    0                    0
            numproc                        30                   38                 2000                 2000                    0
            physpages                    6136                 7392                    0  9223372036854775807                    0
            vmguarpages                     0                    0                65536                65536                    0
            oomguarpages                 6136                 7392                26112  9223372036854775807                    0
            numtcpsock                      8                    8                  360                  360                    0
            numflock                        5                    6                  380                  420                    0
            numpty                          0                    1                   16                   16                    0
            numsiginfo                      0                    2                  256                  256                    0
            tcpsndbuf                  140032                    0             10321920             16220160                    0
            tcprcvbuf                  131072                    0              1720320              2703360                    0
            othersockbuf               151320               156864              4504320             16777216                    0
            dgramrcvbuf                     0                 8472               262144               262144                    0
            numothersock                  107                  109                 5000                 5000                    0
            dcachesize                3408570              3413265              3409920              3624960                   81
            numfile                       724                  957                18624                18624                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            numiptent                      10                   10                  128                  128                    0

You can do a little investigation to see what each particular limit does. I will usually not dig too deep, but just double the value and try again. You can set a new value using the vzctl command like this:

[root@host ~] vzctl set 908 --dcachesize=6819840:7249920 --save

Southeast Linux Fest Presentation on MySQL Replication

June 13, 2010 / Brandon / 0 Comments

I was fortunate to be selected to give a presentation at the 2010 Southeast Linux Fest held this year in Greenville, SC. The topic was MySQL replication which I picked from a similar presentation I gave about about 1.5 years ago at my local LUG. I’ve configured plenty of replicated servers and I think that I understand it well enough to explain it to others.

The 2-hour presentation is about half slides and half demo. Throughout the course of the presentation I set up a simple master-slave. Then I add a second slave. Taking it a step farther I set up the three servers to replicate in a chain, and finally I configure them to replicate in a full circle so that changes made on one are propagated to all of the others. I intentionally do things that break replication at certain points to show some of the limitations and configurable features that can help it to work.

Slides for the presentation are available OpenOffice format.

The presentation was recorded, so hopefully the SELF team will have those videos available shortly.

Skipping the DROP TABLE, CREATE TABLE statements in a large mysqldump file.

April 28, 2010 / Brandon / 2 Comments

I have a large table of test data that I’m copying into some development environments. I exported the table with a mysqldump which has a DROP TABLE and CREATE TABLE statements at the top

DROP TABLE IF EXISTS `mytable`;
CREATE TABLE `mytable` (
  `somecol` varchar(10) NOT NULL default '',
   ... other columns ...
  PRIMARY KEY  (`somecol`),
  KEY `isbn10` (`somecol`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The problem is that the developer has altered the table and re-importing the test data would undo those changes. Editing the text file is impractical because of its size (500 MB gzipped). So I came up with this workaround which just slightly alters the SQL using sed so that it doesn’t try to drop or recreate the table. It comments out the DROP TABLE line, and creates the new table in the test database instead of the real database.

zcat bigfile.sql.gz |sed "s/DROP/-- DROP/"|sed "s/CREATE TABLE /CREATE TABLE test./"|mysql databasename