Increasing the number of simultaneous SASL authentication servers with Postfix

I had a customer complaining lately that messages sent via Gmail to one of my mail servers was occasionally receiving SMTP Authentication failures and bounce backs from Gmail. Fortunately he noticed it happening mainly when he sent a messages to multiple recipients and was able to send me some of the bounces for me to track it down pretty specifically in the postfix logs.

The Error message via Gmail was:

Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 535 535 5.7.0 Error: authentication failed: authentication failure (SMTP AUTH failed with the remote server) (state 7).

This was a little odd, because the SMTP AUTH failure is what I would typically expect with a mistyped username and password. However, I could see that plenty of messages were being sent from the same client. By looking at the specific timestamp of the bounced message, I tracked down the relevant log segment shown below. It indicates 5 concurrent SMTPD sessions where the SASL authentication was successful on 4 of them and failed on the 5th.

Jul  5 12:43:39 mail postfix/smtpd[13602]: connect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[13602]: setting up TLS connection from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14113]: connect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14113]: setting up TLS connection from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14115]: connect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14115]: setting up TLS connection from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14116]: connect from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[14117]: connect from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[14116]: setting up TLS connection from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[14117]: setting up TLS connection from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[13602]: TLS connection established from mail-bk0-f50.google.com[209.85.214.50]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14113]: TLS connection established from mail-bk0-f50.google.com[209.85.214.50]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14115]: TLS connection established from mail-bk0-f50.google.com[209.85.214.50]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14116]: TLS connection established from mail-bk0-f49.google.com[209.85.214.49]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14117]: TLS connection established from mail-bk0-f49.google.com[209.85.214.49]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:40 mail postfix/smtpd[13602]: 2846B11AC5E2: client=mail-bk0-f50.google.com[209.85.214.50], sasl_method=PLAIN, sasl_username=someuser@somedomain.com
Jul  5 12:43:40 mail postfix/smtpd[14113]: 3290811AC5E3: client=mail-bk0-f50.google.com[209.85.214.50], sasl_method=PLAIN, sasl_username=someuser@somedomain.com
Jul  5 12:43:40 mail postfix/smtpd[14115]: 3C4AD11AC5E4: client=mail-bk0-f50.google.com[209.85.214.50], sasl_method=PLAIN, sasl_username=someuser@somedomain.com
Jul  5 12:43:40 mail postfix/cleanup[13420]: 2846B11AC5E2: message-id=
Jul  5 12:43:40 mail postfix/cleanup[14092]: 3290811AC5E3: message-id=
Jul  5 12:43:40 mail postfix/smtpd[14116]: warning: SASL authentication failure: Password verification failed
Jul  5 12:43:40 mail postfix/smtpd[14116]: warning: mail-bk0-f49.google.com[209.85.214.49]: SASL PLAIN authentication failed: authentication failure
Jul  5 12:43:40 mail postfix/cleanup[14121]: 3C4AD11AC5E4: message-id=
Jul  5 12:43:40 mail postfix/qmgr[32242]: 2846B11AC5E2: from=, size=10564, nrcpt=1 (queue active)
Jul  5 12:43:40 mail postfix/qmgr[32242]: 3290811AC5E3: from=, size=10566, nrcpt=1 (queue active)
Jul  5 12:43:40 mail postfix/smtpd[14116]: disconnect from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:40 mail postfix/qmgr[32242]: 3C4AD11AC5E4: from=, size=10568, nrcpt=1 (queue active)
Jul  5 12:43:40 mail postfix/smtpd[13602]: disconnect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:40 mail postfix/smtpd[14113]: disconnect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:40 mail postfix/smtpd[14115]: disconnect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:40 mail postfix/smtpd[14117]: D4F2411AC5E5: client=mail-bk0-f49.google.com[209.85.214.49], sasl_method=PLAIN, sasl_username=someuser@somedomain.com
Jul  5 12:43:41 mail postfix/cleanup[13420]: D4F2411AC5E5: message-id=
Jul  5 12:43:41 mail postfix/qmgr[32242]: D4F2411AC5E5: from=, size=10565, nrcpt=1 (queue active)
Jul  5 12:43:41 mail postfix/smtpd[14117]: disconnect from mail-bk0-f49.google.com[209.85.214.49]

In looking into the SASL component a bit, I noticed that there were 5 simultaneous SASL servers running. The first one looks like a parent with 4 child processes.

[root@mail postfix]# ps -ef |grep sasl
root      9253     1  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9262  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9263  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9264  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9265  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1

So it seemed likely that the 4 child processes were in use and that Postfix couldn’t open a connection to a 5th simultaneous SASL authentication server, so it responded with a generic SMTP AUTH failure.

To fix, I simply added a couple of extra arguments to the saslauthd command that is run. I added a ‘-c’ parameter to enable caching, and ‘-n 10’ to increase the number of servers to 10. On my CentOS server, I accomplished that by modifying /etc/sysconfig/saslauthd to look like this:

# Directory in which to place saslauthd's listening socket, pid file, and so
# on.  This directory must already exist.
SOCKETDIR=/var/run/saslauthd

# Mechanism to use when checking passwords.  Run "saslauthd -v" to get a list
# of which mechanism your installation was compiled with the ablity to use.
MECH=rimap

# Additional flags to pass to saslauthd on the command line.  See saslauthd(8)
# for the list of accepted flags.
FLAGS="-r -O 127.0.0.1 -c -n 10"

After restarting saslauthd, and a quick test, it looks good so far.

Another Quick Script to Generate PHP Barcodes

You’ll need the PEAR Image_Barcode library for this, which can be installed with the command

#pear install Image_Barcode

Once that is installed, the PHP Code is quite simple. Just use the following:

<?php

$number = isset($_GET['number']) ? $_GET['number'] : '';

require_once 'Image/Barcode.php';
header('Content-type: image/png');
Image_Barcode::draw($number, 'code128', 'png');

?>

MySQL Error: Client requested master to start replication from impossible position

I’ve had this error occur when something went terribly wrong on a master server and it had to be rebooted. It appears that the master’s binary log file was not completely written to the server before it was rebooted. However part of the master log file was transmitted to the slave before that time. The error is coming from the slave server when it tried to reconnect to the master. It is trying to connect to the master server and start copying the binary log file from a position which the master does not have.

The proper way to fix this error is to completely resync the data from the master and begin replication over again. However, in many cases, that is impractical, or not worth the hassle, so I was able to fix the problem by setting the slave back to a valid position on the master and then skipping forward over the missing entries.

First, you need to find out what the last valid position is on the master. Run ‘SHOW SLAVE STATUS’ on the slave to figure out which master log file it was reading when the master crashed. You’ll need the value of the Relay_Master_Log_File parameter. Also take note of the Exec_Master_Log_Pos value.

Next, go to the master, and ensure that log file exists in the MySQL data directory. The file size will probably be less than the Exec_Master_Log_Pos value that the slave had stored. To find the last valid position in the log file use the mysqlbinlog utility. The command

mysqlbinlog [LOGFILE] |tail -n 100

should show the final entries in the binary log. Find the last line in the file that looks similar to this:

#120512 13:30:44 server id 3  end_log_pos 8796393       Query   thread_id=138884124     exec_time=0     error_code=0

The number after end_log_pos is the last valid entry that the master has available. The difference between the master log position that the slave had, and this highest one on the master will give you some idea how far ahead the slave got before the master crashed.

You can now go back to the slave and issue the commands:

STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_POS=8796393;
START SLAVE;

This will tell the slave to try to start from that valid position, and continue from there.

There is a strong possibility that you’ll run into some duplicate key errors if the slave was very far ahead of where the master’s log ended. In that case, you can issue this command to bypass those one at a time (or more if you want to skip more than one)

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

!! Note that if you have to do this, then your data between the master and slave is almost certainly inconsistent !! Be sure to understand your data to make sure that this is something you can live with.

Upgrading Disks in a Software Raid System

I have a system at home with a pair of 640 GB drives, that are getting full. The drives are configured in a Linux Software RAID1 array. Replacing those drives with a pair of 2 TB drives is a pretty easy (although lengthy process).

First, we need to verify that our RAID system is running correctly. Take a look at /proc/mdstat to verify that both drives are online:

root@host:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sdb3[1] sda3[0]
      622339200 blocks [2/2] [UU]

md0 : active raid1 sdb1[1] sda1[0]
      192640 blocks [2/2] [UU]

After verifying that the system is running, I powered the machine off and removed the second hard drive and replaced it with the new drive. Upon starting it back up, Ubuntu noticed that the RAID system was running in degraded mode and I had to hit yes at the console to have it continue booting.

Once the machine was running, I logged in and created a similar partition structure on the new drive using the fdisk command. On my system, I have a small 200 partition for /boot as /dev/sdb1, a 1 GB swap partition, and then the rest of the drive is one big block for the root partition. The I copied the partition table that was on /dev/sda, but for the final partition, made it take up the entire rest of the drive. Make sure to set the first partition as bootable. The partitions on /dev/sdb now look like this:

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          24      192748+  fd  Linux raid autodetect
/dev/sdb2              25         146      979965   82  Linux swap / Solaris
/dev/sdb3             147      243201  1952339287+  fd  Linux raid autodetect

After the partitions match up, I can now add the partitions on /dev/sdb into the RAID array

root@host:~# mdadm --manage /dev/md0 --add /dev/sdb1
root@host:~# mdadm --manage /dev/md1 --add /dev/sdb3

And watch the status of the rebuild using

watch cat /proc/mdstat

After that was done, I installed grub onto the new /dev/sdb

grub /dev/sdb
root (hd1)
setup (hd1,0)

Finally, reboot once and make sure that everything works as expected.

The system is now running with one old drive and one new drive. The next step is to perform the same steps with removing the other old drive and rebuilding and re-adding it to the raid system. The steps are the same as above, except performing them with /dev/sda. I also had to change my BIOS to boot from the second drive.

Once both drives are installed and working with the RAID array, the final part of the process is to increase the size of the file system to the full size of our new drives. I first had to disable the EXT3 file system journal. This was necessary so that the online resize doesn’t run out of memory.

Edit /etc/fstab and change the file system type to ext2, and the arguments to include “noload” which will disable the file system journal. My /etc/fstab looks like this:

# /etc/fstab: static file system information.
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    defaults        0       0
# /dev/md1
UUID=b165f4be-a705-4479-b830-b0c6ee77b730 /               ext2    noload,relatime,errors=remount-ro 0       1
# /dev/md0
UUID=30430fe0-919d-4ea6-b3c2-5e3564344917 /boot           ext3    relatime        0       2
# /dev/sda5
UUID=94b03944-d215-4882-b0e0-dab3a8c50480 none            swap    sw              0       0
# /dev/sdb5
UUID=ebb381ae-e1bb-4918-94ec-c26e388bb539 none            swap    sw              0       0

You then have to run `mkinitramfs` to rebuild the initramfs file to include the desired change to the mount options

root@host:~# mkinitramfs -o /boot/initrd.img-2.6.31-20-generic 2.6.31-20-generic

Reboot to have the system running with journaling disabled on the root partition. Finally, you can actually increase the RAID device to the full size of the device:

root@host:~# mdadm --grow /dev/md3 --size=max

And then resize the file system:

root@host:~# resize2fs /dev/md1
resize2fs 1.41.9 (22-Aug-2009)
Filesystem at /dev/md1 is mounted on /; on-line resizing required
old desc_blocks = 38, new_desc_blocks = 117
Performing an on-line resize of /dev/md1 to 488084800 (4k) blocks.

The system is now up and running with larger drives :). Change the /etc/fstab, rebuild the initramfs, and reboot one more time to re-enable journaling and be running officially.

APISigning Now Works with Amazon Simple Email Service (SES)

APISigning.com has been signing Amazon Product Advertising requests for a couple of years now. Amazon recently announced their Simple Email Service that makes it easy to send emails via an API. The SES API requires that requests be authenticated using some cryptographic functions that are not easily available on all platforms or programming languages. In those cases, developers can use the APISigning SES Service to calculate the correct signature and perform the request on their behalf.

APISigning has free accounts that effectively allows 10k signing requests each month. Users who require additional requests can subscribe to a paid account with higher limits.

How to shrink a partition with unmovable files in Windows 7

For some reason Windows can’t figure out how to move some files around on disk. When attempting to shrink a volume, it will only allow you to shrink it to where the last immovable file is located. These are some things I found necessary to temporarily get rid of those immovable files so that an NTFS volume could be shrunk

1- Disable System Restore (Right click on Computer => Properties => System Protection => System Restore)
2- Disable Virtual Memory (Right click on Comptuer => Properties => Advanced System Settings => Performance (Settings) => Advanced Tab => Virtual Memory (Change) => No Paging File => Set.
3- Run Disk Cleanup to get rid of Thumbnails, Temporary Internet Files, and a bunch of other files that it makes no sense why they are immovable.
4- Restart the computer to have #1 and #2 take effect

Try to shrink the volume again. If it still is unreasonably large, you will then have to look at Event Viewer to find which file is at the boundary.
1- Right click on Computer => Manage => Event Viewer => Windows Logs => Application.
2- Click on Filter Current Log, and put ‘259’ for the Event ID
3- Click on the latest event and look through the detail to find the problematic file. You can then attempt to delete that file (or set of files) manually. You may have to restart into safe mode to delete some files

It took me about 5 loops of doing the above before I was finally able to shrink my volume to the size that I wanted. After successful, you can then re-enable the features that you want (namely System Restore and Virtual Memory)

Troubleshooting OpenVZ Out of Memory and Could Not Allocate Memory Errors

I will sometimes run into an error inside an OpenVZ guest where the system complains about running out of memory. This is usually seen as a message containing “Cannot allocate memory”. It may even occur when trying to enter into the container

[root@host ~]# vzctl enter 908
entered into CT 908
-bash: /root/.bash_profile: Cannot allocate memory

This particular error can be difficult to troubleshoot because OpenVZ has so many different limits. The “Cannot allocate Memory” error is a pretty generic error, so Googling it doesn’t necessarily yield any useful results.

I’ve found that the best way to track down the source of the problem is to examine the /proc/user_beancounter statistics on the host server. The far right column is labeled ‘failcnt’ and is a simple counter for the number of times that particular resource has been exhausted. Any non-zero numbers in that column may indicate a problem. In this example, you can see that the dcachesize parameter is too small and should be increased

[root@host ~]# cat /proc/user_beancounters
Version: 2.5
       uid  resource                     held              maxheld              barrier                limit              failcnt
      908:  kmemsize                  5170890              5776435             43118100             44370492                    0
            lockedpages                     0                    0                  256                  256                    0
            privvmpages                 53561                55089              1048576              1048576                    0
            shmpages                        2                    2                21504                21504                    0
            dummy                           0                    0                    0                    0                    0
            numproc                        30                   38                 2000                 2000                    0
            physpages                    6136                 7392                    0  9223372036854775807                    0
            vmguarpages                     0                    0                65536                65536                    0
            oomguarpages                 6136                 7392                26112  9223372036854775807                    0
            numtcpsock                      8                    8                  360                  360                    0
            numflock                        5                    6                  380                  420                    0
            numpty                          0                    1                   16                   16                    0
            numsiginfo                      0                    2                  256                  256                    0
            tcpsndbuf                  140032                    0             10321920             16220160                    0
            tcprcvbuf                  131072                    0              1720320              2703360                    0
            othersockbuf               151320               156864              4504320             16777216                    0
            dgramrcvbuf                     0                 8472               262144               262144                    0
            numothersock                  107                  109                 5000                 5000                    0
            dcachesize                3408570              3413265              3409920              3624960                   81
            numfile                       724                  957                18624                18624                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            numiptent                      10                   10                  128                  128                    0

You can do a little investigation to see what each particular limit does. I will usually not dig too deep, but just double the value and try again. You can set a new value using the vzctl command like this:

[root@host ~] vzctl set 908 --dcachesize=6819840:7249920 --save

Website Performance: Tables Versus CSS

Most website designers have been using CSS for page layout for several years now, but I occasionally see some websites that continue to use HTML tables for layout. As I’ve been focusing on website performance lately, I’ve found some references that modern browsers render sites using tables for layout slower than they do sites that use CSS. I decided to investigate and confirmed that there are many possible situations where sites using large tables will appear to load much slower than those using CSS. I put together two pages to confirm:

This page uses <div> elements for layout
and
This pages uses a large table for layout

On both pages I’ve added a 5-second sleep near the end of the page to show what might happen if the server was slow, if there were network problems, or any other number of things may have happened.

Notice that the page created using a table changes a lot after the delay. I’ve tried it in Firefox 3 which extends the main (yellow) content section all of the way to the right until it receives the rest of the document, at which point it has to shrink that part to make room for the section on the right. Internet Explorer behaves even worse. It leaves a blank white page until after the delay, at which point it draws the whole table.

By contrast, the page created with CSS positioning shows all of the content above the delay and has it in the correct position. When the rest of the document is sent it just fills in the appropriate content, but doesn’t have to re-arrange anything on the page.

Southeast Linux Fest Presentation on MySQL Replication

I was fortunate to be selected to give a presentation at the 2010 Southeast Linux Fest held this year in Greenville, SC. The topic was MySQL replication which I picked from a similar presentation I gave about about 1.5 years ago at my local LUG. I’ve configured plenty of replicated servers and I think that I understand it well enough to explain it to others.

The 2-hour presentation is about half slides and half demo. Throughout the course of the presentation I set up a simple master-slave. Then I add a second slave. Taking it a step farther I set up the three servers to replicate in a chain, and finally I configure them to replicate in a full circle so that changes made on one are propagated to all of the others. I intentionally do things that break replication at certain points to show some of the limitations and configurable features that can help it to work.

Slides for the presentation are available OpenOffice format.

The presentation was recorded, so hopefully the SELF team will have those videos available shortly.