Monit CPU Usage problem

Posted on April 3rd, 2013 in General by Brandon

I just recently fixed an issue I wanted my Monit monitoring process to restart a daemon who was segfaulting and causing 100% CPU usage according to top and most other system tools. I had seen configuration examples where Monit could detect that and restart the process, so I figured that adding a configuration like that below would fix it easily enough:

check process foo with pidfile /var/run/foo.pid
  start program = "/etc/init.d/foo start" timeout 10 seconds
  stop program  = "/etc/init.d/foo stop"
  if cpu usage > 90% for 8 cycles then restart

After letting that run for a bunch of cycles the process remained running, and monit didn’t do anything to acknowledge it even in log files. (FYI, a “cycle” is defined in the Monitrc config file in the “set daemon” line and defaults to 120 seconds).

After some research, I finally came upon this post on the Monit mailing list where somebody describes that the CPU usage that Monit bases its numbers off is a percentage of the CPU available for all processors. My machine had 4 processors, so what was seeing as 100% CPU usage in top, monit would only see that as 25%.

I quickly changed my Monit config to check for CPU Usage > 22% as ween in the following. That now works perfectly, even acknowledging in the log each of the 8 times that the CPU was over the limit before restarting it:

check process foo with pidfile /var/run/foo.pid
  start program = "/etc/init.d/foo start" timeout 10 seconds
  stop program  = "/etc/init.d/foo stop"
  if cpu usage > 22% for 8 cycles then restart

…. Now I need to solve the real problem and see why the latest Mongo PHP pecl module is segfaulting….

Great Guide on Transactional Emails

Posted on January 15th, 2013 in General by Brandon

I don’t usually post blind links to 3rd party resources, but this guide written by MailChip is an excellent resource for creating transactional emails

http://mailchimp.com/resources/guides/html/transactional-email/

Quick PHP Script to find values that add to a specified total

Posted on December 14th, 2012 in General by Brandon

I’ve had need for this on several occasions, and never come up with a decent solution until now. The problem happens for me most frequently when dealing with a group of transactions, and I need to find a subset of them that add up to a specific amount.

My data might look like this:

+----+--------+
| id | total  |
+----+--------+
|  1 |    1.6 |
|  2 |    0.8 |
|  3 |  16.25 |
.........
| 50 |      5 |
| 51 |    2.5 |
| 52 |     29 |
| 53 |    3.5 |
+----+--------+

And I need to find some combination of those that add up to a certain amount. I put together this quick recursive script that comes up with a solution:

$available = array(
  1 => 1.6,
  2 => 0.8,
  3 => 16.25,
  .....
  50 => 5,
  51 => 2.5,
  52 => 29,
  53 => 3.5,
);

function findCombination($count_needed, $amount_needed, $remaining)
{
    if ($count_needed < 0) return false;

    foreach ($remaining as $id => $this_amount) {
        if ($count_needed == 1 && $this_amount == $amount_needed) {
            echo "Found: {$id} for {$amount_needed}\n";
            return true;
        } else {
            unset($remaining[$id]);
            $correct = findCombination($count_needed - 1, $amount_needed - $this_amount, $remaining);
            if ($correct) {
                echo "Found: {$id} for {$this_amount}\n";
                return true;
            }
        }

    }
    return false;
}
$count_needed = 9; 
$amount_needed = 418;
findCombination($count_needed, $amount_needed, $available);

This will output something like:

Looking for 9 transactions totaling 418
Found: 38 for 59
Found: 37 for 45.5
Found: 36 for 34.75
Found: 33 for 44.75
Found: 31 for 57.75
Found: 30 for 48
Found: 26 for 23.5
Found: 22 for 2.5
Found: 20 for 102.25

Increasing the number of simultaneous SASL authentication servers with Postfix

Posted on July 5th, 2012 in General,Linux System Administration,Mail by Brandon

I had a customer complaining lately that messages sent via Gmail to one of my mail servers was occasionally receiving SMTP Authentication failures and bounce backs from Gmail. Fortunately he noticed it happening mainly when he sent a messages to multiple recipients and was able to send me some of the bounces for me to track it down pretty specifically in the postfix logs.

The Error message via Gmail was:

Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 535 535 5.7.0 Error: authentication failed: authentication failure (SMTP AUTH failed with the remote server) (state 7).

This was a little odd, because the SMTP AUTH failure is what I would typically expect with a mistyped username and password. However, I could see that plenty of messages were being sent from the same client. By looking at the specific timestamp of the bounced message, I tracked down the relevant log segment shown below. It indicates 5 concurrent SMTPD sessions where the SASL authentication was successful on 4 of them and failed on the 5th.

Jul  5 12:43:39 mail postfix/smtpd[13602]: connect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[13602]: setting up TLS connection from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14113]: connect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14113]: setting up TLS connection from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14115]: connect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14115]: setting up TLS connection from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:39 mail postfix/smtpd[14116]: connect from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[14117]: connect from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[14116]: setting up TLS connection from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[14117]: setting up TLS connection from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:39 mail postfix/smtpd[13602]: TLS connection established from mail-bk0-f50.google.com[209.85.214.50]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14113]: TLS connection established from mail-bk0-f50.google.com[209.85.214.50]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14115]: TLS connection established from mail-bk0-f50.google.com[209.85.214.50]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14116]: TLS connection established from mail-bk0-f49.google.com[209.85.214.49]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:39 mail postfix/smtpd[14117]: TLS connection established from mail-bk0-f49.google.com[209.85.214.49]: TLSv1 with cipher RC4-SHA (128/128 bits)
Jul  5 12:43:40 mail postfix/smtpd[13602]: 2846B11AC5E2: client=mail-bk0-f50.google.com[209.85.214.50], sasl_method=PLAIN, sasl_username=someuser@somedomain.com
Jul  5 12:43:40 mail postfix/smtpd[14113]: 3290811AC5E3: client=mail-bk0-f50.google.com[209.85.214.50], sasl_method=PLAIN, sasl_username=someuser@somedomain.com
Jul  5 12:43:40 mail postfix/smtpd[14115]: 3C4AD11AC5E4: client=mail-bk0-f50.google.com[209.85.214.50], sasl_method=PLAIN, sasl_username=someuser@somedomain.com
Jul  5 12:43:40 mail postfix/cleanup[13420]: 2846B11AC5E2: message-id=
Jul  5 12:43:40 mail postfix/cleanup[14092]: 3290811AC5E3: message-id=
Jul  5 12:43:40 mail postfix/smtpd[14116]: warning: SASL authentication failure: Password verification failed
Jul  5 12:43:40 mail postfix/smtpd[14116]: warning: mail-bk0-f49.google.com[209.85.214.49]: SASL PLAIN authentication failed: authentication failure
Jul  5 12:43:40 mail postfix/cleanup[14121]: 3C4AD11AC5E4: message-id=
Jul  5 12:43:40 mail postfix/qmgr[32242]: 2846B11AC5E2: from=, size=10564, nrcpt=1 (queue active)
Jul  5 12:43:40 mail postfix/qmgr[32242]: 3290811AC5E3: from=, size=10566, nrcpt=1 (queue active)
Jul  5 12:43:40 mail postfix/smtpd[14116]: disconnect from mail-bk0-f49.google.com[209.85.214.49]
Jul  5 12:43:40 mail postfix/qmgr[32242]: 3C4AD11AC5E4: from=, size=10568, nrcpt=1 (queue active)
Jul  5 12:43:40 mail postfix/smtpd[13602]: disconnect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:40 mail postfix/smtpd[14113]: disconnect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:40 mail postfix/smtpd[14115]: disconnect from mail-bk0-f50.google.com[209.85.214.50]
Jul  5 12:43:40 mail postfix/smtpd[14117]: D4F2411AC5E5: client=mail-bk0-f49.google.com[209.85.214.49], sasl_method=PLAIN, sasl_username=someuser@somedomain.com
Jul  5 12:43:41 mail postfix/cleanup[13420]: D4F2411AC5E5: message-id=
Jul  5 12:43:41 mail postfix/qmgr[32242]: D4F2411AC5E5: from=, size=10565, nrcpt=1 (queue active)
Jul  5 12:43:41 mail postfix/smtpd[14117]: disconnect from mail-bk0-f49.google.com[209.85.214.49]

In looking into the SASL component a bit, I noticed that there were 5 simultaneous SASL servers running. The first one looks like a parent with 4 child processes.

[root@mail postfix]# ps -ef |grep sasl
root      9253     1  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9262  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9263  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9264  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1
root      9265  9253  0 Mar15 ?        00:00:04 /usr/sbin/saslauthd -m /var/run/saslauthd -a rimap -r -O 127.0.0.1

So it seemed likely that the 4 child processes were in use and that Postfix couldn’t open a connection to a 5th simultaneous SASL authentication server, so it responded with a generic SMTP AUTH failure.

To fix, I simply added a couple of extra arguments to the saslauthd command that is run. I added a ‘-c’ parameter to enable caching, and ‘-n 10’ to increase the number of servers to 10. On my CentOS server, I accomplished that by modifying /etc/sysconfig/saslauthd to look like this:

# Directory in which to place saslauthd's listening socket, pid file, and so
# on.  This directory must already exist.
SOCKETDIR=/var/run/saslauthd

# Mechanism to use when checking passwords.  Run "saslauthd -v" to get a list
# of which mechanism your installation was compiled with the ablity to use.
MECH=rimap

# Additional flags to pass to saslauthd on the command line.  See saslauthd(8)
# for the list of accepted flags.
FLAGS="-r -O 127.0.0.1 -c -n 10"

After restarting saslauthd, and a quick test, it looks good so far.

Another Quick Script to Generate PHP Barcodes

Posted on June 15th, 2012 in General by Brandon

You’ll need the PEAR Image_Barcode library for this, which can be installed with the command

#pear install Image_Barcode

Once that is installed, the PHP Code is quite simple. Just use the following:

<?php

$number = isset($_GET['number']) ? $_GET['number'] : '';

require_once 'Image/Barcode.php';
header('Content-type: image/png');
Image_Barcode::draw($number, 'code128', 'png');

?>

MySQL Error: Client requested master to start replication from impossible position

Posted on May 12th, 2012 in General by Brandon

I’ve had this error occur when something went terribly wrong on a master server and it had to be rebooted. It appears that the master’s binary log file was not completely written to the server before it was rebooted. However part of the master log file was transmitted to the slave before that time. The error is coming from the slave server when it tried to reconnect to the master. It is trying to connect to the master server and start copying the binary log file from a position which the master does not have.

The proper way to fix this error is to completely resync the data from the master and begin replication over again. However, in many cases, that is impractical, or not worth the hassle, so I was able to fix the problem by setting the slave back to a valid position on the master and then skipping forward over the missing entries.

First, you need to find out what the last valid position is on the master. Run ‘SHOW SLAVE STATUS’ on the slave to figure out which master log file it was reading when the master crashed. You’ll need the value of the Relay_Master_Log_File parameter. Also take note of the Exec_Master_Log_Pos value.

Next, go to the master, and ensure that log file exists in the MySQL data directory. The file size will probably be less than the Exec_Master_Log_Pos value that the slave had stored. To find the last valid position in the log file use the mysqlbinlog utility. The command

mysqlbinlog [LOGFILE] |tail -n 100

should show the final entries in the binary log. Find the last line in the file that looks similar to this:

#120512 13:30:44 server id 3  end_log_pos 8796393       Query   thread_id=138884124     exec_time=0     error_code=0

The number after end_log_pos is the last valid entry that the master has available. The difference between the master log position that the slave had, and this highest one on the master will give you some idea how far ahead the slave got before the master crashed.

You can now go back to the slave and issue the commands:

STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_POS=8796393;
START SLAVE;

This will tell the slave to try to start from that valid position, and continue from there.

There is a strong possibility that you’ll run into some duplicate key errors if the slave was very far ahead of where the master’s log ended. In that case, you can issue this command to bypass those one at a time (or more if you want to skip more than one)

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

!! Note that if you have to do this, then your data between the master and slave is almost certainly inconsistent !! Be sure to understand your data to make sure that this is something you can live with.

Fix For Amazon API Error: Your request is missing required parameters. Required parameters include AssociateTag.

Posted on November 3rd, 2011 in Amazon APIs,PHP by Brandon

Users of Amazon’s Product Advertising API will begin seeing the following error as of about November 1st:

Your request is missing required parameters. Required parameters include AssociateTag.

This is due to changing requirements to the Product Advertising API. The AssociateTag parameter is now a required and validated field. You can see in the list of changes that it is the first thing mentioned.

These changes are made to keep the Product Advertising API in line with its purpose of driving affiliate sales through the Amazon.com website. The AssociateTag is a tag generated from your Amazon Associates account.

So far, it looks like the tag is not entirely validated to your account. I’ve had luck using fictitious tags such as aztag-20.

Upgrading Disks in a Software Raid System

Posted on August 28th, 2011 in General,Linux System Administration by Brandon

I have a system at home with a pair of 640 GB drives, that are getting full. The drives are configured in a Linux Software RAID1 array. Replacing those drives with a pair of 2 TB drives is a pretty easy (although lengthy process).

First, we need to verify that our RAID system is running correctly. Take a look at /proc/mdstat to verify that both drives are online:

root@host:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sdb3[1] sda3[0]
      622339200 blocks [2/2] [UU]

md0 : active raid1 sdb1[1] sda1[0]
      192640 blocks [2/2] [UU]

After verifying that the system is running, I powered the machine off and removed the second hard drive and replaced it with the new drive. Upon starting it back up, Ubuntu noticed that the RAID system was running in degraded mode and I had to hit yes at the console to have it continue booting.

Once the machine was running, I logged in and created a similar partition structure on the new drive using the fdisk command. On my system, I have a small 200 partition for /boot as /dev/sdb1, a 1 GB swap partition, and then the rest of the drive is one big block for the root partition. The I copied the partition table that was on /dev/sda, but for the final partition, made it take up the entire rest of the drive. Make sure to set the first partition as bootable. The partitions on /dev/sdb now look like this:

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          24      192748+  fd  Linux raid autodetect
/dev/sdb2              25         146      979965   82  Linux swap / Solaris
/dev/sdb3             147      243201  1952339287+  fd  Linux raid autodetect

After the partitions match up, I can now add the partitions on /dev/sdb into the RAID array

root@host:~# mdadm --manage /dev/md0 --add /dev/sdb1
root@host:~# mdadm --manage /dev/md1 --add /dev/sdb3

And watch the status of the rebuild using

watch cat /proc/mdstat

After that was done, I installed grub onto the new /dev/sdb

grub /dev/sdb
root (hd1)
setup (hd1,0)

Finally, reboot once and make sure that everything works as expected.

The system is now running with one old drive and one new drive. The next step is to perform the same steps with removing the other old drive and rebuilding and re-adding it to the raid system. The steps are the same as above, except performing them with /dev/sda. I also had to change my BIOS to boot from the second drive.

Once both drives are installed and working with the RAID array, the final part of the process is to increase the size of the file system to the full size of our new drives. I first had to disable the EXT3 file system journal. This was necessary so that the online resize doesn’t run out of memory.

Edit /etc/fstab and change the file system type to ext2, and the arguments to include “noload” which will disable the file system journal. My /etc/fstab looks like this:

# /etc/fstab: static file system information.
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    defaults        0       0
# /dev/md1
UUID=b165f4be-a705-4479-b830-b0c6ee77b730 /               ext2    noload,relatime,errors=remount-ro 0       1
# /dev/md0
UUID=30430fe0-919d-4ea6-b3c2-5e3564344917 /boot           ext3    relatime        0       2
# /dev/sda5
UUID=94b03944-d215-4882-b0e0-dab3a8c50480 none            swap    sw              0       0
# /dev/sdb5
UUID=ebb381ae-e1bb-4918-94ec-c26e388bb539 none            swap    sw              0       0

You then have to run `mkinitramfs` to rebuild the initramfs file to include the desired change to the mount options

root@host:~# mkinitramfs -o /boot/initrd.img-2.6.31-20-generic 2.6.31-20-generic

Reboot to have the system running with journaling disabled on the root partition. Finally, you can actually increase the RAID device to the full size of the device:

root@host:~# mdadm --grow /dev/md3 --size=max

And then resize the file system:

root@host:~# resize2fs /dev/md1
resize2fs 1.41.9 (22-Aug-2009)
Filesystem at /dev/md1 is mounted on /; on-line resizing required
old desc_blocks = 38, new_desc_blocks = 117
Performing an on-line resize of /dev/md1 to 488084800 (4k) blocks.

The system is now up and running with larger drives :). Change the /etc/fstab, rebuild the initramfs, and reboot one more time to re-enable journaling and be running officially.

Quick Fix for Dovecot’s “save failed to INBOX: Timeout while waiting for lock”

Posted on July 13th, 2011 in Mail by Brandon

I had a mail server lock up and die on me tonight. The data center replaced some hardware in order to get it to power back up. Everything started up fine, but I was unable to retrieve my own email. My mail client was complaining about timeouts on a single account. In looking at the log, I found some lines similar to these in the dovecot mail logs:

Jul 13 04:19:09 secure deliver(user@domain.com): msgid=<20110713105301.E4C2C3293DC@myserver.com>: save failed to INBOX: Timeout while waiting for lock
Jul 13 04:19:09 secure deliver(user@domain.com): msgid=<20110713105702.817483293E5@myserver.com>: save failed to INBOX: Timeout while waiting for lock
Jul 13 04:19:09 secure deliver(user@domain.com): msgid=<20110713105802.3F68E3293E7@myserver.com>: save failed to INBOX: Timeout while waiting for lock
Jul 13 04:19:09 secure deliver(user@domain.com): msgid=<20110713105402.B5DBA3293DE@myserver.com>: save failed to INBOX: Timeout while waiting for lock
Jul 13 04:19:09 secure deliver(user@domain.com): msgid=<20110713102202.6045E3293A6@myserver.com>: save failed to INBOX: Timeout while waiting for lock

Googling the error message didn’t find any quick results, so I attempted a more brute-force method. Inside the user’s maildir, I just moved some dovecot files out of the way:

[root@host /vmail/domain/user/]# mv dovecot-uidlist.lock dovecot.index dovecot.index.cache dovecot.index.log /tmp/badfiles

After moving those files out of the way, I restarted dovecot with

/etc/init.d/dovecot restart

and my client was able to connect without problems.

Map of Amazon Fulfillment Centers

Posted on July 13th, 2011 in General by Brandon

I was doing some research into Amazon’s fulfillment centers today and had trouble finding them. After finding their addresses, I put them all on a Google Map here

Amazon Fulfillment Centers


View Amazon Fulfillment Centers in a larger map

« Previous PageNext Page »