Web Programming, Linux System Administation, and Entrepreneurship in Athens Georgia

Author: Brandon (Page 12 of 29)

Setting Up Software Raid on a Running CentOS 5 Server

Running a system off of one hard drive is just asking for trouble. Hard drives are one of the most likely components to fail in a system. If your system is running off of a single drive, and that drive fails, the results could be devastating. Even with a good, current backup, if a drive dies, you system will be down until you can figure out how to reload the OS and restore your data. Software raid is a good solution for ensuring that your system remains available when a drive decides to fail. I’ve personally had more success dealing with failures with software raid than I have with any hardware raid products.

There are a number of situations where you might have to migrate a running system to a raid array. Doing so is relatively risky, but certainly do-able. In my specific situation, I have a client with a new server at ServerBeach. The server came with two identical drives, but for some reason ServerBeach won’t install the OS to a software raid array (probably because they use some pre-built images). The server has the OS installed to the first drive, and the second drive is completely blank.

There are a few howto’s for getting this working, and from which I took bits and pieces to make this work for CentOS5:

First, an overview of the process:
1- Boot off your running system, Install mdadm, and copy the partition tables to the blank drive
2- Create your raid devices using just the blank drive (the raid will be running in degraded mode since the main drive is unavailable since it is still being used for the main OS)
3- Copy your working filesystem to the mirrored drives
4- Configure Grub to boot from the mirrored drive
5- Reboot onto the mirrored drive, ensure everything works.
6- Add the initial drive into the raid array to bring it online
7- Configure grub on both drives so that if one fails, the other will boot.

Falco’s guide does a very good job of walking through the whole process. I followed it, and would recommend it with just a few changes.

1- I don’t see any purpose in having a RAID1 swap partition. Make this RAID0 or just enable two independent partitions without raid.

2 – Don’t edit /etc/fstab and /etc/mtab on the live, working system. Edit those on the mirrored drive after the filesystem has been copied over. This will leave the working system functional if you need to fall back to it (and you probably will!)

3- The initrd image created by mkinitrd didn’t work for me, and I’m not sure why not at this point. Falco’s guide says to run these commands:

mv /boot/initrd-`uname -r`.img /boot/initrd-`uname -r`.img_orig
mkinitrd /boot/initrd-`uname -r`.img `uname -r`

This makes a backup of the existing initrd image, and then rebuilds a new one. I tried quite a few variations of the command, pointing it at the fstab using the software raid array, but to no avail. I had to manually extract, edit, and recreate the initrd image using the steps of #12 on this post.

I don’t have direct access to the console, but the data center relayed the console error that included this:

switchroot: mount failed: No such file or directory
Kernel panic - not syncing: Attempted to kill init

From what I can tell, inside the the initrd image, it runs the init script which tries to run the command ‘mount /sysimage’ which was failing. Without /sysimage the initrd image can’t pass control over to the real system. I was able to replace that line with ‘mount -o defaults –ro -t ext3 /dev/md1 /sysroot’, and then manually cpio/gzip the image back into place. From there I was able to boot off of the mirrored drive and continue as normal.

I have another one or two systems to do like this still, so I’m hoping to refine the process a bit and maybe figure out what when wrong with the initrd. It was educational to dig into the initrd image and figure out a bit more about how a modern linux box boots and many other things you could learn from types of radio broadcasting.

Preparing WordPress for a Large Traffic Spike

The Hallmark Hall of Fame Movie ‘Front of the Class’ premiered this past weekend with an expected 12-15 million viewers.  We have been preparing the website (ClassPerformance.com) for the event. We expected a significant number of visitors to the website in the 24-48 hours after the movie aired, so I did a number of things to ensure that the site would be able to run without incident during this critical time.

  1. Move temporarily to a higher powered server.
  2. The site is normally hosted on an inexpensive shared-hosting plan. I’ve run some shared-hosting servers before and don’t have much faith that they would handle any amount of significant load. They also usually don’t allow you to configure some of the Apache settings that I was planning on using below.

  3. Serve images and other static content from an alternate location.
  4. I set up a domain alias of ‘static.classperformance.com’ pointed to the same DocumentRoot as the main site. Then I edited the template files to serve most of the background, header, and footer images from that location. For normal usage, serving them from the same server works fine, but this allows the flexibility to move that static content to a separate server if/when it is needed.

    I also copied the entire website to a second server and had it configured so that at any time I could change DNS to point ‘static.classperformance.com’ to the second server in order to reduce the bandwidth from the primary server

  5. Generate static pages wherever possible.
  6. I used wget to download everything, and then deleted the pages that needed to be parsed through PHP (ie: contact forms, etc). Most of the pages don’t change from visitor to visitor, so this can be done for the home page, all of the blog posts, and any other pages. This significantly reduces the overhead due to database queries and just the overhead of running PHP and including multiple files.

    I then added this to my Apache configuration to tell the web server to use the static content if it exists:

        ## Serve static content for files that exist
        RewriteCond /home/classperformance.com/www/rendered/%{REQUEST_URI} -f
        RewriteRule (.*) /rendered/\ [L]
    
        ## For requests without an extension, wget has saved those files as 'index.html'
        ## so the rewrite rule needs to reflect that:
        RewriteCond /home/classperformance.com/www/rendered/%{REQUEST_URI} -d
        RewriteRule (.*) /rendered/\/index.html [L]
    

    I did some performance tests with ApacheBenchmark, and serving the static content had a dramatic effect on the speed, and the number concurrent users. There is probably a more elegant way to configure mod_cache do a similar thing in a more automated fashion, but this was quick and easy, and I didn’t have to worry about checking the various HTTP headers. In my opinion, this was the single most effective thing to do. By serving static content, Apache also correctly handles many of the HTTP headers that enable effective caching (E-Tags, expires, last-modified, etc).

  7. Installed a PHP Accelerator
  8. I’ve previously written about how easy and effective eAccelerator is to install. There are very few scenarios where this is not effective. Again, ApacheBenchmark tests easily showed a huge increase in the number of concurrent requests when eAccelerator was enabled.

  9. Check Apache settings
  10. On a vanilla CentOS install, Apache has the ServerLimit set to 256. By serving primarily static content, you will likely reduce the amount of memory that each Apache child requires, and have memory for more children. I did some quick math and figured that I could have around 800 children before memory became a concern. I also enabled KeepAlives with a very short (1 second) KeepAliveTimeout so that sequential requests from the same user don’t have to recreate TCP sessions.

    Also, by serving static content, I found that WordPress was handling the 301 redirect from the Non-www version of the site to the correct url. I moved that into Apache with this directive:

       ## Rewrite to the desired domain name
        RewriteCond %{HTTP_HOST} !^www\.classperformance\.com [NC] OR
        RewriteCond %{HTTP_HOST} !^static\.classperformance\.com [NC]
        RewriteRule ^/(.*) https://www.classperformance.com/\ [L,R=301]
    
  11. Enable server-side compression
  12. The default Apache install doesn’t compress any content. I configured mod_deflate to compress the static content and thus reduce the bandwidth usage. Compression should easily reduce the bandwidth for HTML and CSS files by one half (even up to one tenth). This not only reduces your bandwidth bill, but since the 100Mbps switch port is potentially a bottleneck, it enables more concurrent users if it approaches anywhere near that limit (and it may have if I hadn’t enabled compression)

  13. Set up some Monitoring
  14. I installed MRTG with some basic graphs. Also, I configured Apache so that I could view the ServerStatus. I also installed iftop to get a real-time view of the bandwidth usage.

With all of these changes, I’m very happy that we had tens of thousands of visitors during and shortly after the show, and everything ran perfectly. I had the static content running on a separate server for the busiest time and combined bandwidth usage peaked at around 90 Mbps shortly after the end of the show.

‘Maintenance’ Pages via Apache mod_rewrite

Occasionally, I’ve found it useful to put up a maintenance page while performing some work on a website. It may be useful if you are debugging and want to ensure that regular visitors don’t see any application generated error messages or blank pages or anything.

This method uses mod_rewrite to redirect all requests to a maintenance page that you create. Since

First create maint.html with some message that you want to display to your users. Then add this to your Apache configuration to redirect users to that page. Obviously, you’ll need to substitute your own IP address. You can add multiple lines to include multiple users if necessary. The configuration essentially says requests not from your IP (notice the exclamation point) will be redirected to /maint.html and that is the last Rewrite rule that should be followed.

  ##### Maintenance section
  ## Uncomment and add your IP address for performing maintenance
  ## Add multiple addresses on multiple lines if necessary
  RewriteCond %{REMOTE_ADDR} !^11\.22\.33\.44$
  RewriteCond %{REMOTE_ADDR} !^1\.1\.1\.1$
  RewriteRule . /maint.html [L]
  ##### End Maintenance section

ddrescue Saves the Day

I had a very bad thing happen to my regular work PC last week.   I use a Windows PC for my normal desktop machine, and when I turned it on one morning is refused to boot up.   After several attempts, it became obvious that the hard drive was dying and wouldn’t last much longer.   I have most of my irreplaceable files backed up to Amazon S3 via JungleDisk, but it is still a huge pain to try and reinstall an operating system, all of my applications, and try to get back to a working system

Fortunately, at a recent CALUG meeting, we had Barry Grundy give a presentation on Data Recovery.   Barry is the author of LinuxLEO – a pretty comprehensive document about Data Forensics  using open-source tools.  In his presentation he covered a number of open-source tools that are super-useful for recovering raw data and then in making sense of it.    The main tool that I found useful was GNU ddrescue which is a variant of dd specifically created to retrieve as much data as possible from a failing drive.’

ddrescue works by reading data from the drive.  When it encounters a bad sector it skips forward a ways and tries to read again.   If it is unsuccessful, it skips forward a larger amount and continues until it is able to read something successfully.  That process repeats until it has gone through the whole drive.    This retrieves as much good data from the drive as possible.   You can then run it again with some different parameters to go back and retry those error areas to retrieve all of the questionable areas.

The drive that I had failing was a 160 GB SATA drive that was over 4 years old.   The first round with ddrescue looked pretty bad – it had around 25% of the drive was bad or questionable.  After some experimenting to figure out the ideal parameters and a few passes through the entire drive I ended up recovering my entire drive minus just 110kb of bad sectors.

At that point I had almost all of my data, but I wasn’t able to boot off of the drive.  There were some problems with the master boot record and the NTFS volume was corrupt and wouldn’t mount cleanly.  I ended up attaching the drive to a working machine so that I could run chkdsk on it which solved the NTFS corruption problems.   I had to work around quite a few problems, but eventually was able to restore it to a point where I was able to boot just fine.

Jungle Disk Error getTotalSizeByType DB error (11)

My hard drive recently died and I was able to restore most of the files without any incident. However, it seems that some of my cached JungleDisk data became corrupted. When trying to connect to a bucket, it generated this error:

getTotalSizeByType DB error (11) database disk image is malformed

Some quick google queries identified this is an SQLite error, but didn’t provide any method for fixing it. Removing and re-adding the bucket to the JungleDisk configuration didn’t resolve it.

To finally get it fixed, I just cleaned out the JD Cache manually. The cache directory is configurable from the JungleDisk Settings menu under "Application Settings" On a Windows machine, it is likely in C:\Documents and Settings\<Your Username>\Application Data\JungleDisk\cache. That directory contains sub directories for each bucket that you connect to. I closed JungleDisk, then simply deleted the cache directory for the bucket that was having the problem. Then when restarting JungleDisk back up, I was able to re-add the bucket and it recreated the cache and went on to work fine.

ClassPerformance.com Is Now Live

I’ve spent the last little while working on ClassPerformance.com The site is for Brad Cohen who has Tourette’s Syndrome and has gone on to become an award-winning teacher, a motivational speaker, and author.  He has appeared on shows such as Oprah.   His life story has been made into a Hallmark Hall of Fame movie.  It will air on CBS on December 7th.  The website revamp is in anticipation of a spike of traffic around the time of the premier.

The blog, and most of the content for the site is through WordPress.   I wrote some additional custom functionality for the dynamic content, and integrated the HTML template.   Overall, I’m quite pleased with the result and am looking forward to the movie.

Funniest Mac ad so Far

I just had to laugh for a while at this commercial. It is funny how these commercials have played out. Mac obviously did a great job with them, and then Microsoft came out with their “I’m a PC” commercials. This one is in response to that.

Trac error: “<username> is not a valid value for the owner field.”

I stumbled around with this simple problem for longer than I care to admit.   Googling for a solution only found similar questions without any answers.   The problem occurs after adding a new user to the user database for trac (in my case, a simple .htpassword file), the user is unable to accept or to be assigned any tickets.  Attempts to do so just generate the error message:

"<username> is not a valid value for the owner field."

The solution is to log in as that user, then click on ‘Settings’ in the top right, and fill out the name and email address.   Evidently after adding that, it saves the user information to the trac database, and makes it possible to use that user in other places.

Fixing a Corrupt MySQL Relay Log

As an extension of my post yesterday about skipping corrupt queries in the relay log, I found out that my problem is due to some network problems between the servers which triggers a MySQL bug.

The connection and replication errors in my MySQL log looks like this:

080930 12:26:52 [ERROR] Error reading packet from server: Lost connection to MySQL server 
  during query ( server_errno=2013)
080930 12:26:52 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, 
  log 'mysql-bin.000249' position 747239037
080930 12:26:53 [Note] Slave: connected to master 'replicate@mysqltunnel:13306',replication 
  resumed in log 'mysql-bin.000249' at position 747239037
080930 13:18:49 [ERROR] Error reading packet from server: Lost connection to MySQL server during
   query ( server_errno=2013)
080930 13:18:49 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 
  'mysql-bin.000249' position 783782486
080930 13:18:49 [ERROR] Slave: Error 'You have an error in your SQL syntax; check the manual 
  that corresponds to your MySQL server version for the right syntax to use near '!' at line 6' 
  on query. Default database: 'database'. Query: 'INSERT INTO `sometable`
            SET   somecol         = 3,
                    comeothercol  = 8,
                    othervalue      = NULL!', Error_code: 1064
080930 13:18:49 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and 
  restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000249' 
  position 783781942
080930 13:18:50 [Note] Slave: connected to master 'replicate@mysqltunnel:13306',
  replication resumed in log 'mysql-bin.000249' at position 783782486

When there are network problems between the server, there was some issue where the master didn’t properly detect and notify the slave of the failure. This resulted in parts of queries missing, duplicated, or replaced by random bits in the relay log on the slave. When the slave tries to execute the corrupt query, it will likely generate an error that begins with:

Error You have an error in your SQL syntax; check the manual that corresponds to 
  your MySQL server version for the right syntax to use near . . 

This bug has been fixed in MySQL releases since February 2008, but still hasn’t made its way into the CentOS 5 repositories. Until then, that bug report contained a work-around which forces the slave to re-request the binary log from the master. Run ‘SHOW SLAVE STATUS’ and make note of the Master_Log_File and Exec_Master_Log_Pos columns. Then run ‘STOP SLAVE’ to suspend replication, and run this SQL:

CHANGE MASTER TO master_log_file='<Value from Relay_Master_Log_File>',
  master_log_pos=<Value from Exec_master_log_pos>;

After that, simply run ‘START SLAVE’ to have replication pick up from there again. That evidently has the slave re-request the rest of the master’s binary log, which it should (hopefully) get without becoming corrupt, and replication will continue as normal.

I guess the network connection between my servers is problematic lately. I’ve had to fix this several times in the past couple days. If that keeps up, I may add this fix to my Replication checking-script until I’m able to upgrade to a version of MySQL that contains this fix.

« Older posts Newer posts »

© 2025 Brandon Checketts

Theme by Anders NorenUp ↑