Speed up a Linux Software Raid Rebuild

I’m setting up software raid on a running server and it is taking forever for the initial sync of the raid drives on the 1TB hard disks. It has been running for about 6 hours and says that it will take about 5 days (7400 minutes) as this pace:

[root@host ~]# cat /proc/mdstat
md1 : active raid1 sdb3[1] sda3[2]
      974559040 blocks [2/1] [_U]
      [>....................]  recovery =  3.9% (38109184/974559040) finish=7399.1min speed=2108K/sec

I did some read and write tests directly to the drives using dd to make sure that they were working okay, and they can operate at about 100 MB/s

[root@host ~]# dd if=/dev/zero of=/dev/sda2 bs=1024 count=1024000
    1048576000 bytes (1.0 GB) copied, 10.8882 seconds, 96.3 MB/s
[root@host ~]# dd if=/dev/zero of=/dev/sdb2 bs=1024 count=1024000
    1048576000 bytes (1.0 GB) copied, 11.1162 seconds, 94.3 MB/s
[root@host ~]# dd if=/dev/sda2 of=/dev/null bs=1024 count=1024000
    1048576000 bytes (1.0 GB) copied, 10.2829 seconds, 102 MB/s
[root@host ~]# dd if=/dev/sdb2 of=/dev/null bs=1024 count=1024000
    1048576000 bytes (1.0 GB) copied, 10.5109 seconds, 99.8 MB/s

What I failed to realize is that there is a configurable limit for the min and max speed of the rebuild. Those parameters are configured in /proc/sys/dev/raid/speed_limit_min and /proc/sys/dev/raid/speed_limit_max. They default to a pretty slow 1MB/s minimum which was causing it to take forever.

Increasing the maximum limit didn’t automatically make it faster either. I had to increase the minimum limit to get it to jump up to a respectable speed.

[root@host ~]# echo 100000 > /proc/sys/dev/raid/speed_limit_min

[root@host ~]# watch cat /proc/mdstat
Every 2.0s: cat /proc/mdstat 
md1 : active raid1 sdb3[1] sda3[2]
      974559040 blocks [2/1] [_U]
      [=>...................]  recovery =  7.7% (75695808/974559040) finish=170.5min speed=87854K/sec

Now it is up around 87 MB/s and will take just a few hours to complete the rest of the drive.

ddrescue Saves the Day

I had a very bad thing happen to my regular work PC last week.   I use a Windows PC for my normal desktop machine, and when I turned it on one morning is refused to boot up.   After several attempts, it became obvious that the hard drive was dying and wouldn’t last much longer.   I have most of my irreplaceable files backed up to Amazon S3 via JungleDisk, but it is still a huge pain to try and reinstall an operating system, all of my applications, and try to get back to a working system

Fortunately, at a recent CALUG meeting, we had Barry Grundy give a presentation on Data Recovery.   Barry is the author of LinuxLEO – a pretty comprehensive document about Data Forensics  using open-source tools.  In his presentation he covered a number of open-source tools that are super-useful for recovering raw data and then in making sense of it.    The main tool that I found useful was GNU ddrescue which is a variant of dd specifically created to retrieve as much data as possible from a failing drive.’

ddrescue works by reading data from the drive.  When it encounters a bad sector it skips forward a ways and tries to read again.   If it is unsuccessful, it skips forward a larger amount and continues until it is able to read something successfully.  That process repeats until it has gone through the whole drive.    This retrieves as much good data from the drive as possible.   You can then run it again with some different parameters to go back and retry those error areas to retrieve all of the questionable areas.

The drive that I had failing was a 160 GB SATA drive that was over 4 years old.   The first round with ddrescue looked pretty bad – it had around 25% of the drive was bad or questionable.  After some experimenting to figure out the ideal parameters and a few passes through the entire drive I ended up recovering my entire drive minus just 110kb of bad sectors.

At that point I had almost all of my data, but I wasn’t able to boot off of the drive.  There were some problems with the master boot record and the NTFS volume was corrupt and wouldn’t mount cleanly.  I ended up attaching the drive to a working machine so that I could run chkdsk on it which solved the NTFS corruption problems.   I had to work around quite a few problems, but eventually was able to restore it to a point where I was able to boot just fine.