Hazard's stuff

Replacing failed hard drive in Linux Software RAID

— Posted by hazard @ 2007-12-28 16:33
Initially I thought it would be a quick midnight maintenance, not taking more than 10 minutes... Oh boy, I was wrong.

Googling around showed absence of any real-life scenarios for Linux software RAID disk replacement. All articles were of the "and now we simulate a disk failure..." category, and on top of that, most of them were outdated. No article seemed to cover the scenario where disk has REALLY failed and system was rebooted after a failure.

Even more surprisingly, it seems that CentOS5/Red Hat Enterprise 5 rescue disks are NOT designed to handle software arrays with any kind of problem. They just refuse to detect problematic arrays and mdadm will not show anything.

To cut a long story short, here is a REAL-LIFE procedure on how to replace a failed disk in Linux software RAID array:
  • Insert the new hard drive (probably your server needs to be turned off when doing that).
  • Boot from a rescue CD.
  • Create a partition table on the new drive so that all partitions are in the same order and sizes as partitions on the working drive.
  • Set RAID partition type as Linux (83), not Linux raid auto (fd). THIS IS VERY IMPORTANT AND IS OPPOSITE OF INSTRUCTIONS YOU CAN FIND ELSEWHERE. Otherwise your Linux system won't boot.
  • Now boot system Linux from the working hard drive (I hope you had bootloader installed on it, otherwise install it).
  • Add the new hard drive into the array:
    mdadm [MD-device] -a [new-HDD-device]
    For example,
    mdadm /dev/md0 -a /dev/sdb1
  • Check that hard disk was successfully added using
    mdadm -Q --detail [MD-device]
    Among other things it should say something like "reconstructing 0%".
  • Now, run fdisk, and change RAID partition type to Linux raid auto (fd).
  • If everything went fine until here, consider yourself lucky. :)