Hazard's stuff

28 Dec, 2007

Replacing failed hard drive in Linux Software RAID

— Posted by hazard @ 2007-12-28 16:33
Initially I thought it would be a quick midnight maintenance, not taking more than 10 minutes... Oh boy, I was wrong.

Googling around showed absence of any real-life scenarios for Linux software RAID disk replacement. All articles were of the "and now we simulate a disk failure..." category, and on top of that, most of them were outdated. No article seemed to cover the scenario where disk has REALLY failed and system was rebooted after a failure.

Even more surprisingly, it seems that CentOS5/Red Hat Enterprise 5 rescue disks are NOT designed to handle software arrays with any kind of problem. They just refuse to detect problematic arrays and mdadm will not show anything.

To cut a long story short, here is a REAL-LIFE procedure on how to replace a failed disk in Linux software RAID array:
  • Insert the new hard drive (probably your server needs to be turned off when doing that).
  • Boot from a rescue CD.
  • Create a partition table on the new drive so that all partitions are in the same order and sizes as partitions on the working drive.
  • Set RAID partition type as Linux (83), not Linux raid auto (fd). THIS IS VERY IMPORTANT AND IS OPPOSITE OF INSTRUCTIONS YOU CAN FIND ELSEWHERE. Otherwise your Linux system won't boot.
  • Now boot system Linux from the working hard drive (I hope you had bootloader installed on it, otherwise install it).
  • Add the new hard drive into the array:
    mdadm [MD-device] -a [new-HDD-device]
    For example,
    mdadm /dev/md0 -a /dev/sdb1
  • Check that hard disk was successfully added using
    mdadm -Q --detail [MD-device]
    Among other things it should say something like "reconstructing 0%".
  • Now, run fdisk, and change RAID partition type to Linux raid auto (fd).
  • If everything went fine until here, consider yourself lucky. :)


Comments

  1. It sounds like you had some fun, eh... :)

    Posted by Leonid Mamchenkov — 28 Dec 2007, 18:28

  2. It wasn't Christmas kind of fun... :-)

    Posted by hazard — 29 Dec 2007, 07:35

  3. Ive had my fair share with RAID and Linux. I can share your feelings on not having anything useful from a quick google search on the topic. I had some stuff documented somewhere for reconstructing with a new drive in the case of an OS depandent RAID which I wanted to add to my post on RAID+Linux on my blog. Maybe I should get that up there as a part two and MAYBE we could actually provide the proper information at the end of the day for someone else's google search.
    :)

    Posted by Mario A. Spinthiras — 24 Jan 2008, 16:57

  4. I have some Linux Software RAID1 experiences and two questions.

    I had an unresponsive system, so rebooted. Upon reboot, the system refused to boot at all, no sign of any boot device. I pulled out one of the disks and rebooted, same story. Put that back in, pulled out the other drive and rebooted again. Same story.

    I would have thought that the system should have successfully rebooted off a good drive, albeit with the system recognizing that it was running with a degraded RAID.

    My questions are:
    1. Is there a way to successfully boot the software RAID1 with only one good drive? e.g. the bad drive was never formally removed from the RAID.

    2. Is the use of a rescue CD the only way to query disks in such a scenario, so that one can start the process to actually determine which specific disk has failed?

    and btw, when I eventually replaced the drive, I did succeed setting the partition as fd linux raid auto and rebuilding the RAID. However perhaps the difference was that the system supported hot-swap drives, so I didn't have to boot the system after installing the drive.

    Any thoughts and comments welcome because I think there are lots of tricks to Linux software RAID, and we can all learn from these real world examples.

    Vladimir, thanks for rekindling this issue.

    Posted by Frank Daley — 28 Jan 2008, 05:04

  5. Hi,

    By saying system refused to boot, which stage do you mean? bootloader (GRUB) didn't start, kernel didn't load, kernel didn't find root... ?

    Posted by hazard — 28 Jan 2008, 09:02

  6. System could not find any boot partition on the hard disk so it began DHCP request to try and find boot device via network.

    Also thinking further about your point re changing the partition type, although in your case changing from Linux (83) to Linux raid auto (fd), is it possible to change the type from fd to 83 without corrupting any data, and given that there is already a degraded RAID1?

    Perhaps then the system would boot?? If so, then it would be a matter of determining which one of the two drives was actually the faulty drive.

    Posted by Frank Daley — 28 Jan 2008, 19:23

  7. So as I understand your problem was that BIOS could not find bootloader. Probably you had only one instance of GRUB installed, and that one was on the failed disk.

    To be able to boot from any of the drivers, GRUB should be installed in the MBR of both hard drives, and partitions should be marked as bootable.

    Posted by hazard — 29 Jan 2008, 16:50

  8. Yes, that's what I thought.

    I had assumed that since I had configured MD0 as an ext3 partition for /boot that the necessary boot data was written to both devices.

    However since I could not boot from either device on its own, maybe I made a mistake. I will try some more experiments in the next few weeks.

    Thanks for the advice.

    Posted by Frank Daley — 29 Jan 2008, 23:48

  9. Hi there Vladimir,

    I would have wrote sooner but it seems your site was down. I have part two of my RAID which is maintaining the RAID in the case of failures and such. The link is :

    http://www.spinthiras.net/?p=42

    Note that yes I did change domain :)

    Regards,
    Mario

    Posted by Mario A. Spinthiras — 11 Feb 2008, 13:54

  10. wedding dresses wedding dresses wedding gowns wedding gowns bridal gowns bridal gowns lace front wigs wigs wedding invitations invitations bridal shower invitations wedding baby shower invitations dresses custom wedding invitation bridal diablo 2 cd key

    Posted by jojo — 28 Aug 2008, 00:58

  11. Might also mention that you don't have to use a recovery disk - as long as you have fdisk installed on your server.

    After recovery you should also run grub and install it in the MBR on the new HDD. For example if sdb is the drive you have just replaced then use:

    grub> device (hd0) /dev/sdb
    grub> root (hd0,0)
    grub> setup (hd0)

    Posted by Bowen Denning — 18 Nov 2008, 19:42


Add comment