John Flinchbaugh Blog: Replacing a Drive in the Software RAID

I realized upon trying to replace a drive in my Linux software RAID, that I had never previously documented this process.

The power failed a couple days ago, and sitting around for 10 hours not spinning was bad for my one Western Digital 500GB drive, so it never came back online, but the computer booted up just fine with 2 of the 3 drives. I ordered a new 1GB drive to replace it within 2 hours of the problem.

When the new drive arrived, I shutdown the computer, unplugged the failing SATA drive, and replaced it with the new one. Linux again booted up with only 2 drives active.

To ensure that the partitions matched, I did an sfdisk -d /dev/sda | sfdisk /dev/sdb. cfdisk was unhappy with the drive initially, but fdisk was happy to read the table and write it back cleanly. Then I could open it again in cfdisk and add one more primary partition to use the extra non-redundant 500G on the new drive.

With the partitions in place, it was time to insert the new partitions back into the RAID devices: mdadm /dev/md0 --add /dev/sdb1, mdadm /dev/md2 --add /dev/sdb5, etc, until all the drives were re-established. Watching /proc/mdstat, I could see that the RAID started recovering the devices, and I could go to bed.

I did manage to do something wrong at one point, so a quick mdadm /dev/md0 --fail /dev/sdb1 and mdadm /dev/md0 --remove /dev/sdb1 allowed me to fail and remove a partition, fix it up, then add it back when I was done. All this could be done with the system up and running—that's pretty convenient.