Recovering a Failed RAID Disk on Linux
I discovered a failed disk in a Linux software RAID array on a customer’s server. After the failed disk /dev/sdb was replaced I rebuilt the RAID by duplicating the partition layout of the remaining disk /dev/sda.
At first I removed the partitions of the failed disk from the RAID:
xxx:~# mdadm --manage /dev/md0 --remove /dev/sda1 hot removed /dev/sda1 xxx:~# mdadm --manage /dev/md1 --remove /dev/sda2 hot removed /dev/sda2 xxx:~# mdadm --manage /dev/md2 --remove /dev/sda3 hot removed /dev/sda3
/proc/mdstat shows the degraded RAID:
xxx:~# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4]
md2 : active raid1 sda3[1]
1458830400 blocks [2/1] [_U]
md1 : active raid1 sda2[1]
2104448 blocks [2/1] [_U]
md0 : active (auto-read-only) raid1 sda1[1]
4200896 blocks [2/1] [_U]
unused devices: <none>The first MB of /dev/sda was copied into a file and put on /dev/sdb using dd:
xxx:~# dd if=/dev/sda of=sda.part bs=1024k count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.0266995 s, 39.3 MB/s xxx:~# dd if=sda.part of=/dev/sdb bs=1024k count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.00635116 s, 165 MB/s
Another way is to use sfdisk:
xxx:~# sfdisk -d /dev/sdb | sfdisk --force /dev/sda Checking that no-one is using this disk right now ... OK Disk /dev/sda: 91201 cylinders, 255 heads, 63 sectors/track Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sda1 0+ 522 523- 4200997 fd Linux raid autodetect /dev/sda2 523 784 262 2104515 fd Linux raid autodetect /dev/sda3 785 91200 90416 726266520 fd Linux raid autodetect /dev/sda4 0 - 0 0 0 Empty New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sda1 1 8401994 8401994 fd Linux raid autodetect /dev/sda2 8401995 12611024 4209030 fd Linux raid autodetect /dev/sda3 12611025 1465144064 1452533040 fd Linux raid autodetect /dev/sda4 0 - 0 0 Empty Warning: no primary partition is marked bootable (active) This does not matter for LILO, but the DOS MBR will not boot this disk. Successfully wrote the new partition table Re-reading the partition table ... If you created or changed a DOS partition, /dev/foo7, say, then use dd(1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1 (See fdisk(8).)
Check this with fdisk:
xxx:~# fdisk /dev/sdb The number of cylinders for this disk is set to 182401. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): p Disk /dev/sdb: 1500.3 GB, 1500301910016 bytes 255 heads, 63 sectors/track, 182401 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x000b0097 Device Boot Start End Blocks Id System /dev/sdb1 1 523 4200997 fd Linux raid autodetect /dev/sdb2 524 785 2104515 fd Linux raid autodetect /dev/sdb3 786 182401 1458830520 fd Linux raid autodetect Command (m for help): q
As the RAID /dev/md0 was read only (see state of md0 above) I changed its mode to read write with mdadm:
xxx:~# mdadm --readwrite /dev/md0
Now I could rebuild the arrays:
xxx:~# mdadm --manage /dev/md0 --add /dev/sda1 xxx:~# mdadm --manage /dev/md1 --add /dev/sda2 xxx:~# mdadm --manage /dev/md2 --add /dev/sda3
The RAID is now rebuilt in the background:
xxx:~# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4]
md2 : active raid1 sdb3[2] sda3[1]
1458830400 blocks [2/1] [_U]
[=======>.............] recovery = 38.6% (563849792/1458830400) finish=248.7min speed=59956K/sec
md1 : active raid1 sdb2[2] sda2[1]
2104448 blocks [2/1] [_U]
resync=DELAYED
md0 : active raid1 sdb1[0] sda1[1]
4200896 blocks [2/2] [UU]
unused devices: <none>