Recovering a Failed RAID Disk on Linux

I discovered a failed disk in a Linux software RAID array on a customer’s server. After the failed disk /dev/sdb was replaced I rebuilt the RAID by duplicating the partition layout of the remaining disk /dev/sda.

At first I removed the partitions of the failed disk from the RAID:

xxx:~# mdadm --manage /dev/md0 --remove /dev/sda1
hot removed /dev/sda1
xxx:~# mdadm --manage /dev/md1 --remove /dev/sda2
hot removed /dev/sda2
xxx:~# mdadm --manage /dev/md2 --remove /dev/sda3
hot removed /dev/sda3

/proc/mdstat shows the degraded RAID:

xxx:~# cat /proc/mdstat 
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid1 sda3[1]
      1458830400 blocks [2/1] [_U]

md1 : active raid1 sda2[1]
      2104448 blocks [2/1] [_U]

md0 : active (auto-read-only) raid1 sda1[1]
      4200896 blocks [2/1] [_U]

unused devices: <none>

The first MB of /dev/sda was copied into a file and put on /dev/sdb using dd:

xxx:~# dd if=/dev/sda of=sda.part bs=1024k count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0266995 s, 39.3 MB/s


xxx:~# dd if=sda.part of=/dev/sdb bs=1024k count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00635116 s, 165 MB/s

Another way is to use sfdisk:

xxx:~# sfdisk -d /dev/sdb | sfdisk --force /dev/sda
Checking that no-one is using this disk right now ...
OK

Disk /dev/sda: 91201 cylinders, 255 heads, 63 sectors/track
Old situation:
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sda1          0+    522     523-   4200997   fd  Linux raid autodetect
/dev/sda2        523     784     262    2104515   fd  Linux raid autodetect
/dev/sda3        785   91200   90416  726266520   fd  Linux raid autodetect
/dev/sda4          0       -       0          0    0  Empty
New situation:
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1             1   8401994    8401994  fd  Linux raid autodetect
/dev/sda2       8401995  12611024    4209030  fd  Linux raid autodetect
/dev/sda3      12611025 1465144064 1452533040  fd  Linux raid autodetect
/dev/sda4             0         -          0   0  Empty
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)

Check this with fdisk:

xxx:~# fdisk /dev/sdb

The number of cylinders for this disk is set to 182401.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): p

Disk /dev/sdb: 1500.3 GB, 1500301910016 bytes
255 heads, 63 sectors/track, 182401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000b0097

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         523     4200997   fd  Linux raid autodetect
/dev/sdb2             524         785     2104515   fd  Linux raid autodetect
/dev/sdb3             786      182401  1458830520   fd  Linux raid autodetect

Command (m for help): q

As the RAID /dev/md0 was read only (see state of md0 above) I changed its mode to read write with mdadm:

xxx:~# mdadm --readwrite /dev/md0

Now I could rebuild the arrays:

xxx:~# mdadm --manage /dev/md0 --add /dev/sda1
xxx:~# mdadm --manage /dev/md1 --add /dev/sda2
xxx:~# mdadm --manage /dev/md2 --add /dev/sda3

The RAID is now rebuilt in the background:

xxx:~# cat /proc/mdstat 
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid1 sdb3[2] sda3[1]
      1458830400 blocks [2/1] [_U]
      [=======>.............]  recovery = 38.6% (563849792/1458830400) finish=248.7min speed=59956K/sec

md1 : active raid1 sdb2[2] sda2[1]
      2104448 blocks [2/1] [_U]
        resync=DELAYED

md0 : active raid1 sdb1[0] sda1[1]
      4200896 blocks [2/2] [UU]

unused devices: <none>

Resources

Ralf Bensmann

Ralf Bensmann

Software Architect, Trainer, Author
Java Standard and Enterprise Edition
Clojure, Groovy & Grails
OpenOffice, LibreOffice

Archive

2012 (3)
2011 (43)
2010 (34)
Posterous theme by Cory Watilo