Rebuild Raid 1, what did I do wrong?

Richard G

Verified User
Joined
Jul 6, 2008
Messages
14,193
Location
Maastricht
I must be honest to say that I never build or rebuild a software raid before. We had a defective disk /dev/sda and a good disk /dev/sdb running in raid 1.
The datacenter has replace the defective disk, we had to rebuild it ourselves, so I used this manual:

Ik moet eerlijk zeggen nog nooit een software raid gerebuild te hebben. Er was een schijf /dev/sda stuk. De /dev/sdb was verder nog in orde.
Het datacentrum heeft de schijf vervangen, rebuilden moesten we zelf doen dus daar heb ik deze handleiding voor gebruikt:
http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array

At least... starting at "adding new harddisk" because a new disk was already present.
In the beginning I had to use the sfdisk with --force because it complaint otherwise, and /dev/sda was defective before so I adjusted the commands, like this:
Code:
sfdisk -d /dev/sdb | sfdisk /dev/sda --force.

Then i used these commands:
Code:
mdadm --manage /dev/md1 --add /dev/sda1 en
mdadm --manage /dev/md2 --add /dev/sda2

And it started rebuilding. It looked as if things went fine, it also took a couple of hours. But maybe I had to use /dev/sda2 and /dev/sda3 here instead of /dev/sda1.

Because after rebuilding "cat /proc/mdstat" gives the following output:

Code:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] 
md1 : active raid1 sdb1[1] sda1[0]
      101312 blocks [2/2] [UU]
      
md2 : active raid1 sdb2[1] sda2[0]
      480088000 blocks [2/2] [UU]
      
unused devices: <none>

So there is clearly a difference in blocks, which should not be there.
You can also see that with the output of "fdisk -l".

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sda1 * 1 13 101376 fd Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sda2 13 59782 480088064 fd Linux raid autodetect
/dev/sda3 59782 60801 8190976 82 Linux swap / Solaris

Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x0004219e

Device Boot Start End Blocks Id System
/dev/sdb1 * 1 13 101376 fd Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2 13 59782 480088064 fd Linux raid autodetect
/dev/sdb3 59782 60801 8190976 82 Linux swap / Solaris

Disk /dev/md2: 491.6 GB, 491610112000 bytes
2 heads, 4 sectors/track, 120022000 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md1: 103 MB, 103743488 bytes
2 heads, 4 sectors/track, 25328 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Question is.... what do I have to do to get this fixed? Because as i believe /dev/md1 should be 491,6 GB too, shouldn't it?
 
Or did I do good and is everything OK? If I look at the partitions, they seem to look correct:
cat /proc/partitions
major minor #blocks name

7 0 5000000 loop0
8 0 488386584 sda
8 1 101376 sda1
8 2 480088064 sda2
8 3 8190976 sda3
8 16 488386584 sdb
8 17 101376 sdb1
8 18 480088064 sdb2
8 19 8190976 sdb3
9 2 480088000 md2
9 1 101312 md1
 
Maybe I misunder4stand you, but...

md1, made up of sda1 and sdb1. is 103 MB. and md2, made up of sda2 and sdb2, is 491.6 GB. So of course they should be made up of a different number of blocks.

What is md1 mounted as? What is md2 mounted as?

Code:
df -h
I have a feeling you'll find that everything is as it should be.

And here's a question for you: How did you know that sda was bad? You should always try rebuilding RAID before you presume a drive is bad, since often the RAID fails because of a timing error and the drive itself is good.

We've only about 10% of the time in cases of software RAID failure found that the drive itself is bad.

Jeff
 
Oh I'm sorry, I forgot to give a reply in this thread that I found out I probably did nothing wrong en everything is indeed as it should be. It was confirmed to me on the Dutch WHT forum.
I was just a little nervous about it, because I never rebuilded a raid before.

Just for your information, this is the current output of df -h:
Code:
Filesystem            Size  Used Avail Use% Mounted on
rootfs                455G   18G  414G   5% /
/dev/root             455G   18G  414G   5% /
none                  2.0G  244K  2.0G   1% /dev
/dev/md1               96M   15M   77M  16% /boot
tmpfs                 2.0G  4.0K  2.0G   1% /dev/shm
/dev/loop0            4.7G   12M  4.5G   1% /tmp

How I knew the SDA was bad? The /var/log/messages logfile was giving all kinds of errors on the drive like this:
Code:
Apr  1 23:21:35 server12 kernel: sd 0:0:0:0: [sda]
Unhandled error code
Apr  1 23:21:35 server12 kernel: sd 0:0:0:0: [sda]  Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Apr  1 23:21:35 server12 kernel: sd 0:0:0:0: [sda] CDB:
Write(10): 2a 00 09 03 4f 58 00 04 00 00
Apr  1 23:21:35 server12 kernel: end_request: I/O error,
dev sda, sector 151211864
Apr  1 23:21:35 server12 kernel: md/raid1:md2: Disk failure
on sda2, disabling device.
Apr  1 23:21:35 server12 kernel: md/raid1:md2: Operation
continuing on 1 devices.
Apr  1 23:21:35 server12 kernel: sd 0:0:0:0: [sda]
Unhandled error code
Apr  1 23:21:35 server12 kernel: sd 0:0:0:0: [sda]  Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Apr  1 23:21:35 server12 kernel: sd 0:0:0:0: [sda] CDB:
Write(10): 2a 00 09 03 5f 58 00 00 60 00
Apr  1 23:21:35 server12 kernel: end_request: I/O error,
dev sda, sector 151215960

After that I gave a command to both drives to check the drives themselves:
smartctl -a -d ata /dev/sda
and
smartctl -a -d ata /dev/sdb

In the output of /dev/sda there was a notice that it was defective:
SMART Error Log Version: 1
ATA Error Count: 1.

So we gave notice to the datacenter provider, they checked, confirmed and replaced the drive.
 
Yes, checking the drives themselves and not depending on just the RAID failing, is indeed the way to go.We simply try a rebuild and see if it solves the problem.

Jeff
 
Oke thank you, I will remember that for in the future if a drive looks as if it fails, but smartctl will say it is not. I'll try to rebuild and see if that fixes it first too.
 
Back
Top