John Flinchbaugh Blog: Server Trouble: Lost a Fast 18G SCSI Drive

Ugh! Things were running slowly yesterday, and last night I found the reason -- one of my 2 fast U160 SCSI drives started showing bad sectors. All of /usr was on there, which included the entire photo gallery. Only a few blocks were going bad at this point, so I had a short list of files which I knew would be corrupt.

I needed to replace the drive quickly with one of my slower 7200RPM Fujitsu's I have on hand. Good old scu (SCSI Command Utility) is next to impossible to find on the web, closed-source, and just broken in the later versions of Debian and Gentoo that I'm running, so I was left stranded at 3am looking for a new tool which is capable of reformatting a SCSI drive with the correct blocksize of 512. I found setblocksize. It's only purpose is to fix this sort of issue. Unfortunately, it's no more reassuring and informative than a BIOS format, which means it tells you absolutely nothing and you sit there wondering if it's working. I have myself convinced that it does stop working after a while, requires an interruption, reboot, and restart at times. It tells you not to do that, but I don't trust it to tell me that it's broken either.

I did eventually get it migrated over, and proved that it's going to suck -- The old drive did 29MB/s transfer, and this replacement only does 8MB/s. I have to get shopping for a new disk or 2. At this point, I'm afraid to have larger disks, because it just means more lost and longer backups. This'll put me back into the mood to run lots of smaller disks in smaller partitions.

Update (17 January 2006): After I got everything over to the new drive, and it seemed to be working, I did a Debian update to ensure all the major files (from a pending glibc update, especially) were intact. Then I went about low-level formatting the bad drive just for kicks. It low-leveled fine, and badblocks proved the drive was looking OK again, so I tortured it a bit, and put it back into service. The slow drive was going to be too painful.

Upon rebooting all hell broke loose with /boot on the unaffected drive not fsck-ing and other mass confusion. After lots more debugging, I saw that my /dev/scsi tree as assembled by udev was all misaligned due to the latest udev update (0.80-1, I believe). After rolling back udev, things got better, but it wasted a couple hours for me.

Once things calmed, I proceeded to run my new favorite tool, setblocksize, on the rest of my unused SCSI drives. Now I'll not have to remember how to do this again, until I buy another batch of drives pulled from RAID arrays. It turns out that setblocksize will complete the format on its own on the first try -- it just takes a long time (an hour or so with my 18G drive on an old BusLogic Flashpoint controller).