[Triumf-linux-managers] Recovering a failing disk drive

Wed Nov 11 20:08:02 PST 2009

On Wed, Nov 11, 2009 at 06:12:56PM -0800, Andrew Daviel wrote:
> 
> I recently had a new-ish (1yr old) 1Tb SATA drive go bad on me at home.
>

I have written a program called "diskscrub" (an RPM available from trshare)
to monitor such situations. The program reads the whole disk to verify
that there are no "hidden" disk errors, and any read errors are reported
by email to root. By default it runs every week (early Tuseday morning),
but this can be adjusted in the crontab entry.

The program also reports the average read speeds,
this catches the case when there are many retries trying to read data.

The program also reports the time taken for each read operation, this
catches failing or problematic disks, for example if it takes 30 seconds
to read 1 Mbyte of data, you know this disk needs extra attention.

The program was written at the time of 100GB/10Mbyte/sec IDE disks,
and while it certainly needs updating for the new world of 1TB/100Mbytes/sec disks
with their fairly different failure modes, the diskscrub program is still
generally useful and catches problematic disks well before they fail completely.

As for mirroring all disks, the policy of the DAQ group is
to mirror (software RAID-1 across 2 disks) the system partition, the user
home partition and the swap partition. Sometimes the data partition
is also mirrored (when the cost of failure in manpower and lost beam time
outweights the cost of an extra hard disk - which at current disk prices
as pretty much always).

-- 
Konstantin Olchanski
Data Acquisition Systems: The Bytes Must Flow!
Email: olchansk-at-triumf-dot-ca
Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada