[Triumf-linux-managers] amanda failure

Kelvin Raywood kray at triumf.ca
Thu Oct 28 15:51:46 PDT 2010


There has been a failure in the storage array of Amanda; the system used
to backup desktop Linux systems at TRIUMF.  Backups have not occurred
since the weekend and backups done before the weekend are not presently
available for restores.

We are working on recovering the array and estimate that backups from 
before last weekend will be available by mid next-week.   In the 
interim, the T2K group have kindly provided us temporary access to one 
of their storage-servers so that we can run fresh backups starting this 
evening.   But it will be starting from scratch; doing full backups of 
all clients so may take up to a week to complete.

If you have files on a non-RAID disk that you were relying on amanda to 
protect, we suggest that you consider copying them to another temporary 
location to protect yourself from catastrophic failure during the period 
while amanda caches up.  Some possibilities might be using rsync to copy 
files to a WestGrid account, or to /NOT_BACKED_UP/home/<username> on 
trcomp01 or trcomp02.  We request that you do not use trshare for this 
purpose as there's not much space on that system.

Our core Linux-servers on which you may have files, are backed up using 
an alternate system and not affected by the amanda failure.  These 
include trshare, ibm00, trcomp01, trcomp02, trmail, trserv, elog, 
indico, and various web-servers including, content-management systems 
and the docushare server.

We will keep you informed of the status via this mailing list. The 
status of backups that is usually available at 
http://amanda.triumf.ca/~amanda will be unavailable or out of date until 
the full cycle completes.  That page is updated at the end of each 
backup cycle.

Apologies to all for the delay in this announcement while we evaluated 
the extent and impact of the failure, and determined a planned for 
recovery.  We'd also like to thanks Konstantin Olchanski of the DAQ 
group for taking the lead in the recovery effort.

--
Kel Raywood
Core Computing and Networking



More information about the Triumf-linux-managers mailing list