Identifying Kit

This was not as straight forward as we expected. We didn't have any plans as to where the servers are located in the Forum server room nor were we sure which edge files to check. After confirming which switches to check, we identified 9 servers - see ServiceDisruption

Backup Checks

Most of the servers we manage are for computational use by Institutes (rather than us providing any specific service). It was therefore not too surprising that a significant amount of data was lost. However, we checked all the data we expected to be backed up was being backed up correctly. The 'user-owned' data is not routinely backed up by us and it is the responsibility of the user to have backup procedures in place if necessary.

Test Recovery

We initially identified castor and pollux as being possible candidates. They have been set up so that some data on each is mirrored to the other machine. Unfortunately, they are situated in the same rack so in the case of a disaster scenario, all data would be lost. We also discovered that castor and pollux had been moved but the entries in the edge files had not been removed so they weren't in any of the racks affected by the flood. This has been corrected.

We therefore decided to restore the devproj service despite it not having being affected by the flood. This was done successfully in a couple of hours.

Actions

  • Move either castor or pollux to AT
  • Spoke to Jim Bednar about the potential loss of his 3 jupiter machines (around 4T of data). He did know that the data wasn't being backed up but would like to purchase additional storage for mirroring the data.
  • US Unit would like to consider having some minimal documentation in paper format.
  • Consider checking that everyone using Institute servers are aware that data on scratch disks is not backed up - part of motd ?

Comments

  • Noticed a couple of machines in the edge files that I think are no longer around or are at least switched off (e.g. ipanema, leblon) so we should check currency.
  • How do we agree priority amongst all the 'low-priority' Institute servers.
  • Units seemed to work pretty much in isolation. In a real disaster situation, we would need more communication with each other. Meetings with representatives from each Unit ? Email/chat communication wouldn't be enough.
  • Difficult to simulate communicating with the users. We nominated a representative but not aware of any other Unit doing the same - see NotifyUsersNotes. We'd need to be more proactive in a real situation with providing users with updates.
  • CSO should be invited to Unit meetings ?
  • Hopefully most data would be recoverable from mirror rather than tape. However, if we did need to recover from tape, how would we prioritise those jobs and should 1 person co-ordinate this ?

-- AlisonDownie - 05 Mar 2012

Topic revision: r3 - 09 Mar 2012 - 11:08:32 - AlisonDownie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies