AT Gas Explosion

(Disaster Recovery exercise)

Affected services

What Backup/mirror Plan
4 AFS file servers: naga, cetus, minotaur, gorgon tibs backups and mirror releases from 8pm Saturday 1657 user volumes affected. Promote offsite ROs to RW, create new RW space downtown for shuffling later
stoater - AIAI web server mirrored to nix mirror17 and then tibs Affected sites aiai binni ksco i-globe i-x.aiai vue.ed openvce.net atate.org youtute schooltute equipment.inf - rebuild and restore from mirror. If the .202 network was still unavailable, we'd have to update the DNS (where necessary) for the affected sites to point at the new non-202 IP address.
cigar - Plone WCMS server mirror nix rmirror16 and then tibs lots of sites - rebuild and restore from mirror. Similarly, if no .202. wire was available we'd need to update DNS entries.
afsdb running on Inf Unit's skoll - There is a mirror on nix rmirror16 wouldn't use mirror, providing we have one working afsdb, just create a new (VM) one and it will get the data from from quorum

Affected due to .202. network being unavailable

What Backup/mirror Plan
toaster - groups.inf and various other web sites also so related NFS group space mirrored to nix mirror16 KVM on jubilee in Forum, but on 202 subnet. Plan would probably be to move machine to .33 (or create a new VM on .33) and update DNS for the various sites

For the record, not affected

mail, www.inf, dice.inf, wiki.inf, printing, most group file space - see groups detail below

Other kit affected, but not providing a live service

atabeast1, satablade1, woblog, borges

More detail on the affected services

AFS

The four AFS servers only use local disk storage, not from the SAN, so there's no option to remount their data. We'd have to rely on the nightly release and backups and make use the the DR data at KB.

Rather that listing all the affected user volumes here to be scraped, you can generate the list yourself with:

echo naga cetus gorgon minotaur | xargs -n1 /usr/sbin/vos listvldb -server | grep ^user\.

As mentioned in the chat room, there are about 13 computing staff affected by that. We'd probably have done "vos convertROtoRW" for those staff fairly quickly. To get them up and running again. Probably being more considered for the other users, as we may want to make sure we've got new AFS space on line to create new offsite RO copies first.

From AFSPartitions (and just local knowledge) we can see that the raw storage required would be 12x455GB of data, but we can get an "actually used" figure from the size of the offsite RO volumes. This comes to roughly 1.8TB from a potential 5.5TB. We have 5TB free on ifevo3 and 16TB on ifevo4.

stoater - aiai web server

Contact: Austin/AIAI

  • atate.org
  • openvce.net
  • oplan.aiai.ed.ac.uk
  • vue.ed.ac.uk
  • www.aiai.ed.ac.uk
  • www.aiai.inf.ed.ac.uk
  • www.i-globe.info
  • www.i-x.info
  • www.ksco.info
  • www.openvce.org

Contact: John Lee

  • youtute.inf.ed.ac.uk
  • schooltute.inf.ed.ac.uk

Contact: Francesco Figari

  • equipment-sharing.inf.ed.ac.uk

cigar - Plone WCMS web server

  • wcms.inf.ed.ac.uk including / cisa, icsa, ipab, lfcs, hcrc, ilcc, speechlabs, pepa, idar09, sspnet, jast, dice, sandbox
  • www.anc.ed.ac.uk
  • www.cisa.inf.ed.ac.uk
  • www.classic-project.org
  • www.emime.org
  • www.hcrc.ed.ac.uk
  • www.ilcc.inf.ed.ac.uk
  • www.mngu0.org
  • www.not-a-service.inf.ed.ac.uk
  • www.transfics.eu
  • www.ultrax-speech.org
  • migration.inf.ed.ac.uk
  • pbf2013.inf.ed.ac.uk

toaster - groups web sites

In the mirror of the web config

for i in `grep -i servername *.conf | awk '{print $NF}' | sort | uniq`; do echo -n $i; host -t A $i | awk '/has address/ {printf(" %s",$NF)} END {printf "\n"}'; done | grep -E '129\.215\.202\.(26|60)\b' 

  • aicat.inf.ed.ac.uk
  • aied.inf.ed.ac.uk
  • conferences.inf.ed.ac.uk
  • data.cstr.ed.ac.uk
  • dbibd-05.inf.ed.ac.uk
  • downloads.specknet.org
  • events.inf.ed.ac.uk
  • fordyce.inf.ed.ac.uk
  • groups.inf.ed.ac.uk
  • history.dcs.ed.ac.uk
  • hoppers.inf.ed.ac.uk
  • infcricket.inf.ed.ac.uk
  • inf.statmt.org
  • media.inf.ed.ac.uk
  • newbuildpics.inf.ed.ac.uk
  • openafs2012.inf.ed.ac.uk
  • proofgeneral.inf.ed.ac.uk
  • ref2014.inf.ed.ac.uk
  • touchscreens.inf.ed.ac.uk
  • uitp05.inf.ed.ac.uk
  • waim-05.inf.ed.ac.uk
  • workshops.inf.ed.ac.uk
  • www.arcs.im
  • www.bctcs.ac.uk
  • www.cav2005.inf.ed.ac.uk
  • www.computersciencepodcast.com
  • www.ehmn.bioinformatics.ed.ac.uk
  • www.entrepedia.org
  • www.etaps05.inf.ed.ac.uk
  • www.euphoria-project.eu
  • www.hscma2011.org
  • www.icdt2005.inf.ed.ac.uk
  • www.ilsi.inf.ed.ac.uk
  • www.inspace.ed.ac.uk
  • www.neurogems.org

There are some NFS group areas web areas served from toaster:

 rfe -g amdmap/group | awk '/^\[/ {sec=$1} /toaster1/ {printf("/group/%s%s\n",sec,$1)}' | tr -d [ | tr ]
/group/bctcs
/group/project/aicat
/group/project/bioinformatics
/group/project/entrepedia
/group/project/hoppers
/group/project/ilsi
/group/project/inspace
/group/project/nxt
/group/project/perlis
/group/project/proofgeneral
/group/project/sicsa
/group/cisa/web
/group/conference/cav2005
/group/conference/compsac2005
/group/conference/emnlp08
/group/conference/emnlp09
/group/social/infcricket

Recovery Exercise

As a recovery test, it was agreed to bring a replace cigar.inf, which involved creating a new VM to host the service and use the data mirrored at KB to restore the content of the various (Plone) websites. Useful information was available from PloneTop5 and ServicesUnitPloneWCMSManagement wiki pages.

Procedure

Created a new profile for "cigar2" based on cigar.inf, having assigned an appropriate IP address. Then created VM with sufficient local disk space to contain the restored mirror data (size was calculated based on size of nix:/disk/rmirror16/wcms).

Once host was up and running, created zope/plone instance and restored web data by importing saved-state (rather than just copying mirrored data):

  • initialised zope/plone
  • checked permissions and apache config
  • started zope by hand (to test setup)
  • copied over apache configuration directory from mirror
  • transferred over certs from mirror location
  • retested apache & browser
  • stopped zope and restored data
  • checked zope consistency (which resulted in noticing that there had been an omission in backed-up data- the psycopg2 source file, psycopg2-2.0.14.tar.gz, wasn't being mirrored)
  • rebuilt zope DB
  • started Plone (in foreground to test)
  • checked OK, so restarted cleanly
  • checked websites (http://cigar2.inf.ed.ac.uk/zope/emime/,for example)
Topic revision: r5 - 29 Jan 2014 - 08:58:10 - RogerBurroughes
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies