Priorities :-

  • High - restore within a couple of days
  • Medium - restore within a week
  • Low - restore within a couple of weeks

MPU kit in racks 0, 1 2

  • budapest
  • telford
  • district
  • lochranza
  • bioboy (used by bakerloo)

MPU services affected

  • IF KVM service (bakerloo)
  • lcfg.org services
  • Upstream RPMs AFS R/W volume
  • Wake on LAN service
  • Backup VMware server
  • OpenAFS build host

IF KVM service (bakerloo)

Description
IF KVM service (bakerloo), currently hosting a small number of KVM guests unavailable
Data
Guests' configs and virtual disks
State of data backup
No backups, so guests lost
Priority to restore
Medium (High if increased demand on VM server capacity).
Steps taken to restore
Guests could be reinstalled on other KVM servers - circle or northern if a guest doesn't care which wire it's on; metropolitan or central if a guest needs to be on a Forum only wire. It is likely that the demand on VM server capacity would grow in a crisis due to shortage of physical hardware. Most of the MPU guests on northern are safely sacrificial which would provide capacity. Bakerloo itself wasn't damaged so would have been able to quickly return to service should sufficient SAN space have been available.

lcfg.org services

Description
{rsync,wiki,www,bugs}.lcfg.org would be completely unavailable.
Data
web, twiki and bugzilla data. This is stored on the SAN so wasn't lost. Also autogenerated releases data.
State of data backup
twiki and bugzilla mirrored to lammasu; web data mirrored to unicorn. Backup valid.
Priority to restore
Medium for rsync, wiki and static web; Low for bugs and dynamic web. IS and other schools rely on the rsync service to fetch LCFG configuration updates.
Steps taken to restore (onto sl6_64)
We first attempted to bring back the rsync service, then the web services. Bugzilla was not restored as we expect its upgrade to SL6 to be time-consuming and complex. Given an SL5 machine this would all have been far simpler.
Just rsync.lcfg.org
  • include dice/options/lcfgrsync.h then run updaterpms.
  • Restore the lcfgdatadisk mirror to /disk/data on the new machine. (To restore, duplicate the original lcfgdatadisk rsync module on the new machine but make it temporarily read-write then use rsync on the mirror server to push the data at the new machine's lcfgdatadisk rsync module.)
  • Add rsync to the ipfilter.export resource.
  • rfe dns/lcfg_org to point rsync at the new machine.
rsync, www, wiki.lcfg.org
  • define DICE_OPTIONS_BUGZILLA_SERVER in the machine profile. This should exclude the bugs.lcfg.org configuration.
  • include dice/options/lcfg-webservices.h then run updaterpms.
  • Restore the lcfgdatadisk mirror to /disk/data on the new machine. (To restore, duplicate the original lcfgdatadisk rsync module on the new machine but make it temporarily read-write then use rsync on the mirror server to push the data at the new machine's lcfgdatadisk rsync module.)
  • Define lcfgweb.basedir to be /disk/data/lcfg.org
  • Add the following packages to satisfy dependencies (since this is a hasty conversion of an SL5 configuration): perl-DateTime-0.5300-1.el6, perl-Class-Singleton-1.4-6.el6/noarch. rpm-build-4.8.0-16.el6_1.1
  • Add -bugs-lcfg-org-files-*-* to profile.packages to take care of remaining bugzilla-related file dependencies.
  • If the new machine does not have a spare 150GB or so on /disk/data then set the following resources to /dev/null to suppress the copying in of install CD images: lcfgweb.cdpath_sl5_i386, lcfgweb.cdpath_sl5_x86_64, lcfgweb.cdpath_sl6_i386, lcfgweb.cdpath_sl6_x86_64
  • Reboot, check that the new components started successfully
  • rfe dns/lcfg_org to point rsync, www and wiki at the new machine.
rsync, www, wiki, bugs.lcfg.org
  • Porting bugs.lcfg.org to SL6 seemed beyond the scope of this exercise and was not attempted.

Upstream RPMs AFS R/W volume

Description
Upstream RPMs AFS RW volume lost - RO copy (on unicorn) still available. Can't mirror from upstream SL and EPEL.
Data
RPMs on AFS volume. Volume physically intact on atabeast, but easier to promote RO copy to RW than mount volume on new AFS server.
State of data backup
RO copy intact. SL6, but not SL5 nor epel, stored on sauce (DR server).
Priority to restore
Medium.
Steps taken to restore
Promote RO copy (on unicorn) to RW, but unlikely to want to do this until have sufficient disk space (250GB) for both an RW and RO copy.

Wake on LAN service

Description
Wake on LAN web service unavailable
Data
none (all configuration in LCFG)
State of data backup
n/a
Priority to restore
Low
Steps taken to restore
Simple reinstall of an SL6 KVM guest

Backup VMware server

Description
Backup VMware server unavailable
Data
None
State of data backup
n/a
Priority to restore
Low
Steps taken to restore
Acquire new kit and reinstall

OpenAFS build host

Description
Build host used by OpenAFS team - only notionally MPU responsibility.
Data
?
State of data backup
?
Priority to restore
?
Steps taken to restore
?

Deficiencies

  • We believe that profiles should document powerbar,ether switch, FC ports and FC Luns used by a host
  • Bakerloo and central were using off rack power - now fixed.
  • Not all LCFG profiles had switch/fpdu details, and some were out-of-date - now fixed
  • How do we know which VM servers host which guests, if the VM server is down. Manual list maintained in WIKI, but gets out of date quickly.
  • All virtual guests should record hosting server in LCFG profile
  • The SL5 PXE installer was broken (and possibly has been for some weeks). Now fixed, but needs added to testing procedures.
  • SL5 guests could only be created on VMware servers as our KVM service doesn't support SL5.
  • telford profile doesn't record which SAN device serves AFS data.
  • budapest profile doesn't record which SAN device serves AFS data
  • sauce should store SL6 epel
  • Resources will be scarce so will need some form of brokering system

-- AlastairScobie - 27 Feb 2012

Topic revision: r19 - 13 Mar 2012 - 15:57:31 - AlastairScobie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies