Forum Power Failure Monday 11th November 2013

We suffered a power failure in the Informatics Forum on Monday 11th November 2013, starting at about 11:00 and ending at about 13:00.

In these circumstances it is always good to carry out a post-mortem, the results of which will enable us to improve our handling of the next such event. Each unit should jot down any problems, hiccups or thoughts they have on how we handled the power failure (either good or bad). Anything which you think is more generally applicable can go in the Miscellaneous section (or under any other heading you would like to add).

So that it will still be fresh, this will be discussed at the Operational Meeting on Wednesday 13th November so please fill in anything for your unit as soon as possible.

Miscellaneous

(Chris) Can we ensure that computing staff desks continue to have substantial battery-backed power (edit: and connectivity) when the building power goes? For instance:

  • Retaining the present UPS cover for Forum offices
  • Individual UPSes for computing staff desks
  • Routine presence of a laptop on all computing staff desks
This will be a critical issue if and when the Building UPS is withdrawn from service.

Did anyone actually throw a big switch to kill the power to the server room, or did the batteries just run flat?

  • The latter!

Does a loss of power like this constitute a "disaster"? Probably I'm making an arbitrary distinction, but is it a disaster if people can't print, send emails for a couple of hours? OK, we don't/didn't know it was only going to be a couple of hours. Obviously if it is relatively easy to provide an alternative service/server we should, but some things are harder to switch back from, eg what if we'd decided to promote all affected AFS readonly copies to being the live RW copies? Syncing changes with the new RWs when the old RWs come back isn't simple. - neilb

A few misc points ascobie :-

  • How do we modify DNS once forum out of action? Are there instructions for this?
  • The status pages weren't updated in good time (with exception of AFS). We need to do much better on this.
  • The status "home" page says last reviewed 27/11/2012 which could give the impression that the information is out of date.
  • Couldn't see UPS status of Forum core, so couldn't work out how long the network would last
  • Couldn't see UPS status of Building UPS, so couldn't work out how long the offices would last
  • Some computing staff were unclear as to what they could do without a home dir (and how)
  • No SMS monitoring of UPS (as we have for over temperature)
  • No information from E&B

Infrastructure Unit

In addition, some general thoughts:

  1. I don't think we particularly 'dithered' in our response to shutting machines down, but:
    • it's true that we don't have a clear strategy;
    • we didn't receive timely information about the expected duration of the power cut; and,
    • when we did realise its seriousness, the very short time-limit (< 20 minutes) didn't help us.
    We need the UPS fixed.
  2. To emphasis the above: in order to make decisions, we need as much information about the event as possible. We didn't get that yesterday.
  3. Where is the 'emergency shutdown' script documented so that it can be used quickly? A sign/poster in the server room containing all such info might be helpful (and wouldn't preclude the same info being made available elsewhere.)
  4. Manually shutting down machines in the self-managed server room (in an effort to eke out the remaining power) is rather painful.
  5. If we are reasonably sure that the power will go, then turning off 'low criticality' etc. machines to shed the load is useless: we'd be better off making sure that 'high criticality' machines were carefully shut down instead. Of course there might then be chicken-and-egg problems affecting other machines which need to be subsequently shut down. (At what point do we decide to shut down machines? What is the minimum amount of time we need to cleanly shutdown all machines - ascobie)
  6. Working in conditions of emergency lighting makes any recabling or similar work more-or-less impossible. (I needed to cable the infrastructure monitor into a UPS-backed supply so that I could use 'abbado'.) Maybe we need a headtorch or two in the server room? Seriously.
  7. What is the 'XML/migrated KVM server' problem referred to in the chatroom? Do we need to worry about it, and/or do anything about it?
  8. What's the alternative COs chatroom? And where are the instructions about how to configure a chat client for it?
  9. Chris's points above about 'Individual UPSes for computing staff desks' etc. probably only make sense if CO's DICE desktop machines still function (well enough, anyway) when all the support they need from servers has gone. There are obvious implications regarding of deployment of things like LDAP (to pick one thing being currently considered.) My own thoughts in times of complete breakage are to abandon DICE desktops, and to revert to laptops which are configured/used in a way that doesn't rely on any of our infrastructure.
  10. We now have suspicions that the Forum UPS NUT arrangements didn't work properly yesterday. That needs careful checking.

There are a couple of issues with the Forum UPS NUT arrangements:

  1. Machines using the forum-server-room.h header are set up to slave to two of the network servers, each of which polls one of the "server room" UPSes. This is normally most robust, as it means that there should be protection in the case where of those network servers or links to the corresponding UPSes is down. Unfortunately in this instance the faulty UPS didn't ever signal that it was on-battery, let alone low-battery, and as a result all the slaves thought there was still enough power for them not to need to shut down.
  2. The UPSes seem to be set up to signal "low-battery" at around 5% charge. With one faulty UPS not delivering any power at all, that might not have been enough time for a clean shutdown in any case. The management card doesn't offer the ability to change this by the documented mechanism, so presumably the UPS doesn't allow it to be changed remotely, It may be settable through the front panel...

-- gdmr

Managed Platform Unit

Key Points

The critical-shutdown script worked well enough but there is room for improvement. Particularly, it needs to handle machines which are very slow to respond or are already running another (separately initiated) shutdown process. It could background each shutdown process. It could also run a number of processes in parallel to speed up the whole process.

The critical-shutdown script assumes the availability of LDAP but CO desktops don't have local LDAP any more. Need to sort that out.

There was no clear strategy regarding who should run this script or when each group of machines at a certain level of criticality be shutdown. This led to a certain amount of dithering and eventually we did not have sufficient time to cleanly shutdown all the high-priority machines (which are the ones where a clean shutdown is the most critical).

(Chris) In the excitement I couldn't remember what the script was called or how to run it, and quick attempts to find its documentation failed. We could perhaps do with a clear and simple EMERGENCY SHUTDOWN wiki page to be linked to from the top of e.g. the main ManagedPlatformUnit page.

The packages server (telford) took a long time to come back as it had a dependency on evo4 (the LCFG profile comments were out-of-date which confused the issue somewhat). This delayed the boot times of all other machines. apache on the packages slave server (brendel) was not able to start, it seems that it could not reach any of the 3 AFS DB servers, was this just a timing issue?

We should be able to serve packages from RO copies elsewhere. It turns out that while we have RO copies of the local package buckets, we don't have any for the upstream mirror buckets. We'll arrange for RO copies at KB.

There was some ambiguity about who should be starting up VMs. We propose that the MPU will go through all the VMs on the KVM servers and start them all up again, rather than expecting each other unit to do that for its own VMs.

Other Points

From a unit point-of-view we need better communication. Given we were not all physically colocated it was hard to know what each person was working on fixing when the power was restored. We should probably use a separate mpu chatroom for this purpose.

The lcfg.org server (budapest) took a long time to come back as it had a dependency on evo4. This is not a critical service so was no big deal.

The Forum packages cache server (hare) did not start automatically. This was probably due to a BIOS configuration error.

The self-managed KVM server-to-be (metropolitan) did not start automatically. This was probably due to a BIOS configuration error.

Many VMs did not start automatically because they did not shutdown cleanly. In some cases it was not possible to restart them without using the managedsave-remove command. At least one had been running so long that the XML config file was out-of-date and referred to a storage location which no longer exists.

It took a while to work out exactly how to run virsh managedsave-remove. An example of how to run it has now been added to the Simple KVM guests FAQ.

VM warwick wouldn't start until its host oyster had a symlink /dev/mp2 pointing to the real storage location /dev/op1 - even though the kvmtool XML for warwick did correctly give /dev/op1 as the storage location.

Ditto for VM buchanan on hammersmith - it needed symlink /dev/jp1 creating and pointing at /dev/hp1 even though its kvmtool XML said /dev/hp1.

Research and Teaching Unit

  • our own KVM guests didn't all come back but wbx seemed to have been shut down OK
  • it was frustrating not having control over return of MPU KVM guests
  • short time scale was a problem, obviously, but I think we'll have to re-evaluate criticalities more 'selfishly'.
  • scripted shutdowns were uncoordinated - but I'm aware these are going to be fixed
  • nagios was actually really useful
    • we have few enough machines that Iain & I were able to chomp through on shutdown (confirmation?)
      • ~gdutton/bin/nagios-ack made it easy to handle (will move to utils)
    • unit status pages gave us a to-do list on return
      • though we'd forgotten one because of DICE_NO_NAGIOS
  • GD: I swear at IPMI consoles but it was fantastic to have the power control. Far easier than power bars (though they're a really important backup)
  • GD: I didn't even realise the power had failed: our office lighting is messed around with so often, and office power was of course OK.
  • GD: I couldn't use my phone-forward to mobile, because apparently the phone is actively in charge of this process.
  • Chat room was excellent, as ever. This is absolutely crucial to any success we had.
  • Laptops were also pretty much crucial, at least for communication.

Services Unit

  • Not yet checked, but it doesn't look like the servers in the Forum shut themselves down cleanly (in time ?).
    Question: Are you including dice/options/forum-server-room.h in the profiles of these machines?
    Answer: Feck! We seemed to have forgotten that for the new servers. Ta.
  • RAID arrays, it would be kinder to shut them down prior to power loss (currently a manual step)
  • Move of www.inf to KB seemed to go OK, possibly too well, as staff continued to update Plone (their changes will need to be manually copied back to wafer).
    Should we advise "minimal writes" in such circumstances? In general, should we promote an "emergency service" approach for these situations, or try and carry on as normal? roger
  • should we have DR mail server(s)? ( ascobie Dave's mailshot suggesting people work from home wasn't delivered until after power returned)
  • similarly an AT print server
    • We do! we just didn't make the switch. We need a checklist to make sure that things like this aren't missed.
  • Didn't spot cetus had separate problems.
  • In the rush to inform users of progress, make sure what you are saying is correct "AFS home directories back" !!
    • We need better monitoring of the cell to spot problems even if it's just something like a script tp do a 'vos listvol' on each server and look for non-zero volumes offline values.
  • Do we want a "move all computing staff to site X" script.
    • We could use the promoteRO script for this.

User Support Unit

  • Did we feed enough information to the users ?
  • Did we have enough information to feed to the users ?
  • Was anyone clear on what the lines of communication were supposed to be ? NotifyUserDisruption

-- StephenQuinney - 11 Nov 2013

Topic revision: r34 - 15 Nov 2013 - 13:00:43 - IanDurkacz
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies