Services Unit report following the powercut exercise

Brief report on the power cut exercise to the Forum server room.

I was the only Services Unit member in this day, but it so happened I chose to start it in the Forum, and didn't see the notification about the power cut in the chat room at 10:13, or receive the personal chat notifications until came to my desk at 10:30. I did check my mail on my phone periodically while in the Forum, I don't check chat on my phone.

Had this been a real power cut, I presume I'd have noticed it in person, and if I hadn't made my presence known, that someone would have tried phoning my mobile.

Action taken

On the discovery of the power cut, we (the Services Unit) would have:

Made sure colleagues were also aware of the power outage.

Given our OK/consent to shutting down all the low priority machines via the critical_shutdown script.

Provide a quick list of services affected by low priority machines. The main ones being homepages.inf and file server riddles providing AMI and CSTR NFS partitions.

Medium priority machines. None according to the criticality netgroup. Manually I'd have said phantom and its associated SAN array, and lammasu. Both host research data, believed to be used by a few people. If riddles is classed as low priority, then these 2 should be as well. Or riddles is upgraded to "medium".

Leaving the high priority machines until we dare turn them off, or trust the upsmon component to do it for us. The problem with leaving it to the upsmon, is that it doesn't shutdown the SAN arrays. We really want the SAN attached servers to go down cleanly before the SAN does. I think if drain on the UPS looked steady, then I'd probably start shutting down the AFS servers with 10mins of runtime left, and notify support that we were doing so.

Our high priority machines. The main AFS servers, physical afsdb server, www.inf, the 2 SAN arrays.

While all this was going on, we could be preparing for a switch of www.inf to the DR hardware at KB. This relies on DNS changes. https://ikiw.inf.ed.ac.uk/DICE/ServiceUnitWwwOffsiteDrPlan

Depending on how long the power was likely to be off, we would have to consider if we wanted to promote offsite RO volumes to become new RWs. I think we'd resist this for as long as we could, to avoid the reconcillation that would be required once power was restored to the original RWs. Perhaps we could only premote ROs for individuals who had a pressing need for a RW home directory, and we could explain the issues of doing so.

Issues

This exercise has revealed a few issues:

  • inventory location information either wrong or missing
  • inconsistent criticality classification of research file servers
  • www.inf DR instructions don't mention web.inf

There was a certain amount of expectation that we could get to the LCFG profile master and rfe server to check the details of some machines to determine their impact/usage.

We will look at replicating at the source DICE wiki content to the Infrastructure netmon machines. Though we do already have the offsite mirror, ikiw.inf.ed.ac.uk

What about the important VM services mail.inf, afsdb servers, jabber, wiki.inf? When do MPU pull the plug on those?

Note: Computing staff if AFS home dirs in the Forum:

Adam, Alison, Alastair, Carol, Chris, George, Graham, Gordon, Iain, Ian, Jennifer, Lindsey, Ross, Roger, Richard, Stephen, Tim, Toby.

ie everyone except: Craig, Neil

-- NeilBrown - 24 Nov 2017

Topic revision: r1 - 24 Nov 2017 - 16:46:13 - NeilBrown
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies