Forum power-down on Saturday 6th November 2010

Collected actions

Who What Comments
MPU Auto-shutdown DICE desktops  
Alastair sys-announce message  
Alison Arrange for managed (Windows) desktops to be shut down  
George Mail to selfmanaged-sr list Done
Alison Two more machines on dexion racking All server-room machines to have local homedirs
Neil Produce list of (un)available web sites Nov2010WebSites
MPU updaterpms workaround  
ALL Check for potential fsck delays  
Neil Dig out fsck-check script basic script at ~neilb/bin/share/check-fsck-at-boot it just lists the current settings for /dev/ mounts
Ian Arrange to turn off nagios  
Alison Display screen or flipchart  
Craig Check MFDs, and shut down if necessary  
George Query office/UPS with E&B  
Neil Create and link in a "(not-)affected" page Done. Nov2010PowerDown
ALL Populate Neil's page  
Alison ssh-server MOTD  
George Contact list  

Pre-meeting (2010-Oct-29) comments and questions

  • The power-down will affect the whole building, and is scheduled to start at 10:00.
    • ... but we would like everything off a few minutes before, to allow us to check the power-bar circuit labels
  • While the UPSes would keep things up for a short while, the server-room air-conditioning will be off, and so everything in there will be shut down, including:
    • all switches, and in particular the four core switches
    • the DNS master
    • the Kerberos master
    • the LDAP master
    • one of the AFSDB servers
  • We will turn off all the IT closets, to allow a clean reset and complete self-test afterwards
  • Appleton Tower and JCMB have their own power and network connectivity arrangements, and so should not be affected (other than for services located in the Forum)

  • What can we usefully do while the power is down?
    • specifically while the power is down, nothing at all in the Forum, as the building will be closed, but afterwards perhaps...?
    • for services unit we're thinking about shutting down the KB disk arrays (and therefore the KB file servers) so we can update firmware and possible SAN config changes. But what offsite provision will be relying on that KB SAN?
    • Can we learn anything about where our power is going during the power off? eg http://netmon.inf.ed.ac.uk/cgi-bin/ForumPower.cgi seems to show a base load of about 60KVA for the building. If we're planning on powering off all desktop machines prior to the power going off. We could note what that figure drops to. When powering back on, could we do it floor by floor to work out draw per floor? Do we care?
      (Comment: a Dell GX-755 consumes 50W at idle. We probably have about 200 desktops per floor, so 1000 in the building. That gives a 50kW load at idle, so our desktop PCs probably more-or-less explains the base load figure we see. -- idurkacz)
    • Server BIOS config changes. I know that we have at least one machine, wafer, where its 2 extra NICs are not enabled in the BIOS so can't be used at the moment. Do other units have machines they'd want to change BIOS settings before powering back up?
    • Machine moves, eg non-SAN machines from the SAN racks. I'm sure Services has one or two.

  • Agreement on what we should be doing for users with machines in the self-managed server room. What arrangements should be made to power off equipment, and subsequently power it back on? This includes the Tardis rack.

  • Communicating the power down to end-users.

Notes from 2010-Oct-29 meeting

Present: George ("chair and minutes"), Craig, Alastair, Ross, Richard, Ian, Alison, Neil, Toby, Chris, Iain

We used the notes from last time (Dec2009PowerDownBSMeet) as a basis for our discussions.

Offices: MPU will arrange for DICE machines in offices to be shut down from about 03:00 onwards, possibly with a backstop job around 09:00 in case any have been restarted in the meantime. The sys-announce message (Alastair) will ask people to shut their machines down last thing on Friday anyway.

It is possible to arrange for the managed (Windows) desktops to shut down automatically, but it's probably just as easy for support to go round last thing to catch any which the users haven't already done. Alison to arrange.

Machines in the self-managed server room are for their managers to arrange something for. George will mail out to the selfmanaged-sr list (done).

It would be useful to have a couple more machines in the server room for use during the shutdown and restart afterwards. These can go on the dexion racking. Alison will arrange. Ideally these (and the existing two) would have local home directories set up, so that they don't have any dependencies on the fileservers.

As it's a weekend, we'll just leave most services-unit services (e.g. mail) down for the duration. AT printing will be OK, as it's the intention to have a print server there anyway. Neil will produce a list of which web sites will be up and which won't be.

We don't want updaterpms to cause delays at reboot time. MPU will arrange something.

Likewise, we don't want fscks to delay things at reboot time. ALL to make sure that there are no important blockages. Neil has a script which can help identify potential problems.

A number of rfe masters are housed in the Forum and will be down. These should be priorities for rebooting afterwards.

It's likely that nagios will get upset again unless we do something. The decision is just to turn it off for the duration. Ian to arrange.

As it's a weekend, we don't anticipate a great rush of people coming back into the building looking for services to be up. We should provide general information, but don't need to go into detail. We might be able to use the display screen at the main entrance. Alison to arrange something.

Do the MFDs need to be shut down cleanly? Craig will check, and arrange to do so if necessary.

Can we use the event to check which offices do have UPS cover? It's not clear that we can. We'll ask E&B when we meet them.

We should produce a list of affected services (or, more likely, a list of not-affected services). Neil will create a web/wiki page, which will be linked into the banner presented by web servers on the day. ALL to populate it as necessary. This should be user-oriented!

Some ssh servers will stay up, and some users may try to use them. The MOTD should be set to say something useful. Alison to arrange.

We can start turning things off from 09:00, or possibly even before.

Power-on arrangements: we expect, as last time, to give E&B a couple of mobile numbers (George, a.n.other). That person will then contact unit managers or their designated deputies, who will cascade the information.

What can we do at power-on time?

  • Check BIOS settings, particularly to enable network interfaces
  • Check bonding arrangements, and arrange to bond across controllers where possible
  • Firmware upgrades. Switches will do this automatically as necessary. Services-unit will check disc-arrays' firmware versions, and upgrade as required.

KB evo firmware

Services-unit would like to take advantage of the disruption to upgrade the firmware in the KB evo and discs.

  • Do we do it now, or wait until all the R/W volumes are in the Forum, when it'll be less disruptive?
  • It might be 6 months before that happens
  • It's not pressing, apart from when we try to report faults, when we'll probably be told to upgrade anyway!
  • The web management interface does lock up, but there are workarounds

After much discussion, the decision was to DO IT. Therefore, ALL AFS home directories will be unavailable for at least some of the power-down.

Users objecting should be directed to the Head of School in the first instance. Limited workarounds might be possible, but would take some effort to set up.

-- NeilBrown - 28 Oct 2010 -- GeorgeRoss - 29 Oct 2010

Topic revision: r12 - 01 Nov 2010 - 18:56:07 - AlastairScobie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies