November 2010 Power Shutdown - Infrastructure Unit notes
For reference,
Dec2009PowerDownInfUnit contains our notes from last time around.
Organisational things we might want to do
Beforehand
- As all the switches will be powered down, we should make sure in advance that they have the most recent firmware loaded in flash, even though they might not actually be running it yet. That way, they'll all be upgraded.
- We should check marriner's BIOS settings as we shut it down, to ensure that it comes straight up when the power is restored.
- There might be a case for rebooting "important" machines in advance, to ensure that things which could delay the power-on (e.g. fsck, updaterpms) are done and out of the way.
- Can we do anything to nagios to keep it happy for the duration?
- Last time around we provided some TP leads for COs' laptops, though not many were used. This time we could pre-configure some ports on the sr99 switch on the dexion racking for that use (and possibly find a couple of C14-13A mains blocks too). Unfortunately the bars on these shelves are powered from DB-0.20, but the techs might be able to source a couple of 32A-commando extenders.
Forum server rooms
Power
- Generally, want to take the opportunity of a complete shutdown (and restart) to trace all power connections, and to confirm that the labelling which correlates the circuit-breakers in distribution boards DB-0.14 (main server room) and DB-0.20 (self-managed server room) with their corresponding power bars (or other connections) is both correct and complete.
- Trace the 5 power bars in use on the Beowulf racks. Label the corresponding circuit breakers on distribution board DB-0.14. They should all be on L3.
- Trace the 2 power bars in use on the Physics DR racks. Label the corresponding circuit breakers on distribution board DB-0.20. They should both be on L2.
- Trace the under-floor 13A sockets used by the floor-tiles-with-fans (in both server rooms) and label the breakers accordingly.
- Trace the 13A wall sockets and label the breakers accordingly.
- Move s07.pdu to an DB-0.14-L2 socket (it's currently on L1) to match its partner s06.pdu. Label s07.pdu, and its corresponding circuit breaker on distribution board DB-0.14, accordingly.
- Last time around we kept the core switches running for the benefit of Appleton Tower. This time we don't need to, so we should just shut everything down.
- ... and so there's no particular order to shut down the comms-rack servers, other than doing marriner last of all.
- The network servers' BIOS settings should all be correct. Rather than check them all, we can just watch what happens when the power comes back.
IT closets
We've had firmware upgrades pending for a while, and it would be handy to do a complete self-test, so we should power them all down beforehand and then bring them all back up one by one afterwards. (Probably excluding the basement closet, as it has the connections to the UPSes.) Don't forget the GPS-clock machine "farg" in the 5A closet.
Afterwards
- Do we want to turn everything on, or only the circuits where we think there should be something plugged in?
Infrastructure Unit machines
Machine |
Role |
Comments |
franklin |
LDAP master |
take dump of ldap db |
barrett |
KDC master |
remove _kerberos._udp SRV record to speed up authentication; om kerberos push to slaves; take dump of kerberos db |
osprey |
cosign/ifriend KDC master |
remove from weblogin alias the night before; om kerberos push to slave; take dump of ifriend kerberos db |
mckinley |
LDAP slave |
move infdir alias; remove _ldap._tcp SRV record |
panther |
prometheus master |
take paranoid dump of everything, can be shut down early |
reeves |
new prometheus machine |
|
fenrir |
KDC slave/AFSDB |
down late, up early; remove _kerberos._udp record |
kingsmen |
toby test machine |
|
harnoncourt |
Forum extRt |
|
hogwood |
Forum netInf |
|
hickox |
Forum netServ |
relocate power to netServ UPS |
linnaeus |
Forum extNS |
|
marriner |
Forum consoles |
Last down, first up. (Note that verte - in AT - is the console server for Forum network machines hogwood , harnoncourt and marriner .) |
dunlin |
Nagios secondary |
|
|
core0 |
Forum core switch |
|
core1 |
Forum core switch |
|
core2 |
Forum core switch |
|
core3 |
Forum core switch |
|
srif |
Forum SRIF PoP switch |
|
sr11 |
Not in service yet |
|
Hot-spare switches |
|
can be turned off in advance |
All server-room edge switches |
|
will be turned off as part of general power-down |
--
TobyBlake - 29 Oct 2010
--
IanDurkacz - 28 Oct 2010
--
GeorgeRoss - 28 Oct 2010
Topic revision: r7 - 29 Oct 2010 - 10:56:13 -
IanDurkacz