Gas explosion drill - MPU report
Conclusion
No lessons were learnt from this particular exercise. Only services which were already replicated were affected.
Machines lost
This MPU kit was affected by the pretend AT gas explosion:
- kubelik (student.ssh)
- KVM servers circle and waterloo
- wildcat (RPM cache and backup PXE)
- With waterloo we have also lost vermeer aka lcfg1, one of the two main LCFG slave servers.
The loss of
waterloo knocks out these virtual machines:
bank |
MPU |
test |
barking |
RAT |
Trac server |
borges |
Services |
backup print server |
capon |
Infrastructure |
secondary Nagios server |
wobleg |
Services |
test |
spadina |
RAT |
projects.inf |
vermeer |
MPU |
lcfg1.inf |
These VMs were lost along with
circle:
argus |
RAT |
testing forumtracker |
arlott |
INF |
testing sl6 KDC |
armitage |
Services |
student labs monitoring test |
cardus |
INF |
testing sl6 cosign |
circlevm0 |
MPU |
for testing |
circlevm2 |
MPU |
for testing |
circlevm3 |
MPU |
for testing |
circlevm4 |
MPU |
for testing |
circlevm5 |
MPU |
for testing |
circlevm6 |
MPU |
for testing |
circlevm7 |
MPU |
for testing |
circlevm8 |
MPU |
for testing |
circlevm9 |
MPU |
for testing |
circlevm10 |
MPU |
for testing |
dilley |
INF |
testing sl6 KDC |
ekcof |
RAT |
testing Coltex |
engadine |
RAT |
gdutton test |
idoru |
Services |
gordon's test vm |
keele |
RAT |
test portal |
littlebird |
RAT |
iainr test |
monmouth |
RAT |
iainr test |
monty |
INF |
testing sl6 prometheus |
moody |
? |
Moodle |
Services affected
- LCFG
- One of the two main LCFG slave servers has been lost. The service will carry on more or less unaffected using the other slave server. The MPU is considering bringing up another slave server elsewhere. In the meantime the DNS aliases lcfg1 and lcfg3 have been moved to the other slave server rembrandt, safely in the Forum.
- Package cache and updaterpms
- We have lost wildcat, one of the two RPM cache servers serving cache.pkgs.inf.ed.ac.uk. This is the address from which
updaterpms
gets its RPMs on most DICE machines. We have altered the DNS to remove its IP address from cache.pkgs. An om dns update
or waiting an hour should be enough to get updaterpms
working on DICE machines outside the Tower.
- SSH
- student.ssh.inf.ed.ac.uk aka ssh.inf.ed.ac.uk has gone. The MPU has brought its hot spare shrew at KB into use as the new temporary student ssh server.
- KVM service
-
- We have recovered the backup of
/etc/libvirt
for the lost KVM server waterloo, in case it should come in handy, though we hope that the waterloo wiki page and the LCFG should give sufficient detail to enable people to restore their VMs elsewhere. We have sufficient capacity on other KVM servers, partly thanks to waterloo having been underused. It only hosted seven VMs of which two were test VMs. The backup of /etc/libvirt
from waterloo can be found in /etc/waterloo
on oyster if anyone needs it.
- We do not expect to provide a short term replacement for circle as it is only a test server.
Spare servers on offer
One of the MPU's functions is to carry spare machines for use in emergencies such as this. We don't have spares of every model but we have these in the Forum:
Dell PowerEdge R805 |
central |
Currently configured as a staff NX server but not yet in service. |
Dell PowerEdge 850 |
figgy |
|
HP DL 180 G6 |
juice |
|
Dell PowerEdge R710 |
metropolitan |
Currently configured as a KVM server but not in service. |
--
ChrisCooke - 13 Jan 2014