Summary of pandemic planning meeting (19/08/09)
(alisond, ascobie, cms, gdmr, perdita, timc)
Scenarios
=========
Agreed plan *NOW* for almost certain scenario of reduced staff.
Decreasing probability -> increasing severity
* Reduced staff => drop development ring-fence
* University decrees reduced face to face contact (teaching, research,
admin - eg tutorials, meetings)
* University decrees no teaching (continue research)
* University closes buildings (continue research)
At some point :-
* open up any unit access restrictions
* document any passwords (eg FC switches) and store for all
to access
* create AFS admin principals for more computing staff
Later :-
* freeze deployment of patches and software upgrades
Later :-
* freeze configuration changes
Remote support for users
========================
DICE users
----------
Users can use VNC. Need to provide documentation.
Need to check capacity of ssh login servers - perhaps
stop mounting home directories on these (on all bar one
for eg. unison and scp users). Staff can ssh onto their
desktops; need to think how to spread students over
student lab machines.
Laptops (and self managed machines)
-----------------------------------
* filesystem (AFS) - need updated documentation
* web - possible issues with IP address controlled content
- editing web content via ssh ok
- need to check how non unix skilled staff edit www.inf
web content
* editing web content
* IS hosted usenet groups (for teaching courses) - need to check
whether IP address controlled.
* openvpn - solves IP address controlled issues. Make a service, (documentation?)
* school DB - secured VNC - need documentation
Service continuity
==================
Critical services (and people SPFs)
----------------------------------
Although we probably don't have any true SPFs, there are a number
of critical services where we need to improve skill coverage.
These are :-
* network
* serial consoles / remote power management
* storage arrays / SAN
* LCFG release mechanism / package service
* virtualisation infrastructure
* AFS
* TiBS / retrospect / Sun networker
* School DB
* plone
The following were identified as critical, but we believe there
to be sufficient skill coverage.
* LDAP, kerberos, cosign
* traditional web based services
* wiki
* RT
* ssh servers
Actions
=======
* Employ FC multipath and ethernet bonding wherever possible (esp
critical services). (
- Also move service related configuration into header
files from individual machine profiles to make it easier to move services
from one machine to another )
* Add nagios monitoring wherever possible (esp. critical services)
* Check capacity of ssh login servers (and consider whether to
stop mounting home directories)
* Consider how to spread remote students over student lab machines
* Consider how non unix skilled staff edit www.inf content
* Check whether IS hosted usenet groups are IP address controlled
* Documentation
- AFS
- VNC
- openvpn?
* Identify any single point of failures re remote management of machines
* Improve skill coverage for those critical services identified to have
weak coverage.
* For each critical service, document how to deal with the top 5
things that routinely need doing/go wrong/etc.
--
AlastairScobie - 25 Aug 2009