Services Server Availability

We will aim to keep track of up and down time of our services, so we can get an idea of the quality of service we provide. Initially I think we'll just jot down incidents as they occur.

Server Down Back up Impact
hootsmon 3/7/2006 17:20 4/7/2006 09:30 Started to run out of memory in the evening, ldap died. homepages CGIs were affected from that point, as were NFS mounts, however static pages and wiki.inf continued to work until 8am, when apache finally stopped. Rebooted at 9am, but the root disk had gone 245 days without an fsck, so it did one then.
stockholm 20/7/2006 12:50 20/7/2006 16:07 Due to ATABeast and fibre problems the CVS/SVN service was unavailable during this time
diglett 20/7/2006 12:50 20/7/2006 16:07 Due to ATABeast and fibre problems this research service (not unit responsibility) was unavailable during this time
har 20/7/2006 12:50 20/7/2006 17:36 Due to ATABeast and fibre problems the research exported filesystems and beowulf homedirs were unavailable during this time. Its reboot took longer than we'd like, due to an fsck of d1,d4 and d5 about 1.5TB!
sphinx, wyvern, har, ataboy1, atabeast1, stockholm 5/8/2006 9:00 5/8/2006 14:30 Planned weekend maintenance
scunner 7/8/2006 02:30 7/8/2006 10:30 Ran out of memory for unknown reason. This would affect all the web services running on scunner, and the filesystems it exports
admin.smb 8/8/2006 12:30 8/8/2006 14:00 Planned migration of admin.smb from topper to stumer
tarn 8/8/2006 17:00 9/8/2006 09:20 Tarn paniced. Needed turned off and on again. ssh.inf was down during this time
mandy 22/8/2006 07:37 22/8/2008 10:03 virtalrelay.inf and smtp.inf unavailable. Syslog seemed to report a kernel bug (see posting to COs), and this seems to hav caused the load to increase, once greater than 12, sendmail stopped accepting connections. Took a while to reboot the machine. It refused to go down cleanly
scunner, roc, phoenix, ... 15/9/2006 17:00 17/9/2006 Planned power outage at AT, see ATPowerOff. Various home directories, samba server, web servers
hootsmon 23/9/2006 18:45 25/9/2006 09:30 homepages.inf and wiki.inf unavailable due to hootsmon running out of memory
ATABoy1 7/10/2006 09:50 8/10/2006 13:00 ATABoy locked up, affect sphinx, wyvern, cvs.inf and some web servers eventually stopped serving pages. ATAboy and servers restarted by 1pm. Some web services not spotted until Monday morning
laney 3/3/2007 06:60 4/3/2007 09:00 homepages and wiki unavailable due to power problems at the weekend
Fc filesystems on phoenix, roc & stumer 04/04/2007 15:00 04/04/2007 17:00 Sataboy1 wiped LUN masking tables. AFS, admin samba and some NFS file systems unavailable
mail.inf web interface 6/4/2007 03:27 6/4/2007 11:26 apache stopped responding
sleekit 17/4/2007 20:10 18/4/2007 09:20 server room power failure, all web services host on sleekit unavail
hippocampus 17/4/2007 20:10 18/4/2007 09:00 server room power failure
pegasus FC filesystems 17/4/2007 20:10 18/4/2007 09:00 server room power failure. Pegasus remained up but UPS powering FC switch shut down
sphinx rebooted 18/4/2007 10:15 18/4/2007 11:20 incomplete profile caused the afs module to be removed, then inittab problems
bpbeast hung 19/4/2007 15:30 19/4/2007 16:30 beast hung for unknown reason. nexsan will contacted
nutty 1/5/2007 06:43 1/5/2007 06:54 planned reboot to enable security fixes for mail.inf and lists.inf
beezer, stoater 1/5/2007 06:37 1/5/2007 10:50 planned reboot to enable security fixes for www.inf, groups.inf,, etc. Unfortunately there were network/configuration problems which meant they didn't come back in the expected 10mins!
printing 1/5/2007 09:00 1/5/2007 14:45 All printing was down for an unknown reason, it appears LDAP was being heavily loaded in the early hours. Resolution was complicated by network configuration problems, though all servers were back between 13:30 and 14:45
maelcum 7/5/2007 13:20 7/5/2007 14:50 Planned down time to reconfigure disks - affected roombooking
mandy 8/5/2007 07:10 8/5/2007 07:20 Planned reboot to enable security fixes - affected smtp.inf
arnie 8/5/2007 09:35 8/5/2007 09:44 Planned reboot to enable security fixes - affected
helpmaboab 8/5/2007 09:45 8/5/2007 09:58 Planned reboot to enable security fixes - affected
phoenix 14/06/2007 06:45 14/06/2007 09:00 Reboot to install new version of AFS prolonged due to missing FC package
nutty 22/06/2007 22:00 approx 23/06/2007 17:00 due to a mail storm, sendmail stopped accepting connections. No mail was lost, just delayed for about a day
arnie 24/06/2007 14:00 approx 25/06/2007 13:30 hard disk failure. down. hootsmon.inf confured to stand in for www.dai. It will take a while for the DNS change to propagate.
phoenix, roc, eejit, SATAboy1 18/06/2007 01:30 1806/2007 09:30 Power failure in AT machine room
stumer 16/8/2007 13:06 16/8/2007 14:10 Kernel bug hit, which caused samba to fail to setuid. admin.smb was affected
eejit 21/8/2007 19:00 22/8/2007 10:30 samba daemon core dumped and machine was eventually rebooted. Samba mounts of home directories and group space were not available during this time.
everything in Appleton Tower 25/9/2007 09:35 25/9/2007 11:00 Power failure in AT machine room, file servers, web, ssh, printing, networking
ratz 26/9/2007 1430 26/9/2007 15:58 Kernel bug /afs_lock.c:133 caused machine to hang, ssh.inf affected
sphinx 12/11/2008 11:30 12/11/2008 12:00 AFS issue?

-- NeilBrown - 04 Jul 2006

Edit | Attach | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r24 - 12 Nov 2008 - 12:11:41 - CraigStrachan
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies