We will aim to keep track of up and down time of our services, so we can get an idea of the quality of service we provide. Initially I think we'll just jot down incidents as they occur.
Server |
Down |
Back up |
Impact |
hootsmon |
3/7/2006 17:20 |
4/7/2006 09:30 |
Started to run out of memory in the evening, ldap died. homepages CGIs were affected from that point, as were NFS mounts, however static pages and wiki.inf continued to work until 8am, when apache finally stopped. Rebooted at 9am, but the root disk had gone 245 days without an fsck, so it did one then. |
stockholm |
20/7/2006 12:50 |
20/7/2006 16:07 |
Due to ATABeast and fibre problems the CVS/SVN service was unavailable during this time |
diglett |
20/7/2006 12:50 |
20/7/2006 16:07 |
Due to ATABeast and fibre problems this research service (not unit responsibility) was unavailable during this time |
har |
20/7/2006 12:50 |
20/7/2006 17:36 |
Due to ATABeast and fibre problems the research exported filesystems and beowulf homedirs were unavailable during this time. Its reboot took longer than we'd like, due to an fsck of d1,d4 and d5 about 1.5TB! |
sphinx, wyvern, har, ataboy1, atabeast1, stockholm |
5/8/2006 9:00 |
5/8/2006 14:30 |
Planned weekend maintenance |
scunner |
7/8/2006 02:30 |
7/8/2006 10:30 |
Ran out of memory for unknown reason. This would affect all the web services running on scunner, and the filesystems it exports |
admin.smb |
8/8/2006 12:30 |
8/8/2006 14:00 |
Planned migration of admin.smb from topper to stumer |
tarn |
8/8/2006 17:00 |
9/8/2006 09:20 |
Tarn paniced. Needed turned off and on again. ssh.inf was down during this time |
mandy |
22/8/2006 07:37 |
22/8/2008 10:03 |
virtalrelay.inf and smtp.inf unavailable. Syslog seemed to report a kernel bug (see posting to COs), and this seems to hav caused the load to increase, once greater than 12, sendmail stopped accepting connections. Took a while to reboot the machine. It refused to go down cleanly |
scunner, roc, phoenix, ... |
15/9/2006 17:00 |
17/9/2006 |
Planned power outage at AT, see ATPowerOff. Various home directories, samba server, web servers |
hootsmon |
23/9/2006 18:45 |
25/9/2006 09:30 |
homepages.inf and wiki.inf unavailable due to hootsmon running out of memory |
ATABoy1 |
7/10/2006 09:50 |
8/10/2006 13:00 |
ATABoy locked up, affect sphinx, wyvern, cvs.inf and some web servers eventually stopped serving pages. ATAboy and servers restarted by 1pm. Some web services not spotted until Monday morning |
laney |
3/3/2007 06:60 |
4/3/2007 09:00 |
homepages and wiki unavailable due to power problems at the weekend |
Fc filesystems on phoenix, roc & stumer |
04/04/2007 15:00 |
04/04/2007 17:00 |
Sataboy1 wiped LUN masking tables. AFS, admin samba and some NFS file systems unavailable |
mail.inf web interface |
6/4/2007 03:27 |
6/4/2007 11:26 |
apache stopped responding |
sleekit |
17/4/2007 20:10 |
18/4/2007 09:20 |
server room power failure, all web services host on sleekit unavail |
hippocampus |
17/4/2007 20:10 |
18/4/2007 09:00 |
server room power failure |
pegasus FC filesystems |
17/4/2007 20:10 |
18/4/2007 09:00 |
server room power failure. Pegasus remained up but UPS powering FC switch shut down |
sphinx rebooted |
18/4/2007 10:15 |
18/4/2007 11:20 |
incomplete profile caused the afs module to be removed, then inittab problems |
bpbeast hung |
19/4/2007 15:30 |
19/4/2007 16:30 |
beast hung for unknown reason. nexsan will contacted |
nutty |
1/5/2007 06:43 |
1/5/2007 06:54 |
planned reboot to enable security fixes for mail.inf and lists.inf |
beezer, stoater |
1/5/2007 06:37 |
1/5/2007 10:50 |
planned reboot to enable security fixes for www.inf, groups.inf, www.aiai.ed.ac.uk, etc. Unfortunately there were network/configuration problems which meant they didn't come back in the expected 10mins! |
printing |
1/5/2007 09:00 |
1/5/2007 14:45 |
All printing was down for an unknown reason, it appears LDAP was being heavily loaded in the early hours. Resolution was complicated by network configuration problems, though all servers were back between 13:30 and 14:45 |
maelcum |
7/5/2007 13:20 |
7/5/2007 14:50 |
Planned down time to reconfigure disks - affected roombooking |
mandy |
8/5/2007 07:10 |
8/5/2007 07:20 |
Planned reboot to enable security fixes - affected smtp.inf |
arnie |
8/5/2007 09:35 |
8/5/2007 09:44 |
Planned reboot to enable security fixes - affected www.dai.ed.ac.uk |
helpmaboab |
8/5/2007 09:45 |
8/5/2007 09:58 |
Planned reboot to enable security fixes - affected www.dcs.ed.ac.uk |
phoenix |
14/06/2007 06:45 |
14/06/2007 09:00 |
Reboot to install new version of AFS prolonged due to missing FC package |
nutty |
22/06/2007 22:00 approx |
23/06/2007 17:00 |
due to a mail storm, sendmail stopped accepting connections. No mail was lost, just delayed for about a day |
arnie |
24/06/2007 14:00 approx |
25/06/2007 13:30 |
hard disk failure. www.dai.ed.ac.uk down. hootsmon.inf confured to stand in for www.dai. It will take a while for the DNS change to propagate. |
phoenix, roc, eejit, SATAboy1 |
18/06/2007 01:30 |
1806/2007 09:30 |
Power failure in AT machine room |
stumer |
16/8/2007 13:06 |
16/8/2007 14:10 |
Kernel bug hit, which caused samba to fail to setuid. admin.smb was affected |
eejit |
21/8/2007 19:00 |
22/8/2007 10:30 |
samba daemon core dumped and machine was eventually rebooted. Samba mounts of home directories and group space were not available during this time. |
everything in Appleton Tower |
25/9/2007 09:35 |
25/9/2007 11:00 |
Power failure in AT machine room, file servers, web, ssh, printing, networking |
ratz |
26/9/2007 1430 |
26/9/2007 15:58 |
Kernel bug /afs_lock.c:133 caused machine to hang, ssh.inf affected |
sphinx |
12/11/2008 11:30 |
12/11/2008 12:00 |
AFS issue? |
nutty |
20/11/2008 17:20 |
20/11/2008 1640 |
A quick reboot to pickup new SL5.2 kernel lead to problems with network cards and SCSI hangs, down longer than anticpated. Limited affect as most people using staffmail now, only lists and forwarding down |