MPU Meeting Thursday 26th February 2008

Buildtools Project

Stephen has been working on testing - writing lots of tests for the code, exploring Perl testing facilities, etc.

Power management project

Chris is working on the project report. It's going to take longer than he expected. He's now aiming for the April development meeting.


LCFG slave server upgrades
The LCFG slave servers (trondra and mousa) have been upgraded to SL5. A few things are worth noting:

    • A complete rebuild of all 2700+ machine sources now takes an hour and a half to two hours.
    • Towards the end of the rebuild the mkxprof process was observed to have a resident memory size of 160MB.
    • Stephen reckons that the spanning map code in the compiler is not very optimised.
    • If the server has no other significant load during the rebuild the machine is barely useable at the command line - the command line delay builds up to perhaps 30 seconds or more.
    • If apache[conf] is left running during the rebuild with machines continuing to request profiles periodically, the command line on the machine will be completely unusable, and monitoring of the rebuild status will only be possible through the server's logserver web pages.
    • Switching DICE machines away from using the LCFG server to be upgraded was done by changing the LCFGSERVER definitions in live/inf-site.h beforehand (prompting a manageably sized rebuild of 1100 hosts) and this reduced the load on the server being upgraded significantly.
    • A few machines still running ancient versions of DICE were detected during this period: we still have a few RH9 and FC3 machines with client components still asking for profiles. Carol will search out such old machines and tidy them as much as possible.
    • It's important to switch the lcfg DNS alias to a server which is not being upgraded, as this alias is used to fetch the initial profile at install time.

Autoreboot and the kernel upgrades
The first real mass use of the new autoreboot seems to have been very successful. Alastair received no complaints and machines have rebooted successfully. Stephen checked some areas before the reboot was due and found that over half of the machines to be rebooted had already been rebooted by their users.

Firmware and RAID
dresden crashed apparently due to buggy firmware in its RAID controller spreading corruption from a problematic disk to a healthy one. We need to look at more routine upgrades of firmware and at improved monitoring of firmware versions and of RAID hardware states and of hardware/firmware health in general.

FibreChannel and kernel upgrade problem
We need short term and longer term fixes for the fibre channel software install problem which became evident when we tried to upgrade the kernel at reboot time.

Stephen will be giving a talk at the UKUUG spring conference; the deadline for the talk is the 14th of March.

Disk partitioning documented
Chris has finished this and will publicise it.

DICE on Dell Optiplex 755
Stephen has tried SL5 on the 755 and it works. There's no point in trying FC6 as the kernel is too old for the hardware.

Chris and Stephen have both sorted out some of their outstanding RT tickets but both have more to tackle.

Large memory on i386
Stephen sees a need for a cos@inf howto on large memory for DICE machines - how much can be done under 32 bit; the use and the limitations of the PAE kernel; at what point it would be wiser to switch to 64 bit.

MPU backups audit
Actions resulting from this:
  1. DIY DICE mirrors need to be moved to a Sun.
  2. The LCFG export server data disk should be mirrored excluding the CD images.
  3. The LCFG machine source files are now being mirrored to a Sun.
  4. The LCFG machine source files now have a new and more unique name for their rsync module.
  5. The subversion component on tobermory had a "dev" version. Stephen will chase Craig for feedback on the expanded subversion component.
  6. Paul's saveprof data is now mirrored to a Sun and also duplicated across the LCFG servers.
  7. There's no need to back up the Orders database on tobermory but there is a need to back up /var/rfe/orders.
  8. Alastair will check that orders-related resources refer to /var rather than /etc.
  9. The subversion mirroring source should be switched to the svndump directory.

LCFG on Mac meeting
this has now happened. We agree with its aims. Toby will take care of the details with MPU guidance from Chris.

Release testing
We agreed to try using bugzilla to record problems with the weekly release testing. That way we might be able to get an idea of what the common problems are and how much effort is required to get things fixed.

kernel updates
There is an updated kernel available, the autoreboot component is now active and will manage all the desktop machines. We will have to reboot all the MPU servers.

This Week

Alastair will:
  • finish upgrading pezenas
  • put in a short term FibreChannel fix
  • move ordershost off tobermory.

Chris will:

  • finish tidying the backups
  • work on the tobermory upgrade
  • work on the power management report
  • be on RT duty.

Stephen will:

  • work on the SL5 upgrade of the LCFG export services
  • make and announce the new buckets
  • talk to Craig about the subversion component
  • work on his UKUUG talk.

