MPU Meeting Monday 5th November 2007
LCFG Website
Awaiting sign-off at the next Development meeting.
Solaris Improvement Project
Ouroboros has been patched and the Services Unit has accepted the patching mechanism. The project just needs a final report now before being ready for sign-off.
SL5 Project
Awaiting sign-off at the next Development meeting.
Stephen closed off the SL5 project diary and started a new, ongoing SL5 activity page in the wiki, to which Kenny has agreed to contribute. Any developments affecting SL5 - software updates, for instance - will be logged here.
Buildtools Project
Stephen and Paul are going to talk about this next week.
Alastair reported a demand for build farm facilities to be provided as soon as possible. We agreed that we could do with some stopgap script to check out, build and submit components for multiple platforms. Stephen is going to identify where packages differ from platform to platform.
rpmsubmit Project
Nothing this week.
Power management project
Chris has been reading up on future kernel tweaks which may bring big power savings, and asked whether this or overnight hibernation had been envisaged as the main focus of the project. Overnight hibernation had been expected to be the focus, since kernel tweaks will happen eventually anyway, but the kernel tweaks are certainly worth investigating too.
Tux On Ice supports Wake On LAN, which could be useful for getting overnight updates done on hibernating machines. It's common for graphics to fail to recover properly when a hibernating machine wakes up. Often switching to an alternative console before or after hibernating then switching back afterwards is enough to save the graphics.
Operational
- autorebooter
- Stephen's come up with a new component to do this. It runs in two modes: nag only does an informative
wall
at about 11am every day that the machine needs rebooting, and this should take the form of a popup message on Gnome; nag then reboot additionally schedules a reboot at 2am for 3am if several days' repeated nagging has had no effect.
- Screen res. issue
- Alastair has packaged the FC5 version of kudzu for FC6 as ddckudzu. It's available but not enabled in the LCFG layer; for us it's enabled in the DICE layer.
- updaterpms
- The new version of
updaterpms
will be pushed out to SL5 immediately and to FC6 after four weeks (i.e. in December).
- Server reboots
- bigfan and kingsbarns don't need rebooting as they're test machines. Chris will reboot the RPM slaves split and boreas and Stephen will reboot the LCFG master server tobermory and the external access server dresden.
-
Apacheconf
on dresden - This should be done by the end of the week. It'll restore external access to the LCFG status pages, using Cosign.
- Personal development topics
-
- Stephen is going to LISA!
- Chris has a big Perl book to work through and an
xinetd
component to convert to Perl.
- Alastair also fancies brushing up his Perl knowledge.
- Bressay
- On Sunday morning Nagios noticed that
apacheconf
on bressay (the LCFG test server) was no longer working. Apacheconf was indeed not working because of a typo in a configuration file which broke apacheconf when the logs were rotated on Sunday morning. The typo has now been fixed. Apacheconf suffers from a bug whereby it will not quit while its config is broken, and Stephen has mailed to Simon a patch which fixes this. Bressay hung during an attempted reboot and we don't know why (perhaps we do though - I expect it was just apacheconf
getting stuck and failing to quit while bressay was on the way down. - Chris).
- LCFG Tutorial Day
- There's demand for an internal rerun of the LCFG tutorial day. There was some discussion of which parts would be most suitable, and whether the tutorial should be less than a whole day long. In the end we decided that Chris should ask which bits people wanted most. Stephen volunteered to present Kenny's bits if necessary and said that Kenny would be happy with this. Chris will organise the event again.
- Kernel locking problem debugged
- Alastair and Simon debugged a kernel-level locking problem that seems to have been causing the problems we've seen with persistent hangs in
updaterpms
and rpm
and OpenLDAP. Such hangs can persist even across reboots. The problem is with the kernel-level locking primitive "futex" on which most locking primitives are based. If a process sets a futex lock, then the process dies, the lock stays until reboot. This is bad. If however the lock was done in an area of memory mapped from disk, the lock won't be cleared by a reboot and will stay locked indefinitely. This is worse. We think this was what was happening on FC5. In FC6 libc
and the kernel have new facilities which allow them to cooperate to register such locks and to clean them up when necessary. This helps where a process has died but the kernel carries on (the kernel will clean up the mess); however it still doesn't help in situations where the kernel itself crashes and the futex's memory location is mapped from an active disk file. Hopefully such situations will be pretty rare, says Simon.
- MPuData
- Alastair has added more detail to the MPuData page and would like us to read it and comment.
- RPM Slaves
- They're going to need more disc space.
- The develop release
- Alastair counted over 120 machines using the develop release, and this seems like too many. Alastair will pursue the problem elsewhere, but meanwhile one easy target would be the Solaris machines: they're all on develop for historical reasons. This is still appropriate for test Suns but Chris will ask Craig to move the production Suns to stable.
This Week
Alastair will:
- Tackle rpmsubmit
- Sift upgrade wikis
Chris will:
- Investigate power management
- Sift upgrade wikis
- Reboot servers
- Organise the LCFG tutorial day
Stephen will:
- Handle RT tickets
- Write up Virtualisation
- Tart up the LCFG web site
- Reboot servers
- Put
apacheconf
on dresden
--
ChrisCooke - 05 Nov 2007