MPU Meeting Tuesday 25th August 2009
Power Management Project
The majority of the Dell 745s in the first test lab, AT-5.03, froze during Chris's holiday. The known HP 7900s by contrast seem to have no problem with sleep. The Dell problem has been going on for ages with no solution in sight. We therefore think it's best to disable lcfg-sleep on Dells and just concentrate on the HPs for now. Chris has moved the sleep tests to another small lab, AT-5.01, which contains some new HPs. If those machines behave with lcfg-sleep it'll be fairly easy to deploy it on all the other new HPs in the student labs.
Chris has been
blogging about his progress.
rpmsubmit
Alastair has produced a new version of
rpmsubmit
for the AFS repository. He's called it
pkgsubmit
and has taken the opportunity to edit out some irrelevant code. The changeover to the AFS repository is expected to happen in two weeks' time.
AFS Component
Nothing happened.
TiBS
Chris had a meeting with Craig and Alison about lcfg-tibs. He's
written it up here.
In summary, once we have the current TiBS tarball, Chris will adapt his work for the current TiBS then we'll deploy it on
alexandria. The current version manages all of those configuration files which are currently edited by hand.
A future version will also manage the addition and removal of non-AFS backup details to the system. A version after that will get these non-AFS backup details from a spanning map which will be fed by an LCFG backup component on clients which will specify candidate filesystems together with some associated keyword signifying a backup type - which will be translated to a TiBS configuration on the server.
Stephen kindly volunteered a code review of the current version.
Chris has been
blogging about his progress.
LCFG Server Refactoring
Stephen and Simon
met to decide on a Perl coding style.
Stephen has patched the current server to make it sort the contents of spanning maps. This seems to have no ill effects and it'll make it far easier when performing tests of LCFG server output as part of the server refactoring project.
Stephen has been learning
git and
gerrit - apparently quite an experience.
He's also produced a Perl module for comparing XML profiles for testing purposes. It currently not sophisticated but does at least do the XML comparison perfectly well. The next step will be to be able to take snapshots of the inputs and outputs and build tests more automatically.
Source modules have been rearranged into an arrangement that makes more sense than the former one did.
Stephen
mailed lcfg-devel about merging in some LCFG server dependencies.
Miscellaneous Development
- LCFG Status Checks
- To tide us over until we have Nagios-based solutions, Stephen has put together some LCFG client status check scripts. These will be run every Monday morning and the results mailed out to COs. They look for machines on the stable release with the following problems (suggestions for additions welcome):
- The LCFG client does not accept new profiles
- Some components are not started
- updaterpms has failed to run
- The LCFG profile fails to compile
- Updaterpms broken script problems
- The LCFG Deployers' Meeting flagged up a couple of problems related to updaterpms:
- How to remove a package with a broken script
- this should be simple to fix in updaterpms so Alastair will do it in September.
- How to upgrade a package with a broken script
- this isn't quite so simple; Alastair will add it to the small projects list.
- Cron enhancement for random time in range
- Stephen has implemented this and is testing it. It's looking pretty good so far and he hopes to deploy it soon.
Operational
- Quote for SAS drives and EVO expansion
- Alastair has got one, and it's a bit expensive. For just over 2TB at RAID10 - meaning 4.5TB worth of disks - we'd have to pay about £10,000. Since the speed is only occasionally a problem, and we have other ways of getting round that, we're probably not going to buy these disks just now.
- Boot run time on virtual servers
- One of the ways of spreading the virtual server load on the virtual server host machines is to spread out the time of the running of the boot component. Stephen has modified the cron component to pick a random time in a given range, which will help greatly with this. We can also probably identify some servers which could just as well have their boot component run during the day as at night. We'll all look at the servers and try to identify times at which the boot component could be run.
- Cosign upgrade for lcfg.org
- This has now been done.
- LCFG Users' Day
- Stephen has started organising this and hopes to split the organisation of the day's timetable with Kenny.
- LCFG Server Exam Slowdown
- An incident with a typo just before an online exam highlighted an existing problem with the LCFG servers and releases: when a host acquires a compilation error it drops out of its normal release and uses the default release instead. The release in use is a component of the path of every file used to make the machine's profile, so all of the host's dependency information has to be removed then recreated. This is all done with a Berkeley DB4 database, which isn't lightning-fast. When this happens with several hundred hosts, the resulting logjam can mean that the LCFG servers take an hour to process the typo, then another hour to process the removal of the typo.
We had a number of ideas on how to fix this:
- Chris suggested renaming the stable release "default". He'll look at this.
- Stephen thought we might get better results out of abandoning the DB4 database and holding dependency information in memory instead. This ought to make it far faster to [re]generate and should be trivial to implement.
- Stephen also wondered about changing the name of the default release from default to stable but we'd have to coordinate this with other LCFG sites.
- Future Exam Arrangements
- Before a recent exam it wasn't possible to get the environment and exam materials in place until the afternoon before the exam, as the exam was on a Friday and the stable release was as usual due on the Thursday afternoon. To help the setup and testing for future online exams Stephen suggests establishing a subversion branch for the exam machines a week before the exam to give them an extra week's frozen release. This should provide staff with plenty of time for a successful cycle of configuration and testing.
- Bug in grub component
- Stephen's health check scripts showed up a bug in the grub component, whereby grep occasionally gets confused and reports that a file contained binary data when it doesn't. This has been fixed. Stephen took the opportunity to remove support for "server mode" which grub proper hasn't supported for nearly a decade and which we haven't needed since we junked the GX240s.
- Bug in gbios component
- Stephen's health checks showed up this problem too. This component is only needed on machines with widescreen monitors and intel graphics cards. The component fails (because "915resolution" fails) on machines with widescreen monitors but no intel graphics card. The component needs to issue a warning rather than failing.
- New HPs with dual head
- Stephen's health checks showed up this problem too. The second monitor is not autodetected. The "r500" driver we use is a bit old. The β release of SL5.4 contains a more recent r500 which might solve the problem. We might also want a script along the lines of get-edid to identify all monitors.
In particular one user has a two-monitor setup which is proving time-consuming to support. It might be simpler to buy this user a single large widescreen monitor, or failing that a new small NVidia graphics card. Stephen will check for other users in a similar situation.
- Old RT ticket
- Stephen will put it to Tim, who may know the answer.
- RT Duty
- Stephen will take the first half of the week and Chris the second half.
- Julian Bradfield's weird key code problems
- These were related to his keyboard, Alastair discovered. Swapping keyboards made the problem travel with the keyboard. Alastair will get Julian a new keyboard.
- Pandemic Planning
- CEG has started PandemicPlanning. They identified areas where how-to knowledge about critical infrastructure needs to be more widely distributed. These are the MPU areas they thought most needed a list of how to put right the top five most likely things to go wrong:
- LCFG release mechanism. Chris will write something up.
- Package service. Alastair will write something.
- Server Virtualisation. Alastair will write something.
- Release script docs
- Chris will revise these to emphasise that all commands should be run from the user's own account and not from root.
This Week
Alastair will:
- Work om rpmsubmit project (produce timetable)
- Tidy LCFG/inf level
- Propose signoff of virtual server project
- Propose signoff of desktop virtualisation project
- Chase George re routing component problem
- Pandemic top 5 howto for server virtualisation
- Pandemic top 5 howto for package service
Chris will:
- TIBS component
- Monitor sleep in student lab
- Tweak LCFG release procedures docs
- Pandemic top 5 howto for release mechanism
- Look into renaming stable to default
- Identify suitable time ranges for boot run on MPU servers
Stephen will:
--
ChrisCooke - 26 Aug 2009