MPU Meeting Tuesday 25th August 2009

Power Management Project

The majority of the Dell 745s in the first test lab, AT-5.03, froze during Chris's holiday. The known HP 7900s by contrast seem to have no problem with sleep. The Dell problem has been going on for ages with no solution in sight. We therefore think it's best to disable lcfg-sleep on Dells and just concentrate on the HPs for now. Chris has moved the sleep tests to another small lab, AT-5.01, which contains some new HPs. If those machines behave with lcfg-sleep it'll be fairly easy to deploy it on all the other new HPs in the student labs.

Chris has been blogging about his progress.

rpmsubmit

Alastair has produced a new version of rpmsubmit for the AFS repository. He's called it pkgsubmit and has taken the opportunity to edit out some irrelevant code. The changeover to the AFS repository is expected to happen in two weeks' time.

AFS Component

Nothing happened.

TiBS

Chris had a meeting with Craig and Alison about lcfg-tibs. He's written it up here. In summary, once we have the current TiBS tarball, Chris will adapt his work for the current TiBS then we'll deploy it on alexandria. The current version manages all of those configuration files which are currently edited by hand.

A future version will also manage the addition and removal of non-AFS backup details to the system. A version after that will get these non-AFS backup details from a spanning map which will be fed by an LCFG backup component on clients which will specify candidate filesystems together with some associated keyword signifying a backup type - which will be translated to a TiBS configuration on the server.

Stephen kindly volunteered a code review of the current version.

Chris has been blogging about his progress.

LCFG Server Refactoring

Stephen and Simon met to decide on a Perl coding style.

Stephen has patched the current server to make it sort the contents of spanning maps. This seems to have no ill effects and it'll make it far easier when performing tests of LCFG server output as part of the server refactoring project.

Stephen has been learning git and gerrit - apparently quite an experience.

He's also produced a Perl module for comparing XML profiles for testing purposes. It currently not sophisticated but does at least do the XML comparison perfectly well. The next step will be to be able to take snapshots of the inputs and outputs and build tests more automatically.

Source modules have been rearranged into an arrangement that makes more sense than the former one did.

Stephen mailed lcfg-devel about merging in some LCFG server dependencies.

Miscellaneous Development

LCFG Status Checks
To tide us over until we have Nagios-based solutions, Stephen has put together some LCFG client status check scripts. These will be run every Monday morning and the results mailed out to COs. They look for machines on the stable release with the following problems (suggestions for additions welcome):
  1. The LCFG client does not accept new profiles
  2. Some components are not started
  3. updaterpms has failed to run
  4. The LCFG profile fails to compile

Updaterpms broken script problems
The LCFG Deployers' Meeting flagged up a couple of problems related to updaterpms:
How to remove a package with a broken script
this should be simple to fix in updaterpms so Alastair will do it in September.
How to upgrade a package with a broken script
this isn't quite so simple; Alastair will add it to the small projects list.

Cron enhancement for random time in range
Stephen has implemented this and is testing it. It's looking pretty good so far and he hopes to deploy it soon.

Operational

Quote for SAS drives and EVO expansion
Alastair has got one, and it's a bit expensive. For just over 2TB at RAID10 - meaning 4.5TB worth of disks - we'd have to pay about 10,000. Since the speed is only occasionally a problem, and we have other ways of getting round that, we're probably not going to buy these disks just now.

Boot run time on virtual servers
One of the ways of spreading the virtual server load on the virtual server host machines is to spread out the time of the running of the boot component. Stephen has modified the cron component to pick a random time in a given range, which will help greatly with this. We can also probably identify some servers which could just as well have their boot component run during the day as at night. We'll all look at the servers and try to identify times at which the boot component could be run.

Cosign upgrade for lcfg.org
This has now been done.

LCFG Users' Day
Stephen has started organising this and hopes to split the organisation of the day's timetable with Kenny.

LCFG Server Exam Slowdown
An incident with a typo just before an online exam highlighted an existing problem with the LCFG servers and releases: when a host acquires a compilation error it drops out of its normal release and uses the default release instead. The release in use is a component of the path of every file used to make the machine's profile, so all of the host's dependency information has to be removed then recreated. This is all done with a Berkeley DB4 database, which isn't lightning-fast. When this happens with several hundred hosts, the resulting logjam can mean that the LCFG servers take an hour to process the typo, then another hour to process the removal of the typo.

We had a number of ideas on how to fix this:

  • Chris suggested renaming the stable release "default". He'll look at this.
  • Stephen thought we might get better results out of abandoning the DB4 database and holding dependency information in memory instead. This ought to make it far faster to [re]generate and should be trivial to implement.
  • Stephen also wondered about changing the name of the default release from default to stable but we'd have to coordinate this with other LCFG sites.

Future Exam Arrangements
Before a recent exam it wasn't possible to get the environment and exam materials in place until the afternoon before the exam, as the exam was on a Friday and the stable release was as usual due on the Thursday afternoon. To help the setup and testing for future online exams Stephen suggests establishing a subversion branch for the exam machines a week before the exam to give them an extra week's frozen release. This should provide staff with plenty of time for a successful cycle of configuration and testing.

Bug in grub component
Stephen's health check scripts showed up a bug in the grub component, whereby grep occasionally gets confused and reports that a file contained binary data when it doesn't. This has been fixed. Stephen took the opportunity to remove support for "server mode" which grub proper hasn't supported for nearly a decade and which we haven't needed since we junked the GX240s.

Bug in gbios component
Stephen's health checks showed up this problem too. This component is only needed on machines with widescreen monitors and intel graphics cards. The component fails (because "915resolution" fails) on machines with widescreen monitors but no intel graphics card. The component needs to issue a warning rather than failing.

New HPs with dual head
Stephen's health checks showed up this problem too. The second monitor is not autodetected. The "r500" driver we use is a bit old. The β release of SL5.4 contains a more recent r500 which might solve the problem. We might also want a script along the lines of get-edid to identify all monitors.

In particular one user has a two-monitor setup which is proving time-consuming to support. It might be simpler to buy this user a single large widescreen monitor, or failing that a new small NVidia graphics card. Stephen will check for other users in a similar situation.

Old RT ticket
Stephen will put it to Tim, who may know the answer.

RT Duty
Stephen will take the first half of the week and Chris the second half.

Julian Bradfield's weird key code problems
These were related to his keyboard, Alastair discovered. Swapping keyboards made the problem travel with the keyboard. Alastair will get Julian a new keyboard.

Pandemic Planning
CEG has started PandemicPlanning. They identified areas where how-to knowledge about critical infrastructure needs to be more widely distributed. These are the MPU areas they thought most needed a list of how to put right the top five most likely things to go wrong:
  • LCFG release mechanism. Chris will write something up.
  • Package service. Alastair will write something.
  • Server Virtualisation. Alastair will write something.

Release script docs
Chris will revise these to emphasise that all commands should be run from the user's own account and not from root.

This Week

Alastair will:

  • Work om rpmsubmit project (produce timetable)
  • Tidy LCFG/inf level
  • Propose signoff of virtual server project DONE
  • Propose signoff of desktop virtualisation project DONE
  • Chase George re routing component problem
  • Pandemic top 5 howto for server virtualisation
  • Pandemic top 5 howto for package service

Chris will:

  • TIBS component
  • Monitor sleep in student lab DONE
  • Tweak LCFG release procedures docs DONE
  • Pandemic top 5 howto for release mechanism DONE
  • Look into renaming stable to default DONE
  • Identify suitable time ranges for boot run on MPU servers DONE

Stephen will:

  • LCFG server refactoring

-- ChrisCooke - 26 Aug 2009

Topic revision: r8 - 31 Aug 2009 - 09:23:31 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies