MPU Meeting Tuesday 17th January 2012

AFS automation project

Down for signoff at February's development meeting.

LCFG Server Refactoring

Stephen has finished the Informatics end of the testing. Prague has functionally the same profiles as mousa and trondra. A full rebuild is much faster, but a stable release build is slower - 13% slower, which is also how much slower prague's processor is. We could speed up the new server code by turning off warnings and some checks but we'd rather opt for safety and keep them. Stephen has helped Kenny to set up similar tests for the IS service. It's already clear that a few GeoS machine profiles will have to change slightly to accommodate the new server but it's not expected that anything else will. Once Kenny is happy we'll roll out to DIY DICE then one of mousa/trondra. Chris will make a new diydice.inf on SL6 using KVM on northern. The project will probably go for signoff in March.

Wake On LAN

Kenny has made new versions of the wolclient and wolconf components which incorporate Chris's patches. The new wolconf version is now in service; the new wolclient is in develop. The service is now using the new CGI. You can now enable WoL on a dc7900 by defining ENABLE_WAKE_ON_LAN. This works on develop (and self-managed) machines and will work on stable machines after the release of 25/1/12. The build mechanism for wakeweb still needs to be redone. We could also do with a command line script which does the right thing a la wake.inf, but this will be a separate "wee project". The 7900 issue will be dealt with as an operational issue, outside of the project. Alastair will test the new kernel with the old BIOS to see if that solves the problem or if the new BIOS really is necessary. Chris reported possible issues with some keyboards not being found at boot time (RT 56367) but said that his machine (with up to date BIOS and kernel) is not affected, since a standard Dell keyboard still works with it at boot time. Chris needs to create a Service Catalogue entry for it. The project should go for signoff in February.

Simple KVM Service

The new kernel fixed the stickiness problem. Alastair will make a wrapper to virsh rather than one just for the console. It'll be possible to integrate console access with conserver but this won't be done as part of this project.

For the transition from VMware to KVM we'll be playing musical servers:

  • bakerloo is free. It can't take more disks, so Alastair will connect a disk array to it directly (because of the continuing SL6 Fibre Channel problems) and RAID10 some of the space there for VMs.
  • piccadilly will be free when the Services Unit replaces its three emergency VMs there with KVM ones on northern. piccadilly then becomes both the hot spare R710 and the hot spare KVM server.
  • circle will be available for KVM use.
  • northern will be available for KVM use.
  • Once metropolitan is free of VMware VMs, it can have disks added to it and be converted to a KVM server too.

Server Upgrades


Miscellaneous Development

ngeneric more quiet
Stephen has enhanced ngeneric to make its -q ("quiet") option do more. When not attached to a terminal, components will now no longer send INFO messages. This work exposed a bug: WAIT messages were being sent in all circumstances rather than only when a terminal was attached. This was clearly not the intention of the code and Stephen has fixed the bug. These two changes, once in stable, should massively cut the amount of root mail.
stdout options for Nagios check scripts
Chris has added --stdout options to check_network and check_hwmon. He will add it to check_multipath too. Stephen suggested a common framework for Nagios check scripts as they seem to have so much in common. We don't think that this is an MPU job but we'll suggest it to Infrastructure.
Nagios passive check strategies
Stephen described how Nagios passive check results are queued up on the Nagios server and processed sequentially - so if the processing is slower than the rate of arrival of results, chaos can ensue. This can be tackled in a couple of ways:
  1. Passive check scripts can report to a local spool directory rather than to Nagios. A single process can then amalgamate these together into one combined report to Nagios once every so often.
  2. Passive check scripts report errors or warnings to Nagios immediately, but only report successes every so often. They keep track of when their last OK was reported and report another one just in time to prevent a check timeout error.
We already have random delays built in to our check scripts for OK results, so at least we shouldn't suffer from hundreds of checks landing on the Nagios server simultaneously.
Disk UUIDs
Alastair has done some boot tests and discovered that the USB stick always gets priority! Luckily you can blacklist kernel modules from initrd using rdblacklist so we're doing that.
PXE Changes
Alison noticed that you could no longer set default choices for PXE for machines with a serial console. This mattered because for some servers, including 1950s, keyboards don't work with PXE and serial console, so the right choice had to be set using a default. Stephen has fixed the bug so PXE defaults can be set once again.
Installroot new kernel
Thanks to Alastair the CD installroot is using the new kernel. Stephen will update the PXE installroot to use the latest kernel too. This will stand us in good stead when we upgrade to SL6.2 soon.
Fibre Channel on SL6
The FC problem is still here with the latest 220 kernel, Alastair discovered. Craig is going to rejig the KB FC fabric to make it Nexsan-free to see if that helps.


Note that dresden now has no memory - so that prague can have as much memory as mousa and trondra.

This Week

  • Alastair
    • DC7900 - old BIOS + new kernel - does this fix WoL problem OK with BIOS 1.16, tower dc7900 with ATI graphics card
    • Deploy bakerloo using bioboy1 or atabeast as IF KVM service
    • Arrange figgy (with replacement RAID card) to go to KB
    • Ask Alison to sort out the spare memory - antistatic bags etc
    • Discuss with George - how rack desktop servers in AT and IF

  • Chris
    • Service catalogue entry for WoL
    • Prepare WoL for signoff
    • Install new KVM on northern to act as diydice, and kill madurai
    • Make a start on local homes for servers
    • PD - local url shortener implementation (for not-a-service)

  • Gordon
    • actions from last meeting

  • Stephen
    • Help Kenny with LCFG testing
    • Consider PD - concrete task
    • RAT

-- AlastairScobie, ChrisCooke - 17 Jan 2012

