MPU Meeting Tuesday 2nd November 2010

LCFG Server Refactoring

Stalled.

Software Build Farm

The build farm software is now coming along fairly nicely, most of the structure is in place but needs some fleshing out.

The database schema has now stabilised and seems to be working well. Stephen has switched to using the DBIx::Class Perl module to provide access to the database. This has massively simplified a lot of the existing code and made it quicker to write new code. The decision has been made to require PostgreSQL for the database, it might be that the SQL (in particular, the locking code) will work with other databases but it's not worth testing it right now. There is still a lack of authorization handling in the database, the plan is that each process which needs access will have kerberos credentials which will map onto DB users. The individual DB users will be limited as to which tables they can modify, need to talk to Graham on how to set this up.

The basic architecture is based around a user submitting a job which consists of a set of source packages and some instructions on how and where they should be built. The submitted jobs are validated and then converted into a set of tasks, one for each required target platform. The build daemons pick off these tasks in the order in which they were submitted, attempt to build them and then store the built packages and log files, they also send the final status back into the DB.

Stephen noted that the processing of the incoming tree might well stay as a simple cron job but the builder processes are much better suited as daemons. Although he has code which should do the daemonisation work he hasn't tested this properly yet.

There will have to be various janitorial processes run on a regular basis to keep everything tidy and consistent. These will do things like look for unfinished tasks which have taken too long (i.e. the builder has gone AWOL). The build process will have a time-out enforced (probably approximately 2 hours) but there is always the chance that the machine might crash.

Most of the job status information can be queried through command line tools but ideally we will have a web interface as well to make the interface a little bit more pleasant. Nothing has been done on the web interface so far, hopefully it won't be too tricky.

F13

Stephen reported that Lindsay has a problem with not being able to login via KDM. She could login using gdm though. Stephen will investigate in case this might affect more staff. Chris noted that his desktop environment has some how switched to KDE as the default, this isn't a general problem and might be related to him doing some of the KDM testing.

As agreed in the last Operational Meeting, the devel bucket has been pulled from testing and stable profiles. This has resulted in 250+ packages being unavailable which breaks installs. Alastair will talk to unit managers about getting this done asap but we might have to let this change slip until next week.

This project is going to the November Development Meeting for closure, Alastair has written a final report.

Replacement for VMware Server

The proposal for a project to find a replacement for our VMWare server systems will go to the November Development Meeting. Given the high priority of the project it is expected that this will be accepted so Alastair and Chris will go ahead and start producing the requirements document.

Miscellaneous Development

nagios configs
Alastair noted that every unit is configuring nagios for their servers in different ways. We should agree on a standard approach and then create some headers to make it all work sensibly. This will probably have to wait until the RHEL6 port.

Power Down

Things which MPU needs to do:

  • Stop updaterpms from running at boot time by removing start from the updaterpms.methods resource.
  • Set shutdown times for Forum desktops. Probably we will add two cron jobs, one in the early hours and another at 9:30am for any machines which have been turned on again. We might want to also do the labs, Stephen will check with Alison.
  • List MPU servers and plan the sequence for shutdown and restart.

Operational

SRPM access
Alastair has added freshenrpms for the source buckets. He also did a proper fix for the SRPM tree apache configuration. Stephen still needs to check that the LCFG website is now getting the source packages.

Moving sauce to KB
Chris has had problems getting our HP server, sauce, to use the dhcp header in readiness for the move to KB. He has seen networking problems which seem to be related to either dns or routing (or both). He has also had trouble with IPMI access, although he was following Ian's notes. Alastair will take a look.

nagios raid checks
Chris has reduced the frequency of the nagios raid checks.

ethernet bonding and IPV6
Whilst moving figgy Chris discovered a problem related to the recent blocking of the loading of the ipv6 kernel module. It turns out that the ethernet bonding module requires the ipv6 module also be loaded. A fix has been added to the dice/options/etherbond.h header. Stephen checked the modules.dep file for the SL5 and F13 kernels and this seems to be the only module likely to cause us problems.

Moving MPU VMs
Chris will make a list of the MPU VMs which are on bakerloo. We will split them up amongst the unit, Alastair will do the non-LCFG VMs, most of the others can just be new installs.

Moving MPU test servers
We might move the MPU test servers from the Forum to AT during the power down. If so we will need to do the dhcp dance and get them physically shifted on Friday.

bnx2 and msi
Alastair documented the fix for machines with network cards using the bnx2 driver occasionally losing bonded networking. The machine on which we saw most of the problems, metropolitan, is currently not running with MSI disabled and it has not had a problem for quite a while. This suggests that the issue might have been fixed, we will just have to wait and see...

smartd
Chris has done some investigating of the smartd problems. Most of them appear to be in the labs and on HP 7900 machines. They don't appear to be real problems as smartctl does not report any errors. If we can't stop the warnings entirely then it would be good to see if we can work out which gnome applet is reporting the spurious smartd errors and disable it. Stephen noted that smartd can be useful when investigating real problems so we don't want to disable the daemon but this means we should not ignore the messages it sends to the rootmail account.

R610 & R710 docs
Alastair has added some documentation on the Dell R610 and R710 hardware.

Perl training
Alastair has contacted Dave Cross and got the details of how much it would cost to run a Perl training course for Informatics staff. He is offering either the one-day courses he has previously done for UKUUG or a new two-day course which has a lot more practical content. He charges £150 per person per day plus travel and accommodation expenses. There is a minimum requirement of 3 people, we reckon if he is doing the two-day practical-based course we would want no more than 10 people on the course. Alastair needs to run it by Liz but hopefully it will be passed since this is cheaper than sending staff to London for the UKUUG courses. After COs have taken up spaces on the course we could offer it around internally and also look to the college for any one interested.

This Week

Alastair will:

  • Review MPU project list
  • Updated project status for dev meeting
  • Talk to unit managers about devel bucket
  • Look at _sauce_
  • VM server requirements
  • Move bakerloo VMs
  • Add note about nagios changes to RHEL6 project

Chris will:

  • Review MPU project list
  • Updated project status for dev meeting
  • VM server requirements
  • List bakerloo VMs
  • Tweak IPMI on KB MPU servers
  • Look at smartd problems

Gordon will:

  • Review MPU project list
  • Updated project status for dev meeting
  • Review LCFG bugs

Stephen will:

  • Review MPU project list
  • Updated project status for dev meeting
  • Power down preparation
  • Software build farm
  • Lindsay's kdm problem

-- StephenQuinney - 02 Nov 2010

Topic revision: r2 - 02 Nov 2010 - 16:38:20 - AlastairScobie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies