MPU Meeting Tuesday 14th September 2010

LCFG Server Refactoring

Stalled.

Software Build Farm

The project has started.

Stephen had a prototype already running so the work consists of steadily finalising various bits of it.

He now has a working basic client tool which lets you submit build jobs.

The architecture: one machine has a daemon or cron job (yet to be decided). This will look at the queue of submitted build jobs, and validates them to an accepted queue, registering jobs as being required. Daemons on each build machine then look for jobs to run.

Installroot

Alastair has deployed the F13 installer built on F13. The release scripts have been modified. He needs to tidy the installroot and PXE documentation to reflect changes.

F13

We'll need f13_64 at some point. Stephen reckons that the core of this would be two days' work. The RAT packages would also need effort on top of this, but thanks to Iain's work most of them will just need to be built and submitted. Alastair will mark priorities against each thing still to be done so we can order them appropriately.

Miscellaneous Development

New storage array
There are two problems:
Management
Alastair couldn't get the management application to send emails or SNMP traps on critical events such as power failure. The management application runs on Linux. It fails on DICE Sl5 and on native SL5. He's not too worried though as we can also monitor the array through scripting, initially a cron/email combination then proper Nagios checks.
Multipathing
Alastair testing the multipathing and found that you need a proprietary additional driver to do failover to an additional controller. This driver needs to be loaded in initrd. Alastair tried their initrd and couldn't get it to work. However, simple multipathing does work, and if a controller fails we can manually move to another controller. Things can be set up so that to do this you just need to enable extra switch ports.

Operational

DL180s
Alastair and Ian have checked them out.
  • Ian has IPMI Serial Over LAN working (although it needs tidying and documenting).
  • Alastair has bought an extra ether card for (HP) sauce. Since eth0 is being used for the (IPMI SOL) serial console, ethernet bonding is done over eth1 and eth2.
  • Alastair will try an install with SOL to make sure it works.
  • Alastair has rolled out a new lcfg-fstab which supports cciss controllers. He's changed the scsiroot header so you don't have to reference cciss directly. You instead use sda in your fstab resources as normal and the component translates these for cciss devices.
  • The HPs are better than Dells in some ways:
    • You can do firmware upgrades (e.g. RAID, BIOS) on a running machine without disturbing it, then just reboot to use the new firmware.
    • There's also a better and more Linux-friendly RAID controller application.
    • The HPs have health monitoring (small, tidy RPMs - not like OMSA) which would be easy to connect to Nagios.
  • We'll now deploy sauce at KB. It's to be the MPU's DR machine there.
  • We'll need to add Nagios support for the HP monitoring. (This is now in the Wee Projects list.)

Nagios RAID monitoring time interval
Chris pointed out that the Nagios passive check of RAID status happens every minute. It could be every 15 minutes, lessening the load on the Nagios server. Chris will make the change.

Installroot DHCP problems
The new installroot works with our DHCP server but not with some others. Alastair will produce a suitable fix.

Thunderbird
Alastair and Iain have sorted something out and Iain's putting it in service this week.

Metropolitan VM storage
Chris tidied up the mess in metropolitan's blob.

Proper login screens
Alastair will tackle this. The most important thing is perhaps to see what RAT need for the exam environment.

Stable release
We had package conflicts when *-* accidentally escaped into a stable release. We need to tweak the release testing procedures to outlaw *-* in package lists. RAT noticed that it also occasionally appears in headers. This can be allowable in exceptional circumstances but is discouraged, except when removing software.

RAT package list troubles
RAT reported to us that it has been having trouble with its package lists:
fixed vs floating
some packages need to have fixed versions but others don't, yet the current package list arrangement forces all packages to be fixed. Stephen suggests splitting the RAT lists into separate "fixed" and "floating" lists and loading the "floating" ones after package updates have been applied. Perhaps the teaching packages would tend to be in the "fixed" category.
devel
Packages needed purely for building other packages - e.g. in BuildRequires in a spec file - can be put into dice_f13_devel. This will make them available on COs' and build machines but not elsewhere. This should cut down the conflicts and complication we sometimes experience with these packages.
perl test modules
dice_f13_devel is also good for perl test modules: they're useful to have available but not on every machine.

Gordon
Gordon will join the MPU for the rest of 2010.

Projects for the last third of 2010

The project list has been finalised:

On the subject of the VMware Server replacement, Alastair saw a presentation from Graeme Wood about IS's virtual server service. The service looks quite good and it may well be possible for us to use it. Points:

  • You need to use Windows for access to the console and for management. IS are willing to look at providing a Windows terminal service.
  • They'll carry our VLANs.
  • Live migration will be available.
  • They'll have live replication available within the year - disk and live memory - for instant switchover.
  • The initial price seems high but they're talking of reducing it. For what you get perhaps it's not so high.
  • The primary service will be at KB with replication infrastructure at AT. This may have network implications, but only for virtual servers which use the network an awful lot.

This Week

Alastair will:

  • Finish off the HP D180 support
  • Finish off the IBM storage array support
  • Assign priorities to F13 tasks
  • Tidy the installroot and PXE documentation.

Chris will:

  • work on Hadoop for RAT

Stephen will:

  • work on the Software Build Farm
  • finish putting F13 onto the LCFG website
  • finish the Package list test scripts

-- ChrisCooke - 17 Sep 2010

Topic revision: r1 - 17 Sep 2010 - 10:15:50 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies