MPU Meeting Wednesday 7th December 2011

AFS automation project

Nothing happened.

LCFG Server Refactoring

The code has been copied into subversion and packaged using the LCFG build tools. This avoids having to add support for git to the build tools. Alastair has setup a test diydice server running SL6 which uses the new version of the LCFG server. A few bugs were found in the packaging (mostly related to compile-time macro translations), one minor bug in the code was found, these have all been fixed.

The next stage is to move on to using a test version of the lcfgtest service which has 277 source profiles. Once the test suite has been run, and any resulting bugs fixed, we will move on to running a full server on prague which is the same hardware as mousa and trondra which should allow for some useful performance comparisons.

Wake-On-Lan

The first version of the service is now live. There is a newer version with better authorization checks and which also provides a better interface for anyone with the ability to wake up a lot of hosts. That will be rolled out after Christmas.

We should add a check for virtual machine profiles so those do not get added to the list.

The problems with the HP DC7900 where it wakes up once but then never sleeps again appear to be OS-specific. Someone using the latest Ubuntu on one of these machines has no problems using wake-on-lan. Hopefully this will be fixed in a later version of RHEL6 (maybe 6.2?).

Simple KVM Service

Chris has now tried out the KVM service and it all worked fine. Alastair is waiting for more users before adding any more documentation so we can get a feel for what still needs better coverage. There is now a beta service on northern. Stephen will move the SL5 build hosts onto the KVM service and Alastair will do the SL6 build hosts.

Server Upgrades

Waiting on the LCFG server testing before upgrading the diydice server.

Miscellaneous Development

lcfg-kdm
A patch from Kenny was applied to add support for xdmcp, this is needed for freenx usage. See bug#484 for details.

Local partition mount options
Kenny has provided a patch to add support for modifying the mount options on the local disk partition. See bug#502 for details.

chkrootkit
We are now running chkrootkit each night on the SSH servers. The report is sent to the MPU list. There are a few oddities but nothing serious, we will monitor it for a while to get a feel for what constitutes "normal" behaviour.

TCP "hardening"
Stephen has been working on an LCFG header - dice/options/tcp-harden.h - which has sysctl kernel settings to "harden" the TCP/IP stack on the SSH servers. George has taken a look and given some feedback.

Desktop SSH header
The LCFG header for DICE desktops with SSH firewall holes is now ready for deployment. There are only 5 machines which need this but they are all likely to be a little bit "awkward". Stephen will start by testing it on Paul Anderson's machine.

Operational

Virtualbox
Virtualbox is now on version 4 everywhere by default.

SL5.5
All support for SL5.5 has been removed.

LCFG schemas for SL6
There are a lot of schema packages missing on SL6. This is mainly due to the versions in lcfg-defaults.rpms being out-of-date. Stephen has made a start on bringing them all up-to-date and removing ancient versions which are no longer required.

auditd boot issues
Stephen is continuing to investigate the problems which cause the auditd to fail to start at boot time when called from the boot component. As part of this it was noticed that some scripts use chkconfig to see if a daemon is enabled. This is a problem as we do not use that approach to enable and disable daemons. The LCFG boot component has the necessary information so theoretically we could add support for this behaviour, Stephen will file a bug.

lcfg.org server outage
The lcfg.org server, dresden, crashed on Friday 2nd December due to a RAID card failure. The FC card was removed and put into budapest (a spare PE1950) and most of the service was resumed within a couple of hours. There was a problem with a lack of backup for the mysql DB used for the LCFG bug tracker. There is a cron job but it has been attempting to run a non-existent script for a long time. Thankfully it was possible to revive dresden using a RAID card taken from figgy which is a spare PE850 and the DB was recovered. We now need to sort out a replacement RAID card for figgy. We will leave the service on budapest until we do the SL6 upgrade when it will move back to dresden. This raised the issue of access to spares for old hardware. We did have a collection of spare RAID cards in the machine room but we didn't know this initially and it wasn't clear what we had or where it was stored. Alastair will take the issue to CEG. We should also come up with a way of trawling rootmail for error messages specifically from MPU servers.

Fibre and SL6
The satabeast was shut down and the RSCN settings changed. When northern was brought up it immediately hit the same "lun 0" bug that had caused problems in the past. Before we do anymore testing of the changes a locally patched version of the kernel will have to be built. The changes caused error messages on other machines on the same fabric, we need to check if these are still occurring.

northern
We will now put northern into service so we need another machine in KB for fibre testing, we will move figgy once it has a working RAID card.

Backups of SL6
We need to have a backup of the SL6 distro RPMs on sauce, the MPU DR server.

gnome smartd applet
This has now been removed from SL6 to get rid of the annoying (and often spurious) error messages seen by users in the labs.

SSH service compromise report
This report has been finished and is published on the DICE publications page.

DNS install issues
George has done some work on the LCFG dns component which should hopefully fix the issues at install time with small installs. They are only in develop for now.

nagios check scripts
Stephen noted that it would occasionally be useful to be able to run the various nagios check scripts by hand using something like a --stdout option so that output goes to the terminal rather than the nagios server. This also raised the idea of having the results of all passive check scripts on a machine aggregated and a single result be sent to the server to reduce load on the server.

This Week

  • Alastair
    • Move SL6 build hosts to northern.
    • Take to CEG - harvesting old machines for spares in organised fashion
    • FC issue - hunt for relevant RH bug. Patch latest kernel re LUN zero problem.
    • Arrange figgy (with replacement RAID card) to go to KB
    • Pass BIOS settings to USU
    • Consider focus for perl learning
    • Investigate updaterpms timeout issues (wrt AFS hangs)
    • Finish work on installroot re multiple interfaces and timeouts (calling udhcpc correctly)

  • Chris
    • Enjoy the heat after the storm

  • Stephen
    • LCFG refactor
      • compare results from DIYDICE live and test services
      • New LCFG server on SL6 prague using full inf profiles
    • Move SL5 build hosts to northern.
    • File a bug wrt lcfg-boot disabling services using chkconfig
    • Configure mysql backup for bugs.lcfg
    • Deploy desktop ssh header
    • Contact users re .xlog file
    • RAT work
    • Move student.ssh to dunlin

-- AlastairScobie - 07 Dec 2011

Topic revision: r10 - 09 Jan 2012 - 14:46:22 - StephenQuinney
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies