MPU Meeting Wednesday 7th December 2011
AFS automation project
Nothing happened.
LCFG Server Refactoring
The code has been copied into subversion and packaged using the LCFG
build tools. This avoids having to add support for git to the build
tools. Alastair has setup a test
diydice
server running SL6 which
uses the new version of the LCFG server. A few bugs were found in the
packaging (mostly related to compile-time macro translations), one
minor bug in the code was found, these have all been fixed.
The next stage is to move on to using a test version of the
lcfgtest
service which has 277 source profiles. Once the test suite has been
run, and any resulting bugs fixed, we will move on to running a full
server on
prague which is the same hardware as
mousa and
trondra which should allow for some useful performance comparisons.
Wake-On-Lan
The first version of the service is now live. There is a newer version
with better authorization checks and which also provides a better
interface for anyone with the ability to wake up a lot of hosts. That
will be rolled out after Christmas.
We should add a check for virtual machine profiles so those do not get
added to the list.
The problems with the HP DC7900 where it wakes up once but then never
sleeps again appear to be OS-specific. Someone using the latest Ubuntu
on one of these machines has no problems using wake-on-lan. Hopefully
this will be fixed in a later version of RHEL6 (maybe 6.2?).
Simple KVM Service
Chris has now tried out the KVM service and it all worked
fine. Alastair is waiting for more users before adding any more
documentation so we can get a feel for what still needs better
coverage. There is now a
beta service on
northern. Stephen
will move the SL5 build hosts onto the KVM service and Alastair will
do the SL6 build hosts.
Server Upgrades
Waiting on the LCFG server testing before upgrading the
diydice
server.
Miscellaneous Development
- lcfg-kdm
- A patch from Kenny was applied to add support for xdmcp, this is needed for freenx usage. See bug#484 for details.
- Local partition mount options
- Kenny has provided a patch to add support for modifying the mount options on the local disk partition. See bug#502 for details.
- chkrootkit
- We are now running chkrootkit each night on the SSH servers. The report is sent to the MPU list. There are a few oddities but nothing serious, we will monitor it for a while to get a feel for what constitutes "normal" behaviour.
- TCP "hardening"
- Stephen has been working on an LCFG header -
dice/options/tcp-harden.h
- which has sysctl kernel settings to "harden" the TCP/IP stack on the SSH servers. George has taken a look and given some feedback.
- Desktop SSH header
- The LCFG header for DICE desktops with SSH firewall holes is now ready for deployment. There are only 5 machines which need this but they are all likely to be a little bit "awkward". Stephen will start by testing it on Paul Anderson's machine.
Operational
- Virtualbox
- Virtualbox is now on version 4 everywhere by default.
- SL5.5
- All support for SL5.5 has been removed.
- LCFG schemas for SL6
- There are a lot of schema packages missing on SL6. This is mainly due to the versions in
lcfg-defaults.rpms
being out-of-date. Stephen has made a start on bringing them all up-to-date and removing ancient versions which are no longer required.
- auditd boot issues
- Stephen is continuing to investigate the problems which cause the auditd to fail to start at boot time when called from the boot component. As part of this it was noticed that some scripts use
chkconfig
to see if a daemon is enabled. This is a problem as we do not use that approach to enable and disable daemons. The LCFG boot component has the necessary information so theoretically we could add support for this behaviour, Stephen will file a bug.
- lcfg.org server outage
- The lcfg.org server, dresden, crashed on Friday 2nd December due to a RAID card failure. The FC card was removed and put into budapest (a spare PE1950) and most of the service was resumed within a couple of hours. There was a problem with a lack of backup for the mysql DB used for the LCFG bug tracker. There is a cron job but it has been attempting to run a non-existent script for a long time. Thankfully it was possible to revive dresden using a RAID card taken from figgy which is a spare PE850 and the DB was recovered. We now need to sort out a replacement RAID card for figgy. We will leave the service on budapest until we do the SL6 upgrade when it will move back to dresden. This raised the issue of access to spares for old hardware. We did have a collection of spare RAID cards in the machine room but we didn't know this initially and it wasn't clear what we had or where it was stored. Alastair will take the issue to CEG. We should also come up with a way of trawling rootmail for error messages specifically from MPU servers.
- Fibre and SL6
- The satabeast was shut down and the RSCN settings changed. When northern was brought up it immediately hit the same "lun 0" bug that had caused problems in the past. Before we do anymore testing of the changes a locally patched version of the kernel will have to be built. The changes caused error messages on other machines on the same fabric, we need to check if these are still occurring.
- northern
- We will now put northern into service so we need another machine in KB for fibre testing, we will move figgy once it has a working RAID card.
- Backups of SL6
- We need to have a backup of the SL6 distro RPMs on sauce, the MPU DR server.
- gnome smartd applet
- This has now been removed from SL6 to get rid of the annoying (and often spurious) error messages seen by users in the labs.
- SSH service compromise report
- This report has been finished and is published on the DICE publications page.
- DNS install issues
- George has done some work on the LCFG dns component which should hopefully fix the issues at install time with small installs. They are only in develop for now.
- nagios check scripts
- Stephen noted that it would occasionally be useful to be able to run the various nagios check scripts by hand using something like a
--stdout
option so that output goes to the terminal rather than the nagios server. This also raised the idea of having the results of all passive check scripts on a machine aggregated and a single result be sent to the server to reduce load on the server.
This Week
- Alastair
-
Move SL6 build hosts to northern.
-
Take to CEG - harvesting old machines for spares in organised fashion
-
FC issue - hunt for relevant RH bug. Patch latest kernel re LUN zero problem.
- Arrange figgy (with replacement RAID card) to go to KB
- Pass BIOS settings to USU
- Consider focus for perl learning
-
Investigate updaterpms timeout issues (wrt AFS hangs)
-
Finish work on installroot re multiple interfaces and timeouts (calling udhcpc correctly)
- Chris
- Enjoy the heat after the storm
- Stephen
- LCFG refactor
-
compare results from DIYDICE live and test services
-
New LCFG server on SL6 prague using full inf profiles
-
Move SL5 build hosts to northern.
-
File a bug wrt lcfg-boot disabling services using chkconfig
-
Configure mysql backup for bugs.lcfg
-
Deploy desktop ssh header
-
Contact users re .xlog file
- RAT work
-
Move student.ssh to dunlin
--
AlastairScobie - 07 Dec 2011