MPU Meeting Friday 29th May 2015
LCFG Client Refactoring
Nothing happened.
Inventory
Alastair has finished the new parser for orders data. He has tested it
but it is not yet in service. The testing against the current data
revealed a lot of bad data which has had to be fixed. The parser needs
to be formally documented to describe what it will accept (or
not). Alastair has some tests but it is currently a manual process, it
still needs to be automated. Alastair has also tested integrating the
parser into rfe so that the data is validated on submission. Chris
will review the code once Alastair has documented the API.
Miscellaneous Development
- SL6.6
- The SL6.6 minor upgrade is now almost complete. A few small problems are still being found, for example, the version of libvirt specified in
dice/options/kvm-server.h
was missing for SL6. We need to test upgrades of some simple server configurations as well as the desktops.
- updaterpms
- A bug was found in the way updaterpms handles certain error situations when upgrading and installing packages. Stephen fixed the problem and released version 3.6.0, see bug#873 for full details.
- lcfg-auth
- Stephen changed some calls to Error into calls to Warn. Previously when the component started successfully the boot component or systemd would think that the start had failed due to a non-zero exit status. See bug#839 for full details.
- Bad UID/GIDs
- The various accounts which had clashes of UID/GID have been given new IDs in SL7, the previous IDs have been kept for SL6. See this message for details.
- 7.1 mirror
- Services created a new SAN volume so that SL7.1 could be mirrored onto telford, the previous smaller volume has now been returned.
- SL7 upgrade scripts
- The SL6 upgrade scripts have been copied and modified to create some SL7 upgrade scripts. Chris will try them out with the 7.1 upgrade next week.
- acroread packages
- Stephen has finished working on the acroread package and its dependencies. It is still incredibly fragile on SL7, there is nothing we can do about that as it is closed-source and has been abandoned upstream for a long time. Stephen had a lot of problems with getting all the necessary 32bit packages, this showed just how difficult it is to build 32bit packages on SL7. Some quite important packages are not available for 32bit and often they do not build without modifications to the specfile.
- updaterpms slowness
- Alastair has been investigating the slowness of updaterpms on SL7 installs and upgrades. He has compared ext3 versus ext4 and SL6 versus SL7. He has not been able to find any big differences on VMs. In one test using the buildinstallroot script the ext4 partition was actually faster. We have only seen this problem on real hardware so it is beginning to look like this is a low-level issue, maybe related to SATA?
- om usage
- Stephen has reviewed the shell-based LCFG components in svn to check for calls to om which need to close file descriptors 11 and 12. He found 3: boot, bugzilla and x509. The boot component need changes to the way the
Run
method calls other components. It looks like the om usage in bugzilla is not a problem. That just leaves x509 which definitely needs a fix as it has a similar problem to the cosign component when restarting apacheconf (see bug#875 for details).
- auditd logs
- Stephen has added a cron job to the
dice/options/auditd.h
header which deletes any auditd logs which have modification timestamps older than 120 days. Since the logs are only rotated based on size this does not completely remove all log entries that are older than 120 days. It should be good enough to get rid of most old data.
- theon book
- Chris is going to review the theon book for RAT.
Operational
- comodo certificate for computing.help
- Alastair has installed the new certificate for the computing.help server.
- KVM status report
- The over-allocation of CPU resources seems to be working fine. The status report should be altered to reflect that and have x2 as a warning (orange) and x3 as a problem (red). Chris will invite COs to increase the number of virtual CPUs for their VMs.
- northern disks
- The two old disks from northern were tested in bakerloo and the dead disk identified. The good spare is in a box on the shelf in Stephen's office to keep it safe in case we need it for piccadilly at any point.
- central
- This has now been wiped and added to the junk pile.
- amarela
- The KVM server amarela has lost a PSU. This needs to be raised with Dell so we can get a new one. Chris will ask around to find out the correct procedure for contacting Dell about hardware repairs under warranty.
- New DR server
- Chris has installed the new DR server - salamanca - and copied all the data from
/disk/dr
on sauce. It looks like he has got everything but there is some difference in disk usage, he'll check it's all there. Stephen noted that the rmirror component on sauce has been hacked to add the --hard-links
option, without this the mirrors from telford are twice the size they should be.
- Packages server
- Once sauce is free it will be used to replace the packages server telford. Stephen wants to revisit our mirroring strategy for upstream sites to reduce the maintenance effort involved. Currently we have a list of sections we wish to include (e.g. sl6.5, sl6.6), this should be converted to a list of sections (e.g. sl6.1, sl6.2) we wish to exclude. The primary benefit being that we get new sections as soon as they are available. It also means we can take advantage of hardlinks across sections and use less disk space.
This Week
- Alastair
- EL7 project
-
Continue investigating Updaterpms performance (try ext3) auditd is the culprit
- Inventory project
- continue working through TartarusWorkFlow
- complete new order file processing code and deploy into existing system
- document new parser and API and integrate testing into RPM package
- RT 65774 -
try two identical monitors on my machineIainR has two identical monitors on his SL6 box and doesn't encounter the problem
- Need to remove default bridge from kvmtool create - not doing until confirm that we're not using br0. Need to announce to COs first, before doing this
- Check whether removed default pool - did we really agree to do this? Why?
-
Change kvm reporting so that CPU over allocation is 3 guests per CPU, and warning is 2 guests per CPU (see oyster)
- Schedule firmware upgrade for DS3254
- Look at SL6 install hang - speak to Carol
- Revisit upstream package list sync
- Blog the final SL7 project report
-
clean up circle ASAP and why no updaterpms failure email last night - circle autorebooted to install new kernel. suspended many of its guests so that it could reboot. I suspect suspension images took up so much disk space that updaterpms couldn't run successfully. so little disk space that even sendmail couldn't send mail (can see attempt in log). space came back as guests were resumed. No problem with new updaterpms - it had 'failed' as it should have.
-
Spec up new KVM server Circulated draft spec
- Chris
- EL7
- Add to EL7 release notes - acroread -> evince
- Start work on SL 7.1
- url shortener
- Talk at development meeting
- Inventory project
- Revisit upstream package list sync
- Start work on SL 7.1
- Invite KVM guest owners to upgrade from 512MB to 1GB and from 1 to 2 core (where useful) on all KVM servers
- Report amarela's broken PSU to Dell
- Continue work on new DR server (note hacked version of lcfg-rmirror on sauce)
- Stephen
- LCFG client refactor stage 1
- LCFG client refactor stage 2
- Think about PD - Interested in ZeroMQ
- Add extra memory to waterloo (and if those work, order up more memory for hammersmith)
- Chase Neil about lcfg-rmirror component (wrt hard links mod)
--
AlastairScobie - 29 May 2015