MPU Meeting Tuesday 19th January 2016

Inventory

Nothing happened.

LCFG Client Refactoring

Nothing happened.

SL7 Server Base

The H200 raid controller is now supported by hwmon thanks to RAT giving us access to one of their SL7 machines. Due to the RAID configuration of the disks it is not possible to eject a disk to test hwmon fully but we think it should work fine. Chris has blogged about hardware monitoring and RAID on SL7.

We need to get the network naming scheme work finished so that we can fully support bonding. Without this it's hard to get many servers upgraded.

apacheconf

Stephen began looking at how the updated apacheconf component might work. He has started with modernising support for apache modules. He has also ripped out the ancient support for apache 1.3 and SL5, this makes all the headers a lot simpler and tidier.

Miscellaneous Development

Disk encryption
Chris has reviewed the script for enabling disk encryption on existing SL7 machines. It all looks good, now we just need to get it deployed.

previous release
Stephen has added support for a previous release which is made from the old stable release whenever a new stable release is created. He will announce this at the next Operational Meeting.

testing release
Stephen has added support for exporting the testing release in a similar way to the stable release (i.e. only lcfg and ed level headers and package list). This has been a longstanding request from MDP users and the hope is that this will increase both the number of people checking the new release and the variety of ways in which it is tested. One nice feature of this is that the latest diff can be viewed on the LCFG website via http://www.lcfg.org/diff?from=current&to=testing.

Operational

mate and LD_LIBRARY_PATH
After the stable release on 13th January Jennifer and Lindsey began having problems with the mate desktop. This was incredibly difficult to diagnose and took a lot of effort but eventually we traced it to the LD_LIBRARY_PATH environment variable being set to a non-existent path which caused the file manager - caja - to fail and thus constantly attempt to restart. We still do not understand why there has been a sudden change in behaviour, it seems likely that this is related to an update for systemd but we cannot be certain. From this investigation we have come up with a number of ideas to make life easier next time:

    • It was necessary to move some machines back to the previous stable release to confirm that this was a new problem. This was rather awkward so we have now added official support for a previous release which is made from the old stable release whenever a new stable release is created.
    • As the SL7.2 release is imminent we have recently seen a flurry of backported security updates for SL7.1. For this particular stable release there were just over 300 updates for a desktop machine. All the standard weekly tests were passed and we have not previously experienced any big issues with large sets of updates so we felt it was fine to go ahead. Clearly however this problem was not picked up by anyone during the testing phase. We propose that in future when we have an especially large set of updates that we hold them back for longer so they can be tested more thoroughly on CO desktops. We will also try to come up with an extended set of tests (e.g. a range of desktop environments and a range of applications) for when we need to be particularly thorough about ensuring no problems have appeared.
    • In this case we identified the problematic application by enabling process accounting so that a full list of launched processes was available. If we had access to this information from the beginning it is likely that we would have found the source of the problem a lot quicker. This is already enabled on all servers, we propose to enable it by default everywhere.
    • This problem seems to have manifested itself after a reboot. We feel that the delay for the autoreboot could be shorter for CO desktop machines to help find problems sooner. The standard for office machines is 5 days but it could be as short as 0 which we use in the student labs to enforce a reboot overnight as soon as updates are available. What does everyone else think about this?

lcfg branch symlinks
Whilst working on adding support for the previous release Stephen noticed that there were lots of symlinks pointing to old exam branches. When he deleted most of them the LCFG slaves stopped tracking ~16000 files. He wonders if this might be the cause of the recent performance issues. We probably ought to clear the caches and restart the slaves. He will also talk to Graham about having these names systematically and cleaned out once no longer required. There is no need to delete the branches themselves, they will always be kept in case they need to be reused. We should add a cron job on the slaves which reports on unused symlinks so we can tidy them up.

Alan Smaill and fetchmail
We have had a report (RT#75694) from Alan Smaill that running fetchmail can cause his SL7 machine to crash. There is no useful information in the logs so we need to investigate with a serial console attached, Alastair will do this using a RaspberryPi and a serial-to-usb cable.

HP 800 G2
This is the new SelectPC. It requires the latest SL7 kernel which might be a problem given that this is incompatible with VirtualBox. Currently only IS have one of these machines. Stephen will get the installer kernel updated to start with and take a proper look once our machine arrives.

nm configure-and-quit
Do we still need networkmanager to have the configure-and-quit option enabled on stable SL7 machines? It's not been like that on develop for quite a while. Alastair will investigate.

systemd changes
The version of systemd has changed substantially from 208 to 219. We ought to properly review the changelog for anything important that affects our LCFG support.

Innovative Learning Week
We need to get organised for the LCFG session. Stephen will look to see if he can find the source for his slides.

This Week

  • Alastair
    • Inventory project
      • continue working through InvProjectWorkFlow
      • consider what next can be integrated into existing system, if anything
      • Check for systemic errors from clientreport
        • Look now that servers don't check monitors
      • Document clientreport
      • Document order sync code
      • Continue work on hpreport processing script
    • Remove default pool if ops meeting agrees
    • Experiment with different window managers under VNC (making the assumption that performance under NX will be similar)
    • Think of a use for 'atom'
    • Deploy encrypted /tmp and swap conversion script
      • try the CO desktop firsts (on develop release)
    • SL7 base server
      • Localhome functionality - use mkhome_dir instead?
      • check metropolitan USB and CD
      • Continue work with FC and LVM
        • investigate interaction between multipath and UDEV
        • check nagios notices if FC cable removed
      • Fix the bonding nagios script to scream if fewer than 2 slaves active for each bonded group
      • Look at defining a macro to set real device names for eth0 and eth1 (parameterised)
        • use Stephen's new disable old-style naming scheme
        • currently, whilst we still use eth0 and eth1 as lcfg tags, if eth0 is actually em1 physically, ifcfg-em1 will be created and not ifcfg-eth0
        • double check bonding still working on metropolitan
        • try on sauce
        • more experimenting required (and documenting)
        • would be good if network component commented which tag was used to generate each ifcfg-{n} file
        • blog article required
      • Read Chris and Stephen's blog articles on bios devname and hwmon and apacheconf
    • Schedule MPU meeting to discuss systemd ordering
    • Continue building computing.help honeypot
    • Rotate drupal logfile on computing.help and devproj
    • Look through latest systemd changes
    • Make configure and quit option default on stable and testing (not just develop)

  • Chris
    • Inventory project
      • continue working through InvProjectWorkFlow
      • Look at clientreport modules for replacing firmwarereport
    • pkgsearch for SL7
      • reimplement as a yum web front end (yum search for keyword produce an html file of links to cgi to do yum info)
      • Need support multiple platforms
    • Liaise with George over iDRAC documentation (look through ops reports to remind)
    • SL7 -
      • Mark up which servers we can't check 'hwmon' on (as no spare kit)
      • diskfull
      • test out rsync / rmirror (both client and server ends) - liaise with Neil
    • RT tickets close
    • Continue investigating SL6 sleep problem
    • Schedule MPU stargazing meeting

  • Stephen
    • LCFG client refactor stage 1
      • schedule debrief meeting
    • LCFG client refactor stage 2
      • document API
      • blog article (once documentation complete)
    • Think about PD - Interested in ZeroMQ
    • Investigate kernel component pipe moan by using shell commands instead of RPM module => waiting on 7.2 => activities list
    • continue thinking about apacheconf
    • SL7 server
    • rkhunter config needs fixing
    • Create OPS meeting report on LD_LIBRARY_PATH problem and request that C(S)Os tell us about problems early on and suggest shortening autoreboot time for C(S)Os
    • Create OPS meeting report article on kernel and virtualbox - we would have difficulty in shipping a critical kernel upgrade on Lab machines at the moment. (Latest versions of virtualbox assume 7.2)
    • LCFG server symlink to exam branches - produce reporting script and discuss with Graham
    • Circulate dmesg proposal

-- AlastairScobie - 19 Jan 2016

Topic revision: r15 - 26 Jan 2016 - 13:11:53 - AlastairScobie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies