MPU Meeting Tuesday 28th February 2017

Inventory

Alastair has the authentication for the conduits working, and the conduits are now run regularly from a cron job.

While working on the test suite for the REST API, he spotted a problem with all of the inventory test suites - diffs of the literal output of commands against expected output were failing because of the recent change of some of the field names. The diffs now extract data to csv files, which are then compared. Stephen pointed out that psql can now produce data in JSON format; Alastair may look into that.

The test suite for the REST API exposed several bugs:

MPU SL7

Chris has a prototype SL7 version of http://bugs.lcfg.org up and running called http://testbugs.lcfg.org.

LCFG Client refactoring

No activity.

Additional disk encryption

No activity.

Miscellaneous development

nginx : Stephen noticed that several machines had configuration for nginx so he has put together a header which provides basic nginx configuration - LCFG:core/include/lcfg/options/nginx.h and LCFG:core/include/dice/options/nginx.h. There isn't an LCFG component for it - if you need one, get in touch.

dsu :

  • iDRAC update 2.41.40.40 has been causing problems. It seems fussy about whether it will apply successfully, and if it does, the IPMI serial console no longer works afterwards. In our experience the IPMI console can be recovered by a complete cold power-cycle of the machine, a complete reconfiguration of the serial console settings from scratch, or both. If you don't fancy all that hassle, don't apply this update. (Interestingly, one of the fixes in 2.41.40.40 is "Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or have an active SOL or SSH sessions while firmware upgrade is in progress.")
  • We still recommend applying firmware updates, but please bear in mind that it can never be a risk-free process. Do your machines one at a time, and be ready for them to be out of service for a while if an update goes wrong in some way.
  • There's a new version of dsu on the way through the release process. The firmware changes are listed in the DSU repository component ChangeLog. We can't say what changes have been made to the software itself because the RPM has no changelog entries, but Dell gives the headline developments as "Support for Microsoft Windows Operating System" and "Ubuntu Operating System as a pre-enablement".
  • It's now explicitly stated that dsu supports 11G, 12G and 13G servers. This means that support for 10G servers has been withdrawn. We have therefore removed dsu from the R200, R805 and R900 headers.

journald : it should now wait for local filesystems to mount before starting up. This is the new default behaviour on DICE.

Operational

kernel security : The suggested mitigation for CVE-2107-6074 was applied as soon as we heard of the problem.

nvidia drivers updated : Stephen has updated most of the nvidia driver versions that we use. (One or two versions didn't seem to be accessible on their website.)

AMD catalyst driver becomes AMD GPU driver : The Catalyst fglrx driver has been replaced by a completely redesigned amdgpu driver. Anyone whose machine uses video_amd_catalyst_pro.h should switch it to video_amd_gpu_pro.h instead.

SL7.3 and sleep : A couple of our more recent models, those with Intel SkyLake architecture, had trouble waking cleanly from sleep when running SL7.2. We've now tested 7.3 on them.

  • HP G2 : It seems to sleep and wake cleanly when running 7.3.
  • Lenovo P310 with nvidia card : It seems to sleep and wake far more happily than before. We noticed a problem with virtual terminals disappearing after sleep, but that test wasn't carried out with the most recent version of the nvidia driver, which has fixes for sleep problems, so we'll do that test again.

KVM servers : We're part of the way through an ongoing programme of adjusting the partitioning on the KVM servers. This means mass migrations of VMs, and the occasional temporary mass shutdown of VMs.

  • gaivota has been done.
  • azul has been done.
  • girassol is next, and since it runs so many VMs, please:
    • delete VMs you no longer need;
    • put up with the absence of lower priority VMs for a day or two.
  • After that oyster's VMs will be migrated elsewhere and oyster itself will move to Appleton Tower, to provide mutual backup for waterloo.

DIY DICE docs : As part of our periodic review of computing.help pages we've unpublished the DIY DICE pages. It's only used as a platform for Virtual DICE and we don't want to encourage anyone to use it otherwise.

sshfs, fuse problem : An ssh server was recently affected by a problem with sshfs or fuse. A user was using sshfs with the ssh server. When their AFS credentials expired, the software reacted by calling eject in a busy loop. This had two effects: the machine had very little time to do anything else, and it logged every call to eject - so the machine not only became virtually unusable but one of its partitions filled up. We've done several things to counter this:

  • Told the user not to do this.
  • Changed the journald logging to unified logs rather than keeping separate logs for each user.
  • Removed the eject binary.

initramfs : this week we will run a cron job which rebuilds initramfs to solve the problem with rebooting.

updaterpms hash bug : Bug:994. Fixing that might speed things up quite a bit.

hosts vs DNS in nsswitch.conf : If /etc/hosts only has the IPV4 address for a machine, and DNS has both IPv4 and IPv6 addresses, and /etc/nsswitch.conf says to consult files before dns, the machine will sometimes not find out about its IPv6 address. We're going to look into this some more.

This Week

  • Alastair
    • Inventory project
      • continue working through InvProjectWorkFlow
      • Document clientreport (eg how to add modules)
      • Document order sync code
      • Document hpreport processing script
      • Continue work on RESTful API - InvProjectRESTapi
      • Document REST API
      • Further encourage people to use API and ii commands
      • Write more of the ii commands and document as writing.
      • Speak to George about macaddr/space feed
      • Start work on final report!
      • Convert from mod-auth_kerb to mod-auth_gssapi (See Stephen for details)
      • How represent VMs
    • Deploy encrypted /tmp and swap conversion script
      • Need to warn users that Gnome3 may pop up a window about /tmp being full (when script is run)
    • Schedule MPU meeting to discuss systemd ordering
    • submit polkit bug to redhat - with Stephen (check with 7.3)
    • Think how to regularly report on machines with no /var/log/journal
    • Decommission old 'hilfe' server
    • Check sysmans (et al) have 'nograce'.
    • Take a look at RT #78875
    • Look at RT and SL7RT
    • Try 7.3 on P310
      • try sleep with latest nvidia modVirtual terminals still broken after sleep
    • Stephen's systemd target question email
    • Order new LCFG slaves
      • awaiting quote
    • Take 'juice' replacement to CEG - want to bring forward
    • Test new testbugs.lcfg.org
    • Look at /etc/hosts - dns issue
    • Project blog about inventory

  • Chris
    • Inventory project
      • Continue work on clientreport modules for replacing firmwarereport
    • MPU SL7
      • Continue with bugzilla
      • Look at wake backend (running on Inf servers)
    • DICE encryption
      • Continue thinking and researching
    • Roll out fixed sleep code
    • Reschedule MPU futures meeting
    • Look at RT and SL7RT
    • Think about whether we can use NX service for staff.login/student.login
    • Blog about MPU SL7 project - just use what produced for T3 report
    • Figures for rest of Feb
    • Produce PXE boot image so that we can update the BIOS of HP 800 G2s

  • Stephen
    • LCFG client refactor stage 1
      • schedule debrief meeting
    • LCFG client refactor stage 2
      • testing and documentation
      • blog article (once documentation complete)
    • LCFG server symlink to exam branches - produce reporting script and discuss with Graham
    • submit polkit bug to redhat - with Alastair (check under 7.3)
    • Investigate George's multiple network interfaces SL7 issue (eg consoles server)
      • waiting on George breaking metropolitan
    • Look at RT and SL7RT
    • Think about whether we can use NX service for staff.login/student.login
    • Draft a position note on shell components under SL8 and possible ways forward
    • Run initramfs rebuild on all servers (by cron)
    • Test new testbugs.lcfg.org
    • Look at how we can give Paul read access to DICE level LCFG profiles and headers
    • Figures for Jan/Feb
    • Minutes for 14th Feb

-- AlastairScobie - 28 Feb 2017

Topic revision: r10 - 07 Mar 2017 - 16:11:11 - StephenQuinney
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies