MPU Meeting Wednesday 31st May 2017

Inventory

Apache authentication is now done using GSSAPI.

Alastair has added API support for aliases and for virtual machines. Aliases allow the system to cope with some machines having more than one serial number.

When Chris replaces firmwarereport with clientreport modules, it would be useful if one of the modules reported on GPUs - at least model numbers, perhaps also serial numbers if possible and meaningful.

LCFG client refactoring

The core C libraries are fully functional and documented and the APIs are complete.

The V4 client is working again on a test VM.

Diffs are now handled properly, and can be seen with the debug "changes" keyword.

Profile signatures (used in secure mode) were being calculated wrongly but this has now been fixed.

The new code generates identical configurations (rpmcfg, Berkeley DB, and so on) for all Informatics profiles.

A new qxprof, using the new Perl libraries, is under development. After that, sxprof.

MPU SL7 Server Upgrades

The final report is almost finished. The main points to note are in the Discussion section near the bottom.

We need to discuss what journald logging is appropriate, so we'll raise this at an Operational meeting.

Additional disk encryption

Chris has been doing more reading. NBDE (Network Bound Disk Encryption, implemented by packages tang and clevis), optional in RHEL 7.4β, looks very interesting both for this project and for the full disk encryption project.

Miscellaneous development

Disabling kernel module loading hasn't been working well on SL7 because of systemd ordering issues. It's been triggering too early, so that some services which need to load modules, for example NFSv4, haven't been able to start properly. Stephen has now got it working reliably. He's introduced a new lcfg-final target which happens after everything has finished starting up; things which need to happen really late in the boot process can now be hung off that. Now that disabling kernel module loading is working reliably on SL7 we'll look into spreading its use more widely.

An SL7.3 upgrade problem was reported by Barry O'Rourke of Physics. If you changed a profile from 7.2 to 7.3 then rebooted before having run updaterpms, a web of software cross-dependencies involving autofs would cause systemd to hang. Stephen and Graham found that localhome needs to start before autofs but after sssd, and that this ordering had not been encoded into the systemd configuration. It now has been, solving the problem.

The package forge builders now use SL7.3. There's a small chance that this might result in them building packages which cannot be used on 7.2.

Operational

We've rebooted most of our servers. We will reboot the SSH and NX servers in the next few weeks, one per week.

Our standard mock configuration now includes support for Fedora 24 and 25. We'll soon add f24 and f25 to pkgforge too. These platforms will not be part of the default set of platforms for pkgforge, but they will be built for if a job is submitted with an explicit platform of all. The idea is that we'll routinely build core LCFG packages on the latest Fedora platforms, and that any build failures will give us early warning of problems which may appear in SL later.

Neil has fixed a problem with the apacheconf logrotate method. This was sending a hangup to the httpd daemons after having rotated each log. On a machine with a dozen apacheconf logs this could be quite disruptive! The configuration now uses the "shared scripts" setting so that the hangup is send only once after all the Apache logs have been rotated.

We thought this would be virtually impossible, but all of the 2.18TB of VM space on KVM server girassol has been filled. As a result, future KVM server reboots are now far more likely to involve many VMs being shut down rather than seamlessly migrated elsewhere, with the consequent additional hassle of arranging and announcing downtime in advance. On closer inspection of VMs it seems that a number of them have unnecessarily large disks. Please think about what size your partitions are probably going to have to be, and size them and the VM's disk accordingly. We'll be asking the owners of several VMs to recreate them with a more appropriate size. Here are a few tips:

  • A number of VMs have a large root partition which is mostly empty. Use small-server.h if you can. This reduces the size of the root partition from 80GB to 10GB. Failing this, shrink the root partition yourself to a reasonable size.
  • More than one VM has been found where most of its disk space was not used by a partition. Space on the KVM servers is expensive; please never do this.
  • A number of VMs seem to have large data partitions which are entirely, or almost entirely, empty. Please size your data partitions for the expected data.

Stephen has checked the BIOS settings of the latest desktops. He's worked out how things need to be set, but this time it's going to be easier and quicker to change the settings individually than to copy them using a USB key. The AMT management engine will be disabled.

New DICE root passwords have been rolled out. If anyone should know them but doesn't, please get in touch.

This Week

  • Alastair
    • Inventory project
      • continue working through TartarusWorkFlow
      • Document clientreport (eg how to add modules)
      • Document order sync code
      • Document hpreport processing script
      • Continue work on RESTful API - TartarusRESTAPI
      • Document REST API
      • Write more of the ii commands and document as writing.
      • Start work on final report!
      • Convert from mod-auth_kerb to mod-auth_gssapi (See Stephen for details)
      • How represent VMs
      • Continue with REST API testing framework
      • Consider what else needs done other than docs and tidying and backups
      • Blog something....take dev meeting talks
    • Deploy encrypted /tmp and swap conversion script
      • Need to warn users that Gnome3 may pop up a window about /tmp being full (when script is run)
      • Do in early July
    • Schedule MPU meeting to discuss systemd ordering
    • Check sysmans (et al) have 'nograce'.
    • Take a look at RT #78875
    • Look at /etc/hosts - dns issue (IPV6?)
      • work out what we need to fix current problem
    • Circulate info on RH7.3 systemd changes we may wish to consider
    • Produce a report on machines with wrong time (using clientreport)
      • module written - produce an email report ( increase threshold to 30 s )
    • Contribute to MPU SL7 final project report
      • Package slave (export)
    • Put sticker on IBMDS3524 to mark controller A FC port 3 as dodgy
    • T1 report
    • RT actions (as agreed)
    • Try out disable-module header on computing.help (test) (with view for promoting usage further within COs)No discernible problems
    • Reboot both NX servers (one on 9th June, one on 16th June)
    • Look at rhel6 console - suspect will have to add TTYS0 line to grub at every kernel upgrade
    • Read futures summary
    • Look at MPU activities list

  • Chris
    • Inventory project
      • Continue work on clientreport modules for replacing firmwarereport
    • DICE encryption
      • Continue thinking and researching
      • Look at 7.4 beta - tang/clevis - to see if this technology does what we want
    • Contribute to MPU SL7 final project report (It's here.)
      • Wake on LAN
    • RT actions (as per agreed list)
    • Remind people at Ops to think carefully about their disk usage on VMs
    • Disk wipe schiff prior to removal from AT
    • T1 report on LCFG MPU SL7 report by Monday 5th (pref)
    • T1 report on DICE encryption report by Monday 5th (pref)
    • Look at MPU activities list

  • Stephen
    • LCFG client refactor stage 2
      • testing and documentation
      • qxprof / sxprof re-implementation
    • LCFG server symlink to exam branches - produce reporting script and discuss with Graham
    • submit polkit bug to redhat - with Alastair (still exists under 7.3)
    • Investigate George's multiple network interfaces SL7 issue (eg consoles server)
      • waiting on George breaking metropolitan
    • Look at config required for merging staff.login with staff.nx
    • Draft a position note on shell components under SL8 and possible ways forward
    • Produce some text for systemd mount bug (to submit to RH)
    • RT actions (as per agreed list)
    • Take issue of disable per user journald logs on certain servers to OPS
    • Reboot staff.ssh on 2nd June and student.ssh sometime week following
    • Read futures summary
    • T1 report on LCFG client project by Monday 5th (pref)
    • Discuss disabling AMT in lab HP 800 G2s with Alison
    • Look at MPU activities list

-- AlastairScobie - 31 May 2017


This topic: DICE > WebHome > ManagedPlatformUnit > MPUnitMeetings > MPunitMeeting20170531
Topic revision: r15 - 24 Sep 2019 - 13:50:24 - AlastairScobie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies