MPU Meeting Wednesday 13th February 2019

Inventory

The second half of the discussion was held - Alastair has notes of it, which he'll circulate. These points in particular came out of it:
  • Self-managed machines using dynamic DNS shouldn't need DNS entries or fixed hostnames or placeholder LCFG files.
  • Servers should probably be given barcode stickers too (other machines already get these).
  • How to deal with machines with multiple serial numbers?

Alastair has developed a new orphan clone script. Until now, if a serial number was changed in the inventory, the old pre-change entry would be left "orphaned", and a separate new entry would get the new serial number. This could lead to the effective loss of the information in the orphaned entry. The clone script copies across information from the orphan to its replacement. Serial number changes have turned out to be less rare than expected - for instance it's needed every time a network switch is replaced.

LCFG profile security

The latest client is on all develop machines.

There's a problem with locking - calls to components aren't necessarily completing properly, they're sometimes returning without unlocking. Stephen is investigating.

Alternative DICE desktop

Nothing this week.

Miscellaneous Development

Stephen has added a clientreport module which checks for LCFG lock files which are more than an hour old. There isn't an accompanying report as yet but the information can be queried by SQL, and this has already proved useful.

Stephen has added a clientreport module to report disk and partition information, got from df and lsblk.

He's also been thinking about machine health checks more generally. It would seem to make sense to have one combined health report incorporating such things as late package updates, old lock files and so on - like the labcheck report. However, he's realised that this report - and the labcheck report - would be more useful as a web page than as mail. (And if someone then wanted to add a bit of javascript, the page could offer filtering controls.)

Operational

The Infrastructure and RAT units were encountering an RPM conflict. Stephen found and fixed the problem in dice/options/devel.h - one of the python devel packages had acquired a new dependency on git.

Stephen has tidied the exam releases from the LCFG slaves.

Chris made space on /var on hammersmith (xrdp.inf) which had got full. One massive file in /var/account was copied to a secure place elsewhere then zeroed, and this freed up enough room to enable accounting to restart.

Two machines have been ordered and both will replace hammersmith as rdp.inf.

The two KB KVM servers amarela and vermelha will be five years old in June, so their warranties will expire. Maybe we should buy an extra year's warranty, putting off the replacement for a year? No decision has been made yet but we thought that this seemed like a good idea.

We've updated most of our pandemic pages.

This Week

  • Alastair
    • Inventory project
      • continue working through InvProjectWorkFlow
      • Document clientreport (eg how to add modules)
      • Document order sync code
      • Document hpreport processing script
      • Start work on final report!
      • and give details on how Tartarus tables are accessed to Ian D for inclusion in his privileged access discussion paper
      • Look at postgresql replication (do after shipping)
      • Add tartarus info to SwitchToSelfManaged
      • Need tests for API /orders and need new tests to check for correct authorisation
      • Make lcfg header generation live (need to check what will be deleted when we do this - big discrepancy between old inventory and new)
      • Look at user support form - how does that lookup hostname?
      • Look at whether there is an easy library way for Chris to grab the macaddr of a machine given the hostname
    • Schedule MPU meeting to discuss systemd ordering
    • Take a look at RT #78875
    • Look at /etc/hosts - dns issue (IPV6?)
      • work out what we need to fix current problem
    • Implement change to kvmtool to allow KVMs to be marked as disabled
      • looked at this - looks like the metadata tag isn't passed through libvirt (prior to 4.0.0), so can't be read/written by kvmtool
      • put on activities list to do once upgrade to libvirt-4.0.0
    • Look at Stephen's 'Thoughts on shell components'
    • Start looking at https and computing.help (remove assumption that https means want cosign login)
      • wait on Neil's efforts with EdWeb
    • Investigate systemd reboot bug on gaivota and add some more debugging (store tree diff somewhere)
    • drupal username collection re GDPR
      • configure live server to run the user expiry script
      • Fixup email domains for existing accounts and check fix for domain setting to inf.ed.ac.uk is in place on live service
      • need to ship fixed cosign module on live service
    • Inventory stuff re GDPR
    • Check with Tim / George about capability for login to student machines - where are we
      • Tim says that we should create a capability that is given to the base cohort and set that capability to no-grace
    • Useful? - a script which checks how fast a machine's console log is growing (eg huge number of dbus problems on hammersmith)
      • suggest to Ian D
    • Blog on projects
    • KVM pcid
      • Investigate spectre / meltdown wrt VMs
      • Which CPU is needed for each group..
Following config worked on 'brent' (hosted on vermelha). We might need to consider whether we want "match='exact'" wrt migrations.
<cpu mode='host-model' match='exact'>
<model fallback='allow'>IvyBridge</model>
<vendor>Intel</vendor>
<feature policy='require' name='pcid' />
</cpu>
      • Update: looked at this. We should be safe to set CPU model to host-model on clusters where the CPU is identical across the cluster (KB and AT). However we can't where the CPU's aren't identical (IF) - here we should be able to set a base minimum machine (SandyBridge ?). We'd need to check that migration works. Recent versions of virsh allow you to specify the hosts in the cluster and ask for a CPU model description which will work across all the cluster. Setting the base minimum to SandyBridge on 'oyster' fixed one of the Spectre flaws, but not all. It looks like we need a more up-to-date qemu-kvm to fix all the remaining flaws. * Wait until 7.6ish is settled re KVM software versions and try above again
    • Move IBM disk array to B.03 and mark as junk
    • Produce some notes from OSS
    • Read George's mail of 8th November wrt DPIA
    • Try latest VDICE on Windows 10 machine at home (research guest login delays)
    • Review the three encryption computing.help pages

  • Stephen
    • submit polkit bug to redhat - with Alastair (still exists under 7.3)
    • Produce some text for systemd mount bug (to submit to RH)
    • Take issue of disable per user journald logs on certain servers to OPS
    • Consider PD work for after LCFG client ...
      • looking at Ceph
    • Look at where we're using ALL in access.conf
    • Finish off NX replacement project (#389)
    • Continue with RT ticket clearout as discussed in October
    • Read George's mail of 8th November wrt DPIA
    • Firmware update - deneb and steen
    • Reboot staff.ssh (hare)
    • Complete tartarus clientreport module errors report
    • Update Pandemic pages - Security, LCFG
    • Add a 'df' module to clientreport
    • Move afsbuild server (juice) from Forum to AT
    • Produce report based on clientreport 'old locks'
    • Talk to Paul about redacted LCFG SVN

-- AlastairScobie - 13 Feb 2019

Topic revision: r6 - 27 Feb 2019 - 13:23:44 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies