MPU Meeting Thursday 18th January 2018

Inventory

Putting the new inventory system into service has been delayed for several reasons. One was that disposal had not yet been implemented from the command line, although it was in the API. That has now been done. The new system is ready to enter service, and this will happen soon, more urgent work permitting. Alastair has tested that the service can be reconstructed from backups. Database replication will be looked into at some future point.

Stephen has added a clientreport module to record all CPU flags from /proc/cpuinfo.

LCFG Client

Stephen is preparing to switch the office machines over to the new client. Machines on the develop release already have it. He's discovered and reported a bug in the "om" code for the V4 client - Bug:1035.

User Security Training

Chris has just started this project.

Virtual Desktop

Stephen starts this project next week.

Miscellaneous Development

Stephen's new dice-check software checks the status of lab machines. It's almost finished. Its output so far has already led to a cut in the numbers of problem hosts. One interesting problem it exposed was a number of unexpectedly slow network connections. The technicians have been fixing these, for instance by replacing damaged cables. He has also moved most of the wakeweb code into a new DICE::Wake perl module so that it can also be used to "wake" lab machines every Monday morning. This not only wakes sleeping machines but (more usefully) also powers on those machines which have been mistakenly turned off during the previous week.

Stephen has tidied up the lightdm config to deal with several bugs:

  • Bug:1028 - lcfg-lightdm: changing lightdm.defaultsession does not override cache
  • Bug:1029 - Update to lightdm 1.25
  • Bug:1033 - lightdm config needs update
It's hoped that this work will also stop the incidence of the dead login screen problem (RT:86770) but we don't yet know whether it will.

mock has been updated (Bug:1030)

A new version of the grub component fixes Bug:1026, whereby after a kernel update, the default boot kernel was being set wrongly.

A new version of the kernel component fixes Bug:1034 in which an unnecessary extra reboot was triggered after a change in the kernel version.

The instructions on replacing a failed disk in a Dell server now also cover the entire process of contacting Dell to get a replacement disk.

Stephen has changed the systemd configuration to ensure that the updaterpms component doesn't start until named is running.

Alastair has been costing expanding the storage of the Forum-based KVM servers:

  • azul has 2 spare slots. 3.5" 15K 0.9TB = 293 + VAT (Would need 15K as only 2 drives in a set) - 400-APGL (?)
  • gaivota/girassol have 6 spare slots each (2.5"). 2.5" 10K 1.2TB (400-AJPD) = 194+VAT, 1.8TB (400-AJQP) = 272+VAT (Would need 4 drives if want 10K)

Operational

The Meltdown/Spectre debacle continues. At meeting time the latest development was that the Intel microcode fixes, which Red Hat passed on in RPM form, were severely unsuited to some types of Xeon processor and were being withdrawn. Red Hat chose to do this by releasing a new version of the RPM containing the previous microcode. Red Hat also advised customers to deal directly with Intel or manufacturers for future microcode updates. At some point that's just what we'll have to do, by the look of it.

We'll continue to need lots of reboots. In particular there are still ten machines running a 7.2 kernel!

The KVM servers need downtime for the 7.4 upgrade, security mitigation, and firmware updates. Chris has scheduled downtime for oyster and waterloo but we decided to delay this by a week to give us more time to look into adding extra processor features to the KVM clients (see PCID is now a critical performance/security feature on x86).

SL 7.4 has now gone on to all desktops.

At some point we'll turn on IPv6 for servers. We want the MPU servers to be ahead of the game, so Stephen has been going through our services one by one, looking for potential problems, fixing them and then enabling IPv6. So far he's done the buildhosts and the package forge machines. Next we should take a look at the package infrastructure machines and the Tartarus hosts. The LCFG servers will be done later. In particular he's been looking at ACLs, for instance to ensure that they don't use just IPv4 addresses such as 129.215. Our new machines should be made to support IPv6 before being put into service.

Toby has added support for our sixkts server to add additional subjectAltNames to our certificates. Stephen has tested this on one of our machines hosting multiple services and found it to work well.

Bug:1025 (fstab component can't cope with nvme devices) has appeared in the wild. We fear that it will be difficult to add support for this to the fstab component.

Encryption of swap and tmp partitions is now universal across DICE desktops! Well done Alastair.

Alastair has been getting quotes for increasing the storage space on the Forum-based KVM servers (details to be added). We realised that when increasing the storage space we should think through the implications for the VM suspend space too, so Chris will look into that.

Alastair has tried the latest Virtual DICE on Windows and Mac. He encountered delays now and then but no show-stoppers - it was mostly quite usable. Chris will try it too and release it soon.

Following the spate of last minute SL7 server upgrades in December, Chris will hunt down abandoned SL6 VMs with a view to termination.

This Week

  • Alastair
    • Inventory project
      • continue working through TartarusWorkFlow
      • Document clientreport (eg how to add modules)
      • Document order sync code
      • Document hpreport processing script
      • Start work on final report!
      • Consider what else needs done other than docs and tidying and backups
      • Blog something....take dev meeting talks
      • and give details on how Tartarus tables are accessed to Ian D for inclusion in his privileged access discussion paper
      • Look at postgresql replication (do after shipping)
      • Move Tartarus to IPV6 before going live
    • Schedule MPU meeting to discuss systemd ordering
    • Check sysmans (et al) have 'nograce'.
    • Take a look at RT #78875
    • Look at /etc/hosts - dns issue (IPV6?)
      • work out what we need to fix current problem
    • Circulate info on RH7.3 systemd changes we may wish to consider
    • RT actions (as agreed)
    • Implement change to kvmtool to allow KVMs to be marked as disabled
    • Look at Stephen's 'Thoughts on shell components'
    • Look at MPUActivitiesList
    • Start looking at https and computing.help (remove assumption that https means want cosign login)
      • wait on Neil's efforts with EdWeb
    • Chase Alison about LCFG check monitoring ( start doing again )
    • Investigate systemd reboot bug on gaivota and add some more debugging (store tree diff somewhere)
    • Try latest vdice.ova (sensa) and steno and record problems in detail
      • Try on a Mac as well as Windows
      • Try on Windows with no net connection
    • If in Forum server room, review MPU rack usage
    • Review 'ssh on a mac'Someone else did this (Chris?)
    • Start upgrading MPU servers to 7.4
      • upgrade computing.help servers (all bar 'lagun' done)
      • upgrade bandama (tartarus)
      • upgrade salamanca - remember to update firmware (Check whether this is needed)
    • Get costings for increasing storage space for Forum KVM servers (and get assertive in new year about tidying up old VMs)
      • azul - has 2 spare slots . 3.5" 15K 0.9TB = 293 + VAT (Would need 15K as only 2 drives in a set) - 400-APGL (?)
      • gaivota/girassol - have 6 spare slots each (2.5"). 2.5" 10K 1.2TB (400-AJPD) = 194+VAT, 1.8TB (400-AJQP) = 272+VAT (Would need 4 drives if want 10K)
      • Decide - focus on gaivota and girassol - (Update: RT ticket to get firm pricing)
    • Add tartarus info to SwitchToSelfManaged

  • Chris
    • Inventory project
      • Continue work on clientreport modules for replacing firmwarereport
    • Look at MPUActivitiesList
    • Look at RT
    • Continue work on SL7 coordination final project report (currently pending other units completing)
    • Ship latest Virtual DICE (once Alastair double checked at home on Windoze)
    • If in Forum server room, review MPU rack usage
    • Start upgrading MPU servers to 7.4
      • deneb, hare and wildcat
    • libvirt - test for memory leaks (wrt console servers) Ian will test it for memory leaks after the 17 January stable release
    • girassol still has some storage for superseded or deleted VMs. They may have been preserved deliberately - will now investigate.
    • Ship 7.4 version of Virtual DICE
    • User training materials project #403
      • start work on fleshing out the aims and possible deliverables of the project
    • Schedule KVM upgrades - oyster and waterloo for week beginning 29th Jan
    • Chris to consider what effect adding additional disks to Forum KVM servers would have on suspend disk space

  • Stephen
    • LCFG client refactor stage 2
    • RT actions (as agreed)
    • submit polkit bug to redhat - with Alastair (still exists under 7.3)
    • Produce some text for systemd mount bug (to submit to RH)
    • Take issue of disable per user journald logs on certain servers to OPS
    • Schedule jubilee downtime to move to SOL
    • Consider PD work for after LCFG client ...
      • looking at Ceph
    • Look at MPUActivitiesList
    • On metropolitan, find fast baud rate we can drive the real physical consoles. (This so we can decide whether to use physical consoles for KVM servers).
    • Look at where we're using ALL in access.conf
    • If in Forum server room, review MPU rack usage
    • Agree with RAT how software package requests are handled - waiting on Graham documenting
    • Start off NX replacement project (#389)
    • Upgrading MPU servers to 7.4
      • NX servers
    • Complete Spectre proof microcode distribution

-- AlastairScobie - 18 Jan 2018

Topic revision: r13 - 23 Sep 2019 - 13:33:40 - AlastairScobie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies