MPU Meeting Wednesday 6th September 2017

Inventory

Alastair has been working on code quality improvements and on bringing man pages up to date. He intends to complete the documentation of commands and the user documentation before the system goes live, then finish the rest of the documentation afterwards.

LCFG Client refactoring

Stephen has been working on test scripts which will enable Kenny MacDonald to test the new client software against his (many thousands of) LCFG profiles.

Miscellaneous development

  • ProtectHome - Stephen has been looking at the problem of some components (e.g. file, localhome, autofs) being unable to delete the /home directory on SL7. This is being caused by a "ProtectHome" option in some systemd service files, notably that for lldpd. To solve the immediate lldpd problem he has disabled "ProtectHome" in lldpd.service, and he has also moved lldpd to later in the boot sequence - after lcfg-multi-user-stable.target, therefore well after the file component. Since this problem is likely to keep popping up as service files are updated, he has also introduced a more general install-time solution: the new dice/options/home-link.h header introduces common resources for the replacement of /home, and adds a new install method which will replace /home (as decreed by those resources) early in the install process, before the machine reboots and systemd gets in the way. However, note that this still doesn't protect against the problem happening after install, for example when the localhome component is added to a running machine. The question of exams is still being looked into, too.
  • Apache logs are now compressed by default. There is a one-rotation delay before compression kicks in, so that a .1 log won't be compressed but older logs will be. Lots of our web servers already had ad-hoc logrotate compression config so this change has introduced quite a bit more standardisation to our apache configs.
  • The mock, firefox, yum and autoreboot components now start later in the boot process, after lcfg-multi-user-stable.target.
  • We've mailed users to warn about the problems which the bumper security update caused to some running MATE and GNOME sessions.
  • dracut was patched to correct a problem in its handling of /dev/shm. We have both known about this problem and have been patching it for a long time, and were surprised to discover that it hadn't been fixed in Red Hat's latest dracut.
  • Alastair has extended clientreport to report on video cards. While there he also fixed a minor bug in which memory was misreported in some newer models including the R730.

Operational

  • An entire subnet was under constant bombardment recently, though it's not clear whether from a determined DOS attacker or a misconfigured machine. We firewalled the subnet.
  • The Nvidia drivers have been brought up to date.
  • The problem with girassol turned out to be caused by an elderly fibre channel card, a 2400 series. The machine works properly with 7.3 and the latest kernel when fitted with a 2500 series card instead. girassol is now stable once more and being brought back into use. gaivota will be upgraded next, once we've found a 2500 series FC card for it.
  • The ssh server was rebooted.
  • The MovingWires instructions were tested and found still to be accurate.
  • All of the circlevm VMs came up in a recent broken consoles report. This turned out to be because their configuration had been broken by the removal of localaccounts.h from an apparently unrelated header! This in turn led to the discovery of numerous uses of local accounts without the inclusion of localaccounts.h. We're now checking to find and fix any more uses of local accounts without inclusion of localaccounts.h.
  • On a KVM server running SL 7.2, the service which suspends the VMs times out after 90 seconds by default. Since it can take some minutes to suspend all the VMs on a typical KVM server, this is unhelpful! The problem is fixed in 7.3, and can be fixed in 7.2 by adding TimeoutStopSec=0 to the "Service" section of /usr/lib/systemd/system/libvirt-guests.service. This eliminates the timeout.
  • We're going to look at using fail2ban with apache.
  • For future KVM server reboots or upgrades, we'll organise a wiki page in which the treatment of every affected VM will be explicitly stated and in which each unit can sign off its agreement.
  • The wandering time report will now come out weekly rather than daily.
  • We're mulling over replacing pkgsubmit with a python script, so that it can check that all submitted files have been fully and completely submitted, rather than for instance having been truncated because of full filesystems or quotas. At the same time we'll look into enforcing the use of acceptable (to rpmlib) names and versions for RPM files.
  • Ian encountered a problem increasing the number of vCPUs in a KVM guest; Chris will investigate the problem and check the documentation.

This Week

  • Alastair
    • Inventory project
      • continue working through TartarusWorkFlow
      • Document clientreport (eg how to add modules)
      • Document order sync code
      • Document hpreport processing script
      • Continue work on RESTful API - TartarusRESTAPI
      • Document REST API
      • Write more of the ii commands and document as writing.
      • Start work on final report!
      • How represent VMs
      • Continue with REST API testing framework
      • Consider what else needs done other than docs and tidying and backups
      • Blog something....take dev meeting talks
    • Deploy encrypted /tmp and swap conversion script
      • Need to warn users that Gnome3 may pop up a window about /tmp being full (when script is run)
      • Now down to 5 desktops, 3 users, 2 COs
    • Schedule MPU meeting to discuss systemd ordering
    • Check sysmans (et al) have 'nograce'.
    • Take a look at RT #78875
    • Look at /etc/hosts - dns issue (IPV6?)
      • work out what we need to fix current problem
    • Circulate info on RH7.3 systemd changes we may wish to consider
    • RT actions (as agreed)
    • Deploy disable-module header on all computing.help servers
      • Added to computing-help-server header, but after 20/9 will need to check config and reboot help servers
    • Is there a route via libvirt to mark a VM as being disabled ?
      • Only option looks like adding something like DISABLED to the name of the first disk image
      • Is it possible to add additional fields into the XML file which a local script could interpret?
        • Yes, there's a field -
          <metadata> <kvmtool:instance xmlns:kvmtool="http://"><enabled/></kvmtool:instance></metadata>
    • Look at Stephen's 'Thoughts on shell components'
    • Upgrade gaivota to 7.3
      • remember to replace FC card with a 25xx series card (alexandria?)
      • remember to run the libvirt_guests script manually to shutdown guests (don't use systemd as will timeout due to length of time script needs)
      • remember to have a page so that units can sign off that they understand that the server is being upgraded - listing machines and whether they will be suspended or shutdown or migrate
    • Buy more memory for azul (upgrade to 256GB)
      • Space for 16 DIMMs. Currently 8 x 16GB. Another 8 x 16GB (DDR4-2133 dual rank RDIMM 1.2V) max 3000 from Dell, 2000 from Crucial
    • Look at buying more disks for gaivotta and girassol
      • Space for 6 drives (600GB 10K) giving 1.8TB usable space (each server). Cost max 2400 each server
    • Look at MPUActivitiesList
    • Wandering time -> run once a week
    • fix lcfg-systemd to be more tolerant of rsync failures

  • Chris
    • Inventory project
      • Continue work on clientreport modules for replacing firmwarereport
    • Produce script to monitor package volume usage
      • run on deneb, using contents of /etc/buckets.conf to find which AFS volumes to check
    • Look at MPUActivitiesList
    • Reinstate girassol KVM guests
    • Check validity of our instructions for adding extra CPUs to KVM guests

  • Stephen
    • LCFG client refactor stage 2
      • testing and documentation
    • LCFG server symlink to exam branches - produce reporting script and discuss with Graham
    • submit polkit bug to redhat - with Alastair (still exists under 7.3)
    • Draft a position note on shell components under SL8 and possible ways forward
    • Produce some text for systemd mount bug (to submit to RH)
    • RT actions (as per agreed list) once 7.3 fully deployed
    • Take issue of disable per user journald logs on certain servers to OPS
    • Schedule jubilee downtime to move to SOL
    • Consider PD work for after LCFG client
    • File bug against lcfg-systemd - spurious warnings about missing targets at first boot.
    • Upgrade waterloo and oyster to 7.3
    • Look at MPUActivitiesList

-- AlastairScobie - 06 Sep 2017

Topic revision: r9 - 24 Sep 2019 - 13:50:24 - AlastairScobie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies