MPU Meeting Tuesday 24 September 2013

Inventory

Merging the proposed system and item tables seems to be generally thought a good move.

The next question is what to index? Stephen suggested that indexing all foreign keys would give a significant speed-up.

Alastair also needs to have a thorough discussion with the CSOs about the proposed design. There's a potential problem: with the new inventory design self-managed machines shouldn't need to be tracked using host names, and we shouldn't need to know their host names if they're ever given any. However this doesn't accord with the working methods of some CSOs who find it convenient to remember machines by name, so that problems and history can be associated with each machine, and to make it easier to make a self-managed machine into a DICE machine. To help with this they create LCFG files for each self-managed machine whether or not it needs one for other purposes.

This has led to a situation where the COs regard the DNS as the canonical list of which hostnames are in use or not; the CSOs regard LCFG as the canonical authority; and Alastair intends that the new inventory hold the master list. This needs sorting out.

Stephen suggested that the need to remember hostnames could be lessened if we had a system which drew together data from various sources - for instance BuzzSaw reports, RT, LCFG, inventory - making it easy to get a quick summary of the life and problem history of a particular machine.

Virtual DICE

The update-vdice script is now there. It optionally runs updaterpms but not by default.

There's been a lot more work on the documentation.

RAT reckons that all semester 1 software is there apart from isabelle and that many students may not need that anyway, so we should go ahead and make the first release.

Stephen suggested a small readme in the images download directory.

Chris and Alastair found some interesting yum behaviour. Chris repeatedly tried yum search on a freshly installed 32bit VM and found that it crashed with alarming strings of Python errors, whereas Alastair, doing yum update on an existing 64bit VM, had no such difficulties. Stephen recognised this as a known yum bug: it'll crash on a freshly installed machine until you run yum update or yum checkupdate. Chris has added this to the pre-ship procedure list on the Virtual DICE management page.

yum update on the VM wanted to replace, amongst other things, our openafs packages with SL ones. However Stephen reckons that that's fine as they're just as up to date as ours. update-vdice can always restore the local packages if needed. We'll advise users not to run yum update anyway - yum install should be safer.

LCFG Client Refactoring

No activity.

Miscellaneous Development

NX
This seems to be a success. It's popular. The host seems to be coping well so far, as it has plenty of processor cores and plenty of memory.
Critical shutdown scripts
These are almost finished; Stephen just needs to pull together the criticality data and the inventory data to give us the ability to shut down, for instance, all low criticality machines in a particular room.
perl-Sub-Name
We have used this for a couple of years in a local RPM (perl-Sub-Name-0.05-2.inf) made using cpanspec. It's used in Nagios, RT and pkgsearch. Since we made our RPM, a package has appeared in EPEL (perl-Sub-Name-0.05-6.el6). Use of that version results in a package conflict on any service which also attempts to use Nagios. Stephen will sort it out.

Operational

KVM reboots
We propose a new policy for KVM server reboots, to be discussed at the 25 September Operational Meeting: Whilst the KVM virtualisation service has been a success, reducing the number of physical machines and energy load, it has increased the operational workload of the MP unit. Much of that load comes from trying to schedule KVM host reboots, including migrating guests from one service to another.
In order to reduce this load, MPU propose that KVM host reboots will be announced some period in advance (1-2 weeks?) and these reboots will only be cancelled for very good reason. The reboots will take place early morning (8am). Only very critical services will be migrated to another server - if possible. Other guests will be suspended or rebooted depending on what KVM allows.
The MP unit will try and minimize the number of KVM host reboots - we expect 2-3 reboots per host per year.
DIYDICE
It's still on SL6.3. Whoops. Chris will talk to the main (other) user and look into the possibility of an upgrade to 6.4.
refreshpkgs
It's having trouble with the inf bucket. It's not yet clear whether this is because of the vast number of RPMs there or because of the size of some of them, though indications point to the latter. Alastair is investigating it.
Purchasing
We're iterating towards a server buying plan for the next three years.
Security documentation
We have a lot of it, but where? Stephen will pull it all together in a Security section on the MPU wiki.

This Week

  • Alastair
    • Start Inventory project diary
    • Inventory project
      • Talk with CSOs - principally to ensure have covered every possible state and transition. Also to ensure not overly complicated to use. Possible issue wrt hostnames for dynamic IP self managed machines.
      • Add "fault" type to the history/changelog table - and rename the history/changelog table as "logbook"
      • Submit bug/enh to App::Cmd author wrt option to die on unspecified options
      • Pester George about location API
      • Create indexes on all foreign references (Stephen reports not done by default) - no measurable performance improvement.
    • Order a spare 600GB disk for waterloo.
    • Ask George - what does the TXretransmit value mean for switch connections?
    • Consider how to make metropolitan usable by users
      • ISOs
      • minimal docs (mostly manual)
      • they'll use virt-manager, but not create machines or change config
    • circulate table of LCFG bugs
    • Pandemic documentation
    • refreshpkgs - document why we timestamp when we do, and the repercussions -and also look at createrepo debugging to find why large memory footprintReckon problem was caused by corrupt createrepo database - full rebuild appears to have fixed the problem.
    • T2 report

  • Chris
    • Virtualised DICE image
      • Add yum configuration for our public buckets (so that people can install extra packages using yum, rather than updaterpms)
      • Try with some student guinea pigs first
    • Pandemic documentation
    • Finish off pkgsearch config and investigate db access problem
    • Flesh out spending plan
    • Speak to Paul about DIYDICE upgrade to SL 6.4

  • Stephen
    • discuss release testing variants with Richard and then document.
    • Tidy up NX config
    • Complete Server shutdown script
    • perl-Sub-Name - sort out versions
    • Pandemic documentation -

  • Carol
-- AlastairScobie - 24 Sep 2013
Topic revision: r12 - 01 Oct 2013 - 07:27:40 - StephenQuinney
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies