MPU Meeting Tuesday 20th August 2013

Inventory

Alastair mailed round his brain dump and asked for comments but hasn't had any responses so far. He's going to think about how best to check with the CSOs (the main users of the inventory system) that his ideas will cover every state and transition and that the system won't be too complicated to use. He'll also plan a technical talk.

DICE Energy Savings

No progress since last week.

Stephen will think about how to delete the old records from the BuzzSaw database while Chris will look at generating reports containing no potentially personal information.

Virtual DICE

Chris is writing a script to download the necessary user and group information and build local passwd and group DB files.

LCFG Client Refactoring

Stephen has added more testing of LCFG::Client::FileLocator. Remote fetches via http are now tested. The test uses HTTP::Server::Simple which simply serves some files from a given directory to localhost over http. HTTP::Server::Simple needed a couple of additions to make it suitable - support for HTTP 1.1 (and 304 Not Modified) and for simple usernames and passwords.

Next he's going to be making tests of the building of DB files, which will probably boil down to checking MD5 checksums; tests to compare complex data structures; then the tests will be able to cover rdxprof.

Also while working on tests Stephen changed the fetch code parsing of client.url since it wouldn't let you specify a port number.

Stephen has finished documenting the modules; once they have been installed use e.g. perldoc to see the documentation.

Miscellaneous Development

Criticality improvements
Stephen has been tackling our action from OperationalMeetingActions to `Sundry "criticality" things from discussion'. He has now enhanced the SET_CRITICALITY macro:
  • sysinfo.criticality is now only set for machines using one or other of the server headers.
  • For servers it's set to low by default.
  • By default servers will be in a critlevel_low netgroup.
  • Servers of medium or high criticality will be removed from critlevel_low and added to critlevel_medium or critilevel_high.
  • Be sure to use lower case for low, medium or high when using SET_CRITICALITY.
  • Virtual machines are now in a new HOST_virtual netgroup.
  • It's all in this week's stable release.
The next step will be to produce simple scripts which attempt to shut down all servers of a given criticality level at a given site. They'll ignore virtual machines. Initially they will probably use ssh but it would be good at some point to look at using remctl for such remote execution tasks. If we had a remctl process listening on every server we might be able to do things (like om or reboots) more robustly. For instance remctl shouldn't need LDAP to be functioning whereas ssh does.

Operational

mod_waklog
It doesn't work with OpenAFS 1.4.15. This means that brendel has an urgent problem. We're going to look for solutions to keep it working after we remove support for Single DES encryption from AFS on 2 September.
mod_waklog (2)
We use a patched version of it which enables "allow weak crypto" (i.e. Single DES encryption). We need to remove that patch, test unpatched versions, and have working packages available ready for all machines which use mod_waklog.
brendel's AFS cache
it wasn't using the right partition for its AFS cache. Stephen has put this right. When brendel is next reinstalled we'll need to revisit its disk partitioning.
district problems and KVM reports
Whoever's on operational duty will check the KVM reports, and in particular will highlight the absence of a report from any of our KVM servers.
iDRAC6 firmware update
We have a new firmware update for servers using the iDRAC6 (e.g. Dell PowerEdge R710) to take its firmware to 1.95. Alastair has tested it on metropolitan and it seems to work so Chris will add it to the goodfirmware map.
Autoreboot for MPU servers
Our thoughts on this seem OK so Chris will look into implementing them. He'll give careful thought to ordering the reboots on separate nights in more complex cases, and he'll compare notes with RAT to get the benefit of their experience in this area.

This Week

* Alastair

    • refreshpkgs working with AFS 1.6.5 using AFS-commandrefreshpkgs done - updatepkgsvolumes will be more complicated to do
    • Start Inventory project diary
    • Inventory project
      • Consider how to present design to CSOs - principally to ensure have covered every possible state and transition. Also to ensure not overly complicated to use
      • Consider technical talk
      • Submit bug/enh to App::Cmd author wrt option to die on unspecified options
      • Pester George about location API
    • Order a spare 600GB disk for waterloo.
    • Discuss NFS installroot problem with George - so why stopped working???
    • Ask George - what does the TXretransmit value mean for switch connections?
    • Look at why circle didn't have disk space to run updaterpms at last bootRebooted fine, with no recurrence of disk space message
    • Look at whether there are any simple tools to allow users to manage their own kvms on metropolitan
      • http://www.linux-kvm.org/page/Management_Tools
      • A few bare metal solutions (based on KVM etc) eg stackops, opennode.
      • Most OS based solutions are heavy weight - quite a bit of effort to get going on SL
      • Not keen on maintenance required for either of above => suggest sticking with virt-manager and accepting the risk of people breaking other peoples' machines.
    • circulate table of LCFG bugs
    • Tidy up circlevm[0-10]
      • circlevm3 - ocsinventory - want to keep for now
      • circlevm6 - computing.help slave
    • Move computing help slave from circlevm6 to another server
    • Discuss pkgsearch with Roger - wrt handover - Roger working on header.
    • Look at hung vgs processes on district Looks like because a VG on the SAN that was accessible to district has now been made unavailable without the VG first being disabled. /sbin/vgs is attempting to read the /dev/mapper/XXXXXXXX file but it's hanging - and because vgs is hanging, all the cron jobs are hanging.

  • Chris
    • Virtualised DICE image
      • Finish off coding for auth fetches
      • Alastair and/or Stephen to try an image
    • Discuss autorebooting servers with RAT and implement

  • Stephen
    • Investigate mod_waklog behaviour when single des removed (using kerberos test cell)
    • Client refactoring project
      • complete writing more tests
    • Further investigation wrt AT / HP 8300 updaterpms issues
    • discuss release testing variants with Richardand then document.
    • Look at why district hasn't rebooted (should have auto-rebooted)
    • Start looking at freenx
    • Tidy up circlevm[0-10]

  • Carol

-- AlastairScobie - 20 Aug 2013

Topic revision: r9 - 13 Jan 2017 - 14:57:41 - StephenQuinney
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies