MPU Meeting Tuesday 20th August 2013
Inventory
Alastair mailed round his
brain dump and asked for comments but hasn't had any responses so far. He's going to think about how best to check with the CSOs (the main users of the inventory system) that his ideas will cover every state and transition and that the system won't be too complicated to use. He'll also plan a technical talk.
DICE Energy Savings
No progress since last week.
Stephen will think about how to delete the old records from the BuzzSaw database while Chris will look at generating reports containing no potentially personal information.
Virtual DICE
Chris is writing a script to download the necessary user and group information and build local passwd and group DB files.
LCFG Client Refactoring
Stephen has added more testing of
LCFG::Client::FileLocator
. Remote fetches via http are now tested. The test uses
HTTP::Server::Simple which simply serves some files from a given directory to localhost over http.
HTTP::Server::Simple
needed a couple of additions to make it suitable - support for HTTP 1.1 (and 304 Not Modified) and for simple usernames and passwords.
Next he's going to be making tests of the building of DB files, which will probably boil down to checking MD5 checksums; tests to compare complex data structures; then the tests will be able to cover
rdxprof
.
Also while working on tests Stephen changed the fetch code parsing of
client.url
since it wouldn't let you specify a port number.
Stephen has finished documenting the modules; once they have been installed use e.g.
perldoc
to see the documentation.
Miscellaneous Development
- Criticality improvements
- Stephen has been tackling our action from OperationalMeetingActions to `Sundry "criticality" things from discussion'. He has now enhanced the SET_CRITICALITY macro:
-
sysinfo.criticality
is now only set for machines using one or other of the server
headers.
- For servers it's set to low by default.
- By default servers will be in a
critlevel_low
netgroup.
- Servers of medium or high criticality will be removed from
critlevel_low
and added to critlevel_medium
or critilevel_high
.
- Be sure to use lower case for low, medium or high when using SET_CRITICALITY.
- Virtual machines are now in a new
HOST_virtual
netgroup.
- It's all in this week's stable release.
The next step will be to produce simple scripts which attempt to shut down all servers of a given criticality level at a given site. They'll ignore virtual machines. Initially they will probably use
ssh
but it would be good at some point to look at using
remctl
for such remote execution tasks. If we had a remctl process listening on every server we might be able to do things (like
om
or reboots) more robustly. For instance
remctl
shouldn't need LDAP to be functioning whereas
ssh
does.
Operational
- mod_waklog
- It doesn't work with OpenAFS 1.4.15. This means that brendel has an urgent problem. We're going to look for solutions to keep it working after we remove support for Single DES encryption from AFS on 2 September.
- mod_waklog (2)
- We use a patched version of it which enables "allow weak crypto" (i.e. Single DES encryption). We need to remove that patch, test unpatched versions, and have working packages available ready for all machines which use
mod_waklog
.
- brendel's AFS cache
- it wasn't using the right partition for its AFS cache. Stephen has put this right. When brendel is next reinstalled we'll need to revisit its disk partitioning.
- district problems and KVM reports
- Whoever's on operational duty will check the KVM reports, and in particular will highlight the absence of a report from any of our KVM servers.
- iDRAC6 firmware update
- We have a new firmware update for servers using the iDRAC6 (e.g. Dell PowerEdge R710) to take its firmware to 1.95. Alastair has tested it on metropolitan and it seems to work so Chris will add it to the
goodfirmware
map.
- Autoreboot for MPU servers
- Our thoughts on this seem OK so Chris will look into implementing them. He'll give careful thought to ordering the reboots on separate nights in more complex cases, and he'll compare notes with RAT to get the benefit of their experience in this area.
This Week
* Alastair
-
-
refreshpkgs working with AFS 1.6.5 using AFS-commandrefreshpkgs done - updatepkgsvolumes will be more complicated to do
- Start Inventory project diary
- Inventory project
- Consider how to present design to CSOs - principally to ensure have covered every possible state and transition. Also to ensure not overly complicated to use
- Consider technical talk
- Submit bug/enh to App::Cmd author wrt option to die on unspecified options
- Pester George about location API
- Order a spare 600GB disk for waterloo.
- Discuss NFS installroot problem with George - so why stopped working???
- Ask George - what does the TXretransmit value mean for switch connections?
-
Look at why circle didn't have disk space to run updaterpms at last bootRebooted fine, with no recurrence of disk space message
- Look at whether there are any simple tools to allow users to manage their own kvms on metropolitan
- http://www.linux-kvm.org/page/Management_Tools
- A few bare metal solutions (based on KVM etc) eg stackops, opennode.
- Most OS based solutions are heavy weight - quite a bit of effort to get going on SL
- Not keen on maintenance required for either of above => suggest sticking with virt-manager and accepting the risk of people breaking other peoples' machines.
- circulate table of LCFG bugs
- Tidy up circlevm[0-10]
- circlevm3 - ocsinventory - want to keep for now
-
circlevm6 - computing.help slave
-
Move computing help slave from circlevm6 to another server
-
Discuss pkgsearch with Roger - wrt handover - Roger working on header.
-
Look at hung vgs processes on district Looks like because a VG on the SAN that was accessible to district has now been made unavailable without the VG first being disabled. /sbin/vgs is attempting to read the /dev/mapper/XXXXXXXX file but it's hanging - and because vgs is hanging, all the cron jobs are hanging.
- Chris
- Virtualised DICE image
- Finish off coding for auth fetches
- Alastair and/or Stephen to try an image
- Discuss autorebooting servers with RAT and implement
- Stephen
- Investigate mod_waklog behaviour when single des removed (using kerberos test cell)
- Client refactoring project
- complete writing more tests
- Further investigation wrt AT / HP 8300 updaterpms issues
-
discuss release testing variants with Richardand then document.
-
Look at why district hasn't rebooted (should have auto-rebooted)
- Start looking at freenx
- Tidy up circlevm[0-10]
--
AlastairScobie - 20 Aug 2013