MPU Meeting Tuesday 14th May 2013

Inventory

Alastair is still working with George to get the switch location data into a standard format which can be used by any script which needs the information. The new system has found more machines for which we previously didn't have any location data, particularly in KB. Our ideal solution would be for the Inf Unit to maintain the switch/port location mapping information as it might be useful for more than just the inventory.

Alastair has been thinking about how to replace invedit with something more usable. His plan was to use rfe with a "virtual map". This would provide an editable interface to the database, when the edit is finished any changes would be pushed back to the relevant tables. This functionality isn't currently supported by the rfe daemon for hierarchical maps (e.g. like lcfg/foo) but it should be easy enough to implement.

Alastair will work on producing prototypes for invedit and invquery then write up the new schema and details of the protoype system.

The simple approach to history tracking based on storing the change records into a table still needs to be implemented.

Login Logs Viewer

Stephen has remembered to turn off the debugging features which would print full stack tracks to the browser whenever a problem occurs. There is a risk with this enabled that we might divulge sensitive information.

The aim is to get this rolled out to users very soon with the first monthly summary emails being sent out at the beginning of June.

Sleep Enhancements

Investigations by Chris and George revealed the likely cause of the missing wake messages in the central log stores. Sometimes syslog messages are emitted before the networking has been fully re-established. The simple fix is to wait for a couple of seconds before recording the successful wake-up of a machine.

Chris also noticed in the logs that some machines are making many attempts to sleep before actually succeeding (.e.g. lochmill). He will take a look to see if he can spot a cause.

Virtual DICE

Not much has happened. Toby has had some good results with dumping some or all of the LDAP data into a cache on the local disk.

LCFG Client Refactoring

Miscellaneous Development

New om API
As part of the LCFG client refactoring project Stephen has developed a new Perl module (LCFG::Om::Command) which provides a better way to call om from within Perl code.

Host name handling fixes for om
A couple of bugs in the way om validates hostnames have been fixed. This allows hostnames which contain hyphens and fully-qualified domain names. See bug#631 for details.

apacheconf nagios patch
We should look at applying the patch from Graham which improves the apacheconf nagios check.

Operational

New KVM server
The new KVM server for AT, named waterloo, is now in service. We need to document the procedure for replacing a failed disk and try it before going live. Chris will find out if any COs have experience with this hardware.

live/mpu-kvm-server.h
Chris has added multi-site support to the MPU KVM server header. We need to move all KVM servers over to using this header and cleanup the various LCFG source profiles.

mail log permissions
Neil is fixing the access permissions for the mail component log files.

KB bonded network interfaces
Alastair has configured all the MPU servers in KB to use the mh0 switch by preference.

358 series kernel I/O
Alastair has checked the I/O performance for the latest kernel in the 358 series. He cannot find any problems so we should look into doing the upgrade soon.

This Week

  • Alastair
    • Educate individuals about inappropriate KVM guest sizes
    • Start Inventory project diary
    • Inventory project
    • RT tickets
    • Review outstanding LCFG bugs for prioritisation
    • Check new kernel performance on juice (more methodically this time)
    • T1 figures and report
    • Order a spare 600GB disk for waterloo.
    • Consider activities list

  • Chris
    • RT tickets
    • Virtualised DICE image - looking at local identity caching
    • Look at RT:61762 - perhaps a later 279 kernel available
    • Review outstanding LCFG bugs for prioritisation
    • Deploy new sleep everywhere after 21st May
    • Look at documenting failed disk replacement for Dell servers.
    • Consider activities list

  • Stephen
    • RT tickets
    • Deploy and publish user log viewer (and blog article)
    • LCFG code cleanup
    • Talk to Graham about limits component
    • Look at Graham's apacheconf nagios patch
    • Review outstanding LCFG bugs for prioritisation
    • Consider activities list

  • Carol
-- AlastairScobie - 14 May 2013
Topic revision: r8 - 21 May 2013 - 08:36:36 - AlastairScobie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies