MPU Meeting Tuesday 29th November 2011

AFS automation project

Nothing happened.

LCFG Server Refactoring

Stephen has moved the code into subversion, mainly so that the LCFG build tools can be used, especially the macro expansion. Alastair has been trying out the new server on the DIY DICE server and has found a few bugs. One bug was in the server component's run method. There was a dependency problem which especially affected machines using small-server.h.

Install Scripts

Waiting for sign-off.

Wake On LAN

The first version of http://wake.inf.ed.ac.uk is up and running and in use by some (happy and grateful!) test users. Testing has thrown up the need for some additions, and in particular Stephen suggested that for convenience wolclient.users could contain netgroups as well as usernames, and that the CGI could group its results using profile.group data, and that this would be easier to do by getting the CGI to use Template Toolkit rather than outputting its own HTML.

Simple KVM Service

Chris and Stephen should try out Alastair's instructions on using the service.

Server Upgrades

Alastair has started the upgrade of diydice but has stalled it until the problems with the new server have been ironed out.

Miscellaneous Development

Stephen suggested an area of possible development for the future: now that syslog for all SL6 hosts is going to a central logging host, we can afford to be a lot more proactive about looking for problems. We could have a simple system whereby people can drop a "check script" into a directory, and the script will be run regularly and the results mailed out as necessary. Here are some examples of problems we could look for using this system:

  • Machines that have had the oom killer run: we could check this nightly and contact the machine's user.
  • Kernel panics. We don't necessarily hear about these on desktops.
  • Smartd warnings of failing disks.
  • Authentication failures. This might help us to spot compromised DICE accounts or self-managed machines more quickly.
  • How much and how many machines sleep.

Operational

  • Stephen's putting the finishing touches to his security report.
  • Stephen and Simon have been looking at some error messages from aklog. It turns out that an old X preference file, .xlog, in some homedirs is now used by aklog to hold a list of other AFS cells. Going through the contents of an old .xlog can slow down aklog quite a lot. Stephen is contacting the people who have a .xlog to suggest that they delete it.
  • auditd isn't starting properly from the boot component and Stephen has been trying to find out why. It should be started by an init script as it has to start very early, then be reconfigured by the auditd component later in the boot process. However it's exiting with a mysterious exit status. Alastair suggested the possibility of a conflict between init and upstart and Stephen will look into this.
  • Alastair has a date for the KB fibrechannel downtime - Monday afternoon. At that time the FC switch RSCN settings will be changed. After that Alastair will be able to change northern at leisure.
  • We discussed the long updaterpms hangs during the recent disk array outage. When updaterpms talks to the web server, updaterpms times out if the web server is unavailable. However if the web server talks to another web server which talks to AFS which hangs rather than failing entirely, then the whole chain hangs and the timeout never happens.
    • To tackle this we'll look at
      • The squid configuration
      • The updaterpms connection timeout
      • The updaterpms traffic timeout
    • We could also try making updaterpms get the file's head before getting the full file. libcurl lets you do this simply. Getting the head should be quick so we can set a short timeout on this, say ten seconds, and if we timeout then we conclude that the file is not available. It would have the further advantage of telling us the size of the file to be downloaded. We would have to try out this idea with squid in the loop to check for problems.
    • Simon and Stephen have been looking into how to simulate an AFS timeout, for testing purposes.
  • We have an action from the Operational meeting of 23/11/11, Identify which servers should have local home directories. Stephen will look at this.
  • Chris will remove the smartd gnome applet from DICE SL6.
  • Stephen will warn people about there being only two more releases this year (after this week's).

This Week

  • Alastair
    • Test northern against reconfigured satabeast
    • Pass BIOS settings to USU
    • Consider focus for perl learning
    • Stable release of 30th Nov
    • Finish KVM documentation
    • Investigate updaterpms timeout issues (wrt AFS hangs)
    • Finish work on installroot re multiple interfaces and timeouts (calling udhcpc correctly)
    • continue DIYDICE SL6 (and other SL6 server work)

  • Chris
    • Try kvm service on circle
    • WoL project
    • Remove gnome smartd applet for SL6

  • Gordon

  • Stephen
    • Announce only two further stable releases of 2011
    • Finish security incident report
    • Fix LCFG server bugs Alastair found wrt diydice
    • Desktop ssh header
    • Contact users re .xlog file
    • Try kvm service on circle

-- AlastairScobie 29 Nov 2011, ChrisCooke 3 Dec 2011

Topic revision: r6 - 06 Dec 2011 - 18:42:11 - AlastairScobie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies