MPU Meeting Wednesday 7th December 2011

AFS automation project

Nothing happened.

LCFG Server Refactoring

the new version of the server code is up and running on prague. The next step will be to compare profile generation between the old and new servers. Since prague was off over the holidays some things may be out of sync. To test Stephen will delete the caches on all servers trondra/mousa and prague and do a full rerun. The new server passes the LCFG and diy-dice tests. We should consult Kenny to see if he has been testing it.

Wake-On-Lan

Chris will add an exception for GX270s (lodestar - is the last remaining). HP7900s don't come back right and lose the ability to sleep. Requires full power-cycle to reset ability. Chris will investigate. Stephen has a new version of the CGI script to be installed.

Simple KVM Service

This has been moved to the new build hosts. There is a problem with something Alastair describes as 'stickiness'. Stephen thinks this may be solved and suggests looking at the Changelog for 220 ( is that kernel 2.6.32-220. or a reference to the KVM software ?) Northern is supposed to be the beta service server, but people have been putting VM services on Circle too (e.g. Graham). There was some discussion on the relevant merits of the shortened 'walkthrough' install versus reading the full documentation. A fix is needed for a simple wrapper to the console command. There are Fibre Channel issues pending.

Server Upgrades

Will begin after prague testing as above.

Miscellaneous Development

Disk UUIDs
A problem has emerged with the new kernel on develop where external USB devices are being found before the usual internal root disk and are being assigned /dev/sda. This will require moving to using disk UUIDs in /etc/fstab. There is already a new template for the fstab component but this requires that the component be stopped before installing the new component and restarting. The grub component will also need to be changed.Alastair will investigate any security issues.

Update RPMS
Some bugs have been fixed to make updaterpms not hang if DNS lookup fails and to respond faster. There may however be issues for first installs.

PXE
The server PXE menu has been simplified so it works better over serial connections.

Installroot
The DHCP code has been fixed so that it now runs in the foreground with better control over timeouts. This means there should be no chance of getting two dhcp clients running. This is not fully deployed yet, requiring that the 'build code' hit the stable release first and then the fix be rolled out subsequently.

Bugfixes
Alastair and Stephen did some end-of-year minor bug hunting.

Component daemon file descriptors
Kenny has reported a bug (https://bugs.lcfg.org/show_bug.cgi?id=519) which shows shell components which use the Daemon method keep open inherited file descriptors. We should work out which components are affected. Some seem to rely on this behaviour e.g. client and rdxprof. Make this a wee project, since it's more than just a normal bug.

Operational

Fiber Chanel
Alastair patched the kernel LUN 0 problem, but it didn't work. Will try again with the 220.2 kernel on figgy.

Mysql Backup
There is an issue with setting up cron jobs that have % signs in them, which need to be escaped. The cron job was not running but this should now be fixed.

Checking MPU rootmail
rootmail for MPU is now coming to the mpu list so we can see what's coming from MPU servers. It's difficult to see real problems in amongst the noise e.g. when database backups weren't working. The logwatch stuff seems to be mostly disk related which could be replaced with a nagios warning if that was possible. (This would require Nagios version 3). Tempted to turn of logwatch just now.

SSH firewall holes
There is a question of how to . Richard did a very good job of convincing users they didn't need firewall holes and we are now down to five desktop machines.

Dunlin
now student.ssh

Boot component
There are issues with the boot component and upstart/chkconfig. Stephen has filed a bug (https://bugs.lcfg.org/show_bug.cgi?id=514)

Activities

OOM Killer
Make OOM killer a mini-project. We need a thorough understanding of how and what it does.
Auto unpack ISOs
started
Fibre channel
make sure it's Nexsan - change
AFS Servers
local home dirs - Stephen to pass to Chris

Other changes made on the activities list.

This Week

  • Alastair
    • Delete old SL6 buildhosts
    • Check MPU server list re buildhosts
    • virsh console wrapper
    • Investigate disk UUIDs - can you override a partition's UUIDYes and fixed
    • Finish off deployment of updated installrootUpdated to new kernel (on sl5 and sl6), but need to test.
    • Arrange figgy (with replacement RAID card) to go to KB
    • Try patched 220.2 kernel on figgy at KBTried on bakerloo at IF

  • Chris
    • Identify which servers should have local home directories
    • WOL - finish cgi improvements,
    • Dc7900 - upgrade to latest bios and 220.2 kernel and see if WOL/Sleep problem fixed
    • Make nagios scripts usable from command line.
    • Consider PD - concrete task

  • Stephen
    • LCFG refactor - check prague profiles vs mousa/trondra
    • Finish auto unpack of install ISOs
    • WOL - finish cgi improvements
    • Consider PD - concrete task

-- AlastairScobie - 10 Jan 2012

Edit | Attach | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 16 Jan 2012 - 16:49:58 - GordonReid
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies