MPU Meeting Tuesday 10th May 2011

Software Build Farm

  • Stephen has written all of the remaining documentation.
  • Package lists have been produced which will enable the master server to run on SL6 at some point.
  • There is now a periodic reaper process which clears out duplicate logs from the build hosts after a week or two. Permanent copies remain in the result directories. The reaper also clears out multiple copies of SRPMS which have had to be unpacked and repacked for SL5.
  • The final project report has been written and circulated to the MPU for comment. It will shortly go to COs.

SL6

Stephen and Alastair met and decided that there wasn't much left to do. The project plan has details on what has and hasn't been done. The main outstanding item is SL6 Nagios support: Alastair will discuss this with the Infrastructure unit.

The build hosts have been moved from inf level to dice level. Other SL6 machines can now also move from inf to dice. SL6 DICE is now practical to use for anyone who doesn't need specialised academic software. Local DNS isn't there yet but non-local DNS works fine for the time being. LDAP now works, which means that om and rfe also work.

The openldap component now has extra resources. This temporarily broke the inf level until they were added in there too.

Desktop hardware testing is complete and much of the server testing has been done too. Alastair will test SL6 on fantoosh.

The 620 doesn't sleep properly. It crashes horribly like the 755 and 7100. This may be because individual drivers are failing to come back when the machine resumes? This could very quickly cause a kernel crash. The 620 is long in the tooth so possibly not worth fixing, but we could investigate the 755 problem more as we have over a hundred of them in the student labs. The stack trace from the kernel crash is longer than fits in one screenful, but Chris could try setting up a serial console on a 755 and feeding it to a nearby desktop with Stephen's handy serial cable; the other machine could then capture the entire stack trace message. The first line of the stack trace message should say what provoked the crash. More handy ideas: try a 755 with stock SL6 rather than DICE. Is that any better? Also, the problem could be reported to the SL6 list and to the Dell desktop list (look on linux.dell.com).

AFS Automation

Chris has been trying to approach this project in a less headstrong way than previous projects so has been thinking it out a lot rather than repeatedly coding then throwing it away and starting again. He asked the unit for their experience/advice on starting a project:

  • When trying to sort out a complicated situation in your head, try explaining it to somebody else. Write a blog post! It won't even necessarily matter if nobody reads it; the process of producing a clear explanation will often help you sort out the problem in your own mind too.
  • Drawing flowcharts and data structures can help.
  • Sometimes there's just no alternative to starting to write code, realising it's rubbish, then junking it and starting again. Besides, rewrites usually get progressively faster as you can usually reuse bits which did successfully solve problems.
  • With a big project you just have to start writing code as the details of a large project just can't fit in the mind all at once. Tackle it chunk by chunk and build up the structure a piece at a time. Sometimes the process of tackling one chunk will lead back to rewrites elsewhere, but that's life.

Miscellaneous Development

sleep
Next week's stable release will contain the last bit of support for allowing users to run "om sleep disable" and "om sleep enable" on their machines.

hwmon
Now has full support for optional checks for read-only disk mounts.

DIY DICE
Alastair has finished DIY DICE support for SL6.

Packages options
Stephen and Kenny have worked out new, less messy ways of handling optional and additional packages. Stephen spent a lot of the last week getting the details right and has also documented it all. The end result is two tidier, easier mechanisms for adding and specifying packages.

Operational

metropolitan problems
The VMWare server metropolitan continues to lose its network from time to time. This time the vlan24 bridge failed to come back. It turns out to be easy to restart: just find and restart the vmnet-bridge daemon for vlan24 on metropolitan.

SL5 openafs packages
The SL5 platform's userspace openafs packages weren't marked as boot only, although the kernel level ones were. Last week when the packages were upgraded from 1.4.12 to 1.4.14 this discrepancy meant that userspace packages were upgraded but kernel packages weren't; this created a mismatch which is definitely not recommended.

Component versions script
It's now linked from the top level LCFG host pages. We should also link to it from the main MPU page.

om and oom-killer
for some time now om has had the ability to set a protection level for oom-killer. This works a bit like "nice" - a numerical value gives a process more or less protection from the oom-killer. We could for instance use this to protect LDAP more, if we wanted to. We could start using this. Ultimately perhaps process groups would be useful here.

UPS messages
The syslog files on our servers have been deluged with messages about the missing UPS, making them less useful. Alastair will see if the Infrastructure unit can do something about this.

Slow trondra
it's been slow for a week now. Could this be anything to do with the increased LDAP logging?

Reminders
Some of them are now very old. They should really only persist for a few weeks. Permanent messages can be stored in comments instead.

Capital expenditure
Bad news - we have 100k this year instead of 200k, which itself was less than we needed. Alastair will reassess our purchasing decisions. We may need to start cutting corners, for instance by running some services on desktops or doing without RAID here and there.

This Week

  • Alastair
    • read PkgForge report
    • speak to Infrastructure about nagios -> SL6 priority
    • try SL6 on fantoosh - but raised problems with network component
    • meet with RAT re package lists
    • document how to restart vmnet-bridge
    • discuss snmp UPS issue with Infrastructure
    • remaining SL6
    • revisit kit replacement for 2011/2012

  • Chris
    • AFS automation
    • upgrade desktop to SL6

  • Gordon
    • mpath component
    • build afs-utils for SL6

  • Stephen
    • meet with RAT re package lists
    • document package lists (as discussed at RAT and LCFG meetings)
    • sysinfo changes
    • LCFG server refactoring - back up to speed

-- AlastairScobie 10 May 2011, ChrisCooke 11 May 2011

Topic revision: r10 - 24 May 2011 - 11:52:12 - AlastairScobie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies