MPU Meeting Tuesday 26th October 2010
LCFG Server Refactoring
Stalled.
F13
This will be proposed for closure at the next development meeting.
Software Build Farm
Stephen has solved a couple of major design issues. He will prepare an architecture document for the next development meeting.
Replacement for VMware Server
It's time to start this project. We need a project plan. The first task would seem to be to make a requirements document (and wish list) and the second to evaluate the IS service against it. We will take this to the November development meeting for starting.
Miscellaneous Development
- release testing scripts
- Stephen has got a solution to the problem of testing the package lists.
- New resource to control the oomkiller
- Stephen has added an
om_oomadj
resource. This lets you (numerically) raise or lower the attractiveness of a process to the oomkiller. It will affect an LCFG component and any processes which it starts directly.
- tcpwrappers change for Fail2ban
- Stephen has tweaked tcpwrappers so that it configures machines for fail2ban correctly. The template was previously in the tcpwrappers code but has now been separated out and joined by a separate template for configuring machines for fail2ban.
- Nagios network monitoring
- we had a big boo-boo. Stephen had to change how it's done: nagios scripts should really check that the executable they need is there before running! The correct thing to do is to check that the executable does not exist; or; run the nagios test (as opposed to the more obvious checking that the executable does exist; and; run the test). This avoids exiting with an error when the executable is not there.
Operational
- metropolitan
- Alastair's MSI solution seems to have fixed its network problems. He will document this and possibly try it out elsewhere.
- Machine moves
- figgy is now at AT. Both on the way down (to use the dhcp header) and on the way back up (after reinstall at AT) any attempt to use bonding resulted in a bonding failure. Chris will put it in this state again so that others can investigate.
- sauce
- Chris will move it to KB. This will involve both IP and IPMI changes.
- piccadilly and northern
- need to have IPMI configured (for power at least). Chris will do this. Note that R710s need a separate DRAC port - suggest bonding over eth0 and eth2 (to spread it across network chips) and using eth1 for DRAC. Three ethernet connections will be needed.
- smartd
- People in the student labs are seeing big popup warnings after they login warning of "SMART status errors". However these seem to be bogus as the machines later claim to have no SMART (i.e. imminent disk failure) problems. It's possible that this is related to sleep, and possibly that smartd should be disabled somehow when preparing for sleep. Chris will have a look. There have been a number of RT tickets on this issue. We should collect them in a handy place: for instance make a parent ticket then make all reports of SMART problems into children of it.
- bakerloo exodus
- We should move MPU guest servers from bakerloo with a view to moving it to the dedicated (to VMware) SAN space.
This Week
Alastair will:
- UKUUG Perl trainer - query doing a course here
-
Proper fix for SRPM web access
-
Build lcfg-kdm for f13_64
-
Fix freshenrpms for source buckets
-
Finish the F13 report
-
Document the MSI solution
-
Flesh out the virtualisation project
Chris will:
- Make bonding fail on figgy
- Configure IPMI power on piccadilly and northern
- Deploy sauce at KB.
- Flesh out the virtualisation project
- Take a look at the SMART problems on lab machines
- Start the submssion project.
Gordon will:
- be on holiday all week (then next week, LCFG bugs review).
Stephen will:
- Software build farm
- F13 devel bucket
- Talk to the services unit about AFS and telford
--
ChrisCooke - 26 Oct 2010