MPU Meeting Tuesday 6th April 2010

AFS Component

Nothing happened, waiting for sign-off.

LCFG Server Refactoring

Nothing much happened.

Server Hardware

Stalled.

Installroot

The new tools have been packaged up. Alastair has achieved several successful installs but it still needs testing on F12/x86_64. The new installer should also be tested again on SL5 to check that none of the recent changes have broken anything. Chris will try reinstalling tarragona to see if he encounters any problems.

There seems to be a problem with setting the time. After first boot, the clock is 1 hour out which upsets fsck. We need to investigate some more. Apparently setting the timezone now has to be done via /etc/adjtime rather than a file in /etc/sysconfig

F12

  • The standard initscripts package has been patched to remove the /etc/event.d/rc scripts. The lcfg-upstarthooks package replaces them with scripts which call the LCFG boot component to start the various services.

  • Alastair added a FULLY_MANAGED macro in the inf-level headers to hack round F12 machines being in different states and not all using the LCFG boot component. We should now be ready to remove this macro and add the packages to a standard location.

  • Stephen is working on the LCFG PAM component. The first step is to reproduce the F12 standard configuration via the LCFG headers. After that he will work out the necessary resources to support the DICE kerberos and AFS infrastructure.

  • Stephen has done some work on cleaning up the LCFG hardware headers (lcfg/hw/). This will allow the switch from using modprobe.conf to modprobe.d/lcfg.conf on F12 which will avoid some annoying warning messages.

  • Chris will take a look at whether the LCFG gdm component works on F12.

  • Stephen has checked that the LCFG cyrussasl component works fine on F12.

Miscellaneous Development

  • Development meeting : We all need to update the project entries on Development Meeting Activity Page.
  • Q1 reports : We all need to write the short project reports for the MPU Q1 report.
  • A bug was found (and fixed) in the parser for old-style LCFG templates in the LCFG::Template Perl module. This was related to the recent changes made to make it work on F12. There was a missing caret (^) symbol which meant that the regular expression backtracked far more than it should. This was ok on small files but the LCFG mailng component has a 1700 line template which was taking about 5 minutes to process.

Disaster Recovery

Now that the Kings Building machine room is available we need to reconsider the MPU disaster recovery strategy. The intention is to ensure we have off-site mirrors of all our data and a lukewarm spare for key services. Where our data storage is managed by other units (e.g. the MPU and LCFG AFS group space) we will work on the assumption that the backups are already safely in off-site locations.

The key services we need to be able to restore quickly in the event of a disaster at the Forum are: the LCFG master, an LCFG slave server and a package server. This could be done with a single machine located at KB.

LCFG master
We want the machine to have a live copy of all the data stored on the LCFG master. This includes all the config data (i.e. headers and packages list), the LCFG profiles. It should also have all the configuration in place to be able to switch into the role of LCFG master without any dependency on LCFG itself.

LCFG slave
As well as mastering the LCFG data the machine should have all the configuration necessary to be able compile and serve the LCFG profiles. The machine will probably run as an LCFG slave under normal conditions so this should not be a problem.

Packages master
All the packages are stored in AFS. IN the event of a disaster we need to be able to build new servers without any dependency on AFS. To do this we will need a local mirror of all our package repositories. This will probably need about 1TB of space to give us a comfortable amount of working space. All the configuration for apache necessary for the machine to serve as the packages master must be in place so that it can take over this role without any dependency on LCFG.

All the procedures related to the disaster recovery should be simple, straightforward and well documented so that it can be done by any CO. It might be good to have the documentation printed out and stored off-site as well so we're not relying on the availability of services like the wiki.

Operational

Not much in the way of operational work done this week as we are all busy with F12.

  • Latest perl-AFS : This has been packaged and installed on F12 machines. If it is all ok it will also go out onto SL5 next week.

This Week

Alastair will:

  • Personal Development topics
  • Q1 project reports
  • F12

Chris will:

  • Personal Development topics
  • Q1 project reports
  • F12

Stephen will:

  • Personal Development topics
  • Q1 project reports
  • F12/x86_64
  • Change the IP address for split
  • Finalise lcfg-server update
  • Roll out lcfg-om changes

-- StephenQuinney - 09 Apr 2010

Topic revision: r1 - 09 Apr 2010 - 11:03:29 - StephenQuinney
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies