Introduce a Production Condor service for Research

Description

Create a production quality Condor service for use on staff and lab machines allowing Research Staff and students to use spare CPU cycles for computationaly intensive tasks.

Condor is a clustering system which is targetted at using spare CPU cycles on otherwise unused desktops, Condor can be configured to only use desktop machines where the keyboard and mouse are idle. Should Condor detect that a machine is no longer available (such as a key press detected), in many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle.

We have had a test cluster running on a number of School machines for some time and we are now in a position to produce a production quality service on lab and staff machines. This will allow research staff to submit jobs to condor clusters which will utilise idle CPU's on standard desktop machines to process jobs.

Customer

This project will support research over a number of Institutes and would be a resouce available to all Research staff and students. It would also be possible to open up access to the cluster for non research use on nodes which had not been grant funded. The Distributed computing working group would sign off on acceptability.

Case

The School purchases a large number of desktop computers which spend much, in some cases most, of their time unused. Deploying condor on lab and staff machines would make use of this currently wasted resource and at the same time free up more time on dedicated clusters for jobs which have more stringent resource requirements.

Deliverables

A production service running on lab and staff machines with multiple master nodes sited at KB and in the city centre. Suitable Documentation for the system to be managed day to day by the support unit and to provide a gentle introduction for new users. Beyond basic testing, usage testing will be carried out by the Distributed Computing working group.

Timescale

2nd Pilot July-August

Start phased roll out in lab and staff machines from middle semester 1 2006-2007.

Multiple masters in place by December 2006.

Condor running site wide by start Semester 3 2006-2007.

Proposal

Deploy condor site wide on lab/general access machines with an opt out for individual users machines

Risks

If takeup is high then condor usage could impact on fileserver or LAN performance.

Master nodes are open to denial of service attacks from malicious or badly written submission scripts.

Running desktop machines at high load levels continously for long periods may degrade the hardware. It may also result in large scale hardware failures where there is a fundamental hardware problem. For example the capacitor problem with the gx270's, we may get a lab full of machines failing within days rather than over a period of weeks.

We currently have no way of running condor with afs.

Automatic backgrounding of jobs currently does not work on nodes with USB input devices on FC5.

Large numbers of nodes running at high loads in labs may produce complaints from students about the levels of temperatur and noise generated.

Dependencies

Condor currently can't detect console usage with USB keyboards and mice, consequently it does not suspend or migrate processes when a user starts using the machine. We would need a patched kernel to be deployed on condor nodes.

Management

Suggest that this project falls under the Research and Teaching Unit and is consequently managed by Tim Colles.

Resources

Hardware for multiple master nodes would be required, this can be fairly low spec in terms of processor/disk requirements but should be geared towards high availablility (rack mount, console access, raidable disks, some form of UPS).

The project would need expertise from the MDP unit to patch the kernel and there would be an ongoing support requirement to ensure that the patch was applied to subsequent kernels.

The project would need expertise from the Services unit on how best to integrade condor with afs.

There would be a need for staff time from frontline support in reconfiguring the labs machines.

There would be a need for ongoing support upgrading condor RPMS.

There would be an ongoing need for condor specific documentation and user support.

Plan

  1. Patch kernel and test with current condor Pool
  2. Rework component to work with third party condor rpmss
  3. Set up second condor pool to test above
  4. Large scale testing of 2 above in student lab (probably newly upgraded FC5 lab)
  5. Review above and plan for larger scale deployment in labs and volunteer user desktops
  6. Deploy multiple masters
  7. Deploy in all labs and volunteer user desktops
  8. Deploy school wide.

Time

This is not fully quantifiable at this stage as a certain amount of time is needed simply to evaluate the technology and how best to implement certain aspects of the proposal.

1 and 2 would require about 5 man days work with a contingency of 2 man days

3 would require 3 man days from RAT and 1 man day from Frontline support with an unknown user support commitment (probably from both units).

Best current estimate for the remainder would be 1-3 man months worth of effort but this is very difficult to quantify as it is not clear what level of configuration or help staff would need to get their machines ready to run as condor nodes and some of the risks listed above could have a major impact on the project. Figures would become clearer after stage 3 is complete.

e.g. If the kernel patch doesn't work and condor can't background tasks then a much larger commitment of resources would be required to produce a solution or we would have to restrict condor use to outside of office hours.

Priority

There is a user request RT:23966 for condor to be deployed in labs by July.

There is considerable pressure from some institutes for more cluster compuing resources.

The current pilot is well subscribed and well used.

-- IainRae - 20 Jun 2006

Topic revision: r5 - 24 Apr 2013 - 14:49:13 - GrahamDutton
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies