Production Condor Project Report

Summary

Condor has now been rolled out in the labs and across selected staff desktops on an opt-in basis. We are running with two pools, one for hosts in the Forum and one for hosts in the Tower. Currently the Tower pool has significantly more (350) nodes in it than the forum one (50), this will probably balance out over time

Issues

We had problems with condor on multicored hosts, the condor daemon wasn't releasing all cores when the user started using the machine again and some condor jobs assigned to the host continued to run, this has been resolved with some config changes but it's somthing we need to watch for given the explosion in CPU cores and virtual machines.

We found that some jobs were taking down the hosts by running the machine out of memory whilst updaterpms was running or whilst ldap was replicating. This has been mitigated in the condor configuration but it's not something that can be ruled out. Ideally it ought to be possible to tell condor to vacate jobs when the system memory decreases below a certain point, in practice the physical memory may fill before condor has an opportunity to do anything with the resultant invocation of the OOM killer and subsequent crash.

We don't currently have a good mechanism to allow staff/research students to opt their machines in and out of the pools

Lessons

Utilising spare CPU cycles on desktops in this way can release a lot of processor resource for research, getting people to access it in a controlled way is a problem however.

Don't try and implement something will be installed on or can impact or bring down large numbers of machines whilst in the middle of a move to a large building.

Being dependent on binaries built by third parties can be painful at times.

-- IainRae - 27 Jul 2009

Topic revision: r3 - 03 Aug 2009 - 13:48:18 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies