Production Condor Project Report
Summary
Condor has now been rolled out in the labs and across selected staff desktops on an opt-in basis. We are running with two pools, one for hosts in the Forum and one for hosts in the Tower. Currently the Tower pool has significantly more (350) nodes in it than the forum one (50), this will probably balance out over time
Issues
We had problems with condor on multicored hosts, the condor daemon wasn't releasing all cores when the user started using the machine again and some condor jobs assigned to the host continued to run, this has been resolved with some config changes but it's somthing we need to watch for given the explosion in CPU cores and virtual machines.
We found that some jobs were taking down the hosts by running the machine out of memory whilst updaterpms was running or whilst ldap was replicating. This has been mitigated in the condor configuration but it's not something that can be ruled out. Ideally it ought to be possible to tell condor to vacate jobs when the system memory decreases below a certain point, in practice the physical memory may fill before condor has an opportunity to do anything with the resultant invocation of the OOM killer and subsequent crash.
We don't currently have a good mechanism to allow staff/research students to opt their machines in and out of the pools
Lessons
Utilising spare CPU cycles on desktops in this way can release a lot of processor resource for research, getting people to access it in a controlled way is a problem however.
Don't try and implement something will be installed on or can impact or bring down large numbers of machines whilst in the middle of a move to a large building.
Being dependent on binaries built by third parties can be painful at times.
--
IainRae - 27 Jul 2009
Topic revision: r3 - 03 Aug 2009 - 13:48:18 -
IainRae