-- Main.simonk - 20 Jan 2006

Getting started with Condor

  • There are two condor pools, AT and FORUM the pools are set to flock so any jobs submitted in AT will be scheduled on free nodes in FORUM if there are no free nodes in AT (and vice versa). This introduces a bit of redundancy with the condor masters and means that jobs will still be scheduled if there is a network partition.
  • Condor will be running in student labs as execute only nodes, you will not be able to submit jobs from these hosts.
  • To see if your desktop DICE machine is part of a pool, try running 'condor_status' - if this runs, then your machine is already part of the pool and you can submit Condor jobs from it. If 'condor_status' fails, and you want to add your machine to the pool, simply submit a support request using the support form.

  • Condor starts jobs when it detects a machine is idle. If your machine seems slow when you return to it after an idle period, some keyboard activity will quickly make Condor remove any running jobs (mouse activity alone doesn't seem to do this).

  • Condor is very configurable. For example, if your machine is noisy when working hard, it can be configured so that Condor jobs only run outside your normal working hours

Submitting Jobs

  • You need to create a job description file. It allows you to specify arguments and stdin/stdout/stderr for your job. You have a number of options - see the Condor manual for instructions.

  • Usually, your job will read and write a few files. Because your job will run on (some) other machine with access to the DICE file system, you can access files e.g. in your home directory. Here's the caveat: DICE uses automounted directories (as of this writing), so you will need to do one or more of the following:
    • explicitly set the working directory to your home folder (or whatever) in your job description file: Initialdir = /home/me
    • use full pathnames, not relative ones for stdin/stdout statements etc: Input = /home/me/PhD/stats/stats2.R

If you don't do that, you'll get error messages like this one: Error from starter on vm1@ratteREMOVE_THIS.inf.ed.ac.uk: Failed to open '/amd/nfs/pegasus/disk/ptn053/dreitter/PhD/stats/stats2.R' as standard input: No such file or directory (errno 2)

-- DavidReitter - 24 Jan 2006

Be nice to other users

    • It will be nice if you put "nice_user = True" in your submit description file. This will lower the priority of your job so that others' job has a chance to run before yours. I find this useful as sometime I need to submit 400+ jobs to the condor pool and I do not want to dominate the computational resource for a long time (and get complains from others). nice_user flag will make all my jobs as ``bottom-feeder'' job, and are evicted as soon as there are jobs with higher priority.

pwd and pawd

* If you need to get the current directory in your shell script or perl script, be sure to use `pawd` instead of `pwd`. This is because paritions are automounted by dice, pwd is not aware of this and returns the a path which may not cause nfs to automount the partition.

Therefore, sometimes the latter two will return /amd/nfs/wyvern/disk/ptn110/s0450736/script instead of /home/s0450736/script, which in turn will cause a failure in your condor/qsub program. The `condor_run` program shipped with condor need such a fix to work properly. -- Hieu Hoang 28 dec 2006

Handle Shadow Exception (updated: new version of the script fixes this)

* I believe the shadow exception is caused by temporary network problem, and I found a solution to this: If you use Peter's condor wrapper script, add a callback to callback_shadow:
def callback_shadow(info):
    sys.stderr.write('Got Shadow Exception:%s' % info['shadowError'])
    sys.stderr.flush()

condorAPI.registerShadow(callback_shadow)

This will print a message when the SE occurs, and the condor will automatically re-submit the killed job to other machine rather than kill all jobs in the queue.

007 (296.057.000) 03/10 15:51:31 Shadow exception!
    Can no longer talk to condor_starter <129.215.155.26:34001>
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
...
001 (296.057.000) 03/10 15:54:38 Job executing on host: <129.215.218.167:55941>

Prevent condor from generating core file.

So far only the following methods work:
  • wrap your program with a script setting ulimit -c 0 before executing the real program
  • setting the limit in your C code:
#include 
#include 

#include 
#include 
#include 


int main(int argc, char **argv)
{
    int res;
    struct rlimit rlp;
    rlp.rlim_cur = 0;
    rlp.rlim_max = 0;
    res = setrlimit(RLIMIT_CORE,  &rlp);
    assert(res == 0);

    {char* s = argv;
    for (;;)
        *s++=1;
    }



    return 0;
}

Topic revision: r12 - 02 Feb 2009 - 11:30:48 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies