-- Main.simonk - 20 Jan 2006

Unanswered questions

Condor

  • condor_compile, g++, fstream and -O

A program that contains the a line like "ofstream("test.dat");" does not run (segfault!) if compiled with condor_compile and optimisation (compiler flag -O). This seems specific to fstream, iostream works fine. The only solution seems to be not to use fstream...

  • Ctrl-Z pruduces core dump if linked against condor lib!
If you have access to source code, you may want to take the advantage of linking your program against condor library and use the standard universe. However, this may cause core dump under DICE. Consider a simple hello world program:
#include 

int main(int argc, char **argv)
{
    printf("Hello, world!\n");
    for(;;)
        ;

        return 0;
}

Let's link it against condor lib:

condor_compile gcc -o hello core_dump.c
LINKING FOR CONDOR : /usr/bin/ld -L/opt/condor-6.7.14/lib -Bstatic --eh-frame-hdr -m elf_i386 -dynamic-linker /lib/ld-linux.so.2 -o hello /opt/condor-6.7.14/lib/condor_rt0.o /usr/lib/gcc/i386-redhat-linux/3.4.4/../../../crti.o /usr/lib/gcc/i386-redhat-linux/3.4.4/crtbeginT.o -L/opt/condor-6.7.14/lib -L/home/s0450736/opt/lib -L/usr/local/lib -L/usr/X11R6/lib -L/usr/lib/gcc/i386-redhat-linux/3.4.4 -L/usr/lib/gcc/i386-redhat-linux/3.4.4 -L/usr/lib/gcc/i386-redhat-linux/3.4.4/../../.. /tmp/ccXnLejZ.o /opt/condor-6.7.14/lib/libcondorzsyscall.a /opt/condor-6.7.14/lib/libcondor_z.a /opt/condor-6.7.14/lib/libcomp_libstdc++.a /opt/condor-6.7.14/lib/libcomp_libgcc.a /opt/condor-6.7.14/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c /opt/condor-6.7.14/lib/libcomp_libgcc.a /opt/condor-6.7.14/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed /usr/lib/gcc/i386-redhat-linux/3.4.4/crtend.o /usr/lib/gcc/i386-redhat-linux/3.4.4/../../../crtn.o
/opt/condor-6.7.14/lib/libcondorzsyscall.a(condor_file_agent.o)(.text+0x250): In function `CondorFileAgent::open(char const*, int, int)':
: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
Then run the program hello, and press ctrl-Z:
./hello 
Condor: Notice: Will checkpoint to ./hello.ckpt
Condor: Notice: Remote system calls disabled.
Hello, world!
Segmentation fault (core dumped)

It seems there is a problem in the condor library, can anyone fix this?:

gdb hello core.9951 
GNU gdb Red Hat Linux (6.1post-1.20040607.43.0.1rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1".

Reading symbols from shared object read from target memory...done.
Loaded system supplied DSO at 0xbac000
Core was generated by `./hello'.
Program terminated with signal 11, Segmentation fault.
#0  0x080b9b80 in adler32 ()
(gdb) up
#1  0x080b526e in fill_window ()
(gdb) 
#2  0x080b5059 in deflate_slow ()
(gdb) 
#3  0x080b407f in deflate ()
(gdb) 
#4  0x0804fde7 in SegMap::Write ()
(gdb) 
#5  0x0804f5c8 in Image::Write ()
(gdb) 
#6  0x0804f29f in Image::Write ()
(gdb) 
#7  0x0804f11e in Image::Write ()
(gdb) 
#8  0x080504ed in Checkpoint ()
(gdb) 
#9  
(gdb) 
#10 0x08048214 in main ()
(gdb) 
  • loop suspended and unsuspended: normally it takes my job 1 hour to finish. However I found one of my job does not finish in 10 hours because condor keeps suspending and unsuspending the job every two minutes, but why?:
010 (005.053.000) 02/16 08:19:42 Job was suspended. 
    Number of processes actually suspended: 1
... 
011 (005.053.000) 02/16 08:22:12 Job was unsuspended.
... 
010 (005.053.000) 02/16 08:24:24 Job was suspended.
    Number of processes actually suspended: 1
...
011 (005.053.000) 02/16 08:26:43 Job was unsuspended.
...
010 (005.053.000) 02/16 08:29:22 Job was suspended. 
    Number of processes actually suspended: 1
... 
011 (005.053.000) 02/16 08:31:51 Job was unsuspended.
... 
010 (005.053.000) 02/16 08:34:09 Job was suspended.
    Number of processes actually suspended: 1
...
011 (005.053.000) 02/16 08:37:03 Job was unsuspended.
...
010 (005.053.000) 02/16 08:39:20 Job was suspended. 
    Number of processes actually suspended: 1
... 
011 (005.053.000) 02/16 08:41:59 Job was unsuspended.
... 
My currently solution to this problem is first a "condor_vacate_job" then a "condor reschedule"

  • Mystery evictions: sometimes sets of jobs are evicted, without any apparent reason, from the machines where they are running. Like in the following example where 14 jobs are simultaneously evicted from 13 different machines:
      .....
      004 (018.000.000,129.215.218.166,ratte)        01/17 14:29:46 Job was evicted.
      004 (019.000.000,129.215.218.166,ratte)        01/17 14:29:46 Job was evicted.
      004 (002.000.000,129.215.218.45,roy)           01/17 14:29:46 Job was evicted.
      004 (014.000.000,129.215.165.64,greenday)      01/17 14:29:46 Job was evicted.
      004 (020.000.000,129.215.165.66,charlatans)    01/17 14:29:46 Job was evicted.
      004 (009.000.000,129.215.165.48,dylan)         01/17 14:29:46 Job was evicted.
      004 (003.000.000,129.215.165.60,cash)          01/17 14:29:46 Job was evicted.
      004 (023.000.000,129.215.218.57,cygnus)        01/17 14:29:46 Job was evicted.
      004 (006.000.000,129.215.218.60,dayna)         01/17 14:29:46 Job was evicted.
      004 (008.000.000,129.215.165.63,bowie)         01/17 14:29:47 Job was evicted.
      004 (005.000.000,129.215.218.54,UNKNOWN)       01/17 14:29:47 Job was evicted.
      004 (010.000.000,129.215.218.38,hieros)        01/17 14:29:47 Job was evicted.
      004 (012.000.000,129.215.218.59,soolin)        01/17 14:29:47 Job was evicted.
      004 (007.000.000,129.215.218.55,jenna)         01/17 14:29:47 Job was evicted.
      ......

-- NarayananEdakunni - 01 May 2007

* Addendum to fstream compile problem : I was able to run a condor_compiled c++ program which uses fstream for file write, but it did give a seg fault on a fstream based operator that the boost c++ library was using to write its data structures. Once I overrode those operators, it started working fine.

GridEngine

Topic revision: r12 - 01 May 2007 - 11:23:47 - NarayananEdakunni
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies