how can I test if any of my HTCondor jobs returned with a non-zero error code?

867 views Asked by At

I have a script running condor_submit for a batch of 25 jobs, condor_wait for them all to complete and then another condor_submit for another batch pf 25 jobs.

I want to make sure non of the first 25 jobs failed with Normal termination (return value 127) (any non-zero return value).

How can I easily do this? Or if that's impossible I'm also willing to wrap my job executable in a script that will fail them in case they return non-zero - but I'm not sure how to fail a HTCondor job!

2

There are 2 answers

0
cooke On BEST ANSWER

You can use condor_history http://research.cs.wisc.edu/htcondor/manual/current/condor_history.html

If you run the following command:

condor_history USERNAME -af clusterId ExitStatus

It will return a space separated list of

JobId ExitStatus

It also supports other options other than just passing USERNAME.

0
Greg On

Another way to solve this problem is to use the condor_dagman tool. With dagman, you list the dependencies between your jobs, and dagman automatically submits a job when all the jobs which is depends on have completed. No need to run condor_wait or look at exit codes.