I am running jobs on Condor and have noticed that for some reason a subset of my jobs will run but never complete. Is there a setting in the submit file that kills and then resubmits a job if it takes over a certain amount of time to complete? This is similar to the question Condor Timeout for idle jobs except I want Condor not to simply kill the jobs, but resubmit them as well.
Thanks!
you can use the KILL transition expression in the machine class add file (Condor user manual). Something like:
Like this the machine will kill jobs that take more than MaxExecutionTime. Condor will then retry the job.