SGE - QSUB fails to submit jobs in -sync mode

6.4k views Asked by At

I have a perl script that prepares files for input to a binary program and submits the execution of the binary program to the SGE queueing system version 6.2u2.

The jobs are submitted with the -sync yoption to permit the parent perl script the ability to monitor the status of the submitted jobs with the waitpid function.

This is also very useful because sending a SIGTERM to the parent perl script propagates this signal to each of the children, who then forward this signal onto qsub, thus gracefully terminating all associated submitted jobs.

Thus, it is fairly crucial that I be able to submit jobs with this -sync y option.

Unfortunately, I keep getting the following error:

Unable to initialize environment because of error: range_list containes no elements

Notice the improper spelling of 'containes'. That is NOT a typo. It just shows you how poorly maintained this area of the code/error message must be.

The attempted submissions that produce this error fail to even generate the STDOUT and STDERR files *.e{JOBID} and *.o{JOBID}. The submission just completely fails.

Searching google for this error message only results in unresolved posts on obscure message board.

This error does not even occur reliably. I can rerun my script and the same jobs will not necessarily even generate the error. It also seems not to matter from which node I attempt to submit jobs.

My hope is that someone here can figure this out.

Answers to any of these questions would thus solve my problem:

  1. Does this error persist in more recent versions of SGE?
  2. Can I alter my command line options for qsub to avoid this?
  3. What the hell is this error message talking about?
2

There are 2 answers

0
bdobbie On BEST ANSWER

Our site hit this issue in SGE 6.2u5. I've posted some questions on the mailing list, but there was no solution. Until now.

It turns out that the error message is bogus. I discovered this by reading through the change logs on the Univa github "open-core" repo. I later saw the issue mentioned in the Son Of Gridengine v8.0.0c Release Notes.

Here are the related commits in the github repo:

What the error message should say is that you've hit the limit on the number of qsub sync -y jobs in the system. This parameter is known as MAX_DYN_EC. The default in our version was 99, and the changes above increase that default to 1000.

The definition of MAX_DYN_EC (from the sge_conf(5) man page) is:

Sets the max number of dynamic event clients (as used by qsub -sync y and by Grid Engine DRMAA API library sessions). The default is set to 99. The number of dynamic event clients should not be bigger than half of the number of file descriptors the system has. The number of file descriptors are shared among the connections to all exec hosts, all event clients, and file handles that the qmaster needs.

You can check how many dynamic event clients you using the following command:

$ qconf -secl | grep qsub | wc -l

We have added MAX_DYN_EC=1000 to qmaster_params via qconf -mconf. I've tested submitting hundreds of qsub -sync y jobs and we no longer hit the range_list error. Prior to the MAX_DYN_EC change, doing so would reliably trigger the error.

1
EMiller On

I found a solution to this problem - or at the very least a workaround.

My goal was to get individual instances of qsub to remain in the foreground as the job that it submitted was still in the queue or running. This was achieved with the -sync option but resulted in the horribly unpredictable bug that I describe in my question.

The solution to this problem was to use the qrsh command with the now -n option. This causes the job to behave similar to qsub -sync in that my script can implicitly monitor whether a submitted job is running by using waitpid on the qrsh instance.

The only caveat to this solution is that the queue you are operating on must not make any distinction between interactive nodes (offered by qrsh) and non-interactive nodes (accessible by qsub). Should a distinction exist (likely there are fewer interactive nodes than non-interactive) then this workaround may not help.

However, as I have found nothing even close to a solution to the qsub -sync problem that is anywhere as functional as this, let this post go out across the interwebs to any wayward soul caught in my similar situation.