I have a perl script that prepares files for input to a binary program and submits the execution of the binary program to the SGE queueing system version 6.2u2.
The jobs are submitted with the -sync y
option to permit the parent perl script the ability to monitor the status of the submitted jobs with the waitpid function.
This is also very useful because sending a SIGTERM to the parent perl script propagates this signal to each of the children, who then forward this signal onto qsub, thus gracefully terminating all associated submitted jobs.
Thus, it is fairly crucial that I be able to submit jobs with this -sync y
option.
Unfortunately, I keep getting the following error:
Unable to initialize environment because of error: range_list containes no elements
Notice the improper spelling of 'containes'. That is NOT a typo. It just shows you how poorly maintained this area of the code/error message must be.
The attempted submissions that produce this error fail to even generate the STDOUT and STDERR files *.e{JOBID}
and *.o{JOBID}
. The submission just completely fails.
Searching google for this error message only results in unresolved posts on obscure message board.
This error does not even occur reliably. I can rerun my script and the same jobs will not necessarily even generate the error. It also seems not to matter from which node I attempt to submit jobs.
My hope is that someone here can figure this out.
Answers to any of these questions would thus solve my problem:
- Does this error persist in more recent versions of SGE?
- Can I alter my command line options for qsub to avoid this?
- What the hell is this error message talking about?
Our site hit this issue in SGE 6.2u5. I've posted some questions on the mailing list, but there was no solution. Until now.
It turns out that the error message is bogus. I discovered this by reading through the change logs on the Univa github "open-core" repo. I later saw the issue mentioned in the Son Of Gridengine v8.0.0c Release Notes.
Here are the related commits in the github repo:
What the error message should say is that you've hit the limit on the number of
qsub sync -y
jobs in the system. This parameter is known asMAX_DYN_EC
. The default in our version was 99, and the changes above increase that default to 1000.The definition of
MAX_DYN_EC
(from the sge_conf(5) man page) is:You can check how many dynamic event clients you using the following command:
We have added
MAX_DYN_EC=1000
toqmaster_params
viaqconf -mconf
. I've tested submitting hundreds ofqsub -sync y
jobs and we no longer hit the range_list error. Prior to theMAX_DYN_EC
change, doing so would reliably trigger the error.