Using call system() causes program to hang... 50% of the time

337 views Asked by At

I have developed a Fortran code with memory requirements that scale with the size of the problem compiled with ifort. After initialization of the problem (allocation of arrays, etc.) the main part of the code loops through a series of function calls.

One of these functions includes 3-5 call system() commands. Some are simple and are only copying directories such as:

call system('cp -r plot_files plot_files1)

While there is another which actually calls an mpiexec that runs a separate program.

The problem is that the program 'hangs' on the system calls about 50% of the time but only for large problems in which there have been arrays allocated (~ array(300000)).

By hang I mean that when I qstat, it is still shown to be running but searching for the PID using pstack, strace, cat/proc/PID/status reveals that the PID no longer exists.

There are call system() earlier in the code before the bulk of the initialization and it will never fail there, only after allocation of the arrays. This made me believe that it was a memory issue, but monitoring during the hanging process reveals that there is plenty of memory available.

[top -cbp PID before the program hangs][1]

I was originally compiling it with OpenMP with the hopes of future parallelization of the code. With OpenMP the failure rate was around 80% of the time. When OpenMP was taken out the failure rate fell to around 50%. I've searched and searched for possible reasons for this problem and have come up empty handed.

0

There are 0 answers