Niether 'MPI_Barrier' nor 'BLACS_Barrier' doesn't stop a processors executing its commands

148 views Asked by At

I'm working on ScaLAPACK and trying to get used to BLACS routines which is essential using ScaLAPACK.

I've had some elementary course on MPI, so have some rough idea of MPI_COMM_WORLD stuff, but has no deep understanding on how it works internally and so on.

Anyway, I'm trying following code to say hello using BLACS routine.

   program hello_from_BLACS
     use MPI
     implicit none

     integer  :: info, nproc, nprow, npcol, &
                 myid, myrow, mycol, &
                 ctxt, ctxt_sys, ctxt_all

     call BLACS_PINFO(myid, nproc)

     ! get the internal default context
     call BLACS_GET(0, 0, ctxt_sys)

     ! set up a process grid for the process set
     ctxt_all = ctxt_sys
     call BLACS_GRIDINIT(ctxt_all, 'c', nproc, 1)
     call BLACS_BARRIER(ctxt_all, 'A')

     ! set up a process grid of size 3*2
     ctxt = ctxt_sys
     call BLACS_GRIDINIT(ctxt, 'c', 3, 2)

     if (myid .eq. 0) then
       write(6,*) '                          myid       myrow       mycol       nprow       npcol'
     endif

(**) call BLACS_BARRIER(ctxt_sys, 'A')

     ! all processes not belonging to 'ctxt' jump to the end of the program
     if (ctxt .lt. 0) goto 1000

     ! get the process coordinates in the grid
     call BLACS_GRIDINFO(ctxt, nprow, npcol, myrow, mycol)
     write(6,*) 'hello from process', myid, myrow, mycol, nprow, npcol

1000 continue

     ! return all BLACS contexts
     call BLACS_EXIT(0)
     stop
   end program

and the output with 'mpirun -np 10 ./exe' is like,

 hello from process           0           0           0           3           2
 hello from process           4           1           1           3           2
 hello from process           1           1           0           3           2
                           myid       myrow       mycol       nprow       npcol
 hello from process           5           2           1           3           2
 hello from process           2           2           0           3           2
 hello from process           3           0           1           3           2

Everything seems to work fine except that 'BLACS_BARRIER' line, which I marked (**) in the code's leftside.

I've put that line to make the output like below whose title line always printed at the top of the it.

                           myid       myrow       mycol       nprow       npcol
 hello from process           0           0           0           3           2
 hello from process           4           1           1           3           2
 hello from process           1           1           0           3           2
 hello from process           5           2           1           3           2
 hello from process           2           2           0           3           2
 hello from process           3           0           1           3           2

So the question goes,

  1. I've tried BLACS_BARRIER to 'ctxt_sys', 'ctxt_all', and 'ctxt' but all of them does not make output in which the title line is firstly printed. I've also tried MPI_Barrier(MPI_COMM_WORLD,info), but it didn't work either. Am I using the barriers in the wrong way?

  2. In addition, I got SIGSEGV when I used BLACS_BARRIER to 'ctxt' and used more than 6 processes when executing mpirun. Why SIGSEGV takes place in this case?

Thank you for reading this question.

1

There are 1 answers

1
Ian Bush On BEST ANSWER

To answer your 2 questions (in future it is best to give then separate posts)

1) MPI_Barrier, BLACS_Barrier and any barrier in any parallel programming methodology I have come across only synchronises the actual set of processes that calls it. However I/O is not dealt with just by the calling process, but at least one and quite possibly more within the OS which actually the process the I/O request. These are NOT synchronised by your barrier. Thus ordering of I/O is not ensured by a simple barrier. The only standard conforming ways that I can think of to ensure ordering of I/O are

  • Have 1 process do all the I/O or
  • Better is to use MPI I/O either directly, or indirectly, via e.g. NetCDF or HDF5

2) Your second call to BLACS_GRIDINIT

 call BLACS_GRIDINIT(ctxt, 'c', 3, 2)

creates a context for 3 by 2 process grid, so holding 6 process. If you call it with more than 6 processes, only 6 will be returned with a valid context, for the others ctxt should be treated as an uninitialised value. So for instance if you call it with 8 processes, 6 will return with a valid ctxt, 2 will return with ctxt having no valid value. If these 2 now try to use ctxt anything is possible, and in your case you are getting a seg fault. You do seem to see that this is an issue as later you have

 ! all processes not belonging to 'ctxt' jump to the end of the program
 if (ctxt .lt. 0) goto 1000

but I see nothing in the description of BLACS_GRIDINIT that ensures ctxt will be less than zero for non-participating processes - at https://www.netlib.org/blacs/BLACS/QRef.html#BLACS_GRIDINIT it says

This routine creates a simple NPROW x NPCOL process grid. This process grid will use the first NPROW x NPCOL processes, and assign them to the grid in a row- or column-major natural ordering. If these process-to-grid mappings are unacceptable, BLACS_GRIDINIT's more complex sister routine BLACS_GRIDMAP must be called instead.

There is no mention of what ctxt will be if the process is not part of the resulting grid - this is the kind of problem I find regularly with the BLACS documentation. Also please don't use goto, for your own sake. You WILL regret it later. Use If ... End If. I can't remember when I last used goto in Fortran, it may well be over 10 years ago.

Finally good luck in using BLACS! In my experience the documentation is often incomplete, and I would suggest only using those calls that are absolutely necessary to use ScaLAPACK and using MPI, which is much, much better defined, for the rest. It would be so much nicer if ScaLAPACK just worked with MPI nowadays.