as I found no answer for my question so far and am on the edge of going crazy about the problem, I just ask the question tormenting my mind ;-)
I'm working on a parallelization of a node-elimination algorithm I already programmed. Target environment is a cluster.
In my parallel program I distinguish on master process (in my case rank 0) and the working slaves (every rank except 0). My idea is it, that the master is keeping track which slaves are available and send them then work. Therefore and for some other reasons I try to establish a workflow basing on passive RMA with lock-put-unlock sequences. I use an integer array named schedule in which for every position in the array representing a rank is either 0 for a working process or 1 for an available process (so if schedule[1]=1 one is available for work). If a process is done with its work, it puts in the array on the master the 1 signalising its availability. The code I tried for that is as follows:
MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win); // a exclusive window is locked on process 0
printf("Process %d:\t exclusive lock on process 0 started\n",myrank);
MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); // the line myrank of schedule is put into process 0
printf("Process %d:\t put operation called\n",myrank);
MPI_Win_unlock(0,win); // the window is unlocked
It worked perfectly, especially when the master process was synchronized with a barrier to the end of the lock because then the output of master was made after the put operation.
As a next step I tried to let master check on regular basis whether there are available slaves or not. Therefore I created a while-loop to repeat until every process signalized its availability (I repeat that it is program teaching me the principles, I know that the implementation still doesn't do what I want). The loop is in a base variant just printing my array schedule and then checking in a function fnz whether there are other working processes than master:
while(j!=1){
printf("Process %d:\t following schedule evaluated:\n",myrank);
for(i=0;i<size;i++)printf("%d\t",schedule[i]);//print the schedule
printf("\n");
j=fnz(schedule);
}
And then the concept blew up. After inverting the process and getting the required information with get from the slaves by the master instead of putting it with put from the slaves to the master I found out my main problem is the acquiring of the lock: the unlock command doesn't succeed because in the case of the put the lock isn't granted at all and in the case of the get the lock is only granted when the slave process is done with its work and waiting in a barrier. In my opinion there has to be a serious error in my thinking. It can't be the idea of passive RMA that the lock can only be achieved when the target process is in a barrier synchronizing the whole communicator. Then I could just go along with standard Send/Recv operations. What I want to achieve is, that process 0 is working all the time in delegating work and being able by RMA of the slaves to identify to whom it can delegate. Can please someone help me and explain how I can get a break on process 0 to allow the other processes getting locks?
Thank you in advance!
UPDATE: I'm not sure if you ever worked with a lock and just want to stress out that I'm perfectly able to get an updated copy of a remote memory window. If I get the availability from the slaves the lock is only granted when the slaves are waiting in a barrier. So what I got to work is, that process 0 performs lock-get-unlock while process 1 and 2 are simulating work such that process 2 is remarkably longer occupied than one. what I expect as a result is that process 0 prints a schedule (0,1,0) because process 0 isn't asked at all wether it's working, process 1 is done with working and process 2 is still working. In the next step, when process 2 is ready, I expect the output (0,1,1), since the slaves are both ready for new work. What I get is that the slaves only grant the lock for process 0 when they are waiting in a barrier, so that the first and only output I get at all is the last one I expect, showing me that the lock was granted for each individual process first, when it was done with its work. So if please someone could tell me when a lock can be granted by the target process instead of trying to confuse my knowledge about passive RMA, I would be very grateful
First of all, the passive RMA mechanism does not somehow magically poke into the remote process' memory since not many MPI transports have real RDMA capabilities and even those that do (e.g. InfiniBand) require a great deal of not-that-passive involvement of the target in order to allow for passive RMA operations to happen. This is explained in the MPI standard but in the very abstract form of public and private copies of the memory exposed through an RMA window.
Achieving working and portable passive RMA with MPI-2 involves several steps.
Step 1: Window allocation in the target process
For portability and performance reasons the memory for the window should be allocated using
MPI_ALLOC_MEM
:Step 2: Memory synchronisation at the target
The MPI standard forbids concurrent access to the same location in the window (§11.3 from the MPI-2.2 specification):
Therefore each access to
schedule[]
in the target has to be protected by a lock (shared since it only reads the memory location):Another reason for locking the window at the target is to provide entries into the MPI library and thus facilitate progression of the local part of the RMA operation. MPI provides portable RMA even when using non-RDMA capable transports, e.g. TCP/IP or shared memory, and that requires a lot of active work (called progression) to be done at the target in order to support "passive" RMA. Some libraries provide asynchronous progression threads that can progress the operation in the background, e.g. Open MPI when configured with
--enable-opal-multi-threads
(disabled by default), but relying on such behaviour results in non-portable programs. That's why the MPI standard allows for the following relaxed semantics of the put operation (§11.7, p. 365):This is also illustrated in Example 11.12 in the same section of the standard (p. 367). And indeed, both Open MPI and Intel MPI do not update the value of
schedule[]
if the lock/unlock calls in the code of the master are commented out. The MPI standard further advises (§11.7, p. 366):Step 3: Providing the correct parameters to
MPI_PUT
at the originMPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win);
would transfer everything into the first element of the target window. The correct invocation given that the window at the target was created withdisp_unit == sizeof(int)
is:The local value of
one
is thus transferred intorank * sizeof(int)
bytes following the beginning of the window at the target. Ifdisp_unit
was set to 1, the correct put would be:Step 4: Dealing with implementation specifics
The above detailed program works out-of-the box with Intel MPI. With Open MPI one has to take special care. The library is built around a set of frameworks and implementing modules. The
osc
(one-sided communication) framework comes in two implementations -rdma
andpt2pt
. The default (in Open MPI 1.6.x and probably earlier) isrdma
and for some reason it does not progress the RMA operations at the target side whenMPI_WIN_(UN)LOCK
is called, which leads to deadlock-like behaviour unless another communication call is made (MPI_BARRIER
in your case). On the other hand thept2pt
module progresses all operations as expected. Therefore with Open MPI one has to start the program like following in order to specifically select thept2pt
component:A fully working C99 sample code follows:
Sample output with 6 processes: