Open MPI Broadcast Latency Measurement

1.2k views Asked by At

From reading the documentation, MPI_Bcast is a blocking call (and therefore boost::mpi::broadcast) is as well. Is measuring the amount of time it takes the root node to broadcast a good measure of the time it takes for the data to get from the root node to all the other nodes?

I.e.

int64_t t1 = Utility::picosecondTime(); //exactly what it sounds like; utility that measures current time in picoseconds
boost::mpi::broadcast(communicator, variable, 0);
std::cout << "Broadcast took " << Utility::picosecondTime()-t1 << std::endl;

Or for straight OpenMPI:

MPI_Comm comm; 
int array[100]; 
...
int64_t t1 = Utility::picosecondTime(); 
MPI_Bcast( array, 100, MPI_INT, 0, comm); 
std::cout << "Broadcast took " << Utility::picosecondTime()-t1 << std::endl;
2

There are 2 answers

1
Wesley Bland On BEST ANSWER

An MPI_BCAST is usually implemented in some sort of tree fashion where processes at the top of the tree can exit the algorithm after they've finished their part of the broadcast. So if rank 0 sends messages to ranks 1 and n/2, then it can leave after those messages are done. So the answer to your question is: no, that's not an accurate measurement.

Without accurately synchronized clocks, it's difficult to actually measure the time that a full broadcast takes across all nodes. There are tricks you can do to get closer (use an MPI_BARRIER to synchronize before taking the start time and use the longest time spent by any process in the broadcast), but as clocks still tend to have some drift, nothing will be perfect.

0
Hristo Iliev On

You are mistaking blocking and globally synchronous calls. The only collective MPI operation that is guaranteed to be globally synchronous is MPI_BARRIER - it won't complete unless all processes have called into it. MPI allows other collective calls to return as soon as the process that makes the call does not need to participate further.

MPI_BCAST is pretty much an example of the latter. Open MPI provides several implementations, among them: basic linear, binary tree, binomial tree, and pipeline/chain. Each implementation can further segment the message. A hardcoded heuristic is used at runtime to select one particular implementation based on the message size and the size of the communicator. From the point of view of the root rank, the same broadcast operation could take different amount of time based on the algorithm used.

  • basic linear - this one uses a bunch of MPI_ISENDs followed by MPI_WAITALL and therefore completes only once all sends have completed and the information has reached all other ranks;
  • binary/binomial tree - this one completes once the message has been transmitted to the ranks that are direct descendants of the root node; it could still need more time to reach all nodes in the tree;
  • pipeline - this one segments the message and implements a pipeline that hands the segments from the root rank to the next after it, which in turn hands them to the next to the next and so on; with large messages and on very fast networks (e.g. InfiniBand) completion of the operation in the root means that the global completion is almost imminent;
  • chain - this one divides the processes into several groups and then implements a separate pipeline for each group.

You can enforce a certain algorithm to be used by Open MPI - just take a look at the possible values of the coll_tuned_bcast_algorithm MCA parameter and the other related parameters:

$ ompi_info --mca coll_tuned_use_dynamic_rules 1 --param coll tuned | grep bcast

The proper way to measure the time for MPI_BCAST would be to surround it with MPI_BARRIER calls but then you'd also have to correctly measure the overhead of the barrier call itself and then to compensate for it.