I have tried to create a simple program that scatters and reduces some simple data between processes in MPI and in the meantime sends a broadcast to other processes in a non blocking way.
Almost all of my tries result in a deadlock and i cannot undestand why since the broadcast call is nonblocking.
This is my code:
#include<iostream>
#include <algorithm>
#include <memory>
#include <random>
#include "mpi.h"
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int num_tasks;
MPI_Comm_size(MPI_COMM_WORLD, &num_tasks);
const int num_elements = 1 << 10;
const int chunk_size = num_elements / num_tasks;
int task_id;
int local_buffer;
int num_proc;
MPI_Status stat;
MPI_Request req;
MPI_Comm_rank(MPI_COMM_WORLD, &task_id);
MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
std::unique_ptr<int[]> send_ptr;
if(task_id == 0){
local_buffer = 42;
send_ptr = std::make_unique<int []>(num_elements);
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution dist(1, 1);
std::generate(send_ptr.get(), send_ptr.get() + num_elements,
[&] {return dist(mt);});
}
std::cout << "Processor : " << task_id << " declares : " << local_buffer << std::endl;
auto recv_buffer = std::make_unique<int []>(chunk_size);
std::cout << "Before di scatter" << std::endl;
MPI_Scatter(send_ptr.get(), chunk_size, MPI_INT, recv_buffer.get(), chunk_size, MPI_INT, 0, MPI_COMM_WORLD);
std::cout << "After scatter" << std::endl;
int local_result = 0;
for(int i = 0; i < chunk_size; i++){
std::cout << "Processor: " << task_id << " completition: "<<
((float)i/chunk_size)*100 << "%\n";
local_result += recv_buffer[i];
}
MPI_Ibcast(&local_buffer, 1, MPI_INT, 0, MPI_COMM_WORLD, &req);
int global_result;
MPI_Reduce(&local_result, &global_result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (task_id == 0){
std::cout << "global result : " << global_result << "\n";
}
MPI_Wait(&req, MPI_STATUS_IGNORE);
std::cout << "Processor : " << task_id << " recieves the value : " << local_buffer << std::endl;
MPI_Finalize();
return 0;
}
I have tried to change all the calls in a non-blocking version but often i get wrong results. Getting rid of the final waits clears the program of any deadlock but at the cost of having also wrong output.
Removing the broadcast results in a correct output with no deadlocks every time
What i needed is simply to have a program that sends a broadcast at the end of the sum and the try and modify it to make so that the first process to finish is the one to send the broadcast
I would also point out that removing Scatter and Reduce function allows the program to finish without deadlocks, i tought the problem might be caused by some sort of interaction between the Ibcast and the Scatter/Reduce op. but to me this does not make a lot of sense since they are working on different buffers and the broadcast is non-blocking
Moving the broadcast call before the Scatter operation also results in the Scatter never being executed and a frozen execution