I'm working on a non-distributed library in C++ that offers an object serialization interface. This interface can serialize objects into bitstreams using a method similar to the following:
std::stringstream ss;
obj.writeTo(ss);
std::string serialized_obj = ss.str();
The size of these serialized objects can vary due to internal data structure length changes resulting from computations between objects. Given this variability, I'm seeking recommendations on how to implement allreduce or allgather operations for these serialized objects.
I aim to optimize performance in my scenario, which currently relies on a pure MPI library. I'm uncertain whether Boost MPI can meet these requirements and whether it would incur performance losses (or potentially improve performance). However, introducing such a large dependency solely for these two operations wouldn't be the optimal solution for me personally.
NOTE: I'm currently using multiple rounds of butterfly-like MPI_Send/MPI_Recv to simulate an Allreduce operation and multiple rounds of MPI_Bcast to simulate Allgather. However, due to load imbalances, these implementations suffer from performance degradation.
Any insights or best practices on how to handle this situation efficiently would be greatly appreciated.