I'm stumped. all_gather
works for primitives (e.g. int
) but fails even for simple STL containers. valgrind claims that the container was not allocated/initialized, but that doesn't seem right.
In summary:
- I do some multi-threading with openMP, then rejoin threads.
- In serial, I try to
all_gather
a simplestd::map
using `boost::mpi::all_gather. The MPI ranks are not the threads. (There are 2 MPI ranks, and each MPI rank has 4 threads). - Then I intend to do some more (isolated) multi-threading.
It seems so straightforward... what could possibly be going on here?
main.cpp
#include <openmpi/mpi.h>
#include <omp.h>
#include <boost/mpi.hpp>
#include "globals.h"
int main(int argc, char* argv[])
{
int provided_MPI;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided_MPI );
boost::mpi::environment my_boost_mpi_env(argc, argv);
boost::mpi::communicator world_MPI_boost;
world_MPI_boost_ptr = &world_MPI_boost;
// ^^^ global variable of type boost::mpi::communicator *
perform_complete_variable_elimination_schedule();
//...
}
Conn_Comp.cpp
#include <boost/mpi.hpp>
#include <boost/mpi/collectives.hpp>
#include <boost/serialization/serialization.hpp>
#include <boost/serialization/vector.hpp>
#include <boost/serialization/map.hpp>
#include "globals.h"
...
void perform_complete_variable_elimination_schedule()
{
// isolated work in parallel using OpenMP
#pragma omp parallel
{
//work
}
// SERIAL REGION (with respect to threading).
std::map<uint,uint> my_map;
std::vector< std::map<uint,uint> > vec_of_my_maps;
boost::mpi::all_gather< std::map<uint,uint> >
(*world_MPI_boost_ptr,
my_map,
vec_of_my_maps); // <--- line 293 (referenced by valgrind)
// more isolated work in parallel using OpenMP
#pragma omp parallel
{
//work
}
}
valgrind complains that the vector
of map
results in an invalid read. But this vector
was created immediately preceding the all_gather
call - so it is obviously in scope and not in parallel-threaded region.
selected valgrind error output:
==12665== Use of uninitialised value of size 4
==12665== at 0x41C8D7A: boost::archive::detail::basic_iarchive::get_library_version() const (basic_iarchive.cpp:575)
==12665== by 0x41C92C6: boost::archive::detail::basic_iarchive::load_object(void*, boost::archive::detail::basic_iserializer const&) (basic_iarchive.cpp:399)
==12665== by 0x80F5696: void boost::mpi::all_gather<std::map<unsigned int, unsigned int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned int> > > >(boost::mpi::communicator const&, std::map<unsigned int, unsigned int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned int> > > const&, std::vector<std::map<unsigned int, unsigned int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned int> > >, std::allocator<std::map<unsigned int, unsigned int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned int> > > > >&) (iserializer.hpp:387)
==12665== by 0x80DEC83: Conn_Comp::perform_complete_variable_elimination_schedule() (Conn_Comp.cpp:**293**)
==12665== by 0x80C840A: main (main.cpp:695)
==12665==
==12665== Invalid read of size 2
==12665== at 0x41C8D7A: boost::archive::detail::basic_iarchive::get_library_version() const (basic_iarchive.cpp:575)
==12665== by 0x41C92C6: boost::archive::detail::basic_iarchive::load_object(void*, boost::archive::detail::basic_iserializer const&) (basic_iarchive.cpp:399)
==12665== by 0x80F5696: void boost::mpi::all_gather<std::map<unsigned int, unsigned int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned int> > > >(boost::mpi::communicator const&, std::map<unsigned int, unsigned int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned int> > > const&, std::vector<std::map<unsigned int, unsigned int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned int> > >, std::allocator<std::map<unsigned int, unsigned int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned int> > > > >&) (iserializer.hpp:387)
==12665== by 0x80DEC83: Conn_Comp::perform_complete_variable_elimination_schedule() (main.cpp:**293**)
==12665== by 0x80C840A: main (main.cpp:695)
==12665== Address 0x3580bece is not stack'd, malloc'd or (recently) free'd
==12665==
[drosphila:12665] *** Process received signal ***
[drosphila:12665] Signal: Segmentation fault (11)
[drosphila:12665] Signal code: Address not mapped (1)
[drosphila:12665] Failing at address: 0x3580bece
[drosphila:12665] [ 0] /lib/i686/cmov/libpthread.so.0(+0xe500) [0x44f8500]
[drosphila:12665] [ 1] /usr/lib/libboost_serialization.so.1.42.0(_ZN5boost7archive6detail14basic_iarchive11load_objectEPvRKNS1_17basic_iserializerE+0x1b7) [0x41c92c7]
[drosphila:12665] [ 2] ./detect_NAHR(_ZN5boost3mpi10all_gatherISt3mapIjjSt4lessIjESaISt4pairIKjjEEEEEvRKNS0_12communicatorERKT_RSt6vectorISD_SaISD_EE+0x587) [0x80f5697]
[drosphila:12665] [ 3] ./detect_NAHR(_ZN9Conn_Comp46perform_complete_variable_elimination_scheduleEv+0x534) [0x80dec84]
[drosphila:12665] [ 4] ./detect_NAHR(main+0xf5b) [0x80c840b]
[drosphila:12665] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x4519ca6]
[drosphila:12665] [ 6] ./detect_NAHR() [0x80c73e1]
[drosphila:12665] *** End of error message ***
I use MPI_Init_thread based on a recommendation from a boost help page.
As I said at the top, if I use a primitive (i.e. just uint
) instead of a map, then the all_gather
works fine. Why should the map fail? boost serialize
already has methods for serializing STL containers, so that is not the problem...
Note also that the vector which will hold all of the values is automatically resized in all_gather
(I checked the implementation for all_gather
) to be big enough to hold everything. regardless, even if I initialize it myself, it still fails.
Finally, even if I use a plain old array (properly allocated) e.g. std::map<uint,uint> *
, I get the same problem.
Well, this is embarrassing. I'm going to leave the question up in case anybody else has the same strange errors.
The problem with my code was actually in the makefile. I forgot to link to the boost library for MPI.
incorrect makefile flags:
Apparently that line contains just enough information to allow the program to compile and run, but results in a runtime error.
Correct makefile flags:
(Notice the addition of the library linking flags).