I am serializing a set of contents, indexed by multiple properties using boost::multi_index_container, and a params struct, into a binary archive which I want to deserialize latter. But the archive created using boost 1.74 is unreadable (Invalid or corrupted archive) when read using boost 1.83.
I have included an mre inside in the git repo. Although it is a single small cpp file I made a repo to share it with the CMakeLists.txt and the Dockerfile. Following is my content
struct content{
friend class boost::serialization::access;
using angle_type = std::size_t;
inline content(angle_type angle): _angle(angle) {}
inline angle_type angle() const { return _angle; }
void reset_angle_random(){
static std::random_device dev;
static std::mt19937 rng_angle(dev());
std::uniform_int_distribution<> angle_dist(0, 180);
_angle = angle_dist(rng_angle);
}
void freeze(){
// complicated deterministic business logic
_angle = 0;
}
content frozen() const{
mre::content copy(*this);
copy.freeze();
return copy;
}
static content generate(){
static std::random_device dev;
static std::mt19937 rng(dev());
std::uniform_real_distribution<> dist_length(-0.5f, 0.5f);
mre::content content{0};
content._length = dist_length(rng);
content.reset_angle_random();
return content;
}
template<class Archive>
void serialize(Archive & ar, const unsigned int version) {
ar & boost::serialization::make_nvp("length", _length);
ar & boost::serialization::make_nvp("angle", _angle);
}
friend std::size_t hash_value(content const& c){
std::size_t seed = 0;
boost::hash_combine(seed, c._length);
boost::hash_combine(seed, c._angle);
return seed;
}
inline std::size_t hash() const { return boost::hash<mre::content>{}(*this); }
inline std::size_t frozen_id() const { return frozen().hash(); }
inline std::string id() const { return (boost::format("%1%~%2%-%3%") % frozen_id() % hash() % angle()).str(); }
inline bool operator<(const content& other) const { return id() < other.id(); }
private:
double _length;
angle_type _angle;
private:
content() = default;
};
The actual code I am working on is much larger and does not use the content struct mentioned here. The above mentioned content struct is a highly reduced version to make a minimal reproducible example. Following is my multi index container setup.
struct package{
friend class boost::serialization::access;
struct tags{
struct id{};
struct content{};
struct angle{};
struct frozen{};
};
using container = boost::multi_index_container<
mre::content,
boost::multi_index::indexed_by<
boost::multi_index::ordered_unique<boost::multi_index::identity<mre::content>>,
boost::multi_index::ordered_unique<boost::multi_index::tag<tags::id>, boost::multi_index::const_mem_fun<mre::content, std::string, &mre::content::id>>,
boost::multi_index::ordered_non_unique<boost::multi_index::tag<tags::content>, boost::multi_index::const_mem_fun<mre::content, std::size_t, &mre::content::hash>>,
boost::multi_index::ordered_non_unique<boost::multi_index::tag<tags::angle>, boost::multi_index::const_mem_fun<mre::content, mre::content::angle_type, &mre::content::angle>>,
boost::multi_index::ordered_non_unique<boost::multi_index::tag<tags::frozen>, boost::multi_index::const_mem_fun<mre::content, std::size_t, &mre::content::frozen_id>>
>
>;
inline explicit package(const mre::parameters& params): _loaded(false), _parameters(params) {}
inline explicit package(): _loaded(false) {}
void save(const std::string& filename) const;
void load(const std::string& filename);
inline std::size_t size() const { return _samples.size(); }
inline bool loaded() const { return _loaded; }
const mre::content& operator[](const std::string& id) const;
const mre::parameters& params() const { return _parameters; }
template<class Archive>
void serialize(Archive & ar, const unsigned int version) {
ar & boost::serialization::make_nvp("samples", _samples);
ar & boost::serialization::make_nvp("params", _parameters);
}
public:
std::size_t generate(std::size_t contents, std::size_t angles);
private:
bool _loaded;
container _samples;
mre::parameters _parameters;
};
I am also serializing a set of paremeters mentioned below.
struct parameters{
std::size_t degree;
std::size_t frame_size;
template<class Archive>
void serialize(Archive & ar, const unsigned int version) {
ar & boost::serialization::make_nvp("degree", degree);
ar & boost::serialization::make_nvp("frame_size", frame_size);
}
};
Saving, loading and generating are done as following
void mre::package::save(const std::string& filename) const {
std::ofstream stream(filename, std::ios::binary);
try{
boost::archive::binary_oarchive out(stream, boost::archive::no_tracking);
std::cout << "serialization library version: " << out.get_library_version() << std::endl;
out << *this;
} catch(const std::exception& e){
std::cout << "Error saving archive: " << e.what() << std::endl;
}
stream.close();
}
void mre::package::load(const std::string& filename){
std::ifstream stream(filename, std::ios::binary);
try{
boost::archive::binary_iarchive in(stream, boost::archive::no_tracking);
std::cout << "serialization library version: " << in.get_library_version() << std::endl;
in >> *this;
_loaded = true;
} catch(const std::exception& e){
std::cout << "Error loading archive: " << e.what() << std::endl;
}
stream.close();
}
std::size_t mre::package::generate(std::size_t contents, std::size_t angles){
std::size_t count = 0;
std::size_t v_content = 0;
while(v_content++ < contents){
mre::content x = mre::content::generate();
std::size_t v_angle = 0;
while(v_angle++ < angles){
mre::content x_angle = x;
x_angle.reset_angle_random(); // commenting out this line makes it work
if (_samples.insert(x_angle).second)
++count;
}
}
return count;
}
It looks like a bug in boost multi index container. But I am unaware of any such existing bugs. I can reproduce the problem by compiling the mre in an Arch linux machine which has latest version of boost libraries. The mre also contains a docker target which compiles the same into an ubuntu 22.04 image in which default boost version is 1.74. The issue can be tested using the executable mre as following.
cd build
cmake .. && make
./mre pack archive_name 10 # to serialize 10 randomly generated contents and save to file named archive_name
./mre unpack archive_name # to de-serialize
I order to test the incompatibility it can be compiled using docker.
make docker # compiles and generates a file named arc inside build/archives directory of the host machine
./mre unpack archives/arc # which throws exception
Looking at this a long time, I couldn't see it. However, by fixing the seeds and verifying that we get deterministic data, I noticed that the results were "identical" but for the order.
I noticed the default index already relies on the hash indirectly, multiple times:
Since the first index is actually also unique, and the only constituent parts are hashes and the angle, this might cause different uniqueness across version of Boost ContainerHash.
Boost
hash_combinedoes not guarantee stability or portability. In fact, most common hash functions don't, e.g.std::hash:In fact persisting information depending on deterministic hash is a logic error anywhere, unless you're only re-reading the same information in the same process, because
Specifically,
hash_combinehas received many changes between 1.74 and 1.83. You should rethink your indexes. In fact, I would consider it a smell that a hash depending on non-unique hashes is being used as the key (identity) to a unique index.Fixing?
To avoid violating the total ordering contract that the index expects (it's basically like you edited the key fields by "editing" the hash function), I'd expect the hash to be something like
And then perhaps something more like:
Where I substituted libfmt for Boost Format, because it can directly format tuples without me doing the work :)
Basically, I'd not throw away the information, which seemed like code smell anyways, but also caused the indexes to rely on non-deterministic functions.
Here's my motivating code listing, complete with the tweaks to optionally use a fixed seed:
Live On Coliru
PS: Come to think of it that still leaves other indexes (like
tags::id) that suffer from the hash-used-as-key problem. I don't think the data structure is sound [beyond using a non-unique non-deterministic hash as a unique key already observed]. E.g. the primary (identity<>-based) index is effectively the same as the one byid()in your original code, becauseoperator()literally projectsstd::lessoverid(). That'sHere's a counter-gambit:
With
(Full listing Live On Coliru)
TL;DR
Basically, don't use non-perfect hashes as keys. Additionally, don't rely on determinism of the algorithm, except with published cryptographical digests.