I have input and target data represented as MatrixXd (N x M) and VectorXd (N). The goal is to create mini-batches of size K consisting of a subset of input and target data shuffled in the same way. Then, the ML model will process these mini-batches in a loop. Could you recommend how to achieve this with as less as possible copying (maybe, with a code example)?
My attempt to implement such kind of batching
#include <algorithm>
#include <numeric>
#include <random>
#include <Eigen/Dense>
using Eigen::MatrixXd;
using Eigen::Ref;
using Eigen::VectorXd;
struct Batch {
const Ref<const MatrixXd> input;
const Ref<const VectorXd> target;
};
std::vector<Batch> generate_batches(const Ref<const MatrixXd> input, const Ref<const VectorXd> target, unsigned batch_size)
{
unsigned num_samples = input.rows();
unsigned num_batches = ceil(num_samples / (float)batch_size);
static std::default_random_engine engine;
std::vector<unsigned> idxs(num_samples);
std::iota(idxs.begin(), idxs.end(), 0);
std::shuffle(idxs.begin(), idxs.end(), engine);
std::vector<Batch> batches;
batches.reserve(num_batches);
auto idxs_begin = std::make_move_iterator(idxs.begin());
for (unsigned idx = 0; idx < num_batches; ++idx) {
int start = idx * batch_size;
int end = std::min(start + batch_size, num_samples);
std::vector<unsigned> batch_idxs(std::next(idxs_begin, start), std::next(idxs_begin, end));
batches.push_back({ input(batch_idxs, Eigen::all), target(batch_idxs) });
}
return batches;
}
Eigen comes with a Transpositions type that does just that. It works in-place by swapping rows or columns. So you can just keep shuffling the same matrix over and over again.
See also Permute Columns of Matrix in Eigen and similar questions.
EDIT: An older version of this initialized the indices via std::shuffle which I think is wrong
Here is a second version that may offer a more palatable interface. In particular, the original matrix and vector can be restored without taking a copy.
Again, I chose to use the matrix column-wise unlike in your code attempt where you take rows. Eigen stores its matrices in column-major order (a.k.a. Fortran order). Taking row slices rather than column slices will significantly slow down pretty much everything you do with the data. So I really urge you to transpose your input generation and matrix use accordingly, if at all possible.