Armadillo C++ matrix multiplication with in-place assignment to submatrix

142 views Asked by At

Maybe it's a trivial question, but suppose I have two matrices A, B of dimension N x s:

arma::mat A(N, s, arma::fill::randu)
arma::mat B(N, s, arma::fill::randu)

and I need to write a product of these matrices to a subblock of a larger matrix

arma::mat X(M * s, M * s, arma::fill::randu)

, which I do in a loop as follows:

X.submat(j * s, j * s, (j + 1) * s- 1, (j + 1) * s - 1) = A.t() * B;

My question is whether or not Armadillo creates a copy of the product of these matrices and then assigns it to the submatrix, or if the evaluation is performed in-place. The latter would be much more memory and cpu efficient since I am performing this step in a loop over all the blocks (M ~ 1000 steps).

I want to speed up this loop and this is one of the few things I can change. I'd rather not deal with pointers to memory itself and let armadillo deal with that. Can you recommend an efficient way to do it without creating a temporary copy or is this way the best?

The question is related to the Block-Lanczos algorithm for anyone interested.

1

There are 1 answers

1
YurkoFlisk On BEST ANSWER

Official Armadillo site mentions:

Armadillo employs a delayed evaluation approach to combine several operations into one and reduce (or eliminate) the need for temporaries. Where applicable, the order of operations is optimised. Delayed evaluation and optimisation are achieved through recursive templates and template meta-programming.

However, documentation is quite scarce regarding the details of how exactly it is done to the point that even return types of methods aren't specified there, which is, I guess, done with the intention not to scare ordinary users of the library with the level of details (template trickery in particular) typically absent in documentation of similar libraries in higher-level languages from where the users are expected to come from.

Using the source code, though, it isn't very hard to examine what happens, especially with an IntelliSense and Debugger. What helps is that Armadillo is templated and thus is implemented in headers, so we can see from IDE what happens up to the point where underlying (compiled, with platform-specific implementation) BLAS routines are called.

So, your A.t() * B represents the computation which has to be done in a context where the result is finally needed. It is of type arma::Glue<arma::Op<arma::dmat, arma::op_htrans>, arma::dmat, arma::glue_times> and contains references to two arma::Base-derived operands, which in our case are an arma::Op<arma::dmat, arma::op_htrans> representing A.t() (which itself contains reference to A, of course) and an arma::dmat representing B. The third arma::Glue's template argument specifies operation itself, which is arma::glue_times for multiplication. arma::glue_times::apply eventually gets called to perform the multiplication, and, through several layers of indirection with various if's, except some special (e.g. small-matrices) cases, BLAS [x]gemm (dgemm for double matrices in our case) function gets called like this (from arma::gemm<...>::apply_blas_type):

// ...
      #elif defined(ARMA_USE_BLAS)
        {
        arma_extra_debug_print("blas::gemm()");
        
        arma_debug_assert_blas_size(A,B);
        
        const char trans_A = (do_trans_A) ? ( is_cx<eT>::yes ? 'C' : 'T' ) : 'N';
        const char trans_B = (do_trans_B) ? ( is_cx<eT>::yes ? 'C' : 'T' ) : 'N';
        
        const blas_int m   = blas_int(C.n_rows);
        const blas_int n   = blas_int(C.n_cols);
        const blas_int k   = (do_trans_A) ? blas_int(A.n_rows) : blas_int(A.n_cols);
        
        const eT local_alpha = (use_alpha) ? alpha : eT(1);
        
        const blas_int lda = (do_trans_A) ? k : m;
        const blas_int ldb = (do_trans_B) ? n : k;
        
        const eT local_beta  = (use_beta) ? beta : eT(0);
        
        arma_extra_debug_print( arma_str::format("blas::gemm(): trans_A = %c") % trans_A );
        arma_extra_debug_print( arma_str::format("blas::gemm(): trans_B = %c") % trans_B );
        
        blas::gemm<eT>
          (
          &trans_A,
          &trans_B,
          &m,
          &n,
          &k,
          &local_alpha,
          A.mem,
          &lda,
          B.mem,
          &ldb,
          &local_beta,
          C.memptr(),
          &m
          );
        }
// ...

(where blas::gemm is a very thin wrapper dealing with choosing appropriate [x]gemm depending on number type and with Fortran hidden string length arguments if needed)

As you can see, the information about whether the argument matrices have to be interpreted as transposed for the purposes of multiplication is passed to gemm in a standard way through one-character string arguments, so there is no need to create a temporary matrix for A.t(). And arma::glue_times::apply does not, instead it "unwraps" arma::Op<arma::dmat, arma::op_htrans> to setting do_trans_A as true (and do_trans_B is false since second argument is just arma::dmat) template arguments (see here) which are used above to construct trans_A and trans_B characters.

Another aspect of how blas::gemm is called above is that it uses (together with dimensions of matrices, of course) a single pointer C.memptr() as an output argument (where C is Mat<eT>& output argument of apply_blas_type), so, in particular, the storage to which blas::gemm writes to is assumed to be a contiguous one (otherwise, it wouldn't know how to access rows past the first). When you assign to some matrix D like D = A.t() * B this isn't the problem since D has contiguous storage and operator= for Glue<T1, T2, glue_type> argument is implemented to just call glue_type::apply.

However, arma::subview can't be guaranteed to refer to contiguous storage, so when we assign to it we wouldn't be able to call BLAS functions with something like its "memptr" as output as it's done above for arma::Mat argument. Thus, Armadillo creates a temporary output matrix and then copies from it to a subview. In particular, arma::subview<eT>::operator= (source) calls helper inplace_op<op_internal_equ> where the temporary matrix is created (B below) and is copied into subview (s) col-by-col (not "row-by-row" because matrices are stored in column-major order):

// ...
      if((s.aux_row1 == 0) && (s_n_rows == s.m.n_rows))
        {
        if(is_same_type<op_type, op_internal_equ  >::yes)  { arrayops::copy         ( s.colptr(0), B.memptr(), s.n_elem ); }
        if(is_same_type<op_type, op_internal_plus >::yes)  { arrayops::inplace_plus ( s.colptr(0), B.memptr(), s.n_elem ); }
        if(is_same_type<op_type, op_internal_minus>::yes)  { arrayops::inplace_minus( s.colptr(0), B.memptr(), s.n_elem ); }
        if(is_same_type<op_type, op_internal_schur>::yes)  { arrayops::inplace_mul  ( s.colptr(0), B.memptr(), s.n_elem ); }
        if(is_same_type<op_type, op_internal_div  >::yes)  { arrayops::inplace_div  ( s.colptr(0), B.memptr(), s.n_elem ); }
        }
      else
        {
        for(uword ucol=0; ucol < s_n_cols; ++ucol)
          {
          if(is_same_type<op_type, op_internal_equ  >::yes)  { arrayops::copy         ( s.colptr(ucol), B.colptr(ucol), s_n_rows ); }
          if(is_same_type<op_type, op_internal_plus >::yes)  { arrayops::inplace_plus ( s.colptr(ucol), B.colptr(ucol), s_n_rows ); }
          if(is_same_type<op_type, op_internal_minus>::yes)  { arrayops::inplace_minus( s.colptr(ucol), B.colptr(ucol), s_n_rows ); }
          if(is_same_type<op_type, op_internal_schur>::yes)  { arrayops::inplace_mul  ( s.colptr(ucol), B.colptr(ucol), s_n_rows ); }
          if(is_same_type<op_type, op_internal_div  >::yes)  { arrayops::inplace_div  ( s.colptr(ucol), B.colptr(ucol), s_n_rows ); }
          }
        }
// ...

(Note that in the case when s_n_rows == s.m.n_rows, i.e. subview is actually contiguous, a single arrayops::copy (which itself is just a memcpy wrapper) is called instead of for each column in a loop, but the temporary matrix is still created. Also, op_internal_schur is for element-wise multiplication)

So, summing up, no intermediate matrix will be created when evaluating A.t() * B, but before actual assignment to a subview one matrix will be created. And if you want to use BLAS optimized functions, you can't get around it. But you can try implementing multiplication directly into your discontinuous subview manually using arma::Mat::at indexing (which isn't bounds-checked) and measure if you want, though I doubt that avoiding one temporary matrix creation will outweigh the loss of BLAS optimizations (and, among other things, contiguousness of the output buffer during computations there may make them more cache-friendly).