Maybe it's a trivial question, but suppose I have two matrices A, B of dimension N x s:
arma::mat A(N, s, arma::fill::randu)
arma::mat B(N, s, arma::fill::randu)
and I need to write a product of these matrices to a subblock of a larger matrix
arma::mat X(M * s, M * s, arma::fill::randu)
, which I do in a loop as follows:
X.submat(j * s, j * s, (j + 1) * s- 1, (j + 1) * s - 1) = A.t() * B;
My question is whether or not Armadillo creates a copy of the product of these matrices and then assigns it to the submatrix, or if the evaluation is performed in-place. The latter would be much more memory and cpu efficient since I am performing this step in a loop over all the blocks (M ~ 1000 steps).
I want to speed up this loop and this is one of the few things I can change. I'd rather not deal with pointers to memory itself and let armadillo deal with that. Can you recommend an efficient way to do it without creating a temporary copy or is this way the best?
The question is related to the Block-Lanczos algorithm for anyone interested.
Official Armadillo site mentions:
However, documentation is quite scarce regarding the details of how exactly it is done to the point that even return types of methods aren't specified there, which is, I guess, done with the intention not to scare ordinary users of the library with the level of details (template trickery in particular) typically absent in documentation of similar libraries in higher-level languages from where the users are expected to come from.
Using the source code, though, it isn't very hard to examine what happens, especially with an IntelliSense and Debugger. What helps is that Armadillo is templated and thus is implemented in headers, so we can see from IDE what happens up to the point where underlying (compiled, with platform-specific implementation) BLAS routines are called.
So, your
A.t() * B
represents the computation which has to be done in a context where the result is finally needed. It is of typearma::Glue<arma::Op<arma::dmat, arma::op_htrans>, arma::dmat, arma::glue_times>
and contains references to twoarma::Base
-derived operands, which in our case are anarma::Op<arma::dmat, arma::op_htrans>
representingA.t()
(which itself contains reference toA
, of course) and anarma::dmat
representingB
. The thirdarma::Glue
's template argument specifies operation itself, which isarma::glue_times
for multiplication.arma::glue_times::apply
eventually gets called to perform the multiplication, and, through several layers of indirection with variousif
's, except some special (e.g. small-matrices) cases, BLAS[x]gemm
(dgemm
fordouble
matrices in our case) function gets called like this (fromarma::gemm<...>::apply_blas_type
):(where
blas::gemm
is a very thin wrapper dealing with choosing appropriate[x]gemm
depending on number type and with Fortran hidden string length arguments if needed)As you can see, the information about whether the argument matrices have to be interpreted as transposed for the purposes of multiplication is passed to
gemm
in a standard way through one-character string arguments, so there is no need to create a temporary matrix forA.t()
. Andarma::glue_times::apply
does not, instead it "unwraps"arma::Op<arma::dmat, arma::op_htrans>
to settingdo_trans_A
astrue
(anddo_trans_B
is false since second argument is justarma::dmat
) template arguments (see here) which are used above to constructtrans_A
andtrans_B
characters.Another aspect of how
blas::gemm
is called above is that it uses (together with dimensions of matrices, of course) a single pointerC.memptr()
as an output argument (whereC
isMat<eT>&
output argument ofapply_blas_type
), so, in particular, the storage to whichblas::gemm
writes to is assumed to be a contiguous one (otherwise, it wouldn't know how to access rows past the first). When you assign to some matrixD
likeD = A.t() * B
this isn't the problem sinceD
has contiguous storage andoperator=
forGlue<T1, T2, glue_type>
argument is implemented to just callglue_type::apply
.However,
arma::subview
can't be guaranteed to refer to contiguous storage, so when we assign to it we wouldn't be able to call BLAS functions with something like its "memptr" as output as it's done above forarma::Mat
argument. Thus, Armadillo creates a temporary output matrix and then copies from it to a subview. In particular,arma::subview<eT>::operator=
(source) calls helperinplace_op<op_internal_equ>
where the temporary matrix is created (B
below) and is copied into subview (s
) col-by-col (not "row-by-row" because matrices are stored in column-major order):(Note that in the case when
s_n_rows == s.m.n_rows
, i.e. subview is actually contiguous, a singlearrayops::copy
(which itself is just amemcpy
wrapper) is called instead of for each column in a loop, but the temporary matrix is still created. Also,op_internal_schur
is for element-wise multiplication)So, summing up, no intermediate matrix will be created when evaluating
A.t() * B
, but before actual assignment to a subview one matrix will be created. And if you want to use BLAS optimized functions, you can't get around it. But you can try implementing multiplication directly into your discontinuous subview manually usingarma::Mat::at
indexing (which isn't bounds-checked) and measure if you want, though I doubt that avoiding one temporary matrix creation will outweigh the loss of BLAS optimizations (and, among other things, contiguousness of the output buffer during computations there may make them more cache-friendly).