At the moment I try to refactor a code which consists of a loop of dlarfx calls to a single dlarft/dlarfb call to improve performance. The code transform a matrix A which is part of a product P * A * Q^T and the dlarft/dlarfb calls intend to modify the P and Q matrix so that the product is still valid.
My problem is that the refactored code results into completely different matrices. My questions is if there are known subtle differences or caveats of the dlarft/dlarfb code in comparison to a loop of dlarfx applications? Or do I have a misunderstand of the dlarft/dlarfb interface from the documentation and call the functions the wrong way.
Below I attach the relevant except of both versions of the code. It is assumed that the input matrix A is square with dimension (N, N). The code is written in C and uses the LAPACKE interface. Since the matrix is processed in backwards order, the Householder vector is of from (v, v, 1, 0).
Loop version
for (int i = N - 1; i >= 0; i--) {
// Calculation of Householder vector and tau, omitted
LAPACKE_dlarfx_work(LAPACK_COL_MAJOR,
'R',
nrowsq, // Number of rows of Q
N,
vec_array,// Householder vector
tau,
q, // Stored as Q => application from right
ldq,
work);
}
Blocked version
a is constructed as matrix with elements
( 1, 1, vp1, vp2)
(vq1, 1, 1, vp2)
(vq2, vq2, 1, 1)
(vq3, vq3, vq3, 1)
where vqX
are the vectors for application to Q.
LAPACKE_dlarft_work(LAPACK_COL_MAJOR,
'B',
'R',
N,
N,
a,
lda,
tauq_array,
t_matrix,
N);
LAPACKE_dlarfb_work(LAPACK_COL_MAJOR,
'R',
'N',
'B',
'R',
nrowsq,
N,
N,
a,
lda,
t_matrix,
N,
q,
ldq,
work,
nrowsq);
For any hint why the blocked version results into different results I would be very thankful. If you need another information to triangulate the error, I am happy to provide more info.