I am writing a small test code for parallel matrix diagonalization using ScaLAPACK's divide-and-conquer algorithm PDSYEVD in C. However I am new to ScaLAPACK and looking at the source it appears a rather scary amount of variables to set for which I could not find good any examples. The type of matrices I need to diagonalize are real, symmetric, rather sparse and of size ~1000x1000.
A simple (serial) LAPACK-based code looks like this:
/* "matrix diagonalization using lapack" */
#include <stdio.h>
#include <stdlib.h>
/* DSYEV prototype */
extern void dsyev_(char* jobz, char* uplo, int* n, double* a, int* lda,
double* w, double* work, int* lwork, int* info );
#define n 4
int main(int argc, char *argv[])
{
int i,j,info,lda,lwork,N;
double A[n][n], a[n*n], work[100*n],w[n];
/* initialize matrices */
for(i=0;i<n;i++) { for(j=0;j<n;j++) A[i][j] = (i+j); }
/* diagonalize */
N=n; lda=n; lwork=10*n; info=0;
for(i=0;i<n;i++) { for(j=0;j<n;j++) a[i+j*n]=A[i][j]; }
dsyev_("V","L",&N,a,&lda,w,work,&lwork,&info);
/* print result */
for(i=0;i<n;i++) {
printf("w[%d] = %8.4f | v[%d] = [ ",i,w[i],i);
for(j=0;j<n;j++) printf("%6.2f ",a[j+i*n]);
printf("]\n");
}
return 0;
}
From there I would like to go pdsyevd, which should be called as as :
SUBROUTINE PDSYEVD( JOBZ, UPLO, N, A, IA, JA, DESCA, W, Z, IZ, JZ, DESCZ, WORK, LWORK, IWORK, LIWORK, INFO )
A number of things are unclear to me. An answer to any or several of these questions would be much appreciated.
- Does anyone happen to have an example?
- In the ScaLAPACK documentation it says : "[...] PDSYEVD assumes a homogeneous system [...] a hetereogeneous system may return incorrect results without any error message.". What does it mean in this context, a "homo-/hetereogeneous" system?
- At the end of the documentation it says : "The distributed submatrices must verify some alignment properties, namely [...] ". The terms in this expression aren't even in the input, how do I know if this is satisfied? MB_A/MB_Z?
- How do I chose the submatrices for each process? Is there any rationale? I suppose they should be more or less equal in size, but rows/columns/squares/...? Maybe larger for more sparse regions...?
- What is Z/IZ/IJ/DESCZ doing in the input? I am already giving each process a submatrix A/IA/JA/DESCA to work on, so why this Z? It appears to be the partial eigenvector matrix, but why is not written to A like in DSYEV?
- Do all processes have to have the same block size? Can there be overlap of the submatrices or what do I do for prime-sized matrices for example?
- What is in DESCA? Also, what is DLEN? It seems to be an input "array descriptor" for A. I am not sure what to make of this.
Here's more or less what I have so far, which is not doing much:
/* "matrix diagonalization using scalapack" */
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
/* PDSYEVD prototype */
extern void pdsyevd_( char* jobz, char* uplo, int* n, double* a, int* ia, int* ja, int* desca,
double* w, double* z, int* iz, int* jz, int* descz, double* work,
int* lwork, double* iwork, int* liwork, int* info );
#define n 4
int main(int argc, char *argv[])
{
int numprocs,myid,i,j,k,index,info,lwork,liwork,N;
int ia,ja,desca[n];
int iz,jz,descz[n];
double A[n][n],w[n];
double *a,*z,*work,*iwork;
/* set up MPI stuff */
MPI_Init(&argc,&argv); // initialize MPI
MPI_Comm_size(MPI_COMM_WORLD,&numprocs); // find out size of world
MPI_Comm_rank(MPI_COMM_WORLD,&myid); // determine current mpi proc id
/* initialize matrices */
for(i=0;i<n;i++) { for(j=0;j<n;j++) A[i][j] = (i+j); }
/* determine submatrices */
/* let's use square blocks, of equal size:
there number of processors (mp^2) should be
- square (mp^2 = 1,2,4,9,16,..)
- with mp being a divisor of n
In case we have 4 procs and a 4x4 matrix for example
(A11 A12 A13 A14)
(A21 A22 A23 A24)
A = (A31 A32 A33 A34)
(A41 A42 A43 A44)
( a1 a2 )
= ( a3 a3 )
where (A11 A12)
a1 = (A21 A22)
is the submatrix sent to process 1 for example
in this case :
n=4
nprocs=4 --> mp=2
len = 2
*/
int mp = sqrt(numprocs);
if(numprocs!=mp*mp) {printf("ERROR: numprocs (%d) should be square.\n",numprocs); return 1; }
if(n%mp!=0) {printf("ERROR: mp (%d) should be divisor of n (%d).\n",mp,n); return 1; }
int len = n/mp;
a = malloc((n*n+1)*sizeof(double));
z = malloc((n*n+1)*sizeof(double));
//a = malloc((len*len+1)*sizeof(double));
//z = malloc((len*len+1)*sizeof(double));
ia=(myid*len)%n; // from 0 to n in steps of len
ja=(int)(myid/mp)*len;
printf("proc %d has a submatrix of size %d, with starting indices (%d,%d)\n",myid,len,ia,ja);
iz=ia;
jz=ja;
for(i=ia;i<ia+len;i++) { for(j=ja,k=0;j<ja+len;j++,k++) a[k]=A[i][j]; }
for(i=iz;i<iz+len;i++) { for(j=jz,k=0;j<jz+len;j++,k++) z[k]=A[i][j]; }
/* diagonalize */
N=n; lwork=n*n; liwork=n*n; info=0;
work = malloc(lwork*sizeof(double));
iwork = malloc(liwork*sizeof(double));
for(i=0;i<n;i++) { for(j=0;j<n;j++) a[i+j*n]=A[i][j]; }
pdsyevd_("V","U",&len,a,&ia,&ja,desca,w,z,&iz,&jz,descz,work,&lwork,iwork,&liwork,&info);
if(!myid)printf("info = %d\n",info);
/* gather & print results */
double wf[n];
MPI_Reduce(w,wf,n,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
if(myid==0) {
for(i=0;i<n;i++) {
printf("wf[%d] = %8.4f | v[%d] = [ ",i,wf[i],i);
for(j=0;j<n;j++) printf("%6.2f ",a[j+i*n]);
printf("]\n");
}
}
free(a); free(z); free(work); free(iwork);
/* clean up */
MPI_Finalize(); /* MPI Programs end with MPI Finalize; this is a weak synchronization point */
return 0;
}
Two useful references:
IBM Parallel ESSL PDSYEVD documentation
Intel LAPACK DSYEVD documentation
I believe the behavior described in the IBM docs matches ScaLAPACK, while being much more completely documented.
For work, lwork, iwork, etc: set lwork=0 and they should be allocated internally by the subroutine as needed, no need to pass them in.
For z, iz, jz, etc: if jobz = 'V', z contains "the updated local part of the global matrix Z, where columns jz to jz+ n-1 of the global matrix Z contain the orthonormal eigenvectors of the global matrix A". if jobz != 'V' these are not used.
More answers to questions: