scalapack matrix diagonalization (pdsyevd)

2.5k views Asked by At

I am writing a small test code for parallel matrix diagonalization using ScaLAPACK's divide-and-conquer algorithm PDSYEVD in C. However I am new to ScaLAPACK and looking at the source it appears a rather scary amount of variables to set for which I could not find good any examples. The type of matrices I need to diagonalize are real, symmetric, rather sparse and of size ~1000x1000.

A simple (serial) LAPACK-based code looks like this:

/* "matrix diagonalization using lapack" */
#include <stdio.h>
#include <stdlib.h>

/* DSYEV prototype */
extern void dsyev_(char* jobz, char* uplo, int* n, double* a, int* lda, 
                double* w, double* work, int* lwork, int* info );
#define n 4     

int main(int argc, char *argv[])
{
  int i,j,info,lda,lwork,N;
  double A[n][n], a[n*n], work[100*n],w[n];

  /* initialize matrices */
  for(i=0;i<n;i++) { for(j=0;j<n;j++) A[i][j] = (i+j); }

  /* diagonalize */
  N=n;  lda=n;  lwork=10*n;  info=0;
  for(i=0;i<n;i++) { for(j=0;j<n;j++) a[i+j*n]=A[i][j]; }
  dsyev_("V","L",&N,a,&lda,w,work,&lwork,&info);

  /* print result */
  for(i=0;i<n;i++) {
    printf("w[%d] = %8.4f  | v[%d] = [ ",i,w[i],i);
    for(j=0;j<n;j++) printf("%6.2f ",a[j+i*n]);
    printf("]\n");
  }

  return 0;
}

From there I would like to go pdsyevd, which should be called as as :

SUBROUTINE PDSYEVD( JOBZ, UPLO, N, A, IA, JA, DESCA, W, Z, IZ, JZ, DESCZ, WORK, LWORK, IWORK, LIWORK, INFO )

A number of things are unclear to me. An answer to any or several of these questions would be much appreciated.

  1. Does anyone happen to have an example?
  2. In the ScaLAPACK documentation it says : "[...] PDSYEVD assumes a homogeneous system [...] a hetereogeneous system may return incorrect results without any error message.". What does it mean in this context, a "homo-/hetereogeneous" system?
  3. At the end of the documentation it says : "The distributed submatrices must verify some alignment properties, namely [...] ". The terms in this expression aren't even in the input, how do I know if this is satisfied? MB_A/MB_Z?
  4. How do I chose the submatrices for each process? Is there any rationale? I suppose they should be more or less equal in size, but rows/columns/squares/...? Maybe larger for more sparse regions...?
  5. What is Z/IZ/IJ/DESCZ doing in the input? I am already giving each process a submatrix A/IA/JA/DESCA to work on, so why this Z? It appears to be the partial eigenvector matrix, but why is not written to A like in DSYEV?
  6. Do all processes have to have the same block size? Can there be overlap of the submatrices or what do I do for prime-sized matrices for example?
  7. What is in DESCA? Also, what is DLEN? It seems to be an input "array descriptor" for A. I am not sure what to make of this.

Here's more or less what I have so far, which is not doing much:

/* "matrix diagonalization using scalapack" */
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

/* PDSYEVD prototype */
extern void pdsyevd_( char* jobz, char* uplo, int* n, double* a, int* ia, int* ja, int* desca,
                double* w, double* z, int* iz, int* jz, int* descz, double* work,
                int* lwork, double* iwork, int* liwork, int* info );

#define n 4

int main(int argc, char *argv[])
{
  int numprocs,myid,i,j,k,index,info,lwork,liwork,N;
  int ia,ja,desca[n];
  int iz,jz,descz[n];
  double A[n][n],w[n];
  double *a,*z,*work,*iwork;

  /* set up  MPI stuff */
  MPI_Init(&argc,&argv);                    // initialize MPI
  MPI_Comm_size(MPI_COMM_WORLD,&numprocs);  // find out size of world
  MPI_Comm_rank(MPI_COMM_WORLD,&myid);      // determine current mpi proc id

  /* initialize matrices */
  for(i=0;i<n;i++) { for(j=0;j<n;j++) A[i][j] = (i+j); }

  /* determine submatrices */

  /* let's use square blocks, of equal size:
     there number of processors (mp^2) should be
      - square (mp^2 = 1,2,4,9,16,..)
      - with mp being a divisor of n

     In case we have 4 procs and a 4x4 matrix for example

         (A11  A12  A13  A14)
         (A21  A22  A23  A24)
     A = (A31  A32  A33  A34)
         (A41  A42  A43  A44)

         ( a1  a2 )
       = ( a3  a3 )

     where      (A11  A12)
           a1 = (A21  A22)

     is the submatrix sent to process 1 for example 

    in this case :
      n=4
      nprocs=4 --> mp=2
      len = 2
    */

  int mp = sqrt(numprocs);
  if(numprocs!=mp*mp) {printf("ERROR: numprocs (%d) should be square.\n",numprocs); return 1; }
  if(n%mp!=0) {printf("ERROR: mp (%d) should be divisor of n (%d).\n",mp,n); return 1; }
  int len = n/mp;
  a = malloc((n*n+1)*sizeof(double));
  z = malloc((n*n+1)*sizeof(double));
  //a = malloc((len*len+1)*sizeof(double));
  //z = malloc((len*len+1)*sizeof(double));

  ia=(myid*len)%n; // from 0 to n in steps of len
  ja=(int)(myid/mp)*len;
  printf("proc %d has a submatrix of size %d, with starting indices (%d,%d)\n",myid,len,ia,ja);

  iz=ia;
  jz=ja;

  for(i=ia;i<ia+len;i++) { for(j=ja,k=0;j<ja+len;j++,k++) a[k]=A[i][j]; }
  for(i=iz;i<iz+len;i++) { for(j=jz,k=0;j<jz+len;j++,k++) z[k]=A[i][j]; }

  /* diagonalize */
  N=n;  lwork=n*n;  liwork=n*n;  info=0;
  work = malloc(lwork*sizeof(double));
  iwork = malloc(liwork*sizeof(double));

  for(i=0;i<n;i++) { for(j=0;j<n;j++) a[i+j*n]=A[i][j]; }
  pdsyevd_("V","U",&len,a,&ia,&ja,desca,w,z,&iz,&jz,descz,work,&lwork,iwork,&liwork,&info);
  if(!myid)printf("info = %d\n",info);

  /* gather & print results */
  double wf[n];
  MPI_Reduce(w,wf,n,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
  if(myid==0) {
    for(i=0;i<n;i++) {
      printf("wf[%d] = %8.4f  | v[%d] = [ ",i,wf[i],i);
      for(j=0;j<n;j++) printf("%6.2f ",a[j+i*n]);
      printf("]\n");
    }
  }

  free(a); free(z); free(work); free(iwork);

  /* clean up */
  MPI_Finalize(); /* MPI Programs end with MPI Finalize; this is a weak synchronization point */
  return 0;
}
1

There are 1 answers

1
Alex I On

Two useful references:

I believe the behavior described in the IBM docs matches ScaLAPACK, while being much more completely documented.

For work, lwork, iwork, etc: set lwork=0 and they should be allocated internally by the subroutine as needed, no need to pass them in.

For z, iz, jz, etc: if jobz = 'V', z contains "the updated local part of the global matrix Z, where columns jz to jz+ n-1 of the global matrix Z contain the orthonormal eigenvectors of the global matrix A". if jobz != 'V' these are not used.

More answers to questions:

  1. No example, sorry.
  2. Homogeneous system as in http://en.wikipedia.org/wiki/System_of_linear_equations#Homogeneous_systems
  3. Sorry, I'm not sure how to answer this.
  4. Squares with roughly equal number of non-zero elements. Roughly equal sized squares is also probably ok.
  5. The usage of Z is described above and in the docs I linked. I think it is provided as a convenience to implementers.
  6. No, any block size should be ok. I'm guessing break up prime sized matrix into two slightly unequal parts.
  7. DESCA is an array of 9 integers containing various sizes (global matrix rows/columns, block rows/columns, etc), described in the IBM docs.