Running a matrix algorithm on how many cores?

124 views Asked by At

I am running a program called dnadist from PHYLIP (http://evolution.genetics.washington.edu/phylip/doc/dnadist.html). This creates a dna distance matrix from the number of sequences you input.

Currently, I want to create a matrix from 14,778 sequences. I am submitting this to run on my University's HPCC and based on my calculated estimate it will take 10 days to run.

I want to request more cores to speed up the time, but I am getting confused on if this is even possible to split up the algorithm running? Or does it have to run all on 1 core? My assumption is I would have to alter the algorithm itself to spilt up the matrix being produced and then concatenate it all back together. Is this correct to assume?

2

There are 2 answers

2
Oscar Foley On

Yes, you can parallelize, that is the main point of using HPCC. Without reading your code is hard to answer. I assume you code would something like:

EXPORT CalculateDistances :=FUNCTION(parameters)
    // For each parameter do your DNA magic(matrix calculation) 
    RETURN something;
END;

result:= CalculateDistances(A,B,C,D...);
OUTPUT(result, named('result'));

You can parallelize coding your function with the basic matrix calculation, using PARALLEL ECL command and running workunit in Thor (not in HThor).

EXPORT CalculateDistance :=FUNCTION(Matrix1, Matrix2)
    // Your DNA magic(basic matrix calculation) for Matrix1&Matrix2
    RETURN something;
END;

result01:= CalculateDistance(A,B);
result02:= CalculateDistance(A,C);
result03:= CalculateDistance(A,D);
result04:= CalculateDistance(B,A);
result05:= CalculateDistance(B,C);
result06:= CalculateDistance(B,D);
result07:= CalculateDistance(C,A);
result08:= CalculateDistance(C,B);
result09:= CalculateDistance(C,D);
result10:= CalculateDistance(D,A);
result11:= CalculateDistance(D,B);
result12:= CalculateDistance(D,C);

executeCalculates := PARALLEL(
        result01,
        result02,
        result03,
        result04,
        result05,
        result06,
        result07,
        result08,
        result09,
        result10,
        result11,
        result12
);

executeCalculates;
3
DataMacGyver On

I'm not sure of a few things here: How phylip is running the pairwise comparisons (if it's all at once that's a helluva calculation!), what you are sequencing (bacterial proteins are orders of magnitude easier to fit in memory than the wheat genome would be) and how this is setup to run on HPCC (phylip is C I believe so how has it been deployed?).

In short, genetic analyses are done all the time so writing bespoke programs to do it yourself is probably a non-starter. There are other tools, such as MEGA that can chew through distance calculations for you but it's worth seeing what's being used in the literature for your problem and on what hardware. Perhaps also try R's dist.dna() function? You could if you wanted to parellelise this (link) but you'd need some jiggery pokery to make sure you got all the distances done before you combine them.

Is speed of calculation important? If you have 15,000 whole bacterial sequences (assuming 1,300 kbp each) then they will fit in memory on a decent machine. Again, my guess is that someone has already got something that will do this, a few days on a desktop you've got lying around is fine, gives you chance to write up your intro and methods!