fast perl t-test function

1.7k views Asked by At

I'm using perl+R to analyze a large dataset of samples. For each two samples, I calculate the t-test p-value. Currently, I'm using the statistics::R module to export values from perl to R, and then use the t.test function. However, this process is extremely slow. I was wondering if someone knows a perl function that will do the same procedure, in a more efficient manner.

Thanks!

3

There are 3 answers

0
choroba On

You can also try PDL, in particular PDL::Stats.

4
Vincent Zoonekynd On

The volume of data, the number of dataset pairs, and perhaps even the code you have written would probably help us identify why your code is slow. For instance, sending many small datasets to R would be slow, but can probably be sped up simply by sending all the data at once.

For a pure Perl solution, you first need to compute the test statistic (that is easy, and already done in Statistics::TTest, for instance), and then to convert it to a p-value (you need something like R's qt function, but I am not sure it is readily available in Perl -- you could send the T-values to R, in one block, at the end, to convert them to p-values).

0
flies On

The Statistics::TTest module gives you a p-value.

use Statistics::TTest;

my @r1 = map { rand(10)   } 1..32;
my @r2 = map { rand(10)-2 } 1..32;

my $ttest = new Statistics::TTest;  
$ttest->load_data(\@r1,\@r2);  
say "p-value = prob > |T| = ", $ttest->{t_prob};

Playing around a bit, I find that the p-values that this gives you are slightly lower than what you get from R. R is apparently doing something that reduces the degrees of freedom, but my knowledge of statistics is insufficient to explain what it's doing or why. (In the above example, the difference is about 1%. If you use samples of 320 floats instead of 32, then the difference is 50% or even more, but it's a difference between 1e-12 and 1.5e-12.) If you need precise p-values, you will want to take care.