Invalid use of group function; attempting to find pearson correlation

212 views Asked by At

I'm trying to figure how to calculate the pearson correlation coefficient using sql. Here is the formula I'm using: enter image description here and here is the table I'm using: enter image description here

This is what I have so far for a query but it's giving me this message: Invalid use of group function

select first_id, second_id, movie_id, first_score, second_score,  count(*) as n, 
sum((first_score-avg(first_score))*(second_score-avg(second_score)))/
(
sqrt(sum(first_score-avg(first_score)))*
sqrt(sum(second_score-avg(second_score))))
as pearson
from connections
group by second_id

Thanks for helping

2

There are 2 answers

0
Gordon Linoff On BEST ANSWER

Here is a query that does the calculation in the formula:

select sum((first_score - avg_first_score)*(second_score - avg_second_score)) /
       (sqrt(sum(pow((first_score - avg_first_score), 2)))*
        sqrt(sum(pow((second_score - avg_second_score), 2)))
       ) as r      
from connections c cross join
     (select avg(first_score) as avg_first_score, avg(second_score) as avg_second_score
      from connections
     ) const;

There are numerous issues with your attempt. This precalculates the average values for the two scores. It then applies the formula pretty much as written.

0
Bohemian On

From a purely syntactic perspective, you've got a problem with your group by clause. It should list every non-aggregated column to work properly. It should be:

group by first_id, second_id, movie_id, first_score, second_score