Difference in values between numpy.correlate and numpy.corrcoef?

133 views Asked by At

It was my understanding that numpy.correlate and numpy.corrcoef should yield the same result for aligned normalized vectors. Two immediate cases to the contrary:

from math import isclose as near
import numpy as np


def normalizedCrossCorrelation(a, b):
    assert len(a) == len(b)
    normalized_a = [aa / np.linalg.norm(a) for aa in a]
    normalized_b = [bb / np.linalg.norm(b) for bb in b]
    return np.correlate(normalized_a, normalized_b)[0]


def test_normalizedCrossCorrelationOfSimilarVectorsRegression0():
    v0 = [1, 2, 3, 2, 1, 0, -2, -1, 0]
    v1 = [1, 1.9, 2.8, 2, 1.1, 0, -2.2, -0.9, 0.2]
    assert near(normalizedCrossCorrelation(v0, v1), 0.9969260391224474)
    print(f"{np.corrcoef(v0, v1)=}")
    assert near(normalizedCrossCorrelation(v0, v1), np.corrcoef(v0, v1)[0, 1])


def test_normalizedCrossCorrelationOfSimilarVectorsRegression1():
    v0 = [1, 2, 3, 2, 1, 0, -2, -1, 0]
    v1 = [0.8, 1.9, 2.5, 2.1, 1.2, -0.3, -2.4, -1.4, 0.4]
    assert near(normalizedCrossCorrelation(v0, v1), 0.9809817769512982)
    print(f"{np.corrcoef(v0, v1)=}")
    assert near(normalizedCrossCorrelation(v0, v1), np.corrcoef(v0, v1)[0, 1])

Pytest output:

E       assert False
E        +  where False = near(0.9969260391224474, 0.9963146417122921)
E        +    where 0.9969260391224474 = normalizedCrossCorrelation([1, 2, 3, 2, 1, 0, ...], [1, 1.9, 2.8, 2, 1.1, 0, ...])


E       assert False
E        +  where False = near(0.9809817769512982, 0.9826738919606931)
E        +    where 0.9809817769512982 = normalizedCrossCorrelation([1, 2, 3, 2, 1, 0, ...], [0.8, 1.9, 2.5, 2.1, 1.2, -0.3, ...])
1

There are 1 answers

2
Ruggero Turra On

I think your formula with np.correlate is wrong, it does not yield the correlation coefficient.

Consider the first example

v0 = [1, 2, 3, 2, 1, 0, -2, -1, 0]
v1 = [1, 1.9, 2.8, 2, 1.1, 0, -2.2, -0.9, 0.2]


np.correlate(v0 / np.linalg.norm(v0), v1 / np.linalg.norm(v1))[0] # 0.9969260391224474
# you can also use
#    np.correlate(v0 , v1 , mode='valid') / np.linalg.norm(v0) / np.linalg.norm(v1)
# but you get same number
np.corrcoef(v0, v1)[0][1]                                         # 0.9963146417122921

The correct answer, computed without using floating point should be 59 Sqrt[5/17534] which approximates to 0.99631464171229218403, which is surprasingly identical to np.corrcoef.

Take into account that

np.correlate(a, b)

when a and b are 1d array of the same size, returns the scalar product (e.g. np.dot(a, b)). The covariance can be computed (even if it is not recomended) as E[v0 v1] - E[v0]E[v1]. This can be done as

(np.correlate(v0 , v1 , mode='valid') / len(v0) - np.mean(v0) * np.mean(v1))[0]

this is equal to np.cov(v0, v1, ddof=0)[0][1]. So you can compute the correlation as

((np.correlate(v0 , v1 , mode='valid') / len(v0) - np.mean(v0) * np.mean(v1)) / np.std(v0) / np.std(v1))[0]

By the way, just use np.corrcoef or np.cov.

Math explaination

Your formula using np.correlate is equivalent to:

E[v0 * v1] / sqrt( E[v0 ** 2] E[v1 ** 2] )

where E is the sample mean. But the correlation coefficient can be computed as

(E[v0 * v1] - (E[v0] * E[v1])) / sqrt( (E[v0 ** 2] - E[v0] ** 2) *  (E[v1 ** 2] - E[v1] ** 2 )