Force intercept at origin with scipy.stats.pearsonr or alternatives?

52 views Asked by At

I checked the documentation of scipy.stats.pearsonr but did not find any options to force intercept at origin 0,0.

If it's impossible with scipy.stats.pearsonr does anyone know any alternatives?

1

There are 1 answers

0
Nick ODell On

I checked the documentation of scipy.stats.pearsonr but did not find any options to force intercept at origin 0,0.

The Pearson correlation coefficient has the property that you can add any number to either sample, or multiply either number by a non-negative number, and this won't change the calculated R score.

For example, if you compare a sample to the same sample plus 10, it still has a correlation of 1.0 with the original.

from scipy.stats import pearsonr
import numpy as np


a = np.array([0, 1, 1, 2, 3, 5, 8, 13, 21, 34])
b = a + 10
print(pearsonr(a, b))

I assume what you're asking for is a version of Pearson's correlation coefficient where being wrong by a constant does matter. Scikit-learn has something similar to this.

Example:

from sklearn.metrics import r2_score
import numpy as np


a = np.array([0, 1, 1, 2, 3, 5, 8, 13, 21, 34])
b = a + 10
print(r2_score(a, b))

In this example, being wrong by a constant does matter, and it gets a R^2 of 0.087.

You should be aware of four gotchas about this:

  1. The r2_score() function is affected by both scaling the sample and adding a constant. In other words, r2_score(a, a * 2) will be a score less than 1.0.
  2. This is R^2, not R.
  3. This score is in the range (-∞, 1]. A score of 1.0 means a perfect match. A score of 0.0 means that it is as accurate a match as predicting the mean of the input distribution for every point. Unlike pearsonr(), -1 does not mean that it is perfectly negatively correlated. Instead, a negative score means that it is worse than predicting the mean, and it can be infinitely negative, because it can be infinitely worse.
  4. Unlike Pearson correlation, this score is not symmetric. In other words, r2_score(a, b) is not necessarily the same thing as r2_score(b, a).

See the documentation for more.