Minimal p-value for scipy.stats.pearsonr

2.4k views Asked by At

I am running scipy.stats.pearsonr on my data, and I get

(0.9672434106763087, 0.0) # (r, p)

It is reasonable that the r-value is high and the p-value is very low. However, the true p-value is obviously not exactly 0, so I would like to know what p=0.0 means. Is it p<1e-10, p<1e-100, or what is the limit?

2

There are 2 answers

0
Bob On

As pointed out by @MB-F in the comments it is calculated analytically.

In the code for the version 0.19.1, you could isolate that part of the code and plot the p-value in terms of r

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import betainc
r = np.linspace(-1, 1, 1000)*(1-1e-10);

for n in [10, 100, 1000]:
    df = n - 2
    t_squared = r**2 * (df / ((1.0 - r) * (1.0 + r)))
    prob = betainc(0.5*df, 0.5, df/(df+t_squared))
    plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()

The current stable version 1.9.3 uses a different formula

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import btdtr
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
    ab = 0.5*n
    prob = btdtr(ab, ab, 0.5*(1-abs(r)))
    plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()

But yield the same results.

You can see that if you have 1000 points and your correlation, the p value will be less than the minimum floating value.

The beta distribution

Scipy provides a collection of probability distributions, among them, the beta distribution.

The line

    prob = btdtr(ab, ab, 0.5*(1-abs(r)))

could be replaced by

from scipy.stats import beta
prob = beta(ab, ab).cdf(0.5*(1-abs(r)))

There you can get much more information about it.

0
Matt Haberland On

If you're interested in the magnitude of p-values below the floating point minimum, you can use mpmath.

The pearsonr documentation states that the two-sided p-value is given by.

dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2*dist.cdf(-abs(r))

According to Wikipedia, the CDF of the beta distribution is the regularized incomplete beta function.

Putting it together:

from mpmath import mp
mp.dps = 1000

r = mp.mpf(0.9672434106763087)
n = 1000

x = (-abs(r) + 1)/2  # shift per `loc=-1`, scale per `scale=2`
p = 2*mp.betainc(n/2 - 1, n/2 - 1, 0, x, regularized=True)
# 1.534373680362601e-596

If you want to stick with floating point arithmetic, you'll have to work in log-space. Unfortunately, the log of the regularized incomplete beta function is not implemented in SciPy, but you can compute the log of the regularized incomplete beta function from the log of the integrand using quadrature.

import numpy as np
from scipy.integrate._tanhsinh import _tanhsinh
from scipy import special

r = 0.9672434106763087
n = 1000

x = (-abs(r) + 1)/2 
a = b = n/2 - 1
def logf(t):
    # log of incomplete beta function integrand
    return (a-1)*np.log(t) + (b-1)*np.log1p(-t)

# multiplying by 2 -> adding log(2)
# regularization / dividing by beta(a, b) -> subtracting betaln(a, b)
_tanhsinh(logf, 0, x, log=True).integral + np.log(2) - special.betaln(a, b)
# -1371.9125931524995  # log(1.534373680362601e-596)