Is the scipy.stats.ks_2samp function supposed to take in raw data arrays or the ECDF of the data?

55 views Asked by At

I'm confused on how to use the scipy stat function ks_2samp. I have 2 samples of discrete data and want to see if they are significantly different. I have tried 3 things:

  1. directly input the raw data
  2. input the ecdfs of the data
  3. input pdf fit of the data

I'm getting different values each time and I do not know which is the right one and the right method. From the scipy.stats.ks_2samp documentation, it looks like data1 and data2 are raw data samples. But the Wikipedia page says that I need to input empirical data. I'm working with a huge dataset. Also, please suggest if there's any other method to achieve the same.

Following is my code snippet:

    import numpy as np 
    import matplotlib.pyplot as plt 
    import scipy.stats as ss 
    from scipy.stats import norm 
    import pandas as pd
        
    # x0 is a dataframe column
    
    mean0 = np.mean(x0) 
    std0 = np.std(x0) 
    pdf0 = ss.norm.pdf(x0.sort_values(), mean0, std0)
        
    mean1 = np.mean(x1) 
    std1 = np.std(x1) 
    pdf1 = ss.norm.pdf(x1.sort_values(), mean1, std1)
        
    def ecdf(x):
            xs = np.sort(x)
            ys = np.arange(1, len(xs)+1)/float(len(xs))
            return xs, ys
        
      
    xs0, ys0 = ecdf(x0) xs1, ys1 = ecdf(x0)
        
    rslt1=ss.ks_2samp(x0,x1) 
    rslt2=ss.ks_2samp(ys0,ys1) 
    rslt3=ss.ks_2samp(pdf0,pdf1)
        
        
    print(rslt1) 
    print(rslt2) 
    print(rslt3)
1

There are 1 answers

0
Matt Haberland On

scipy.stats.ks_2samp accepts raw samples. You do not need to do any fitting, parametric or non-parametric (e.g. ecdf), beforehand. So #1 of the things you tried and rslt1 of your code look valid.

One way you can confirm this is by repeatedly generating samples under the null hypothesis and checking the distribution of p-values. Roughly speaking, they should be uniformly distributed (although there are some footnotes in the case of this test due to the discrete nature of the statistic).

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

rng = np.random.default_rng(459324692354693456)
pvals = []
for i in range(10000):
  x = rng.normal(size=500)
  y = rng.normal(size=500)
  res = stats.ks_2samp(x, y, method='asymp')
  pvals.append(res.pvalue)

res = stats.ecdf(pvals)
res.cdf.plot()
plt.plot([0, 1], [0, 1])
plt.title("ECDF of p-values under H0")
plt.xlabel("p-value")
plt.ylabel("Empirical CDF")
plt.legend(["ECDF", "Expected"])

ECDF of p-values under H0

If you were to pre-process the data according to ideas #2 or #3, the distribution of p-values would not be as expected.

Other tests of the null hypothesis that the two samples were drawn from the same distribution are scipy.stats.cramervonmises_2samp, scipy.stats.anderson_ksamp, and scipy.stats.epps_singleton_2samp. The last two do not assume the underlying distribution is continuous.

There are many other independent sample tests that may be more sensitive to specific deviations, but these are the ones that are most similar in usage to Kolmogorov-Smirnov.

Note that there is also a scipy.stats.ecdf as of SciPy 1.11.0, and if you know what distribution you want to fit, you can use the distribution's fit method or scipy.stats.fit.