I'm confused on how to use the scipy stat function ks_2samp. I have 2 samples of discrete data and want to see if they are significantly different. I have tried 3 things:
- directly input the raw data
- input the ecdfs of the data
- input pdf fit of the data
I'm getting different values each time and I do not know which is the right one and the right method. From the scipy.stats.ks_2samp documentation, it looks like data1 and data2 are raw data samples. But the Wikipedia page says that I need to input empirical data. I'm working with a huge dataset. Also, please suggest if there's any other method to achieve the same.
Following is my code snippet:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as ss
from scipy.stats import norm
import pandas as pd
# x0 is a dataframe column
mean0 = np.mean(x0)
std0 = np.std(x0)
pdf0 = ss.norm.pdf(x0.sort_values(), mean0, std0)
mean1 = np.mean(x1)
std1 = np.std(x1)
pdf1 = ss.norm.pdf(x1.sort_values(), mean1, std1)
def ecdf(x):
xs = np.sort(x)
ys = np.arange(1, len(xs)+1)/float(len(xs))
return xs, ys
xs0, ys0 = ecdf(x0) xs1, ys1 = ecdf(x0)
rslt1=ss.ks_2samp(x0,x1)
rslt2=ss.ks_2samp(ys0,ys1)
rslt3=ss.ks_2samp(pdf0,pdf1)
print(rslt1)
print(rslt2)
print(rslt3)
scipy.stats.ks_2samp
accepts raw samples. You do not need to do any fitting, parametric or non-parametric (e.g. ecdf), beforehand. So #1 of the things you tried andrslt1
of your code look valid.One way you can confirm this is by repeatedly generating samples under the null hypothesis and checking the distribution of p-values. Roughly speaking, they should be uniformly distributed (although there are some footnotes in the case of this test due to the discrete nature of the statistic).
If you were to pre-process the data according to ideas #2 or #3, the distribution of p-values would not be as expected.
Other tests of the null hypothesis that the two samples were drawn from the same distribution are
scipy.stats.cramervonmises_2samp
,scipy.stats.anderson_ksamp
, andscipy.stats.epps_singleton_2samp
. The last two do not assume the underlying distribution is continuous.There are many other independent sample tests that may be more sensitive to specific deviations, but these are the ones that are most similar in usage to Kolmogorov-Smirnov.
Note that there is also a
scipy.stats.ecdf
as of SciPy 1.11.0, and if you know what distribution you want to fit, you can use the distribution'sfit
method orscipy.stats.fit
.