Why does the Kolmogorov-Smirnov test fail in this case?

Question

Why does the Kolmogorov-Smirnov test fail in this case?

330 views Asked by mat At 12 April 2021 at 10:37

I have these two time series, and I want to test if they come from the same distribution. So I applied the scipy.stats.ks_2samp() test. But the test returns a p-value of 0.0028, whereas describe() gives these statistics:

count   120.000000  120.000000
mean    0.785867    0.774267
std     0.323941    0.304894
min     0.610000    0.610000
25%     0.619000    0.610000
50%     0.619000    0.619000
75%     0.749000    0.769500
max     1.812000    1.742000

So I don't get why the test rejects the null hypothesis, when mean and standard deviation are pretty similar. Also plots of the (cumulative) distributions look very similar.

Can anybody help me?

Here are my data and the test call:

from scipy import stats

df = pd.DataFrame(data=[[
    0.62, 0.61, 0.61, 0.619, 0.619, 0.619, 0.62, 0.619, 0.61,
    0.619, 0.62, 0.619, 0.619, 0.62, 0.611, 0.62, 0.62, 0.61,
    0.619, 0.61, 0.619, 0.62, 0.642, 0.67, 0.749, 0.838, 0.862,
    0.804, 0.89, 0.942, 1.012, 1.13, 1.14, 1.191, 1.201, 1.123,
    1.299, 1.359, 1.411, 1.362, 1.352, 1.44,1.451, 1.46, 1.557,
    1.491, 1.622, 1.639, 1.787, 1.812, 1.665, 1.612, 1.253, 0.936,
    0.704, 0.643, 0.62, 0.619, 0.62, 0.61, 0.619, 0.62, 0.619,
    0.62, 0.61, 0.619, 0.61, 0.619, 0.62, 0.619, 0.62, 0.62,
    0.619, 0.62, 0.62, 0.619, 0.62, 0.619, 0.619, 0.62, 0.619,
    0.619, 0.619, 0.619, 0.61, 0.61, 0.619, 0.619, 0.619, 0.62,
    0.619, 0.619, 0.619, 0.619, 0.61, 0.619, 0.619, 0.62, 0.619,
    0.61, 0.619, 0.619, 0.619, 0.619, 0.61, 0.619, 0.619, 0.62,
    0.619, 0.61, 0.619, 0.619, 0.62, 0.619, 0.749, 0.63, 0.62,
    0.61, 0.619, 0.619],
    [0.801, 0.644, 0.62, 0.62, 0.61, 0.61, 
    0.619, 0.62, 0.61, 0.61, 0.61, 0.61, 0.619, 0.619, 0.62,
    0.61, 0.619, 0.61, 0.619, 0.62, 0.62, 0.629, 0.689, 0.759,
    0.849, 0.84, 0.918, 1.019, 0.967, 0.92, 0.976, 1.089, 1.062,
    1.219, 1.202, 1.261, 1.387, 1.422, 1.39, 1.264, 1.281, 1.35,
    1.32, 1.419, 1.568, 1.554, 1.623, 1.592, 1.709, 1.742, 1.535,
    1.123, 0.84, 0.682, 0.63, 0.62, 0.61, 0.61, 0.619, 0.62,
    0.61, 0.61, 0.61, 0.61, 0.619, 0.62, 0.61, 0.619, 0.61,
    0.62, 0.61, 0.62, 0.61, 0.61, 0.619, 0.62, 0.62, 0.61,
    0.61, 0.61, 0.619, 0.62, 0.61, 0.619, 0.62, 0.61, 0.61,
    0.61, 0.61, 0.61, 0.619, 0.62, 0.62, 0.61, 0.61, 0.61,
    0.619, 0.619, 0.619, 0.61, 0.618, 0.61, 0.61, 0.619, 0.61,
    0.61, 0.61, 0.61, 0.619, 0.619, 0.62, 0.61, 0.619, 0.62,
    0.62, 0.61, 0.619, 0.61, 0.61, 0.61]]).T

print(stats.ks_2samp(df.iloc[:, 1], df.iloc[:, 0]).pvalue)

Original Q&A

There are 1 answers

**Arne** · Accepted Answer · 2021-04-12T22:52:55+00:00

The Kolmogorov-Smirnov test did not fail. The seemingly flat tails of the two series really are substantially different from each other. We can see this by zooming in on the tails (starting at index 60) and sorting the values in each series for ease of comparison:

import matplotlib.pyplot as plt

plt.plot(df.iloc[60:, 0].sort_values(ignore_index=True))
plt.plot(df.iloc[60:, 1].sort_values(ignore_index=True), color='orange')
plt.ylim([0.605, 0.625]);

I don't know whether this is an artefact of how the data were recorded, or a real effect. In any case, note that the Kolmogorov-Smirnov test is not appropriate here, because it assumes two random samples, wheras what you have are time series with the time clearly being a significant factor.

TechQA.

Why does the Kolmogorov-Smirnov test fail in this case?

There are 1 answers

Related Questions in PYTHON

Related Questions in STATISTICS

Related Questions in KOLMOGOROV-SMIRNOV

Popular Questions

Popular Tags

Trending Questions