Not sure Kolmogorov Smirnov Test is working as it should

590 views Asked by At

changed the code with the Gaussian args considering Sam Masons comment. The results are still wrong, since I know from QQ-plots the data is probably a decent Gaussian. I will try to post my updated code and attach the data file too. Perhaps it's obvious but I don't see how the KS-test gets it so wrong (or I). The .csv datafile can be found here: https://ln5.sync.com/dl/658503c20/5fek5x39-y8aqbkfu-tqptym98-nz75wikq

import pandas as pd
import numpy as np
alpha = 0.05
df = pd.read_csv("Z079_test_mc.csv")
columns = df.columns
with open('matrix.txt', 'a') as f:
    for col in columns:
        print ([col])
        a, b = stats.kstest(df[[col]].dropna().values, stats.norm.cdf, args=(np.mean(df[col]),np.std(df[col])))
        print('Statistics', a, 'p-value', b)
        if b < alpha:
            print('The null hypothesis can be rejected' + '\n')
            f.write(str(col) + ',' + 'Kolmogorov Smirnov' + '\n' + \
                '        ' + ',' + str(a) + ',' + str(b) + 'The null hypothesis can be rejected' + '\n')
        else:
            print('The null hypothesis cannot be rejected')
            f.write(str(col) + ',' + 'Kolmogorov Smirnov' + '\n' + \
                '        ' + ',' + str(a) + ',' + str(b) + 'The null hypothesis cannot be rejected' + '\n')
2

There are 2 answers

1
Sam Mason On

The parameters for a Gaussian distribution in SciPy are the location and scale. In stats speak these are mu and sigma. Hence passing the min and max as args is breaking things.

Probably easiest is just to use args=stats.norm.fit(values), or you could do it manually via args=(np.mean(values), np.std(values)). As a more complete example:

import numpy as np
import scipy.stats as sps

# generate some values from something almost Gaussian
#   1 = Cauchy, +Inf = Gaussian
values = 1e9 + np.random.standard_t(10, size=1000) * 1e9

# perform test
sps.kstest(values, 'norm', sps.norm.fit(values))

or

# parameterize distribution
dist = sps.norm(*sps.norm.fit(values))

# perform test
sps.kstest(values, dist.cdf)
5
pjs On

I don't know what's going on with Python's KS test aside from your initial use of min/max rather than location/scale as arguments. A quick web review seemed to indicate that Shapiro-Wilk test is preferred over KS for sample sizes < 50, which you have.

I did a quick analysis in JMP, and have pasted the results below. I suspect your results are inconclusive due to the small sample sizes. My experience with distribution fitting for simulation models is that the results are often ambiguous unless you have sample sizes in the hundreds or even thousands. With sample sizes in the 20s-40s, each histogram bin only has a few observations in it. With that said, normality was not the top choice for any of your three columns of data. I've provided histograms with both the recommended best fit and the best fit normal superimposed, along with QQ plots and associated test statistics for recommended and normal.

Despite inconclusive statistical tests on two of the three columns of data, I stand by what I said in comments -- the histograms do not look normal. The Z79V001 data is heavy in the tails and has a huge dip near what should be the mode; the Z79V0003_1 data looks multimodal with big gaps; and the Z79V0003_2 data is clearly skewed right (plus it fails the Shapiro-Wilk test at the 0.05 level even with a very small sample size).

Without further ado, here are screenshots:

Z79V0001 results

Z79V0003_1 results

Z79V0003_2 results