Different results of SPSS and Python KS-test to assess normality

47 views Asked by At

Suppose that I have a series of data:

age;height 8;120 8;123 8;130 8;125 10;160 9;158 8;120 7;126 6;98 5;97 7;115 7;120 7;118 8;117 6;97 6;99 9;123 10;157 10;155 9;155 9;153 5;96 7;115 6;94 6;94 5;87 8;117 6;96 5;97 6;91 6;88 9;149 6;94 8;117 10;156 10;160 6;90 6;90 7;116 5;89 6;90 7;118 10;162

And I would like to assess the normality using Kolmogorov-Smirnov using both SPSS and Python. SPSS yielded a result of:

variables statistics sig
age 0.190 0.000
height 0.173 0.002

I tried to compare using Python with this code:

import pandas as pd
from scipy.stats import kstest
from scipy.stats import norm
data = pd.DataFrame([[8, 120], [8, 123], [8, 130], [8, 125], [10, 160], [9, 158], [8, 120], [7, 126], [6, 98], [5, 97], [7, 115], [7, 120], [7, 118], [8, 117], [6, 97], [6, 99], [9, 123], [10, 157], [10, 155], [9, 155], [9, 153], [5, 96], [7, 115], [6, 94], [6, 94], [5, 87], [8, 117], [6, 96], [5, 97], [6, 91], [6, 88], [9, 149], [6, 94], [8, 117], [10, 156], [10, 160], [6, 90], [6, 90], [7, 116], [5, 89], [6, 90], [7, 118], [10, 162]], columns=['age', 'weight'])
x = np.log(data.age)
n = norm(loc=0,scale=1)
kstest(x, n.cdf)

which gives:

KstestResult(statistic=0.9462396895483368, pvalue=5.139087762288979e-55)

Even if I don't log-transform the data, the result is still different:

kstest(data.age, n.cdf)

which gives:

KstestResult(statistic=0.9999997133484281, pvalue=9.27397852188504e-282)
1

There are 1 answers

0
Matt Haberland On

The SciPy calculation is correct given your input: the KS-test statistic is the maximum difference between the empirical CDF and the provided CDF evaluated at the data.

import numpy as np
from scipy import stats

dist = stats.norm(loc=0, scale=1)

data = np.asarray([[8, 120], [8, 123], [8, 130], [8, 125], [10, 160], [9, 158], [8, 120], [7, 126], [6, 98], [5, 97], [7, 115], [7, 120], [7, 118], [8, 117], [6, 97], [6, 99], [9, 123], [10, 157], [10, 155], [9, 155], [9, 153], [5, 96], [7, 115], [6, 94], [6, 94], [5, 87], [8, 117], [6, 96], [5, 97], [6, 91], [6, 88], [9, 149], [6, 94], [8, 117], [10, 156], [10, 160], [6, 90], [6, 90], [7, 116], [5, 89], [6, 90], [7, 118], [10, 162]])
logage = np.log(data[:, 0])

x = np.sort(logage)
cdfvals = dist.cdf(x)
n = len(cdfvals)
dminus = (cdfvals - np.arange(0.0, n)/n)
dminus.max() # 0.9462396895483368

The SPSS code is not provided, so I cannot assess the reason for the discrepancy. Perhaps in SPSS you are not testing the null hypothesis that the data follows the standard normal distribution, which is clearly not the case. Instead, perhaps it is performing Lilliefors' test, which uses the KS-statistic to perform a test of the null hypothesis that the data follows a normal distribution in which the parameters loc and scale are treated as unknown.

res = stats.goodness_of_fit(stats.norm, logage, statistic='ks')
res.statistic  # 0.1821555634826541
res.pvalue  # 0.001
# p-value computed using Monte Carlo simulation, so results may vary.

If you want to perform such a test, there are many more powerful options available. Consider the Shapiro-Wilk test.