How to modify Kolmogorov-Smirnov Test?

21 views Asked by At

I won't go much into the details, but here is the problem. I have the following data: words from a big text and frequences of each word. The main problem is to check (rigorously) if frequences follow Zipf's Law.

First of all, when I plot data in the log(rank)-log(probability) coordinates I get a line with -1 slope, which is good. I then proceed to perform the test: kstest(prob, stats.powerlaw.cdf, args = (0.01,)) which outputs statistic=0.9313565666905572, pvalue=0.0, which is strange, so I decided to plot empirical cdf of my data along with empirical cdf of randomly generated sample from powerlaw distribution:

ecdf_data = sm.distributions.ECDF(prob)

x = np.linspace(min(prob), max(prob))
y = ecdf(x)
plt.step(x, y)

powerlaw = powerlaw.rvs(0.01, size=34000)
ecdf_powerlaw = sm.distributions.ECDF(powerlaw)

x_p = np.linspace(min(powerlaw), max(powerlaw))
y_p = ecdf_powerlaw()
plt.step(x_p, y_p)

And it looks like they are pretty close: cdf of my data and cdf of powerlaw sample

I think that the difference of statistic=0.9313565666905572 comes from points around zero, so there is a question: can I just truncate the range, which is used for KS statistic? Does it have any sense, when it comes to rejecting or not rejecting the null-hypothesis?

0

There are 0 answers