Matlab: Chi-squared fit (chi2gof) to test if data is exponentially distributed

7k views Asked by At

I guess this is a simple question, but I can't sort it out. I have a vector, the first elements of which look like:

V = [31 52 38 29 29 34 29 24 25 25 32 28 24 28 29 ...];

and I want to perform a chi2gof test in Matlab to test if V is exponentially distributed. I did:

[h,p] = chi2gof(V,'cdf',@expcdf);

but I get a warning message saying:

Warning: After pooling, some bins still have low expected counts.
The chi-square approximation may not be accurate

Have I defined the chi2gof call incorrectly?

1

There are 1 answers

3
horchler On BEST ANSWER

At 36 values, you have a very small sample set. From the second sentence of Wikipedia's article on the chi-squared test (emphasis added):

It is suitable for unpaired data from large samples.

Large in this case usually means around at least 100. Read about more assumptions of this test here.


Alternatives

You might try kstest in Matlab, which is based on the Kolmogorov-Smirnov test:

[h,p] = kstest(V,'cdf',[V(:) expcdf(V(:),expfit(V))])

Or try lillietest, which is based on the Lilliefors test and has an option specifically for exponential distributed data:

[h,p] = lillietest(V,'Distribution','exp')

In case you can increase your sample size, you are doing one thing wrong with chi2gof. From the help for the 'cdf' option:

A fully specified cumulative distribution function. This can be a ProbabilityDistribution object, a function handle, or a function. name. The function must take X values as its only argument. Alternately, you may provide a cell array whose first element is a function name or handle, and whose later elements are parameter values, one per cell. The function must take X values as its first argument, and other parameters as later arguments.

You're not supplying any additional parameters, so expcdf is using the default mean parameter of mu = 1. Your data values are very large and don't correspond at all an exponential distribution with this mean. You need to estimate parameters as well. You the expfit function, which is basted on maximum likelihood expectation, you might try something like this:

[h,p] = chi2gof(V,'cdf',@(x)expcdf(x,expfit(x)),'nparams',1)

However, with only 36 samples you may not get a very good estimate for a distribution like this and still may not get expected results even for data sampled from a known distribution, e.g.:

V = exprnd(10,1,36);
[h,p] = chi2gof(V,'cdf',@(x)expcdf(x,expfit(x)),'nparams',1)