Mismatch between Scipy stat (KS-test) distribution and histogram plot of the data set

195 views Asked by At

I have a dataset like this

y = array([ 25.,  20.,  10.,  31.,  30.,  66.,  13.,   5.,   9.,   2.,   4.,
     9.,   6.,  26.,  72.,   7.,   5.,  18.,   8.,  12.,   4.,   7.,
   114.,   5.,   6.,  17.,  39.,   4.,   5.,  42.,  63.,   3.,   6.,
    16.,  17.,   4.,  27.,  18.,   3.,   7.,  48.,  24.,  72.,  21.,
    12.,  13., 106., 120.,   5.,  34.,  52.,  22.,   2.,   8.,   9.,
     5.,  35.,   4.,   4.,   1.,  56.,   1.,  17.,  34.,   3.,   5.,
    17.,  17.,  10.,  48.,   9., 195.,  20.,  60.,   5.,  77., 114.,
    59.,   1.,   1.,   1.,  67.,   9.,   4.,   1.,  13.,   6.,  46.,
    40.,   8.,   6.,   1.,   2.,   1.,   1.,   1.,   7.,   6.,  53.,
     6.,   3.,   4.,   2.,   1.,   1.,   5.,   1.,   5.,   1.,   7.,
     1.,   1.])

The corresponding histogram from this data is following

number_of_bins = len(y)
bin_cutoffs = np.linspace(np.percentile(y,0), np.percentile(y,99),number_of_bins)
h = plt.hist(y, bins = bin_cutoffs, color='red')

enter image description here

I test the dataset to get the actual parameter from scipy stat KS test with the following code (got this from How to find probability distribution and parameters for real data? (Python 3))

def get_best_distribution(data):
dist_names = ["norm", "exponweib", "weibull_max", "weibull_min","expon","pareto", "genextreme","gamma","beta"]
dist_results = []
params = {}
for dist_name in dist_names:
    dist = getattr(st, dist_name)
    param = dist.fit(data)

    params[dist_name] = param
    # Applying the Kolmogorov-Smirnov test
    D, p = st.kstest(data, dist_name, args=param)
    print("p value for "+dist_name+" = "+str(p))
    dist_results.append((dist_name, p))

# select the best fitted distribution
best_dist, best_p = (max(dist_results, key=lambda item: item[1]))
# store the name of the best fit and its p value

print("Best fitting distribution: "+str(best_dist))
print("Best p value: "+ str(best_p))
print("Parameters for the best fit: "+ str(params[best_dist]))
return best_dist, best_p, params[best_dist]

The result shows that its genextreme distribution. The result is as shown bellow:

('genextreme',
0.1823402997669471,
(-1.119997717132149, 5.036499415233003, 6.2122664378291175))

The fitted curve using these attributes is following enter image description here

From my understanding, the histogram suggests that it is a exponential distribution.But from KS test it shows another.Can anyone explain why this is happening or anything wrong?

0

There are 0 answers