Fit the gamma distribution only to a subset of the samples

1.7k views Asked by At

I have the histogram of my input data (in black) given in the following graph:

histogram

I'm trying to fit the Gamma distribution but not on the whole data but just to the first curve of the histogram (the first mode). The green plot in the previous graph corresponds to when I fitted the Gamma distribution on all the samples using the following python code which makes use of scipy.stats.gamma:

img = IO.read(input_file)
data = img.flatten() + abs(np.min(img)) + 1

# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1

# data histogram
n, bins, patches = plt.hist(data, 1000, normed=True)

# slice histogram here

# estimation of the parameters of the gamma distribution
fit_alpha, fit_loc, fit_beta = gamma.fit(data, floc=0)
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, fit_loc, fit_beta)
print '(alpha, beta): (%f, %f)' % (fit_alpha, fit_beta)

# plot estimated model
plt.plot(x, y, linewidth=2, color='g')
plt.show()

How can I restrict the fitting only to the interesting subset of this data?

Update1 (slicing):

I sliced the input data by keeping only values below the max of the previous histogram, but the results were not really convincing:

histogram2

This was achieved by inserting the following code below the # slice histogram here comment in the previous code:

max_data = bins[np.argmax(n)]
data = data[data < max_data]

Update2 (scipy.optimize.minimize):

The code below shows how scipy.optimize.minimize() is used to minimize an energy function to find (alpha, beta):

import matplotlib.pyplot as plt
import numpy as np
from geotiff.io import IO
from scipy.stats import gamma
from scipy.optimize import minimize


def truncated_gamma(x, max_data, alpha, beta):
    gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
    norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
    return np.where(x < max_data, gammapdf / norm, 0)


# read image
img = IO.read(input_file)

# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1

# data histogram
n, bins = np.histogram(data, 100, normed=True)

# using minimize on a slice data below max of histogram
max_data = bins[np.argmax(n)]
data = data[data < max_data]

data = np.random.choice(data, 1000)
energy = lambda p: -np.sum(np.log(truncated_gamma(data, max_data, *p)))
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x

# plot data histogram and model
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, 0, fit_beta)
plt.hist(data, 30, normed=True)
plt.plot(x, y, linewidth=2, color='g')
plt.show()

The algorithm above converged for a subset of data, and the output in o was:

x: array([ 16.66912781,   6.88105559])

But as can be seen on the screenshot below, the gamma plot doesn't fit the histogram:

minimize

2

There are 2 answers

1
AudioBubble On BEST ANSWER

Here's another possible approach using a manually created dataset in excel that more or less matched the plot given.

Raw Data

enter image description here enter image description here

Outline

  • Imported data into a Pandas dataframe.
  • Mask the indices after the max response index.
  • Create a mirror image of the remaining data.
  • Append the mirror image while leaving a buffer of empty space.
  • Fit the desired distribution to the modified data. Below I do a normal fit by the method of moments and adjust the amplitude and width.

Working Script

    # Import data to dataframe.
    df = pd.read_csv('sample.csv', header=0, index_col=0)
    # Mask indices after index at max Y.
    mask = df.index.values <= df.Y.argmax()
    df = df.loc[mask, :]
    scaled_y = 100*df.Y.values

    # Create new df with mirror image of Y appended.
    sep = 6
    app_zeroes = np.append(scaled_y, np.zeros(sep, dtype=np.float))
    mir_y = np.flipud(scaled_y)
    new_y = np.append(app_zeroes, mir_y)

    # Using Scipy-cookbook to fit a normal by method of moments.
    idxs = np.arange(new_y.size)  # idxs=[0, 1, 2,...,len(data)]
    mid_idxs = idxs.mean() # len(data)/2
    # idxs-mid_idxs is [-53.5, -52.5, ..., 52.5, len(data)/2]
    scaling_param = np.sqrt(np.abs(np.sum((idxs-mid_idxs)**2*new_y)/np.sum(new_y)))

    # adjust amplitude
    fmax = new_y.max()*1.2 # adjusted function max to 120% max y.
    # adjust width
    scaling_param = scaling_param*.7 # adjusted by 70%.
    # Fit normal.
    fit = lambda t: fmax*np.exp(-(t-mid_idxs)**2/(2*scaling_param**2))

    # Plot results.
    plt.plot(new_y, '.')
    plt.plot(fit(idxs), '--')
    plt.show()

Result ![enter image description here

See the scipy-cookbook fitting data page for more on fitting a normal using method of moments.

9
periphreal On

You can use a general optimization tool such as scipy.optimize.minimize to fit a truncated version of the desired function, resulting in a nice fit: Truncated fit

First, the modified function:

def truncated_gamma(x, alpha, beta):
    gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
    norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
    return np.where(x<max_data, gammapdf/norm, 0)

This selects values from the gamma distribution where x < max_data, and zero elsewhere. The np.where part is not actually important here, because the data is exclusively to the left of max_data anyway. The key is normalization, because varying alpha and beta will change the area to the left of the truncation point in the original gamma.

The rest is just optimization technicalities.

It's common practise to work with logarithms, so I used what's sometimes called "energy", or the logarithm of the inverse of the probability density.

energy = lambda p: -np.sum(np.log(truncated_gamma(data, *p)))

Minimize:

initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x

My output is (alpha, beta): (11.595208, 824.712481). Like the original, it is a maximum likelihood estimate.

If you're not happy with the convergence rate, you may want to

  1. Select a sample from your rather big dataset: data = np.random.choice(data, 10000)

  2. Try different algorithms using the method keyword argument.

Some optimization routines output a representation of the inverse hessian, which is useful for uncertainty estimation. Enforcement of nonnegativity for the parameters may also be a good idea.

A log-scaled plot without truncation shows the entire distribution:

Log-scaled fit