MLE for censored distributions of the exponential family

832 views Asked by At

With the scipy.stats package it is straightforward to fit a distribution to data, e.g. scipy.stats.expon.fit() can be used to fit data to an exponential distribution.

However, I am trying to fit data to a censored/conditional distribution in the exponential family. In other words, using MLE, I am trying to find the maximum of

,

where is a PDF of a distribution in the exponential family, and is its corresponding CDF.

Mathematically, I have found that the log-likelihood function is convex in the parameter space , so my assumption was that it should be relatively straightforward to apply the scipy.optimize.minimize function. Notice in the above log-likelihood that by taking we obtain the traditional/uncensored MLE problem.

However, I find that even for simple distributions that e.g. the nelder-mead simplex algorithm does not always converge, or that it does converge but the estimated parameters are far off from the true ones. I have attached my code below. Notice that one can choose a distribution, and that the code is generic enough to fit the loc and scale parameters, as well as the optional shape parameters (for e.g. a Beta or Gamma distribution).

My question is: what am I doing wrong to obtain these bad estimates, or sometimes get convergence issues? I have tried a few algorithms but there is not one that easily works, to my surprise as the problem is convex. Are there smoothness issues, and that I need to find a way to use the Jacobian and Hessian in a generic way for this problem?

Are there other methods to tackle this problem? Initially I thought to override fit() function in the scipy.stats.rv class to take care of the censoring with the CDF, but this seemed quite cumbersome. But since the problem is convex, I would guess that using the minimize function of scipy I should be able to easily get the results...

Comments and help are very welcome!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import expon, gamma, beta, norm
from scipy.optimize import minimize
from scipy.stats import rv_continuous as rv


def log_likelihood(func: rv, delays, max_delays=10**8, **func_pars)->float:
    return np.sum(np.log(func.pdf(delays, **func_pars)+1) - np.log(func.cdf(max_delays, **func_pars)))


def minimize_log_likelihood(func: rv, delays, max_delays):

    # Determine number of parameters to estimate (always 'loc', 'scale', sometimes shape parameters)
    n_pars = 2 + func.numargs

    # Initialize guess (loc, scale, [shapes])
    x0 = np.ones(n_pars)

    def wrapper(params, *args):
        func = args[0]
        delays = args[1]
        max_delays = args[2]
        loc, scale = params[0], params[1]

        # Set 'loc' and 'scale' parameters
        kwargs = {'loc': loc, 'scale': scale}

        # Add shape parameters if existing to kwargs
        if func.shapes is not None:
            for i, s in enumerate(func.shapes.split(', ')):
                kwargs[s] = params[2+i]
        
        return -log_likelihood(func=func, delays=delays, max_delays=max_delays, **kwargs)

    # Find maximum of log-likelihood (thus minimum of minus log-likelihood; see wrapper function)
    return minimize(wrapper, x0, args=(func, delays, max_delays), options={'disp': True}, 
                    method='nelder-mead', tol=1e-8)




# Test code with by sampling from known distribution, and retrieve parameters
distribution = expon
dist_pars = {'loc': 0, 'scale': 4}
x = np.linspace(distribution.ppf(0.0001, **dist_pars), distribution.ppf(0.9999, **dist_pars), 1000)

res = minimize_log_likelihood(distribution, x, 10**8)
print(res)

1

There are 1 answers

0
flow_me_over On

I have found that the convergence is bad due to numerical inaccuracies. Best is to replace

np.log(func.pdf(x, **func_kwargs))

with

func.logpdf(x, **func_kwargs)

This leads to correct estimation of the parameters. The same holds for the CDF. The documentation of scipy also indicates that the numerical accuracy of the latter performs better.

This all works nicely with the Exponential, Normal, Gamma, chi2 distributions. The Beta distribution still gives me issues, but I think this is again to some (other) numerical inaccuracies which I will analyse separately.