It is a while since I have tackled any non-linear function fitting (optimization) problems. Previously I have been familiar with them in the context of C++ or C libraries. The algorithm which I am most familiar with is the Levenberg Marquardt algorithm, which is a damped gradient descent algorithm. It typically provides the ability to either specify function derivatives analytically, or to calculate them numerically.
I am now working on an optimization problem using Python. It seems that Python provides support for optimization using the Levenberg Marquardt algorithm via the scipy package optimize.least_squares
. The way the interface is formulated, this function expects to receive an array of residuals. A residual is (data - model). It appears to minimize the square sum of the residuals, multipled by a factor of 1/2.
Unless I have misunderstood something, this makes the API quite specific to least-squares fitting problems, and it cannot be used in a more general way such as for finding maximum values rather than minimum values. (This is due to the fact that the residuals are squared. So, even if the residuals consists of a single scalar value, we cannot find a maximum by optimizing a minimum of the function multipled by -1, which would be the usual approach.)
This directs me towards two possible lines of thought:
- I have misunderstood the documentation, or more likely, not found the function/API I am looking for.
- The Levenberg Marquardt algorithm is not exactly the right tool for the job. I thought that LM is considered to be a general purpose, robust, gradient descent optimizer. It might be that there is something more appropriate to use, and that I have overlooked the reason as to why some other gradient descent algorithm would be more appropriate to use here.
For context, I am trying to optimize (maximize) a Likelihood, or Log-Likelihood, function. I have some statistical data which I am trying to model with non-linear curve/model fitting optimization methods.
My question is, does Python/Scipy provide an interface to a Levenberg Marquardt algorithm implementation which can be used for such purposes? If not, then why not, and what should I look to use as an alternative?
I have a partial answer to this question, although is is not a concrete answer in the sense that I cannot be completely certain.
Most of this information comes from Numerical Recipies in C++.
Information on function minimization problems are contained in two chapters. There is a chapter on function minimization, and a chapter on statistical modelling which includes a section on the LM algorithm.
The first point to note:
The LM algorithm uses gradient descent far from the minimum to approach the region in which the minimum lies, before gradually transitioning to behave as a quadratic solver, which is appropriate in the viscinity of the minimum because the derivatives (by Taylor) are approximatly linear, and the function is approximatly quadratic in the region of the minimum.
Further, the problem of least-squares always has one global minima, so there is no risk of accidentally converging to a local minima.***
*** Note: I am not totally sure of this last point, if someone knows of a counter-example, please leave a comment. Having performed a number of least-squares minimization problems (a number of years ago) I do not recall ever encountering one where convergence to a local minima instead of a global one was a practical concern.
In summary: LM is highly suited to the problem of least-squares. This does not explain why it may be less suited to a more general minimization problem.
Numerical Recipies does not include LMA in the section on general function optimization.
However, in this Section it is noted that greedy algorithms, such as gradient descent, have disadvantages in this area. For functions with local minima or maxima, gradient descent has a high chance of converging to that local minima or maxima instead of the global minima or maxima.
Quite why this is not of concern for other algorithms such as the Simplex method or Simulated Annealing, I do not quite understand.
Another disadvantage of gradient descent is the tendancy of this method to take orthogonal steps towards the minima, which is not the most efficient route. This happens because each step finds the minimum point assuming a linear gradient, along a single dimension only. If the global minima does not lie along any of the vector-basis directions (any of the axis) then it will necessarily not converge efficiently. (It takes a zig-zag or staircase path towards the minimum point.)
Given the above information, which is not complete, we might conclude that there are other algorithms besides LMA which are found in practice to give better performance for general non-linear function optimization problems.
It should be noted that in the context of highly non-linear optimization problems, determining the "best" algorithm is partly an art as much as it is a science. It involves assessing different algorithms as well as tuning various algorithm parameters.