Python Fitting polynomial to natural log, then reversing the transformation back into non-log space?

162 views Asked by At

I'm trying to create a smooth function to represent some data I am working with. Trouble is the data is pretty noisy, and simply using the least squares method to create a polynomial line of best fit does not come out that well.

polynomial = np.polyfit(df.x,df.y)

Doing some digging, apparently for this sort of thing some people first transform their input data into its natural log, which produces a nicer curve.

df['y']=np.log(df['y'])
lop_polynomial = np.polyfit(df.x,df.y)

And lo and behold it seems to work. When I turn this polynomial into a function and graph a bunch of points (by just taking e^x of the output), it looks like the kind of thing I want;

log_polynomial =np.poly1d(put_log_polynomial)

x_values = np.arange(0, 100,.01)
plt.plot(
    x_values,
    np.exp(log_polynomial (x_values ))

However, this requires me to transform the input data into log data, then find a polynomial of best fit, then turn that polynomial into a function, then turn that function into a bunch of discrete points, then make another polynomial of best fit around those points. Because this is something I am going to run a lot of times that's a lot of extra steps.

Is there a way to take a polynomial that I got using log inputs and transform it back into non-log space without manually doing an exponent on all of the inputs to get discrete points then trying to create a line of best fit for those points in non-log space?


edit* here is some more background on my use case. I have a set of discrete points that form a curve. The second derivative of this curve contains important information that I want to extract cleanly, but if I take the second derivative without modification it ends up very noisy, over-interpreting small kinks in the curve as huge rough spikes in the second derivative.

I have tried various things to smooth out the smooth out the curve, but all the other ones I've tried either failed to adequately address the noise or disrupted the inherent signal I am trying to isolate. Here's what I've tried so far:

Least squares line of best fit in non-log space

polynomial = np.polyfit(df.x,df.y)

1 dimensional gaussian filter

scipy.ndimage.gaussian_filter1d(df.y, 3)

Cubic spline

t,c,k=scipy.interpolate.splrep(df.x,df.y,k=3)

B-spline

spline = scipy.interpolate.BSpline(t, c, k, extrapolate=False)

The reason I am trying to log transform it before finding the polynomial of best fit is because I found some research papers that use this method of interpolation for the exact use case I am working on and when I test it on my data it seems to work well.


*edit2 As @NickODell astutely pointed out, there are a lot of potential polynomials that cannot be transformed back to non log space and still represent themselves in the python polynomial format of a list of coefficients for x to the power of integers. So that makes turning the log space polynomial back into a polynomial in non log space directly effectively impossible as far as I can tell.

What I really want though is the second derivative of my input data that is free of noise and retains all the signal. I was hopeful that I might be able to do this by transoforming the log space polynomial directly back to non-log space and deriving it directly, but apparently that's not going to be possible. However, it could still be represented as a function that I can derive using np.gradient, which would effectively allow the same thing. But I'm less confident this would actually be significantly increase the efficiency of the program. I'll try messing around with some profiling and see if I can get a handle on how large a deal this is.

*edit3 I want to thank @jlandercy and @Nick ODell for their clear and thorough answers. I wish I could select both of them as the answer.

2

There are 2 answers

0
jlandercy On BEST ANSWER

If I correctly understand your requirements, you want:

  • Apply log transform on y axis;
  • Fit polynomial data with huge noise;
  • Predict values at new x points;
  • Take second derivative of fitted curve;
  • Automatize the process that must be performed multiple times.

Sklearn is a perfect package to perform such operations through Pipelines.

First we create some polynomial noisy data:

import numpy as np

from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

np.random.seed(12345)
x = np.linspace(-1, 1, 50).reshape(-1, 1)
P = np.poly1d([6, 5, 4, 3, 2, 10])
y = P(x[:, 0])
n = 0.5 * np.random.randn(y.size)
yn = y + n

Simple regression

As a baseline, we create a pipeline to linearly regress polynomial from your data:

pipeline_lin = Pipeline([
    ("transformer", PolynomialFeatures(5)),
    ("model", LinearRegression(fit_intercept=False))
])

And fit to dataset:

pipeline_lin.fit(x, y)

pipeline_lin["model"].coef_
# array([10.0226118 ,  1.51830909,  2.23638955,  2.91665499,  5.99904944, 7.88614861])

Log Transform

If you wish to apply logarithmic transformation on target (y axis, indeed it requires positiveness) we can change the pipeline for:

pipeline_log = Pipeline([
    ("transformer", PolynomialFeatures(5)),
    ("model",
         TransformedTargetRegressor(
             regressor=LinearRegression(fit_intercept=False),
             func=np.log,
             inverse_func=np.exp
         )
    )
])
pipeline_log.fit(x, yn)

pipeline_log["model"].regressor_.coef_
# array([ 2.2901081 ,  0.15049063,  0.46685674,  0.32678866, -0.12522207, 0.34130133])

Indeed coefficients are not directly related to the original polynomial as we applied the log transformation.

Now we can predict new target values for other features without having the need to care about the forward and back log transform:

xlin = np.linspace(-1, 1, 200).reshape(-1, 1)
yh_lin = pipeline_lin.predict(xlin)
yh_log = pipeline_log.predict(xlin)

fig, axe = plt.subplots()
axe.scatter(x, yn, marker=".", label="Data")
axe.plot(xlin, yh_lin, label="Linear")
axe.plot(xlin, yh_log, label="Log Linear")
axe.legend()
axe.grid()

enter image description here

Automation

The process can be repeated multiple times, simply chaining fit and predict:

# Iterate datasets:
for i in range(10):
    x = ...   # Select X data
    yn = ...  # Select Y data
    # Fit and predict:
    yhat = pipeline_log.fit(x, yn).predict(xlin)

Alternative

If you don't know the polynomial order and don't require the fit to be a polynomial regression (eg. you just want to predict plausible values), then Gaussian Process can be a good alternative:

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

pipeline_gpr = Pipeline([
    ("model",
         TransformedTargetRegressor(
             regressor=GaussianProcessRegressor(
                 kernel=1.0*RBF(),
                 alpha=0.01
             ),
             func=np.log,
             inverse_func=np.exp
         )
    )
])

enter image description here

Taking second derivative

Numerically

A potential candidate to estimate second derivative is savgol_filter from scipy:

from scipy import signal

dx = xlin[1,0] - xlin[0,0]
yh_lin_d2 = signal.savgol_filter(yh_lin, 15, 3, deriv=2, delta=dx)
yh_log_d2 = signal.savgol_filter(yh_log, 15, 3, deriv=2, delta=dx)
yh_gpr_d2 = signal.savgol_filter(yh_gpr, 15, 3, deriv=2, delta=dx)

enter image description here

Caution, taking the derivative (or even worse the second derivative) of a fit is not a trivial operation. As expected the derivative of polynomial of 5th order (Linear) is a 3rd order polynomial.

From regressed coefficients

If fitted model is a polynomial, we can use regressed polynomial coefficients to deduce any order derivative.

Phat = np.poly1d(pipeline["model"].coef_[::-1])
Phat_xx = Phat.deriv(m=2)

Which can be use to interpolate new points:

yh_d2 = Phat_xx(np.squeeze(xlin))

This solution perfectly agrees with yh_lin_d2.

Proposal

Here is a complete proposal for having the second derivative based on a set of points that follow a polynomial trend:

def fit_diff(x, y, poly_order=5, resolution=200):
    pipeline = Pipeline([
        ("transformer", PolynomialFeatures(poly_order)),
        ("model", LinearRegression(fit_intercept=False))
    ])
    xlin = np.linspace(x.min(), x.max(), resolution).reshape(-1, 1)
    yhat = pipeline.fit(x, y).predict(xlin)
    dx = xlin[1, 0] - xlin[0, 0]
    dy2dx2 = signal.savgol_filter(yhat, 15, 3, deriv=2, delta=dx)
    return dy2dx2
0
Nick ODell On

In addition to the numerical solution that @jlandercy has suggested, there is an analytical solution. Unfortunately, as I discussed in the comments, it's not a nice polynomial. For that reason, we need to bring in a new library, SymPy, which is capable of representing, simplifying, and differentiating arbitrary expressions.

I'm going to start by creating the same dataset he uses, to make the answers easier to compare.

import sympy as sm
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(12345)
x_data = np.linspace(-1, 1, 50)
P = np.poly1d([6, 5, 4, 3, 2, 10])
y_data = P(x_data) + 0.5 * np.random.randn(x_data.size)

Then, fit a polynomial to a logged version of the data:

poly_coeffs = np.polyfit(x_data, np.log(y_data), deg=2)

Next, declare a variable X, and create a polynomial in terms of this variable. Raise it to the power of e.

x = sm.Symbol('x')
logged_poly = sm.Poly(poly_coeffs, x).as_expr()
non_logged_poly = sm.exp(logged_poly)

Next, take the second derivative of this expression with respect to x.

twice_diff = sm.diff(non_logged_poly, x, 2)

This gives us the expression

differentiated expression

Which is pretty ugly, but at least avoids the use of np.gradient() to find the second derivative.

Now, to use this in your code, you need to turn it back into a Python function. The SymPy function lambdify() can be used to do this.

f = sm.lambdify(x, twice_diff, 'numpy')

If you want to see the source code of the generated function, you can use help(f) to do so. (You don't need to copy this source code into your program. You can use f() directly. It's just to show what SymPy is doing internally.)

def _lambdifygenerated(x):
    return 9.99113566549129*(0.504928487945428*(x + 0.721464680150301)**2 + 0.710583202690176)*exp(x*(0.355291601345088*x + 0.512660683049044))

You can use this function either by passing one number at a time, or by passing an array of numbers.

f(0)
f(np.array([-1, 0, 1]))

If you want to plot the result of this function, you can do so like this:

plt.plot(x_data, f(x_data))