Scipy curve_fit giving wrong results for powerlaw

58 views Asked by At

I have list of degrees of nodes in a network and I want to fit a powerlaw distribution to this empirical degree distribution. I am able to do it using the powerlaw library but for some reason get bad results when trying using the scipy curve_fit function.

I have a list of degrees of a network called degree. I am able to fit a powerlaw distribution to these degrees via the powerlaw library. I do so as follows:

import powerlaw as pwl

fit_function = pwl.Fit(degree,xmin=1)

alpha = fit_function.power_law.alpha
sigma = fit_function.power_law.sigma
xmin = fit_function.power_law.xmin

print("alpha = ", alpha)

I end up obtaining alpha = 2.5 which is exactly in the range one would expect from a social network. I am then able to plot the data against this fitted distribution on a log log plot and get a nice looking result:

def powerlaw(x,alpha):
    y = (alpha - 1) *(x)**(-alpha)
    return y

x = np.linspace(min(unique_degrees), max(unique_degrees), 1000)
y = powerlaw(x, alpha)

plt.loglog(unique_degrees, counts_pdf, 'o', label='Degree Distribution')
plt.loglog(x, y, c='r', label='Fitted Power Law')

plt.legend()
plt.show()

Now when I try this using the scipy curve_fit problem:

def powerlaw(x,alpha):
    y = (alpha - 1) *(x)**(-alpha)
    return y

bounds = ([2, 3])
p0 = [2.5]
params, _ = curve_fit(powerlaw,unique_degrees, counts_pdf,p0=p0,bounds=bounds)

print("alpha = ", params[0]

When I do this without the bounds I get a value of alpha = 1.46. When I include the bound, provided that the lower bound is always greater than 1.46, my alpha always equals this lower bound. I've tried plotting the data on a loglog plot and the results do not look good:

x = np.linspace(min(unique_degrees), max(unique_degrees), 1000)
y1 = powerlaw(x,\*params)

plt.loglog(unique_degrees, counts_pdf, 'o', label='Degree Distribution')
plt.loglog(x, y1, c='g', label='Fitted Power Law 1')

plt.legend()
plt.show()

Am I doing something wrong?

1

There are 1 answers

0
jlandercy On

When you formulate a regression problem in a non linear fashion you probably shape multiple minima on the error function to be minimized.

Below an example of such case:

MCVE

Let's create a plausible dataset:

import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize

def model(x, alpha):
    return (alpha - 1.) * np.power(x[:, 0], -alpha)

# Generate synthetic data with proportional noise
np.random.seed(12345)
X = np.linspace(1, 50, 30).reshape(-1, 1)
y = model(X, 2.5)
s = 0.05 * np.abs(y)
yn = y + s * np.random.normal(size=y.shape)

If we fit without precaution, we got:

# Optimization with default initial guess (unitary)
popt, pcov = optimize.curve_fit(model, X, yn, sigma=s, absolute_sigma=True)
# (array([1.00692318]), array([[6.30590102e-09]]))

enter image description here

If we draw the error function, we see the existence of two minima in your NLLS formulation:

enter image description here enter image description here

To reach the optimum it is sufficient to move the initial parameters to the right of the peak:

# Optimization with educated guess:
popt, pcov = optimize.curve_fit(model, X, yn, sigma=s, p0=(3.,), absolute_sigma=True)
# (array([2.49840598]), array([[1.36086962e-05]]))

enter image description here