Is there easy way in python to extrapolate data points to the future?

28.3k views Asked by At

I have a simple numpy array, for every date there is a data point. Something like this:

>>> import numpy as np
>>> from datetime import date
>>> from datetime import date
>>> x = np.array( [(date(2008,3,5), 4800 ), (date(2008,3,15), 4000 ), (date(2008,3,
20), 3500 ), (date(2008,4,5), 3000 ) ] )

Is there easy way to extrapolate data points to the future: date(2008,5,1), date(2008, 5, 20) etc? I understand it can be done with mathematical algorithms. But here I am seeking for some low hanging fruit. Actually I like what numpy.linalg.solve does, but it does not look applicable for the extrapolation. Maybe I am absolutely wrong.

Actually to be more specific I am building a burn-down chart (xp term): 'x=date and y=volume of work to be done', so I have got the already done sprints and I want to visualise how the future sprints will go if the current situation persists. And finally I want to predict the release date. So the nature of 'volume of work to be done' is it always goes down on burn-down charts. Also I want to get the extrapolated release date: date when the volume becomes zero.

This is all for showing to dev team how things go. The preciseness is not so important here :) The motivation of dev team is the main factor. That means I am absolutely fine with the very approximate extrapolation technique.

4

There are 4 answers

1
denis On BEST ANSWER

It's all too easy for extrapolation to generate garbage; try this. Many different extrapolations are of course possible; some produce obvious garbage, some non-obvious garbage, many are ill-defined.

alt text

""" extrapolate y,m,d data with scipy UnivariateSpline """
import numpy as np
from scipy.interpolate import UnivariateSpline
    # pydoc scipy.interpolate.UnivariateSpline -- fitpack, unclear
from datetime import date
from pylab import *  # ipython -pylab

__version__ = "denis 23oct"


def daynumber( y,m,d ):
    """ 2005,1,1 -> 0  2006,1,1 -> 365 ... """
    return date( y,m,d ).toordinal() - date( 2005,1,1 ).toordinal()

days, values = np.array([
    (daynumber(2005,1,1), 1.2 ),
    (daynumber(2005,4,1), 1.8 ),
    (daynumber(2005,9,1), 5.3 ),
    (daynumber(2005,10,1), 5.3 )
    ]).T
dayswanted = np.array([ daynumber( year, month, 1 )
        for year in range( 2005, 2006+1 )
        for month in range( 1, 12+1 )])

np.set_printoptions( 1 )  # .1f
print "days:", days
print "values:", values
print "dayswanted:", dayswanted

title( "extrapolation with scipy.interpolate.UnivariateSpline" )
plot( days, values, "o" )
for k in (1,2,3):  # line parabola cubicspline
    extrapolator = UnivariateSpline( days, values, k=k )
    y = extrapolator( dayswanted )
    label = "k=%d" % k
    print label, y
    plot( dayswanted, y, label=label  )  # pylab

legend( loc="lower left" )
grid(True)
savefig( "extrapolate-UnivariateSpline.png", dpi=50 )
show()

Added: a Scipy ticket says, "The behavior of the FITPACK classes in scipy.interpolate is much more complex than the docs would lead one to believe" -- imho true of other software doc too.

1
ty812 On

The mathematical models are the way to go in this case. For instance, if you have only three data points, you can have absolutely no indication on how the trend will unfold (could be any of two parabola.)

Get some statistics courses and try to implement the algorithms. Try Wikibooks.

0
Luka Rahne On

You have to swpecify over which function you need extrapolation. Than you can use regression http://en.wikipedia.org/wiki/Regression_analysis to find paratmeters of function. And extrapolate this in future.

For instance: translate dates into x values and use first day as x=0 for your problem the values shoul be aproximatly (0,1.2), (400,1.8),(900,5.3)

Now you decide that his points lies on function of type a+bx+cx^2

Use the method of least squers to find a,b and c http://en.wikipedia.org/wiki/Linear_least_squares (i will provide full source, but later, beacuase I do not have time for this)

0
Eric O. Lebigot On

A simple way of doing extrapolations is to use interpolating polynomials or splines: there are many routines for this in scipy.interpolate, and there are quite easy to use (just give the (x, y) points, and you get a function [a callable, precisely]).

Now, as as been pointed in this thread, you cannot expect the extrapolation to be always meaningful (especially when you are far from your data points) if you don't have a model for your data. However, I encourage you to play with the polynomial or spline interpolations from scipy.interpolate to see whether the results you obtain suit you.