Python Pandas Linear Interpolate Y over X

9.4k views Asked by At

I'm trying to answer this Udacity question: https://www.udacity.com/course/viewer#!/c-st101/l-48696651/e-48532778/m-48635592

I like Python & Pandas so I'm using Pandas (version 0.14)

I have this DataFrame df=

pd.DataFrame(dict(size=(1400,
                        2400,
                        1800,
                        1900,
                        1300,
                        1100), 
                   cost=(112000,
                         192000,
                         144000,
                         152000,
                         104000,
                         88000)))

I added this value of 2100 square foot to my data frame (notice there is no cost; that is the question; what would you expect to pay for a house of 2,100 sq ft)

 df.append(pd.DataFrame({'size':(2100,)}), True)

The question wants you to answer what cost/price you expect to pay, using linear interpolation.

Can Pandas interpolate? And how?

I tried this:

df.interpolate(method='linear')

But it gave me a cost of 88,000; just the last cost value repeated

I tried this:

df.sort('size').interpolate(method='linear')

But it gave me a cost of 172,000; just halfway between the costs of 152,000 and 192,000 Closer, but not what I want. The correct answer is 168,000 (because there is a "slope" of $80/sqft)

EDIT:

I checked these SO questions

3

There are 3 answers

0
Nate Anderson On BEST ANSWER

Pandas' method='linear' interpolation will do what I call "1D" interpolation

If you want to interpolate a "dependent" variable over an "independent" variable, make the "independent" variable; i.e. the Index of a Series, and use the method='index' (or method='values', they're the same)

In other words:

pd.Series(index=df.size, data=df.cost.values) #Make size the independent variable
    # SEE ANSWER BELOW; order() method is deprecated; use sort_values() instead
    .order() #Orders by the index, which is size in sq ft; interpolation depends on order (see OP)
    .interpolate(method='index')[2100] #Interpolate using method 'index'

This returns the correct answer 168,000

This is not clear to me from the example in Pandas Documentation, where the Series' data and index are the same list of values.

0
frederikwillersinn On

In my version of Pandas (1.1.1), order() is deprecated. you should use sort_values() instead. This does the job:

df = df.append(pd.DataFrame({'size':(2100,)}), True) 
pd.Series(index=df['size'].values, 
data=df['size'].values).sort_values().interpolate(method='index')[2100]

=168000.0

0
Luca Rigazio On

with my version of Pandas (0.19.2) index=df.size breaks unlucky choice of words -- things is size of the table ... so this works

df=df.append(pd.DataFrame({'size':(2100,)}), True)
pd.Series(index=df['size'].values, 
data=df['cost'].values).order().interpolate(method='index')[2100]

=168000.0