Binning data into equal box sizes and apply OLS to each bin

Question

Binning data into equal box sizes and apply OLS to each bin

543 views Asked by Miquel At 07 January 2025 at 08:20

I have a DataFrame df1:

import pandas as pd
import numpy as np
import statsmodels.formula.api as sm

df1 = pd.DataFrame( np.random.randn(3000,1), index= pd.date_range('1/1/1990', periods=3000), columns = {"M"})

I would like to group elements in a box size = 10, fit them using OLS and compute Y_t, where Y_tstands for the series of straight line fits.

In other words, I would like to take the first 10 values, fit them using OLS ( Y_t = b*X_t+a_0) and obtain the values Y_t for these 10 values. Again do the same for the next 10 values (not a rolling window!), and so on and so forth.

My approach

The first issue that I faced was that I could not fit elements using DateTime values as predictors, so I defined a new DataFrame df_fit that contains two columns Aand B. Column Acontains integers from 0 to 9, and column Bthe values of df1 in groups of 10 elements:

 def compute_yt(df,i,bs):

    df_fit = pd.DataFrame({"B": np.arange(1,bs+1),\
                           "A": df.reset_index().loc[i*bs:((i+1)*bs-1), "M"]})

    fit = sm.ols(formula = "A ~ B", data = df_fit).fit()
    yt = fit.params.B*df_fit["B"] + fit.params.Intercept

    return yt

Where bs is the box size (10 in this example), iis an index that allows to sweep over all values.

Finally,

 result = [compute_yt(df1,n,l) for n in np.arange(0,round(len(df1)/l)-1)]           

 result =    
      Name: B, dtype: float64, 840   -0.249590
      841   -0.249935
      842   -0.250280
      843   -0.250625
      844   -0.250970
      845   -0.251315
      846   -0.251660
      847   -0.252005
      848   -0.252350
      849   -0.252695
      Name: B, dtype: float64, 850   -0.252631
      851   -0.252408
      ...    ...

Where resultis a list that should contain the values for the straight line fits.

So, my questions are the following:

Is there a way to run an OLS using DateTime values as predictors?
I would like to use the list comprehension to build a DataFrame (with the same shape as df1) containing the values of y_t. This relates to question (1) in the sense that I would like to obtain a time-series for these values.
Is there a more "pythonic" way to write this code? The way I have sliced the dataframe does not seem too much suitable.

Original Q&A

There are 1 answers

**Ted Petrou** · Accepted Answer · 2016-12-20T15:51:47+00:00

Not really sure if this is what you wanted to do but I first added a group number and an observation number to each row of your dataframe and then pivoted it so that every row had 10 observations.

df1 = pd.DataFrame( data={'M':np.random.randn(3000)}, index= pd.date_range('1/1/1990', periods=3000))

df1['group_num'] = np.repeat(range(300), 10)
df1['obs_num'] = np.tile(range(10), 300)

df_pivot = df1.pivot(index='group_num', columns='obs_num')
print(df_pivot.head())

Output

                  M                                                    \
obs_num           0         1         2         3         4         5   
group_num                                                               
0         -0.063775 -1.293410  0.395011 -1.224491  1.777335 -2.395643   
1         -1.111679  1.668670  1.864227 -1.555251  0.959276  0.615344   
2         -0.213891 -0.733493  0.175590  0.561410  1.359565 -1.341193   
3          0.534735 -2.154626 -1.226191 -0.309502  1.368085  0.769155   
4         -0.611289 -0.545276 -1.924381  0.383596  0.322731  0.989450   


obs_num           6         7         8         9  
group_num                                          
0         -1.461194 -0.481617 -1.101098  1.102030  
1         -0.120995 -1.046757  1.286074 -0.832990  
2          0.322485 -0.825315 -2.277746 -0.619008  
3          0.794694  0.912190 -1.006603  0.572619  
4         -1.191902  1.229913  1.105221  0.899331

I then wrote a function to do ordinary least squares with statsmodels - not the formula type.

import statsmodels.api as sm
def compute_yt(row):
    X = sm.add_constant(np.arange(10))
    fit = sm.OLS(row.values, X).fit()
    yt = fit.params[1] * row.values + fit.params[0]
    return yt

I then called this function over all the rows via apply.

df_pivot.apply(compute_yt, axis=1)

With output a predicted value for each original set of 10 values.

                  M                                                    \
obs_num           0         1         2         3         4         5   
group_num                                                               
0         -0.063775 -1.293410  0.395011 -1.224491  1.777335 -2.395643   
1         -1.111679  1.668670  1.864227 -1.555251  0.959276  0.615344   
2         -0.213891 -0.733493  0.175590  0.561410  1.359565 -1.341193   
3          0.534735 -2.154626 -1.226191 -0.309502  1.368085  0.769155   
4         -0.611289 -0.545276 -1.924381  0.383596  0.322731  0.989450   


obs_num           6         7         8         9  
group_num                                          
0         -1.461194 -0.481617 -1.101098  1.102030  
1         -0.120995 -1.046757  1.286074 -0.832990  
2          0.322485 -0.825315 -2.277746 -0.619008  
3          0.794694  0.912190 -1.006603  0.572619  
4         -1.191902  1.229913  1.105221  0.899331

TechQA.

Binning data into equal box sizes and apply OLS to each bin

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in CURVE-FITTING

Related Questions in STATSMODELS

Related Questions in BINNING

Popular Questions

Popular Tags

Trending Questions