Resampling and merging data frame with python

2.2k views Asked by At

Hi I have created a dictionary of dataFrame with this code

import os
import pandas 
import glob

path="G:\my_dir\*"
dataList={}

for files in glob.glob(path):
    dataList[files]=(read_csv(files,sep=";",index_col='Date'))

The different dataframe present in the dictory have different time sample. An example of dataFrame(A) is

Date               Volume   Value 
2014-01-04 06:00:02 6062   108000.0
2014-01-04 06:06:05 6062   107200.0
2014-01-04 06:12:07 6062   97400.0
2014-01-04 06:18:10 6062   99200.0
2014-01-04 06:24:12 6062   91300.0
2014-01-04 06:30:14 6062   84100.0
2014-01-04 06:36:17 6062   57000.0

Example of dataFrame(B) is

Date                Volume Value 
2014-01-04 05:52:50  6062   4.7
2014-01-04 05:58:53  6062   4.7
2014-01-04 06:04:56  6062   4.9 
2014-01-04 06:10:58  6062   5.1
2014-01-04 06:17:01  6062   5.2
2014-01-04 06:23:03  6062   5.2
2014-01-04 06:29:05  6062   5.5
2014-01-04 06:35:08  6062   5.5

The different data frame don't have the same number of rows. I want to merge the different data frame in a single one like this:

    Data                 Volume       B               A               Value(DataframeN)
2014/04/01 05:52:50      6062        4.70            NaN
2014/04/01 05:58:53      6062        4.70            NaN
2014/04/01 06:04:56      6062        4.90            107465.51
2014/04/01 06:10:58      6062        5.10            100652.60
2014/04/01 06:17:01      6062        5.20            98899.57
2014/04/01 06:23:03      6062        5.20            92618.56
2014/04/01 06:29:05      6062        5.50            85301.73
2014/04/01 06:35:08      6062        5.50            61523.06

I have done this easily with Matlab using with the command

ts_A=timeseries(ValueA,datenum(DateA));
ts_B=timeseries(ValueB,datenum(DateB));
res_A=resample(ts_A,datenum(DateB));

I have to do this for several sets of csv files so I wanted to automate the process with python.

Tnx

1

There are 1 answers

4
TheBlackCat On BEST ANSWER

You can concat the two DataFrames, interpolate, then reindex on the DataFrame you want.

I assume we have a certain number of DataFrames, where the Date is a DateTimeIndex in all of them. I will use two in this example, since you used two in the question, but the code will work for any number.

df_a:

                     Volume   Value
Date                               
2014-01-04 06:00:02    6062  108000
2014-01-04 06:06:05    6062  107200
2014-01-04 06:12:07    6062   97400
2014-01-04 06:18:10    6062   99200
2014-01-04 06:24:12    6062   91300
2014-01-04 06:30:14    6062   84100
2014-01-04 06:36:17    6062   57000

df_b:

                     Volume  Value
Date                              
2014-01-04 05:52:50    6062    4.7
2014-01-04 05:58:53    6062    4.7
2014-01-04 06:04:56    6062    4.9
2014-01-04 06:10:58    6062    5.1
2014-01-04 06:17:01    6062    5.2
2014-01-04 06:23:03    6062    5.2
2014-01-04 06:29:05    6062    5.5
2014-01-04 06:35:08    6062    5.5

And I will put these into a dict for the example. You read them directly into a dict, so you don't need to do this step. I just want to show how my example dict is formatted. The dict keys don't matter, any valid dict key will work:

dataList = {'a': df_a,
            'b': df_b}

This gets us to where you currently are, with my dataList hopefully having the same format as yours.

The first thing you need to do is to combine the DataFrames. I use the dict keys as MultiIndex column names so you can keep track of which instance of a given column came from which DataFrame. You can do that like so:

df = pd.concat(dataList.values(), axis=1, keys=dataList.keys())

This gives you a DataFrame like this:

                         a              b      
                    Volume   Value Volume Value
Date                                           
2014-01-04 05:52:50    NaN     NaN   6062   4.7
2014-01-04 05:58:53    NaN     NaN   6062   4.7
2014-01-04 06:00:02   6062  108000    NaN   NaN
2014-01-04 06:04:56    NaN     NaN   6062   4.9
2014-01-04 06:06:05   6062  107200    NaN   NaN
2014-01-04 06:10:58    NaN     NaN   6062   5.1
2014-01-04 06:12:07   6062   97400    NaN   NaN
2014-01-04 06:17:01    NaN     NaN   6062   5.2
2014-01-04 06:18:10   6062   99200    NaN   NaN
2014-01-04 06:23:03    NaN     NaN   6062   5.2
2014-01-04 06:24:12   6062   91300    NaN   NaN
2014-01-04 06:29:05    NaN     NaN   6062   5.5
2014-01-04 06:30:14   6062   84100    NaN   NaN
2014-01-04 06:35:08    NaN     NaN   6062   5.5
2014-01-04 06:36:17   6062   57000    NaN   NaN

Next, you need to interpolate to fill in the missing values. I interpolate using 'time' mode so it properly handles the time indexes:

df = df.interpolate('time')

This gives you a DataFrame like this:

                         a                     b          
                    Volume          Value Volume     Value
Date                                                      
2014-01-04 05:52:50    NaN            NaN   6062  4.700000
2014-01-04 05:58:53    NaN            NaN   6062  4.700000
2014-01-04 06:00:02   6062  108000.000000   6062  4.738017
2014-01-04 06:04:56   6062  107352.066116   6062  4.900000
2014-01-04 06:06:05   6062  107200.000000   6062  4.938122
2014-01-04 06:10:58   6062   99267.955801   6062  5.100000
2014-01-04 06:12:07   6062   97400.000000   6062  5.119008
2014-01-04 06:17:01   6062   98857.851240   6062  5.200000
2014-01-04 06:18:10   6062   99200.000000   6062  5.200000
2014-01-04 06:23:03   6062   92805.801105   6062  5.200000
2014-01-04 06:24:12   6062   91300.000000   6062  5.257182
2014-01-04 06:29:05   6062   85472.375691   6062  5.500000
2014-01-04 06:30:14   6062   84100.000000   6062  5.500000
2014-01-04 06:35:08   6062   62151.239669   6062  5.500000
2014-01-04 06:36:17   6062   57000.000000   6062  5.500000

I think generally it would be best to stop here, since you keep all data from all csv files. But you said you want only the time points from the longest csv. To get that, you need to find the longest DataFrame, and then get the rows corresponding to its indexes. Finding the longest DataFrame is easy, you just find the one with the maximum length. Keeping only the time points in that index is also easy, you just slice using that index (you use the loc method for slicing in this way).

longind = max(dataList.values(), key=len).index
df = df.loc[longind]

This gives you the following final DataFrame:

                         a                     b      
                    Volume          Value Volume Value
Date                                                  
2014-01-04 05:52:50    NaN            NaN   6062   4.7
2014-01-04 05:58:53    NaN            NaN   6062   4.7
2014-01-04 06:04:56   6062  107352.066116   6062   4.9
2014-01-04 06:10:58   6062   99267.955801   6062   5.1
2014-01-04 06:17:01   6062   98857.851240   6062   5.2
2014-01-04 06:23:03   6062   92805.801105   6062   5.2
2014-01-04 06:29:05   6062   85472.375691   6062   5.5
2014-01-04 06:35:08   6062   62151.239669   6062   5.5

This can be combined into one line if you want:

df = pd.concat(dataList.values(), axis=1, keys=dataList.keys()).interpolate('time').loc[max(dataList.values(), key=len).index]

Or, perhaps a slightly clearer 4 lines:

names = dataList.keys()
dfs = dataList.values()
longind = max(dfs, key=len).index
df = pd.concat(dfs, axis=1, keys=names).interpolate('time').loc[longind]

I am not sure why my final results are different than what you show. I ran your example in MATLAB (R2015A) myself and got the same results as I get here, so I suspect you generated the final data with a different data set than the example.