Hi I have created a dictionary of dataFrame with this code
import os
import pandas
import glob
path="G:\my_dir\*"
dataList={}
for files in glob.glob(path):
dataList[files]=(read_csv(files,sep=";",index_col='Date'))
The different dataframe present in the dictory have different time sample. An example of dataFrame(A) is
Date Volume Value
2014-01-04 06:00:02 6062 108000.0
2014-01-04 06:06:05 6062 107200.0
2014-01-04 06:12:07 6062 97400.0
2014-01-04 06:18:10 6062 99200.0
2014-01-04 06:24:12 6062 91300.0
2014-01-04 06:30:14 6062 84100.0
2014-01-04 06:36:17 6062 57000.0
Example of dataFrame(B) is
Date Volume Value
2014-01-04 05:52:50 6062 4.7
2014-01-04 05:58:53 6062 4.7
2014-01-04 06:04:56 6062 4.9
2014-01-04 06:10:58 6062 5.1
2014-01-04 06:17:01 6062 5.2
2014-01-04 06:23:03 6062 5.2
2014-01-04 06:29:05 6062 5.5
2014-01-04 06:35:08 6062 5.5
The different data frame don't have the same number of rows. I want to merge the different data frame in a single one like this:
Data Volume B A Value(DataframeN)
2014/04/01 05:52:50 6062 4.70 NaN
2014/04/01 05:58:53 6062 4.70 NaN
2014/04/01 06:04:56 6062 4.90 107465.51
2014/04/01 06:10:58 6062 5.10 100652.60
2014/04/01 06:17:01 6062 5.20 98899.57
2014/04/01 06:23:03 6062 5.20 92618.56
2014/04/01 06:29:05 6062 5.50 85301.73
2014/04/01 06:35:08 6062 5.50 61523.06
I have done this easily with Matlab using with the command
ts_A=timeseries(ValueA,datenum(DateA));
ts_B=timeseries(ValueB,datenum(DateB));
res_A=resample(ts_A,datenum(DateB));
I have to do this for several sets of csv files so I wanted to automate the process with python.
Tnx
You can
concat
the twoDataFrames
,interpolate
, thenreindex
on theDataFrame
you want.I assume we have a certain number of
DataFrames
, where theDate
is aDateTimeIndex
in all of them. I will use two in this example, since you used two in the question, but the code will work for any number.df_a
:df_b
:And I will put these into a
dict
for the example. You read them directly into adict
, so you don't need to do this step. I just want to show how my exampledict
is formatted. Thedict
keys
don't matter, any validdict
key
will work:This gets us to where you currently are, with my
dataList
hopefully having the same format as yours.The first thing you need to do is to combine the
DataFrames
. I use thedict
keys
asMultiIndex
column names so you can keep track of which instance of a given column came from whichDataFrame
. You can do that like so:This gives you a
DataFrame
like this:Next, you need to interpolate to fill in the missing values. I interpolate using
'time'
mode
so it properly handles the time indexes:This gives you a
DataFrame
like this:I think generally it would be best to stop here, since you keep all data from all
csv
files. But you said you want only the time points from the longestcsv
. To get that, you need to find the longestDataFrame
, and then get the rows corresponding to its indexes. Finding the longestDataFrame
is easy, you just find the one with the maximum length. Keeping only the time points in thatindex
is also easy, you just slice using thatindex
(you use theloc
method for slicing in this way).This gives you the following final
DataFrame
:This can be combined into one line if you want:
Or, perhaps a slightly clearer 4 lines:
I am not sure why my final results are different than what you show. I ran your example in
MATLAB
(R2015A) myself and got the same results as I get here, so I suspect you generated the final data with a different data set than the example.