I have a data frame such as the following with tens of thousands of rows and a thousand columns:
For an LSTM, I would like to
- extract the values only per column,
- put them together at the beginning of the dataframe and
- pad 0 before the values start up to index 99.
However, please note, not every column has the same amount of values. Some have already a lot, some have none yet. Also on which timestamp the values are generated is column specific. I did achieve the result with the following code. However, since the code is really slow (700 hours), I am looking for a possibility to execute the calculation logic faster. It takes so long, since I want to calculate this kind of result for each hourly timestamps from 2008 to 2020.
Is there any way to make the code significantly faster?
df1=pd.DataFrame(index=range(100),columns=dummydata.columns)
for j in dummydata.columns:
df1[j]=dummydata[(dummydata.index<=i)][j].dropna().iloc[-T:].iloc[::-1].reset_index(drop=True)
df1=df1.fillna(0).reset_index(drop=True)
Can you try this to see if this is faster?
dummydata.apply(lambda x: pd.Series(x.dropna().values)).fillna(0)
Then you can select only the first 100 rows using
dummydata.loc[0:100, :]