How to transform wide date data to long format in Pandas

513 views Asked by At

I have a large weather dataset in this format:

'Daily Mean Temp for place name 2015'   # One table title per year
'Day'  'JAN'  'FEB'  'MAR'  ...  'DEC'
 1      23     26     21    ...   14
 2      20     30     22    ...   12
 3      26     27     22    ...   16
 ...    ...    ...    ...   ...   ...
 31     28     -      19    ...   11

And I want to get it into this format:

'date'         'mean_temp'
2015-01-01      23
2015-01-02      20
2015-01-03      26

I have been unable to find a solution and would appreciate any ideas?

2

There are 2 answers

1
Rostyslav On BEST ANSWER

First map your months and then unpivot your df

import padas as pd

df.columns = [col.title() for col in df.columns]
df_unpivot = df.melt(id_vars=["Day"], var_name="month", value_name="mean_temp")

Then add new column with your date (if date is not exists it is naT)

df_unpivot['date'] = pd.to_datetime(df_unpivot["Day"].map(str) + "-" + df_unpivot["month"] + "-2015", format='%d-%b-%Y', errors="coerce")

Drop unneeded columns & invalid dates

df_unpivot.drop(["Day", "month"], axis=1, inplace=True)
df_unpivot.dropna(inplace=True)

Set date as index

df_unpivot.set_index("date", inplace=True)
0
PieCot On

In a single line:

pd.concat([
    pd.concat((pd.Series(pd.date_range(start=f'{y}/{m}', end=pd.Timestamp(f'{y}/{m}') + pd.offsets.MonthEnd(0), freq='D'), name='Day'),
               df[c].rename('mean_temp')[:pd.Period(f'{y}/{m}').days_in_month]), axis=1)
    for y, df in sorted(temp_dfs.items())
    for m, c in enumerate(['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC'], start=1)
], axis=0)

Here the result:

         Day  mean_temp
0  2015-01-01           16
1  2015-01-02           29
2  2015-01-03           33
3  2015-01-04           28
4  2015-01-05           17
..        ...          ...
26 2019-12-27           32
27 2019-12-28           39
28 2019-12-29           -2
29 2019-12-30           39
30 2019-12-31            1

The previous code assumes that temp_dfs is a dict with all the DataFrames organized by year: keys are all the available years. I used this code to produce a sample dict:

import pandas as pd
import numpy as np

temp_dfs = {
    y: pd.DataFrame(
        data=np.column_stack((np.arange(1, 32), np.random.randint(-3, 40, (31, 12)))),
        columns=['Day', 'JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']
    ) for y in range(2015, 2020)
}