I have weather data which has the following column where the first 3 rows look like this
| date | hour | city | condition | snow | rain |
|---|---|---|---|---|---|
| 2023-01-30 | 3 | berlin | snow | 1 | 0 |
| 2023-01-30 | 6 | berlin | rain | 0 | 1 |
| 2023-01-30 | 9 | berlin | clear | 0 | 0 |
I want to write code where which will create rows for the missing hours and replace the values with the hour city and date closest to that hour. The result dataframe should look like
| date | hour | city | condition | snow | rain |
|---|---|---|---|---|---|
| 2023-01-30 | 3 | berlin | snow | 1 | 0 |
| 2023-01-30 | 4 | berlin | snow | 1 | 0 |
| 2023-01-30 | 5 | berlin | snow | 1 | 0 |
| 2023-01-30 | 6 | berlin | rain | 0 | 1 |
| 2023-01-30 | 7 | berlin | rain | 0 | 1 |
| 2023-01-30 | 8 | berlin | rain | 0 | 1 |
| 2023-01-30 | 9 | berlin | clear | 0 | 0 |
| 2023-01-30 | 10 | berlin | clear | 0 | 0 |
| 2023-01-30 | 10 | berlin | clear | 0 | 0 |
Note: I have many cities and many rows.
I tried this but dint get the right solution and its not optimum for large number of rows (cities and hours)
df_expanded = df.set_index(['date', 'city', 'condition'])\
.hour.unstack().reset_index().melt(id_vars=['date', 'city', 'condition'], value_name='hour')\
.dropna()\
.drop(columns=['variable'])
df_expanded = df_expanded.sort_values(by=['date', 'city', 'condition', 'hour'])\
.ffill()
result = df_expanded.merge(df, on=['date', 'city', 'condition', 'hour'], how='left')\
.dropna()\
.drop_duplicates()
Open to easier and simpler solutions
It is easiest to
ffillthe missing data like below but I will try to also think of a solution for the closest timeif you want the original columns of date and hour then add the following