Python df add rows by date, so each group ends on the same date. Ffill remaining rows

230 views Asked by At

to use the geo-plot animation frame I want all my groups to end on the same date. This would avoid the last frame to grey out certain countries. Currently, the newest data point according to date is 'Timestamp('2021-05-13 00:00:00')'.

So in the next step I want to add new rows according to all countries so that they have rows until the newest date in the df. Columns 'people_vaccinated_per_hundred' and 'people_fully_vaccinated_per_hundred' can be populated using ffill.

Data: Data Set

So ideally, if e.g. norway has 1 day less than the newest data point '2021-05-13', then it should add a new row as shown below. And this should be done for all other countries in the df.

Example

    country iso_code    date    people_vaccinated_per_hundred   people_fully_vaccinated_per_hundred
12028   Norway  NOR 2021-05-02  0.00    NaN
12029   Norway  NOR 2021-05-03  0.00    NaN
12188   Norway  NOR ...         ...     ...
12188   Norway  NOR 2021-05-11  27.81   9.55
12189   Norway  NOR 2021-05-12  28.49   10.42

Add new row
12189   Norway  NOR 2021-05-13  28.49   10.42
2

There are 2 answers

1
richardjmorton On BEST ANSWER

A blunt approach to this might be to create a cartesian product of the countries and dates, then join on this to create empty values for every missing combination of dates and country.

countries = df.loc[:, ['country', 'iso_code']].drop_duplicates()
dates = df.loc[:, 'date'].drop_duplicates()
all_countries_dates = countries.merge(dates, how='cross')

df.merge(all_countries_dates, how='right', on=['country', 'iso_code', 'date'])

With a dataset like:

country       iso_code  date        people_vaccinated   people_fully_vaccinated
Norway        NOR       2021-05-09  0.00                1.00
Norway        NOR       2021-05-10  0.00                3.00
Norway        NOR       2021-05-11  27.81               9.55
Norway        NOR       2021-05-12  28.49               10.42
Norway        NOR       2021-05-13  28.49               10.42
United States USA       2021-05-09  23.00               3.00
United States USA       2021-05-10  23.00               3.00

This transformation would give you:

country       iso_code  date        people_vaccinated   people_fully_vaccinated
Norway        NOR       2021-05-09  0.00                1.00
Norway        NOR       2021-05-10  0.00                3.00
Norway        NOR       2021-05-11  27.81               9.55
Norway        NOR       2021-05-12  28.49               10.42
Norway        NOR       2021-05-13  28.49               10.42
United States USA       2021-05-09  23.00               3.00
United States USA       2021-05-10  23.00               3.00
United States USA       2021-05-11  NaN                 NaN
United States USA       2021-05-12  NaN                 NaN
United States USA       2021-05-13  NaN                 NaN

After this you could use fillna to change the empty values for the added rows.

0
Sebastian ten Berge On

Code for the cross join in versions older than pandas 1.1.5

#creating a df with all unique countries and iso_codes
#creating a new table with all the dates in the original dataframe
countries = animation_covid_df.loc[:, ['country', 'iso_code']].drop_duplicates()
dates_df = animation_covid_df.loc[:, ['date']].drop_duplicates()

#creating an index called row number to later merge the dates table with the countries table on
dates_df['row_number'] = dates_df.reset_index().index

number_of_dates = dates_df.max() #shows the number of dates or rows in the the dates table

#creating an equivalent number of rows for each country as there are dates in the dates_df 
indexed_country = countries.append([countries]*number_of_dates[1],ignore_index=True)
indexed_country = indexed_country.sort_values(['country', 'iso_code'], ascending=True)
#creating a new column called 'row_number' to join the indexed_country df with the dates_df
indexed_country['row_number'] = indexed_country.groupby(['country', 'iso_code']).cumcount()+1

#merging all the indexed countries with all the possible dates on the row number
indexed_country_date_df = indexed_country.merge(dates_df, on='row_number', how='left', suffixes=('_1', '_2'))

#setting the 'date' column in both tables to datetime so they can be merged on
animation_covid_df['date'] = pd.to_datetime(animation_covid_df['date'])
indexed_country_date_df['date'] = pd.to_datetime(indexed_country_date_df['date'])