I'm working on a data base with some columns, and I drop duplicates after sorting values by date (format Y-m-d). My df is like the following :
id date name firstname
01 2020-04-01 max smith
04 2020-08-04 georges yellow
01 2020-05-31 smith
03 2020-02-24 emma blue
01 2020-01-02 maxime smith
02 2020-12-17 matt green
02 2020-04-05 matt
My code is the following :
liste_id= list(df['id'].drop_duplicates())
df_ind = pd.DataFrame()
for i in range(len(liste_id)):
df_int = df[df['id'] == liste_id[i]]
df_int = df_int.loc[:,].sort_values(['date'], ascending=True).drop_duplicates("id", keep='last')
df_ind = df_ind.append(df_int)
I obtain df_ind:
id date name firstname
01 2020-05-31 smith
02 2020-12-17 matt green
03 2020-02-24 emma blue
04 2020-08-04 georges yellow
But my aim is to complete my results. I've got some NaN in the most recent values, and I want to replace them with the data of the older rows. For the first id of my example, I want to complete my row of the 2020-05-31 with the firstname "max', because it's the firstname completed most recently (2020-04-01). Maxime is completed but the date is older (2020-01-02). I want df_ind to look like this:
id date name firstname
01 2020-05-31 max smith
02 2020-12-17 matt green
03 2020-02-24 emma blue
04 2020-08-04 georges yellow
My problematic is that I have a lot of rows (more than 600 000...) so did anybody have an idea to reach my goal ? And optimise the time of running ?
Thanks in advance and sorry for my bad english..!
You can optimize a little your code by removing the for loop. I think it is a good idea to use
fillna(method='bfill')
as you mentioned it in your comment. You can do something like that :