Drop duplicates and complete nan with oldest values and optimise runing time

78 views Asked by At

I'm working on a data base with some columns, and I drop duplicates after sorting values by date (format Y-m-d). My df is like the following :

id    date         name     firstname
01    2020-04-01   max      smith
04    2020-08-04   georges  yellow
01    2020-05-31            smith
03    2020-02-24   emma     blue
01    2020-01-02   maxime   smith
02    2020-12-17   matt     green
02    2020-04-05   matt       

My code is the following :

liste_id= list(df['id'].drop_duplicates())
df_ind = pd.DataFrame()
for i in range(len(liste_id)):
    df_int = df[df['id'] == liste_id[i]]
    df_int = df_int.loc[:,].sort_values(['date'], ascending=True).drop_duplicates("id", keep='last')
    df_ind = df_ind.append(df_int)

I obtain df_ind:

id    date         name     firstname
01    2020-05-31            smith
02    2020-12-17   matt     green
03    2020-02-24   emma     blue
04    2020-08-04   georges  yellow

But my aim is to complete my results. I've got some NaN in the most recent values, and I want to replace them with the data of the older rows. For the first id of my example, I want to complete my row of the 2020-05-31 with the firstname "max', because it's the firstname completed most recently (2020-04-01). Maxime is completed but the date is older (2020-01-02). I want df_ind to look like this:

id    date         name     firstname
01    2020-05-31   max      smith
02    2020-12-17   matt     green
03    2020-02-24   emma     blue
04    2020-08-04   georges  yellow

My problematic is that I have a lot of rows (more than 600 000...) so did anybody have an idea to reach my goal ? And optimise the time of running ?

Thanks in advance and sorry for my bad english..!

1

There are 1 answers

3
Pierre-Loic On

You can optimize a little your code by removing the for loop. I think it is a good idea to use fillna(method='bfill') as you mentioned it in your comment. You can do something like that :

df_ind = df.sort_values(by=["firstname", "date"], ascending=[True, False]).fillna(method="bfill")
df_ind = (df_ind.sort_values(by=["id", "date"], ascending=[True, False])
            .drop_duplicates(subset=["id"]))