Drop duplicates and complete nan with oldest values and optimise runing time

Question

Drop duplicates and complete nan with oldest values and optimise runing time

76 views Asked by mathilde At 17 December 2020 at 11:00

I'm working on a data base with some columns, and I drop duplicates after sorting values by date (format Y-m-d). My df is like the following :

id    date         name     firstname
01    2020-04-01   max      smith
04    2020-08-04   georges  yellow
01    2020-05-31            smith
03    2020-02-24   emma     blue
01    2020-01-02   maxime   smith
02    2020-12-17   matt     green
02    2020-04-05   matt

My code is the following :

liste_id= list(df['id'].drop_duplicates())
df_ind = pd.DataFrame()
for i in range(len(liste_id)):
    df_int = df[df['id'] == liste_id[i]]
    df_int = df_int.loc[:,].sort_values(['date'], ascending=True).drop_duplicates("id", keep='last')
    df_ind = df_ind.append(df_int)

I obtain df_ind:

id    date         name     firstname
01    2020-05-31            smith
02    2020-12-17   matt     green
03    2020-02-24   emma     blue
04    2020-08-04   georges  yellow

But my aim is to complete my results. I've got some NaN in the most recent values, and I want to replace them with the data of the older rows. For the first id of my example, I want to complete my row of the 2020-05-31 with the firstname "max', because it's the firstname completed most recently (2020-04-01). Maxime is completed but the date is older (2020-01-02). I want df_ind to look like this:

id    date         name     firstname
01    2020-05-31   max      smith
02    2020-12-17   matt     green
03    2020-02-24   emma     blue
04    2020-08-04   georges  yellow

My problematic is that I have a lot of rows (more than 600 000...) so did anybody have an idea to reach my goal ? And optimise the time of running ?

Thanks in advance and sorry for my bad english..!

Original Q&A

There are 1 answers

**Pierre-Loic** · Answer 1 · 2020-12-17T14:09:04+00:00

You can optimize a little your code by removing the for loop. I think it is a good idea to use fillna(method='bfill') as you mentioned it in your comment. You can do something like that :

df_ind = df.sort_values(by=["firstname", "date"], ascending=[True, False]).fillna(method="bfill")
df_ind = (df_ind.sort_values(by=["id", "date"], ascending=[True, False])
            .drop_duplicates(subset=["id"]))

TechQA.

Drop duplicates and complete nan with oldest values and optimise runing time

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DROP-DUPLICATES

Popular Questions

Popular Tags

Trending Questions