How to remove duplicate rows in pandas with multiple conditions

147 views Asked by At
import pandas as pd

df = pd.DataFrame(
    [
        ['China', 'L', '08/06/2022 20:00', '08/10/2022 20:00'],
        ['China', 'L', '8/13/2022 00:54', '8/14/2022 00:54'],
        ['China', 'M', '8/14/2022 00:54', '8/14/2022 12:54'],
        ['United Kingdom', 'L', '8/27/2022 06:36', '8/31/2022 21:08'],
        ['United Kingdom', 'L', '9/01/2022 21:08', '09/02/2022 21:38'],
        ['China', 'D', '09/04/2022 21:38', '09/06/2022 21:38']
    ],
    columns=['Country', 'Function', 'Arrival', 'Departure']
)

In this case, i want to remove the consistent duplicate rows and replace the departure time with the last duplicates value, with below two conditions:

  1. do not remove other duplicates that are not in consistent manner.
  2. if the 'Function' column changed, do not take it as duplicate even it is in consistent manner.

So it should look like this:

df = pd.DataFrame(
    [
        ['China', 'L', '08/06/2022 20:00', '8/14/2022 00:54'],
        ['China', 'M', '8/14/2022 00:54', '8/14/2022 12:54'],
        ['United Kingdom', 'L', '8/27/2022 06:36', '09/02/2022 21:38'],
        ['China', 'D', '09/04/2022 21:38', '09/06/2022 21:38']
    ],
    columns=['Country', 'Function', 'Arrival', 'Departure']
)
1

There are 1 answers

2
mozway On

You can use groupby.idxmax:

idx = (pd.to_datetime(df['Departure'])
         .groupby([df['Country'], df['Function']], sort=False)
         .idxmax()
       )

out = df.loc[idx]

output:

          Country Function           Arrival         Departure
1           China        L   8/13/2022 00:54   8/14/2022 00:54
2           China        M   8/14/2022 00:54   8/14/2022 12:54
4  United Kingdom        L   9/01/2022 21:08  09/02/2022 21:38
5           China        D  09/04/2022 21:38  09/06/2022 21:38