Python - How to extract subset under first row condition

47 views Asked by At

I'm looking for a way in order to take subset with a condition on the first row obtain by a groupby, let's see an example to be more clear:

    ID         Date             Diff     Class
 0  010101     11/12/2008       NaN      Good
 1  010102     02/07/2008       NaN      Very Good
 2  010102     08/07/2008       6.0      Bad
 3  010103     12/02/2009       NaN      Very Bad
 4  010104     07/05/2007       NaN      Very Good

So I'm looking for a way to obtain the following subset, due to the first class per ID, here, Very Good for the first time it appears:

   ID         Date             Diff     Class
   010102     02/07/2008       NaN      Very Good
   010102     08/07/2008       6.0      Bad
   010104     07/05/2007       NaN      Very Good

Thanks for your help,

import pandas as pd

df = pd.DataFrame(
    [
        {"ID": 010101, "Date": '11/12/2008', "Class" : "Good"},
        {"ID": 010102, "Date": '02/07/2008', "Class" : "Very Good"},
        {"ID": 010102, "Date": '08/07/2008', "Diff": 6.0 , "Class" : "Bad"},
        {"ID": 010103, "Date": '12/02/2009', "Class" : "Very Bad"},
        {"ID": 010104, "Date": '07/05/2007', "Class" : "Very Good"},
    ],
    columns=["ID", "Date", "Diff", "Class"],
)
1

There are 1 answers

1
asongtoruin On BEST ANSWER

Based on comments, it seems as though you want to filter to any ID where the first dated entry for that ID is "Very Good". I'm going to assume your dataframe is called df.

First, we should convert your Date column to a proper pandas datetime, to ensure it is properly ordered (I am assuming your dates are dd/mm/yyyy here):

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

Then, we sort by ID and Date, group on ID and return the first row we get (which, due to our sorting, will be the first date for that ID).

first_entry = df.sort_values(by=['ID', 'Date']).groupby('ID', as_index=False).first()

From this, we can work out which of the first rows has a Class of Very Good, and select those ID values:

vg = first_entry[first_entry['Class'] == 'Very Good']['ID']

And finally, we can use those ID values to filter the original dataframe.

matches = df[df['ID'].isin(vg)]

This gives us:

      ID       Date  Diff      Class
1  10102 2008-07-02   NaN  Very Good
2  10102 2008-07-08   6.0        Bad
4  10104 2007-05-07   NaN  Very Good