drop duplicates in list within data frames python

36 views Asked by At

I have a dataframe that I have grouped with textbook ISBN and I the schools, state and grades that those books are used in. I want to remove the duplicates within the lists of the dataframe. I have tried the following steps within the screenshots for the state column as a test but Im not sure if its a list or a dataframe or a series as I tried number of code to see if any will work. Was wondering if someone can explain the structure of these "list" within a dataframe and any code to drop the duplicates.step1step2step3step4

1

There are 1 answers

0
Tanishq Chaudhary On

The df['State'] is a <class 'pandas.core.series.Series'> data type. But, each element of this series is a list, as you converted it during aggregation. Therefore, when you .apply() the lambda on the df['State'], it sees each x as a list.

You can .apply() the lambda x: list(set(x)))) instead of lambda x: x.drop_duplicates(). It will do the same job - removing duplicates.

Sample example:

import pandas as pd

df = pd.DataFrame(
    {
        "val": [1, 1, 2, 3, 4, 3, 2],
        "data": ["X", "Y", "X", "X", "X", "X", "X"],
    }
)

df = df.groupby(["val"]).agg(lambda x: x.tolist())
print(type(df["data"]))
print((df["data"].apply(lambda x: list(set(x)))))

Output:

<class 'pandas.core.series.Series'>
val
1    [Y, X]
2       [X]
3       [X]
4       [X]
Name: data, dtype: object