Take two sets 'A' and 'B' , how to create a set C= A minus B in pandas

252 views Asked by At

Consider I have two sets 'A' and 'B', how do I create a set C= A minus B in pandas. Here A and B are dataframes. A is a dataframe containing First name and Last name as multiindex. B has integers as index. First name and Last name are columns in B.

I tried by converting multindex of A to column of A by A['index']=A.index and later tried to merge B and A.But it is not working.

A:

csv for A.csv

B:

csv for B.csv

The columns of B (f_Name and l_name) are multiindex of A.

I want all rows in A for which f_name and l_name does not exist in B as output. I have tried following code:

A['index']=A.index

my_df=pd.merge(A,B,left_on=['F_name','L_name'],right_index=True,how='left'] 

ans_df=A[~A.index.isin(my_df.index)]

but the len(and_df) is same as len(A) which is not correct. The length of ans_df should be less than that of A as few f_name and l_name exist in B.

2

There are 2 answers

0
Abhishek Anand On BEST ANSWER

Here are the dataframes A and B

import pandas as pd
import numpy as np

A
               Age  Gender
F_name  L_name      
Josh    Crammer 25  M
John    Smith   29  M
Mellisa Simpson 32  F
Ahemed  Khan    26  M
Frank   J       25  M
Charles Brown   26  M
William Gibson  26  M

B
    F_name  L_name
0   Josh    Crammer
2   Mellisa Simpson
4   Frank   J
5   Charles Brown
6   William Gibson

What we can do is reset the index of A and create columns in place like this.

A.reset_index(level=A.index.names, inplace=True)
A
    F_name  L_name  Age Gender
0   Josh    Crammer 25  M
1   John    Smith   29  M
2   Mellisa Simpson 32  F
3   Ahemed  Khan    26  M
4   Frank   J       25  M
5   Charles Brown   26  M
6   William Gibson  26  M

All that needs to be done now is to add a not in condition to fetch the rows we require:

A[~((A.F_name.isin(B.F_name)) & (A.L_name.isin(B.L_name)))]
    F_name  L_name  Age Gender
1   John    Smith   29  M
3   Ahemed  Khan    26  M
0
Marjan Moderc On

Solution using fake column

Disclaimer: Below you can find example of the "fake column" approach which may not be suitable for huge dataframes with many matching columns of complex types. Besides, I prefer to work with simple indexes and put as many data into columns rather than in indexes.

So, lets create two datasets: A will contain few random Family Guy characters, and B will contain few Family Guy family members. Hope you are familiar with this awesome TV series! :)

# Create a DF A with some Quahog Family guy citizens (with multiindex)
multiindexA = pd.MultiIndex.from_tuples([["Peter","Griffin"],["Glenn","Quagmire"],["Joe","Swanson"],["Cleveland","Brown"],["Brian","Griffin"],["Stewie","Griffin"],["Lois","Griffin"]],names=["Name","Surname"])
A=pd.DataFrame([40,35,38,45,8,2,35],index=multiindexA, columns=["Age"])
print A

                    Age
Name      Surname      
Peter     Griffin    40
Glenn     Quagmire   35
Joe       Swanson    38
Cleveland Brown      45
Brian     Griffin     8
Stewie    Griffin     2
Lois      Griffin    35


# Create a DF B with some Family guy inner family members (with simple simple index)
B = pd.DataFrame(data=[["Peter","Griffin",40],["Lois","Griffin",35],["Brian","Griffin",8],["Stewie","Griffin",2]], columns=["Name","Surname","Age"])
print B

     Name  Surname  Age
0   Peter  Griffin   40
1    Lois  Griffin   35
2   Brian  Griffin    8
3  Stewie  Griffin    2

Let's find the Family Guy characters that are not members of the Griffin family. Firstly, we will use reset_index to normalize dataframes into the same structure since this is going to make our lives much easier:

# Reset index to move multiindex into columns in order to normalize dataframes
A = A.reset_index()
print A

        Name   Surname  Age
0      Peter   Griffin   40
1      Glenn  Quagmire   35
2        Joe   Swanson   38
3  Cleveland     Brown   45
4      Brian   Griffin    8
5     Stewie   Griffin    2
6       Lois   Griffin   35

Since you are matching on two (or even more columns), one (possibly dirty and memory wasting) solution might be creating a fake index column by combining interesting columns into one with .apply(lambda x: ...) function. Keep in mind that you have to convert any non-string fields into strings with .astype(str).:

#Create a new dummy column by merging all matching columns into one (in both dataframes!)
A["fake_index_col"]=A[["Name","Surname","Age"]].astype(str).apply(lambda x: "".join(x),axis=1)
B["fake_index_col"]=B[["Name","Surname","Age"]].astype(str).apply(lambda x: "".join(x),axis=1)

This will add a dummy column to both of the dataframe, where all the matching data will be compressed into one single field.

        Name   Surname  Age    fake_index_col
0      Peter   Griffin   40    PeterGriffin40
1      Glenn  Quagmire   35   GlennQuagmire35
2        Joe   Swanson   38      JoeSwanson38
3  Cleveland     Brown   45  ClevelandBrown45
4      Brian   Griffin    8     BrianGriffin8
5     Stewie   Griffin    2    StewieGriffin2
6       Lois   Griffin   35     LoisGriffin35

This will allow you to easily apply the inverse of isin function to find the citizens of Quahog that are not Griffins. Finally delete the fake column and/or recreate multiindex to preserve the initial state of the dataframe.

C = A[~A["fake_index_col"].isin(B["fake_index_col"])]
del C["fake_index_col"]
print C



        Name   Surname  Age
1      Glenn  Quagmire   35
2        Joe   Swanson   38
3  Cleveland     Brown   45