How to handle unordered sets of tags in Python Pandas Dataframe in terms of merging

68 views Asked by At

I have two Dataframes which contain tags in one of the columns. Tags can be present or absent in the list and are stored in an unordered manner. I would like to get the merge of the two datasets with rows that have the same tags. In plain python I can store them as the set datatype.

tags1 = {"tag1","tag2"}
tags2 = {"tag2","tag1"}
tags1 == tags2
True

However I cant do the same in Pandas since sets are not hashable.

df1 = pd.DataFrame([{'name': 'foo', 'tags': {"tag1","tag2"}}, {'name': 'bar', 'tags': {"tag3","tag4"}}])
df2 = pd.DataFrame([{'name': 'baz', 'tags': {"tag4","tag3"}}, {'name': 'qux', 'tags': {"tag5","tag6"}}])
df1.merge(df2, how="inner", indicator=True, on="tags")

What I would like to see

     name_x  value           name_y _merge
0    bar     {"tag3","tag4"} baz    both

What I get

TypeError: unhashable type: 'set'

What would be the best approach here? Unfortunatelly I cant define a fixed amount of columns since the quantity of the tags is not fixed.

1

There are 1 answers

0
jezrael On BEST ANSWER

Use frozensets and assign to left_on and right_on parameters if need original sets columns:

out = (df1.merge(df2, how="inner", indicator=True, 
                left_on=df1['tags'].apply(frozenset),
                right_on=df2['tags'].apply(frozenset)))
print (out)
          key_0 name_x        tags_x name_y        tags_y _merge
0  (tag3, tag4)    bar  {tag3, tag4}    baz  {tag3, tag4}   both

Or assign frozensets to tags columns:

out = (df1.assign(tags=df1['tags'].apply(frozenset))
          .merge(df2.assign(tags=df2['tags'].apply(frozenset)), 
                how="inner", 
                indicator=True, 
                on="tags"))
print (out)
  name_x          tags name_y _merge
0    bar  (tag3, tag4)    baz   both