How can I merge two dataframes based on multiple columns with fuzzy string matching?

38 views Asked by At

I have two dataframes of which a subset is shown below:

df1:


        bo_beheerobjectsoort    IH_onderdeel_naam   IH_materiaal
1916    Duiker                  Voegovergang        bitumen
1917    Duiker                  Voegovergang        bitumen
1918    Duiker                  Voegovergang        staal
1919    Duiker                  Voegovergang        nan
1920    Duiker                  Voegovergang        staal geconserveerd

df2:

    GUID Objecttype (Element) sam.          Objecttype (Element) sam.                                       Materiesoort    GUID Objecttype (Element)                Objecttype (Element)   GUID Objecttype (Beheerobject)          Objecttype (Beheerobject)
12  196ed841-8791-ed11-b699-001dd8d7027d    Voegovergangsysteem|kunststof,-|flexibele voegovergang 4.2c     kunststof       f26cd841-8791-ed11-b699-001dd8d7027d    Voegovergangsysteem     39e6e058-e301-ed11-b691-001dd8d70290    Duiker
13  976ed841-8791-ed11-b699-001dd8d7027d    Voegovergangsysteem|bitumen,-|integraalvoegovergang 6.1         bitumen         f26cd841-8791-ed11-b699-001dd8d7027d    Voegovergangsysteem     39e6e058-e301-ed11-b691-001dd8d70290    Duiker
14  c16ed841-8791-ed11-b699-001dd8d7027d    Voegovergangsysteem|staal,-|open voegovergang                   staal           f26cd841-8791-ed11-b699-001dd8d7027d    Voegovergangsysteem     39e6e058-e301-ed11-b691-001dd8d70290    Duiker


I want to merge the two dataframes based on multiple columns. For df1 this would be on: bo_beheerobjectsoort, IH_onderdeel_naam, IH_materiaal. For df 2 this would be on: Objecttype (Beheerobject), Objecttype (Element), Materiesoort.

The problem is that the values in the columns are not completely similar, resulting in a no match. Especially when mergeing IH_onderdeel_naam, IH_materiaal and Objecttype (Element), Materiesoort. Also when no value is stated in a row for column "IH_materiaal" the corresponding GUID of objecttype (element) should be stated instead of GUID Objecttype (Element) sam.

I want the result to be like:

bo_beheerobjectsoort    IH_onderdeel_naam   IH_materiaal          GUID Objecttype (Element) sam.           GUID Objecttype (Element)
Duiker                  Voegovergang        bitumen               976ed841-8791-ed11-b699-001dd8d7027d     nan
Duiker                  Voegovergang        bitumen               976ed841-8791-ed11-b699-001dd8d7027d     nan
Duiker                  Voegovergang        staal                 c16ed841-8791-ed11-b699-001dd8d7027d     nan
Duiker                  Voegovergang        nan                   nan                                      f26cd841-8791-ed11-b699-001dd8d7027d
Duiker                  Voegovergang        staal geconserveerd   c16ed841-8791-ed11-b699-001dd8d7027d     nan

I tried to obtain the above result with the following code, but without luck

Code

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

beheerobject = ["Duiker", "Onderdoorgang", "Verzorgingsplaats", "Viaduct", "Weg"]
# Export to basisregister format for Duiker, Onderdoorgang, Verzorgingsplaats, Viaduct, Weg
filtered_group_maatregelen = group_maatregelen[group_maatregelen["bo_beheerobjectsoort"].isin(beheerobject)]


def fuzzy_merge(row, choices, scorer, threshold):
    matches = process.extract(row, choices, scorer=scorer)
    best_match = max(matches, key=lambda x: x[1])
    if best_match[1] >= threshold:
        return best_match[0]
    return None

# Create a new DataFrame to store approximate matches for each row in df1
matches_df = pd.DataFrame()

# Loop through each row of df1 and find approximate matches for each pair of columns
for col1, col2 in zip(filtered_group_maatregelen["IH_onderdeel_naam"], filtered_group_maatregelen["IH_materiaal"]):
    match1 = fuzzy_merge(col1, basisregister_data["Objecttype (Element)"], scorer=fuzz.token_sort_ratio, threshold=70)
    match2 = fuzzy_merge(col2, basisregister_data["Materiesoort"], scorer=fuzz.token_sort_ratio, threshold=70)
    matches_df = matches_df.append({"Approximate_match1": match1, "Approximate_match2": match2}, ignore_index=True)

# Merge df1 and matches_df based on the approximate matches
merged_df = pd.concat([filtered_group_maatregelen, matches_df], axis=1)
merged_df = pd.merge(merged_df, basisregister_data, how="left", left_on=["bo_beheerobjectsoort", "Approximate_match1", "Approximate_match2"], right_on=["Objecttype (Beheerobject)", "Objecttype (Element)", "Materiesoort"])
0

There are 0 answers