I have two dataframes of which a subset is shown below:
df1:
bo_beheerobjectsoort IH_onderdeel_naam IH_materiaal
1916 Duiker Voegovergang bitumen
1917 Duiker Voegovergang bitumen
1918 Duiker Voegovergang staal
1919 Duiker Voegovergang nan
1920 Duiker Voegovergang staal geconserveerd
df2:
GUID Objecttype (Element) sam. Objecttype (Element) sam. Materiesoort GUID Objecttype (Element) Objecttype (Element) GUID Objecttype (Beheerobject) Objecttype (Beheerobject)
12 196ed841-8791-ed11-b699-001dd8d7027d Voegovergangsysteem|kunststof,-|flexibele voegovergang 4.2c kunststof f26cd841-8791-ed11-b699-001dd8d7027d Voegovergangsysteem 39e6e058-e301-ed11-b691-001dd8d70290 Duiker
13 976ed841-8791-ed11-b699-001dd8d7027d Voegovergangsysteem|bitumen,-|integraalvoegovergang 6.1 bitumen f26cd841-8791-ed11-b699-001dd8d7027d Voegovergangsysteem 39e6e058-e301-ed11-b691-001dd8d70290 Duiker
14 c16ed841-8791-ed11-b699-001dd8d7027d Voegovergangsysteem|staal,-|open voegovergang staal f26cd841-8791-ed11-b699-001dd8d7027d Voegovergangsysteem 39e6e058-e301-ed11-b691-001dd8d70290 Duiker
I want to merge the two dataframes based on multiple columns. For df1 this would be on: bo_beheerobjectsoort, IH_onderdeel_naam, IH_materiaal. For df 2 this would be on: Objecttype (Beheerobject), Objecttype (Element), Materiesoort.
The problem is that the values in the columns are not completely similar, resulting in a no match. Especially when mergeing IH_onderdeel_naam, IH_materiaal and Objecttype (Element), Materiesoort. Also when no value is stated in a row for column "IH_materiaal" the corresponding GUID of objecttype (element) should be stated instead of GUID Objecttype (Element) sam.
I want the result to be like:
bo_beheerobjectsoort IH_onderdeel_naam IH_materiaal GUID Objecttype (Element) sam. GUID Objecttype (Element)
Duiker Voegovergang bitumen 976ed841-8791-ed11-b699-001dd8d7027d nan
Duiker Voegovergang bitumen 976ed841-8791-ed11-b699-001dd8d7027d nan
Duiker Voegovergang staal c16ed841-8791-ed11-b699-001dd8d7027d nan
Duiker Voegovergang nan nan f26cd841-8791-ed11-b699-001dd8d7027d
Duiker Voegovergang staal geconserveerd c16ed841-8791-ed11-b699-001dd8d7027d nan
I tried to obtain the above result with the following code, but without luck
Code
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
beheerobject = ["Duiker", "Onderdoorgang", "Verzorgingsplaats", "Viaduct", "Weg"]
# Export to basisregister format for Duiker, Onderdoorgang, Verzorgingsplaats, Viaduct, Weg
filtered_group_maatregelen = group_maatregelen[group_maatregelen["bo_beheerobjectsoort"].isin(beheerobject)]
def fuzzy_merge(row, choices, scorer, threshold):
matches = process.extract(row, choices, scorer=scorer)
best_match = max(matches, key=lambda x: x[1])
if best_match[1] >= threshold:
return best_match[0]
return None
# Create a new DataFrame to store approximate matches for each row in df1
matches_df = pd.DataFrame()
# Loop through each row of df1 and find approximate matches for each pair of columns
for col1, col2 in zip(filtered_group_maatregelen["IH_onderdeel_naam"], filtered_group_maatregelen["IH_materiaal"]):
match1 = fuzzy_merge(col1, basisregister_data["Objecttype (Element)"], scorer=fuzz.token_sort_ratio, threshold=70)
match2 = fuzzy_merge(col2, basisregister_data["Materiesoort"], scorer=fuzz.token_sort_ratio, threshold=70)
matches_df = matches_df.append({"Approximate_match1": match1, "Approximate_match2": match2}, ignore_index=True)
# Merge df1 and matches_df based on the approximate matches
merged_df = pd.concat([filtered_group_maatregelen, matches_df], axis=1)
merged_df = pd.merge(merged_df, basisregister_data, how="left", left_on=["bo_beheerobjectsoort", "Approximate_match1", "Approximate_match2"], right_on=["Objecttype (Beheerobject)", "Objecttype (Element)", "Materiesoort"])