Rapidfuzz match merge

1.3k views Asked by At

Very new to this, would appreciate any advice on the following:

I have a dataset 'Projects' showing list of institutions with project IDs:

project_id  institution_name
0           somali national university
1           aarhus university
2           bath spa
3           aa school of architecture
4           actionaid uk

I would like to fuzzy match merge this with the following dataset of 'Universities' and their country codes:

institution_name                      country_code
a tan kapuja buddhista foiskola             HU
aa school of architecture                   UK
bath spa university                         UK
aalto-yliopisto                             FI
aarhus universitet                          DK

And get back this:

project_id  institution_name           Match    organisation               country_code
0           somali national university []       NaN                        NaN
1           aarhus university          [(91)]   aarhus universitet         DK
2           bath spa                   [(90)]   bath spa university        UK
3           aa school of architecture  [(100)]  aa school of architecture  UK
4           actionaid uk               []       NaN                        NaN

Using rapidfuzz:

import pandas as pd

import numpy as np

from rapidfuzz import process, utils as fuzz_utils

def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
    #   baseFrame: the left table to join
    #   compareFrame: the right table to join
    #   baseKey: key column of the left table
    #   compareKey: key column of the right table
    #   threshold: how close the matches should be to return a match, based on Levenshtein distance
    #   limit: the amount of matches that will get returned, these are sorted high to low
    #   return: dataframe with boths keys and matches
    s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}

    m1 = baseFrame[baseKey].apply(lambda x: process.extract(
      fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
    ))
    baseFrame['Match'] = m1

    m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
    baseFrame['organisation'] = m2

    return baseFrame.merge(compareFrame, on=baseKey, how=how)

Merged = fuzzy_merge(Projects, Universities, 'institution_name', 'institution_name')

Merged

I got this (with some extra text in the match column but won't go into that now). It's nearly what I want, but the country code only matches up when it's a 100% match:

project_id  institution_name           Match    organisation               country_code
0           somali national university []       NaN                        NaN
1           aarhus university          [(91)]   aarhus universitet         NaN
2           bath spa                   [(90)]   bath spa university        NaN
3           aa school of architecture  [(100)]  aa school of architecture  UK
4           actionaid uk               []       NaN                        NaN

I reckon this is an issue with how I'm comparing my basekey to the compareframe to create my merged dataset. I can't sort out how to return it on 'organisation' instead though - attempts to plug it in result in varying errors.

1

There are 1 answers

0
StrangeBadger On

Never mind, figured it out - I didn't account for the empty cells! Replacing them with NaN worked out perfectly.

def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
    s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}

    m1 = baseFrame[baseKey].apply(lambda x: process.extract(
      fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
    ))
    baseFrame['Match'] = m1

    m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
    baseFrame['organisations'] = m2.replace("",np.nan)

    return baseFrame.merge(compareFrame, left_on='organisations', right_on=compareKey, how=how)