Fuzzy Logic to match the records in a dataframe

65 views Asked by At

I have a huge datasets of 2 Million and i want to match the records based on the fuzzy logic I have my original dataframe like

+---------+---------------+
|     name|        address|
+---------+---------------+
|   Arvind|      Kathmandu|
|   Arvind|      Kathmands|
|   Arbind|      Kathmandu|
|  Arvinds|      Kathmandu|
|   Arveen|      Kathmandu|
|   Arvins|      Kathmandu|
|   Arvind|Kathmandu Nepal|
| Abhishek|        Pokhara|
|Abhisheks|        Pokhara|
|Abhishek1|        Pokhara|
|Abhishek2|        Pokhara|
|Abhishek3|        Pokhara|
+---------+---------------+

I tried using pyspark windows function but windows function takes partition based on the exact match and i want to have may records matched based on the fuzzy logic and wanted to have my output as a dataframe like this:-

+---------+---------------+
|     name|        address|uuid_for_match_record
+---------+---------------+
|   Arvind|      Kathmandu| uuid_1
|   Arvind|      Kathmands|uuid_1
|   Arbind|      Kathmandu|uuid_1
|  Arvinds|      Kathmandu|uuid_1
|   Arveen|      Kathmandu|uuid_1
|   Arvins|      Kathmandu|uuid_1
|   Arvind|Kathmandu Nepal|uuid_1
| Abhishek|        Pokhara|uuid_2
|Abhisheks|        Pokhara|uuid_2
|Abhishek1|        Pokhara|uuid_2
|Abhishek2|        Pokhara|uuid_2
|Abhishek3|        Pokhara|uuid_2`

How can it be achieved based on the huge 2Million datasets Here is the image of my dataframe and what i want to achieve: image of dataframe and output i want

2

There are 2 answers

0
user238607 On

I have taken inspiration from this blogpost to write the following code.

https://leons.im/posts/a-python-implementation-of-simhash-algorithm/

The cluster_names function just clusters the strings within the list based on the cluster_threshold value. You can tweak this value to get good results. You can also play around with shingling_width in name_to_features. You can create features of width=2,3,4,5 and so on and concatenate together.

Once you are satistifed your with your clusters, then you can further do fuzzywuzzy (this library has been renamed to thefuzz) matching to find more exact matches.

https://github.com/seatgeek/thefuzz

First install simhash python library, then run the following code.

pip install simhash

from simhash import Simhash


def simhash_distance(hash1, hash2):
    return hash1.distance(hash2)


def name_to_features(name, shingling_width=2):
    name = name.lower()
    return [name[i:i + shingling_width] for i in range(len(name) - shingling_width + 1)]


def cluster_names(names_list, cluster_threshold=20):
    clusters_internal = []
    name_hashes = [(name, Simhash(name_to_features(name))) for name in names_list]

    for name, hash_val in name_hashes:
        found_cluster = False
        for cluster_ele in clusters_internal:
            if simhash_distance(cluster_ele['centroid'], hash_val) <= cluster_threshold:
                cluster_ele['names'].append(name)
                found_cluster = True
                break
        if not found_cluster:
            clusters_internal.append({'centroid': hash_val, 'names': [name]})
    return clusters_internal



names = ["Alice", "Alicia", "Alise", "Alyce", "Bob", "Bobb"]
clusters = cluster_names(names)
for i, cluster in enumerate(clusters, 1):
    print(f"Cluster {i}: {cluster['names']}")

data = [
    "Arvind Kathmandu",
    "Arvind Kathmands",
    "Arbind Kathmandu",
    "Arvinds Kathmandu",
    "Arveen Kathmandu",
    "Arvins Kathmandu",
    "Arvind Kathmandu Nepal",
    "Abhishek Pokhara",
    "Abhisheks Pokhara",
    "Abhishek1 Pokhara",
    "Abhishek2 Pokhara",
    "Abhishek3 Pokhara"
]

clusters_data = cluster_names(data)
for i, cluster in enumerate(clusters_data, 1):
    print(f"Cluster {i}: {cluster['names']}")

Output :

Cluster 1: ['Alice', 'Alicia', 'Alise', 'Alyce']
Cluster 2: ['Bob', 'Bobb']
Cluster 1: ['Arvind Kathmandu', 'Arvind Kathmands', 'Arbind Kathmandu', 'Arvinds Kathmandu', 'Arveen Kathmandu', 'Arvins Kathmandu', 'Arvind Kathmandu Nepal']
Cluster 2: ['Abhishek Pokhara', 'Abhisheks Pokhara', 'Abhishek1 Pokhara', 'Abhishek2 Pokhara', 'Abhishek3 Pokhara']
0
user238607 On

Here's another implementation which does the same thing. This time using MinHash and LSH.

Here's an article which explains this.

https://spotintelligence.com/2023/01/02/minhash/

First, install datasketch and networkx

pip install networkx

pip install datasketch

import networkx as nx
from datasketch import MinHash, MinHashLSH
import pprint
import random
names = [
    "Arvind Kathmandu",
    "Arvind Kathmands",
    "Arbind Kathmandu",
    "Arvinds Kathmandu",
    "Arveen Kathmandu",
    "Arvins Kathmandu",
    "Arvind Kathmandu Nepal",
    "Abhishek Pokhara",
    "Abhisheks Pokhara",
    "Abhishek1 Pokhara",
    "Abhishek2 Pokhara",
    "Abhishek3 Pokhara"
]

# just to test stuff
random.shuffle(names)
print(f"shuffled names list = {names}")
def get_shingles(name_arg):
    name_internal = name_arg.lower()
    shingle_list = [2, 3, 4]
    list_given = []
    for shingle_width in shingle_list:
        list_internal = [name_internal[i:i + shingle_width] for i in range(max(len(name_internal) - shingle_width + 1, 1))]
        list_given.extend(list_internal)

    return set(list_given)

signatures = {}

for name in names:
    m = MinHash(num_perm=128)
    for shingle in get_shingles(name):
        m.update(shingle.encode('utf8'))
    signatures[name] = m


lsh = MinHashLSH(threshold=0.5, num_perm=128)
for name, minhash in signatures.items():
    lsh.insert(name, minhash)

G = nx.Graph()

for name in signatures.keys():
    similar_names = lsh.query(signatures[name])
    for sim_name in similar_names:
        G.add_edge(name, sim_name)


clusters = list(nx.connected_components(G))


pp = pprint.PrettyPrinter(indent=2)
print("Final results:")
pp.pprint(clusters)

Output :

shuffled names list = ['Arvind Kathmands', 'Arvins Kathmandu', 'Abhishek3 Pokhara', 'Abhishek1 Pokhara', 'Abhishek2 Pokhara', 'Arvind Kathmandu', 'Arvind Kathmandu Nepal', 'Arbind Kathmandu', 'Abhisheks Pokhara', 'Arvinds Kathmandu', 'Arveen Kathmandu', 'Abhishek Pokhara']
Final results:
[ { 'Arbind Kathmandu',
    'Arveen Kathmandu',
    'Arvind Kathmands',
    'Arvind Kathmandu',
    'Arvind Kathmandu Nepal',
    'Arvinds Kathmandu',
    'Arvins Kathmandu'},
  { 'Abhishek Pokhara',
    'Abhishek1 Pokhara',
    'Abhishek2 Pokhara',
    'Abhishek3 Pokhara',
    'Abhisheks Pokhara'}]