i have a dataframe with a column A as below :

Column A
Carrefour supermarket
Carrefour hypermarket
Carrfour downtown
Carrfor market
Lulu Hyper
Lulu dxb
lulu airport
k.m trading
KM Trading
KM trade
K.M.  Trading

I wanted to derive at the below "column A" :

Column A
KM Trading
KM Trading
KM Trading
KM Trading
KM Trading

To do this, i code as below :

MERCHANT_NAME_DICT = {"lulu": "Lulu", "carrefour": "Carrefour",  "km": "KM Trading"}

def replace_merchant_name(row):
    """Provided a long merchant name replace it with short name."""
    processed_row = re.sub(r'\s+|\.', '', row.lower()).strip()
    for key, value in MERCHANT_NAME_DICT.items():
        if key in processed_row:
            return value

    return row

frame['MERCHANT_NAME'] = frame['MERCHANT_NAME'].astype(str)
frame.MERCHANT_NAME = frame.MERCHANT_NAME.apply(lambda row: replace_merchant_name(row))

But i wanted to use NLP Logic and make it a generic function ( Instead of using values for mapping ). Just call the generic function and run it on any similar data column and get the desired results. I am pretty new to NLP Concepts, so looking for some help on it friends.

NOTE : Basically i wanted a generic NLP way coding to find all similar words from a given column ( or in a list ).

2 Answers

Jindřich On

You can do something as: for each word that is not in a spellchecking vocabulary (and thus is likely to be misspelled), have a look in your list of merchant names and have a look if there is a name with a small edit distance. You can also somehow normalize the words for the similarity search, i.e., lowercase everything and remove punctuation.

You can use textdistance package that implements plenty of string distances. I would probably use Jaro-Winkler distance for this purpose.

import string
import textdistance

MERCHANT_NAMES = [("lulu", "Lulu"), ("carrefour", "Carrefour"),  ("km", "KM")]

def normalize(orig_name):
    name_sig = orig_name.translate(str.maketrans('', '', string.punctuation)).lower()

    best_score = DISTANCE_THRESHOLD
    replacement = name

    for sig, name in MERCHANT_NAMES:
        distance = textdistance.jaro_winkler(name_sig, sig)
        if distance > best_score:
            best_score = distance
            replacement = name
    return replacement

You will probably need to tune what is an acceptable threshold for word replacement and do something with multi-word expressions. (E.g., throw away words similar to "supermarket", "hypermarket", etc.)

David Dale On

If you don't have a golden set of "right" merchant names, this sounds like a clustering problem. It can be solved with a clever distance function (like Jaro-Winkler from the Jindrich's answer) and a simple clustering algorithm (e.g. agglomerative clustering).

After having clustered the texts, you can find the most representative text from each cluster and replace with it the whole cluster.

import numpy as np
import re
import textdistance
# we will need scikit-learn>=0.21
from sklearn.cluster import AgglomerativeClustering  

texts = [
  'Carrefour supermarket', 'Carrefour hypermarket', 'Carrefour', 'carrefour', 'Carrfour downtown', 'Carrfor market', 
  'Lulu', 'Lulu Hyper', 'Lulu dxb', 'lulu airport', 
  'k.m trading', 'KM Trading', 'KM trade', 'K.M.  Trading', 'KM.Trading'

def normalize(text):
  """ Keep only lower-cased text and numbers"""
  return re.sub('[^a-z0-9]+', ' ', text.lower())

def group_texts(texts, threshold=0.4): 
  """ Replace each text with the representative of its cluster"""
  normalized_texts = np.array([normalize(text) for text in texts])
  distances = 1 - np.array([
      [textdistance.jaro_winkler(one, another) for one in normalized_texts] 
      for another in normalized_texts
  clustering = AgglomerativeClustering(
    distance_threshold=threshold, # this parameter needs to be tuned carefully
    affinity="precomputed", linkage="complete", n_clusters=None
  centers = dict()
  for cluster_id in set(clustering.labels_):
    index = clustering.labels_ == cluster_id
    centrality = distances[:, index][index].sum(axis=1)
    centers[cluster_id] = normalized_texts[index][centrality.argmin()]
  return [centers[i] for i in clustering.labels_]


The code above will print its output as

['carrefour', 'carrefour', 'carrefour', 'carrefour', 'carrefour', 'carrefour', 
 'lulu', 'lulu', 'lulu', 'lulu', 
 'km trading', 'km trading', 'km trading', 'km trading', 'km trading']

As a baseline, this function will do. You may want to improve it by modifying the distance function so that it would reflect your domain more closely. For example:

  • take into account synonyms: supermarket=hypermarket=market
  • lemmatize words (so that trading=trade)
  • give smaller weight to the non-important words (IDF?)

Unfortunately, most of such adjustments would be domain-specific, so you will have to tune them to your own dataset.