Checking and Removing NoneTypes for Jaro String Similarity

465 views Asked by At

I'm trying to discern the string similarity between two strings (using Jaro). Each string resides in a separate column in my dataframe.

String 1 = df['name_one'] 

String 2 = df['name_two']

When I try to run my string similarity logic:

from pyjarowinkler import distance
df['distance'] = df.apply(lambda d: distance.get_jaro_distance(str(d['name_one']),str(d['name_two']),winkler=True,scaling=0.1), axis=1)

I get the following error:

 **error: JaroDistanceException: Cannot calculate distance from NoneType (str, str)**

Great, so there is a nonetype in the columns, so the first thing I do is check for this:

maskone = df['name_one'] == None
df[maskone]

masktwo = df['name_two'] == None
df[masktwo]

This yields in no None types found.... I'm scratching my head here at this point, but proceed to clean the two columns any ways.

df['name_one'] = df['name_one'].fillna('').astype(str)
df['name_two'] = df['name_two'].fillna('').astype(str) 

And yet, I'm still getting:

error: JaroDistanceException: Cannot calculate distance from NoneType (str, str)

Am I removing NoneTypes correctly?

1

There are 1 answers

0
ggordon On BEST ANSWER

Problem

The issue isn't exactly that you are only experiencing NoneTypes but empty strings which can also throw this exception as you can see in the implementation of distance.get_jaro_distance

if not first or not second:
    raise JaroDistanceException("Cannot calculate distance from NoneType ({0}, {1})".format(
        first.__class__.__name__,
        second.__class__.__name__))

Option 1

Trying replacing your none types and/or empty strings with 'NA' or filtering them from your dataset.

Option 2

Use a flag value/distance for rows that may raise this exception . In the example below, I will utilize 999

from pyjarowinkler import distance

df['distance'] = df.apply(lambda d: 999 if not str(d['name_one']) or not str(d['name_two']) else distance.get_jaro_distance(str(d['name_one']),str(d['name_two']),winkler=True,scaling=0.1), axis=1)