I'm trying to discern the string similarity between two strings (using Jaro). Each string resides in a separate column in my dataframe.
String 1 = df['name_one']
String 2 = df['name_two']
When I try to run my string similarity logic:
from pyjarowinkler import distance
df['distance'] = df.apply(lambda d: distance.get_jaro_distance(str(d['name_one']),str(d['name_two']),winkler=True,scaling=0.1), axis=1)
I get the following error:
**error: JaroDistanceException: Cannot calculate distance from NoneType (str, str)**
Great, so there is a nonetype in the columns, so the first thing I do is check for this:
maskone = df['name_one'] == None
df[maskone]
masktwo = df['name_two'] == None
df[masktwo]
This yields in no None types found.... I'm scratching my head here at this point, but proceed to clean the two columns any ways.
df['name_one'] = df['name_one'].fillna('').astype(str)
df['name_two'] = df['name_two'].fillna('').astype(str)
And yet, I'm still getting:
error: JaroDistanceException: Cannot calculate distance from NoneType (str, str)
Am I removing NoneTypes correctly?
Problem
The issue isn't exactly that you are only experiencing
NoneTypes
but empty strings which can also throw this exception as you can see in the implementation ofdistance.get_jaro_distance
Option 1
Trying replacing your none types and/or empty strings with 'NA' or filtering them from your dataset.
Option 2
Use a flag value/distance for rows that may raise this exception . In the example below, I will utilize
999