I am working with a dataframe which has 2 columns of city names which should be equal. But they are not due to administrative errors, spelling mistakes or name changes. I am trying to see when those city names are 'equal enough' to be assumed equal. Using SequenceMatcher I can divide the list in roughly 3 parts: Everything is wrong, some are wrong, some are right, everything is right.
In a perfect world I would want the list to be divided in: Everything is wrong, Everything is right. Where the split can be made around a certain ratio/matching value.
Therefore, SequenceMatcher does not do the trick for me. I found textdistance but I get overwhelmed by the possibilities and probably there are more possibilities. An example where it goes wrong:
'Zeddam' and 'Didam' are classified with a ratio of 0.72 (Which are not equal). 'Nes' and 'Nes gem dongeradeel' with a ratio of 0.45 (Which are equal, the second just specifies it's province)
Just checking if one of the strings is a subset of the other string does not do the trick since it will give problems in other cases.
Do you guys have a suggestion on what string comparing algorithm is appropriate and why? I am comparing multiple columns in my dataframe, which is around 1000 rows.