I have 250,000 rows of first and last names. The first and last names are in separate columns, but they can be inconsistent e.g:
John Smith John-Smith John M. Smith Jhon Smith
How do I identify these near-duplicates and remove/merge them using openrefine?
I tried using sorting then using blank down, but it only appears to work for exact match duplicates.
OpenRefine has implemented several clustering methods to identify and merge data. Check out the OpenRefine documentation for details.
Each clustering method has its own benefits and weaknesses so it usually is recommended to combine them and use several iterations.
In your case I would use the following workflow: