I have a spark dataframe that contains customer information. Some clients are duplicates but it's hard for the computer to determine that without some form of fuzzy matching like levenstein distance, etc.
In the example below, John Smith and Johnny Smith are the same person but their "first_name" and "address" fields are slightly different. Other details like birthdate and phone number might not necessary be there. Therefore, I am only able to identify the same person with some % probability.
+----------+---------+-----------------------+-------------------+------------+-----------+
|first_name|last_name|birthdate |address |phone_number|client_uuid|
+----------+---------+-----------------------+-------------------+------------+-----------+
|John |Smith |1998-01-01 12:29:42.835|123 Bakersville |555-555-5555|null |
|Jay |Leno |1955-11-12 12:30:12.946|null |null |null |
|Johnny |Smith |null |123 Bakersville St.|null |null |
+----------+---------+-----------------------+-------------------+------------+-----------+
Let's say I want to try to attempt solving this problem anyways. I would like my end result to fill out the final field "client_uuid". My ideal result will look something like this:
+----------+---------+-----------------------+-------------------+------------+-----------+
|first_name|last_name|birthdate |address |phone_number|client_uuid|
+----------+---------+-----------------------+-------------------+------------+-----------+
|John |Smith |1998-01-01 12:29:42.835|123 Bakersville |555-555-5555|CLIENT_123 |
|Jay |Leno |1955-11-12 12:30:12.946|null |null |CLIENT_456 |
|Johnny |Smith |null |123 Bakersville St.|null |CLIENT_123 |
+----------+---------+-----------------------+-------------------+------------+-----------+
I realize that this is not an easy problem and it's trying to tackle many small problems at once. In fact, this is not really a Spark data frames problem but bonus points if someone finds a solution with Spark DF.
A solution that I am contemplating is to transform each customer record into a vector and then I can use the cosine similiarity to determine how similar each record is to each other. If they are within some threshold, then I would assign them the same generated UUID.
I'm sure this isn't a new problem so I would be interested in hearing other approaches as well. If this is a solved problem and there is already a snippet or library that already solves this problem, that would be even better.