I have two datasets each containing Name, First Name, Street, House Number, Postal Code and City. I have noticed these datasets contain multiple cases of duplicates. For instance, in one dataset the first name is "John", in the other dataset it is "Jon" with the same last name, same street, same house number, same city, and postal code also differs in one digit. As I have millions of data points, there are multiple possibilities of how the same person could present differences in those two datasets.

I believe I need to do five things:

  1. Identify all the cases contained in the dataset and map each case to each data point.

Examples of cases: Name different; Name and Postal Code different; First Name different; First Name and Name different; City different, etc.

  1. Calculate with distance measures how many digits differ in those data points.

For instance, from "Jon" to "John" I need to add 1 letter and from postal code 34567 to 34568 I need to change 1 digit, resulting in two changes for this case.

  1. Calculate how often each case happens, for instance:
Case Frequency (%)
Name different 50
Name and Postal Code different 30
  1. Calculate how often each distance measure occurs for each case:
Case Distance Measure Frequency (%)
Name different 1 50
Name different 2 30
Name different 3 20
Name and Postal Code different 3 100
  1. The result would be a matrix joining those information, such as:
Case/Distance Measure 0 1 2 3 4 5
Name different 20% 10% 10% 5% 5% 50%
Name and Postal Code different 10% 30% 10% 20% 5% 25%

Could you please help me identify which libraries I would need to perform these steps with Python and Jupyter?

For now I do not have access to the data and I am doing research.

0

There are 0 answers