For my application, using Postgres...
The problem
Each user should be associated with N-number of cases, defining a one-to-many relationship, but due to faulty application logic, users are commonly duplicated in the DB, resulting in multiple ids for any given human.
Given that these types of near-duplicates for most users, this has caused each user to be almost always be represented by Y-number of ids in the users table.
In this context, a near duplicate means two rows that are mostly similar. Here is an example of a near-duplicate.
| id | first_name | last_name | str_adrr |
------------------------------------------------
| 1 | Mary | Doe | 124 Main Ave |
| 2 | Mary | Doe | 124 Main St |
The goal is to remove all near-duplicate users, except for one, leaving a single user while associating all relevant cases to that single user. Ending up with a one-to-many relationship between users, and cases.
My approach
First Step
I fuzzy matched the users and grouped them by cluster_id as identifier. Where cluster_id is used to indicate the grouping itself; all rows with cluster_id 1 are considered duplicates of each other.
Here is a sample of the users table
| id | first_name | last_name | str_adrr | group | cluster_id
-------------------------------------------------------------------------------------
| 1 | Mary | Doe | 124 Main Ave | Mary Doe 124 Main Ave | 1
| 2 | Mary | Doe | 124 Main St | Mary Doe 124 Main Ave | 1
| 7 | Mary | Doe | 124 Main Ave | Mary Doe 124 Main Ave | 1
| 4 | Mary | Does | 124 Main Ave | Mary Doe 124 Main Ave | 1
| 5 | James | Smith | 14 Street NW |James Smith 14 Street NW | 2
| 6 | James | Smith | 14 Street NW |James Smith 14 Street NW | 2
| 10 | James | Smth | 14 Street NW |James Smith 14 Street NW | 2
| 11 | Paula | James | 21 River SW | Paula James21 River SW | 3
| 45 | Paula | James | 21 River SW | Paula James21 River SW | 3
Given another table called cases. Here is a sample of the relevant columns from that table:
| id | user_id
---------------
| 1 | 1 # corresponds to mary
| 2 | 2 # corresponds to mary
| 3 | 4 # corresponds to mary
| 4 | 7 # corresponds to mary
| 5 | 10 # corresponds to james
| 6 | 11 # corresponds to paula
| 7 | 45 # corresponds to paula
| 8 | 1 # corresponds to mary
| 9 | 10 # corresponds to james
| 10 | 10 # corresponds to james
| 11 | 6 # corresponds to james
user_id in this cases table corresponds with id from the users table
A user_id can have many (up to a few thousand) cases.
Step 2
I joined the users and cases tables
Here is a sample of the resultant table, users_cases:
|cluster_id| user_id| case_id
----------------------------------
| 1 | 1 | 1
| 1 | 1 | 8
| 1 | 2 | 2
| 1 | 4 | 3
| 1 | 7 | 4
| 2 | 10 | 5
| 2 | 10 | 9
| 2 | 10 | 10
| 2 | 6 | 11
| 3 | 11 | 6
| 3 | 11 | 7
Step 3
I needed to determine which user_id within a given cluster_id grouping is associated with the greatest amount of cases from the user_cases table.
I was able to do so and ended up with a max_cluster_user table with the following shape
|cluster_id| user_id| case_id_count
-------------------------------------
| 1 | 1 | 2
| 2 | 10 | 3
| 3 | 11 | 1
To translate. The first row indicates that for cluster_id with a value of 1, the user_id with the greatest number of cases is 2 and the quantity of cases is represented by case_id_count which has the value 2.
Step 4: Where I need help
I need to then update the user_cases table (or make a new table with the same shape) such that each user_id is the same for each row in a cluster_id group. The result should be something like
|cluster_id| user_id| case_id
----------------------------------
| 1 | 1 | 1
| 1 | 1 | 8
| 1 | 1 | 2
| 1 | 1 | 3
| 1 | 1 | 4
| 2 | 10 | 5
| 2 | 10 | 9
| 2 | 10 | 10
| 2 | 10 | 11
| 3 | 11 | 6
| 3 | 11 | 7
I'm at a loss for how to make this happen. The constraint is that it has to be done via Postgresql compatible SQL.
Procedural code solution for Step 4
I did sketch this out as code to think about it procedurally, this might help. Though I am aware this is not a viable solution as with > 500k records it would take days for this type of logic to run as-is.
# max_cluster_user refers to the table of the same name
for cluster in max_cluster_user:
# get the users within a specific cluster
cluster_users = [user for user in users if user['cluster_id'] == cluster['cluster_id']]
# users refers to the table of the same name
for user in cluster_users:
# get the cases associated with the given id
user_cases = [case for case in cases if case['user_id'] == user['id']
for user_case in user_cases:
# update the user_id for a case
user_case['user_id = cluster['user_id']
Thanks in advance
I think you just need
updatewith join for step 4: