A postgres users table with unintended similar-duplicates has led to another table with messy foreign-keys, how to fix and consolidate foreign-keys?

98 views Asked by At

For my application, using Postgres...

The problem

Each user should be associated with N-number of cases, defining a one-to-many relationship, but due to faulty application logic, users are commonly duplicated in the DB, resulting in multiple ids for any given human.

Given that these types of near-duplicates for most users, this has caused each user to be almost always be represented by Y-number of ids in the users table.

In this context, a near duplicate means two rows that are mostly similar. Here is an example of a near-duplicate.

|  id | first_name | last_name | str_adrr      | 
------------------------------------------------
|  1  | Mary       | Doe       | 124 Main Ave  | 
|  2  | Mary       | Doe       | 124 Main St   |

The goal is to remove all near-duplicate users, except for one, leaving a single user while associating all relevant cases to that single user. Ending up with a one-to-many relationship between users, and cases.

My approach

First Step

I fuzzy matched the users and grouped them by cluster_id as identifier. Where cluster_id is used to indicate the grouping itself; all rows with cluster_id 1 are considered duplicates of each other.

Here is a sample of the users table

|  id | first_name | last_name | str_adrr      | group                   | cluster_id
-------------------------------------------------------------------------------------
|  1  | Mary       | Doe       | 124 Main Ave  | Mary Doe 124 Main Ave   | 1
|  2  | Mary       | Doe       | 124 Main St   | Mary Doe 124 Main Ave   | 1
|  7  | Mary       | Doe       | 124 Main Ave  | Mary Doe 124 Main Ave   | 1
|  4  | Mary       | Does      | 124 Main Ave  | Mary Doe 124 Main Ave   | 1
|  5  | James      | Smith     | 14 Street NW  |James Smith 14 Street NW | 2
|  6  | James      | Smith     | 14 Street NW  |James Smith 14 Street NW | 2
| 10  | James      | Smth      | 14 Street NW  |James Smith 14 Street NW | 2
| 11  | Paula      | James     | 21 River SW   | Paula James21 River SW  | 3
| 45  | Paula      | James     | 21 River SW   | Paula James21 River SW  | 3

Given another table called cases. Here is a sample of the relevant columns from that table:

|  id | user_id
---------------
|  1  | 1  # corresponds to mary
|  2  | 2  # corresponds to mary
|  3  | 4  # corresponds to mary
|  4  | 7  # corresponds to mary
|  5  | 10 # corresponds to james
|  6  | 11 # corresponds to paula
|  7  | 45 # corresponds to paula
|  8  | 1  # corresponds to mary
|  9  | 10 # corresponds to james
|  10 | 10 # corresponds to james
|  11 | 6  # corresponds to james

user_id in this cases table corresponds with id from the users table

A user_id can have many (up to a few thousand) cases.

Step 2

I joined the users and cases tables

Here is a sample of the resultant table, users_cases:

|cluster_id| user_id| case_id
----------------------------------
|  1       | 1      | 1
|  1       | 1      | 8
|  1       | 2      | 2
|  1       | 4      | 3
|  1       | 7      | 4
|  2       | 10     | 5
|  2       | 10     | 9
|  2       | 10     | 10
|  2       | 6      | 11
|  3       | 11     | 6
|  3       | 11     | 7

Step 3

I needed to determine which user_id within a given cluster_id grouping is associated with the greatest amount of cases from the user_cases table.

I was able to do so and ended up with a max_cluster_user table with the following shape

|cluster_id| user_id| case_id_count
-------------------------------------
| 1        | 1      | 2 
| 2        | 10     | 3 
| 3        | 11     | 1 

To translate. The first row indicates that for cluster_id with a value of 1, the user_id with the greatest number of cases is 2 and the quantity of cases is represented by case_id_count which has the value 2.

Step 4: Where I need help

I need to then update the user_cases table (or make a new table with the same shape) such that each user_id is the same for each row in a cluster_id group. The result should be something like

|cluster_id| user_id| case_id
----------------------------------
|  1       | 1      | 1
|  1       | 1      | 8
|  1       | 1      | 2
|  1       | 1      | 3
|  1       | 1      | 4
|  2       | 10     | 5
|  2       | 10     | 9
|  2       | 10     | 10
|  2       | 10     | 11
|  3       | 11     | 6
|  3       | 11     | 7

I'm at a loss for how to make this happen. The constraint is that it has to be done via Postgresql compatible SQL.

Procedural code solution for Step 4

I did sketch this out as code to think about it procedurally, this might help. Though I am aware this is not a viable solution as with > 500k records it would take days for this type of logic to run as-is.

# max_cluster_user refers to the table of the same name
for cluster in max_cluster_user: 
    # get the users within a specific cluster
    cluster_users = [user for user in users if user['cluster_id'] == cluster['cluster_id']]
    # users refers to the table of the same name
    for user in cluster_users:
        # get the cases associated with the given id
        user_cases = [case for case in cases if case['user_id'] == user['id']
        for user_case in user_cases:
            # update the user_id for a case
            user_case['user_id = cluster['user_id']

Thanks in advance

1

There are 1 answers

1
Gordon Linoff On BEST ANSWER

I think you just need update with join for step 4:

update user_cases uc
    set user_id = mcu.user_id
    from max_cluster_user mcu
    where mcu.cluster_id = uc.cluster_id and
          uc.user_id <> mcu.user_id;