Anonymise data in python across multiple tables and keep the relationships between related columns post anonymisation?

316 views Asked by At

I am working on a project where there are two seperate csv files which I have pulled from a database. I want to load the data in python using pandas and anonymise contents of some of the columns in both tables. Some of these columns that has data that will be anonymised already exist in one of the other tables that shall also be anonymised. But I want them to anonymise to the same thing.

Is there a way to keep the relationships post anonymisation? ie the data in both tables that sit in different columns get anonymised to the same thing?

I have seen many examples online but only for anonymising single tables. How can this be done for two tables and keeping the relationships between the columns so they anonymise to the same thing, as in the example below?

Example:

Both tables, both columns pre anonymisation, but colA in Table 1 is related to colC in Table 2.

Pre anonymisation:

Table 1

colA colB
123456 abcdefg
789123 hijklm

Table 2:

colC colD
123456 xyz123
789123 abc456

Post anonymisation:

Table 1

colA colB
912056 zxcvbn
450912 poiuyt

Table 2:

colC colD
912056 qwe098
450912 asd321
2

There are 2 answers

0
gtomer On BEST ANSWER

You need to first set a dictionary for replacement. If you'll be using it for changing values in all your tables, you will keep relationships:

import pandas as pd
import random
df = pd.DataFrame({'ID': [1,1,2,3],
                  'value': ['sdfsdfs', 'fhgdhds', 'rgeerfdg', 'regfdd']})
# create dictionary to replace ID column with
IDs = df['ID'].unique()
anon = {}
for ID in IDs:
    annon[ID] = random.randint(111111,999999)
df.replace({'ID': anon}, inplace=True)
df

       ID     value
0  373005   sdfsdfs
1  373005   fhgdhds
2  152871  rgeerfdg
3  934125    regfdd
0
Blach15 On

Depending on how secure the anonymity must be. You can just use a hash function on the data like sha256. In python you can import and use hashlib.

import hashlib

def sha256Hash(id):
    return hashlib.sha256((f"{id}").encode("utf-8")).hexdigest()

IDs = [12345, 563279]

newIDs = [sha256Hash(id) for id  in IDs]
print(newIDs)

This will always change the id too the same newId