match hex string with list indice

62 views Asked by At

I'm building a de-identify tool. It replaces all names by other names.

We got a report that <name>Peter</name> met <name>Jane</name> yesterday. <name>Peter</name> is suspicious.

outpout :

We got a report that <name>Billy</name> met <name>Elsa</name> yesterday. <name>Billy</name> is suspicious.

It can be done on multiple documents, and one name is always replaced by the same counterpart, so you can still understand who the text is talking about. BUT, all documents have an ID, referring to the person this file is about (I'm working with files in a public service) and only documents with the same people ID will be de-identified the same way, with the same names. (the goal is to watch evolution and people's history) This is a security measure, such as when I hand over the tool to a third party, I don't hand over the key to my own documents with it.

So the same input, with a different ID, produces :

We got a report that <name>Henry</name> met <name>Alicia</name> yesterday. <name>Henry</name> is suspicious.

Right now, I'm hashing each name with the document ID as a salt, I convert the hash to an integer, then subtract the length of the name list until I can request a name with that integer as an indice. But I feel like there should be a quicker/more straightforward approach ?

It's really more of an algorithmic question, but if it's of any relevance I'm working with python 2.7 Please request more explanation if needed. Thank you !


I hope it's clearer this way ô_o Sorry when you are neck-deep in your code you forget others need a bigger picture to understand how you got there.

2

There are 2 answers

1
Marcus Müller On

As @LutzHorn pointed out, you could just use a dict to map real names to false ones.

You could also just do something like:

existing_names =  []
for nameocurrence in original_text:
    if not nameoccurence.name in existing_names:
        nameoccurence.id = len(existing_names)
        existing_names.append(nameoccurence.name)
    else:
        nameoccurence.id = existing_names.index(nameoccurence.name)

for idx, _ in enumerate(existing_names):
    existing_names[idx] = gimme_random_name()
0
AudioBubble On

Try using a dictionary of names.

import re

names = {"Peter": "Billy", "Jane": "Elsa"}

for name in re.findall("<name>([a-zA-Z]+)</name>", s):
    s = re.sub("<name>" + name + "</name>", "<name>"+ names[name] + "</name>", s)
print(s)

Output:

'We got a report that <name>Billy</name> met <name>Elsa</name> yesterday. <name>Billy</name> is suspicious.'