Python script to search PII

14.8k views Asked by At

I would like to write a script which can search for and report on Personally Identifiable Information like card numbers, etc in a file system. I would like to find it in txt as well as xls word and PDF files.

Any starting tips or which lib to use are welcome.

I'd also like advice on an efficient way to scan large files for patterns like credit cards etc.

3

There are 3 answers

0
mikelus On

We are implementing a similar system which allows data entry from dynamic forms and CSV imports. Fields will be classified as either list, numeric range, free-text. Data ends up in one field in a DB table. We are scanning free-text entries to find PHI. The data is entered via a website and is stored in SQL Server. We fire off a command to add the id for any new import batch to a RabbitMQ queue and flag all free-text fields in the batch as pending examination which prevents them from being displayed or exported. All fields considered "safe", such as those generated from dropdowns or based on number ranges are ready for export or display in charts. Only free-text fields are locked temporarily. A python windows service then pulls from the Rabbit queue and scans each text field for PHI and flags them accordingly. If there are fields that look suspect, I get a report and I check the entire text import batch manually. I am currently using Spacy for entity recognition, and aspects of Deduce to find other PHI types.

As the analysis is carried out asynchronously I as able to put the data through multiple scan approaches without impacting performance.

0
Don Johnson On

give piianalyzer a shot: https://pypi.python.org/pypi/piianalyzer/0.1.0

or you can write your own and use a common regular expression dataset like https://github.com/madisonmay/CommonRegex

0
Ken On

If you're working for a company, you could consider buying a packaged solution. One I've seen advertised is Nuix. Also, Oracle has an end-to-end solution for GDPR (the new EU privacy law), which includes the kind of functionality you describe. See http://www.oracle.com/technetwork/database/security/wp-security-dbsec-gdpr-3073228.pdf.

If you have the Oracle RDBMS, there is a package called CTXSYS (now called Oracle Text) which has amazing search capabilities across documents, including PDFs, the entire Office suite, and many more. CTXSYS is included in the regular license. If you're a home user, you can download Oracle server (the Express version is fine for this function).

If you're using regexes as suggested above, one simple approach would be to search for words that are capitalized in mid-sentence, but that only helps with documents (not so much with XLS, for example). You could also build a dictionary of common names (first/last names, streets, towns). The credit cards and SSNs should be readily regex-able.