I would like to write a script which can search for and report on Personally Identifiable Information like card numbers, etc in a file system. I would like to find it in txt as well as xls word and PDF files.
Any starting tips or which lib to use are welcome.
I'd also like advice on an efficient way to scan large files for patterns like credit cards etc.
We are implementing a similar system which allows data entry from dynamic forms and CSV imports. Fields will be classified as either list, numeric range, free-text. Data ends up in one field in a DB table. We are scanning free-text entries to find PHI. The data is entered via a website and is stored in SQL Server. We fire off a command to add the id for any new import batch to a RabbitMQ queue and flag all free-text fields in the batch as pending examination which prevents them from being displayed or exported. All fields considered "safe", such as those generated from dropdowns or based on number ranges are ready for export or display in charts. Only free-text fields are locked temporarily. A python windows service then pulls from the Rabbit queue and scans each text field for PHI and flags them accordingly. If there are fields that look suspect, I get a report and I check the entire text import batch manually. I am currently using Spacy for entity recognition, and aspects of Deduce to find other PHI types.
As the analysis is carried out asynchronously I as able to put the data through multiple scan approaches without impacting performance.