Create Data Catalog column tags by inspecting BigQuery data with Cloud Data Loss Prevention

550 views Asked by At

I want to use DLP to inspect my tables in BigQuery, and then write the findings to policy tags on the columns of the table. For example, I have a (test) table that contains data including an email address and a phone number for individuals. I can use DLP to find those fields and identify them as emails and phone numbers, and I can do this in the console or via the API (I'm using NodeJS). When creating this inspection job, I know I can configure it to automatically write the findings to the Data Catalog, but this generates a tag on the table, not on the columns. I want to tag the columns with the specific type of PII that has been identified.

I found this tutorial that appears to achieve exactly that - but tutorial is a strong word; it's a script written in Java and a basic explanation of what that script does, with the only actual instructions being to clone the git repo and run a few commands. There's no information about which API calls are being made, not a lot of comments in the code, and no links to pertinent documentation. I have zero experience with Java, so I'm not able to work out the process and translate it into NodeJS for my own purposes.

I also found this similar tutorial which also utilises Dataflow, and again the instructions are simply "clone this repo, run this script". I've included the link because it features a screenshot showing what I want to achieve: tagging columns with PII data found by DLP

So, what I want to do appears to be possible, but I can't find useful documentation anywhere. I've been through the DLP and Data Catalog docs, and through the API references for NodeJS. If anyone could help me figure out how to do this, I'd be very grateful.

UPDATE: I've made some progress and changed my approach as a result.

DLP provides two methods to inspect data: dlp.inspectContent() and dlp.createDlpJob(). The latter takes a storageItem which can be a BigQuery table, but it doesn't return any information about the columns in the results, so I don't believe I can use it.

inspectContent() cannot be run on a BigQuery table; it can inspect structured text, which is what the Java script I linked above is utilising; that script is querying the BigQuery table, and constructing a Table from the results, then passing that Table into inspectContent() which then returns a Findings object which contains fieldnames. I want to do exactly that, but in NodeJS. I'm struggling to convert the BigQuery results into the format of a Table as NodeJS doesn't appear to have a constructor for that type, like Java does.

1

There are 1 answers

2
Felix L. On

I was unable to find node.js documentation implementing column level tags.

However, you might find the Policy Tags official documentation helpful to point you in the right direction. Specifically, you might lack some roles to manage column-level tags.