GCP PubSub to DLP Integration via Dataflow

35 views Asked by At

I have a situation here. i want to figure out the best way to ingest API streaming data from an application to GCP BigQuery while having data masking in place. However, some downstream admin users will essentially need to see unmasked data too.

What i am thinking here is to implement an event based trigger data ingestion using PubSub to trigger a dataflow as soon as a new file is published. And the dataflow will have 2 branches inside it.

Branch 1: To call DLP and mask the incoming data and load a table T1 in BigQuery Branch 2: Use "PubSub topic to BigQuery" template and load unmasked (as-is) data from the source to another table T2 in BigQuery.

I can later use role based access to give general user access to T1 and admin access to T2.

My question to you is about the first branch in the dataflow. Is there any template available to use DLP and mask the incoming data row by row. How can this be done. Do i need to use Apache Beam here.

Or is it a case that my entire design is wrong and a better approach can be implemented as a whole. Please guide me.

To get direction to my next project and build a dataflow accordingly.

1

There are 1 answers

0
XQ Hu On

Your approach seems reasonable. I do not think any available template but it is quite easy to create one. For example, if you use Python, roughly, this will be

with beam.Pipeline() as p:
    raw_messages = (p
                | 'Read from PubSub' >> beam.io.ReadFromPubSub(topic='your-pubsub-topic'))
    # branch 1
    _ = (raw_messages
                | 'Process Data' >> beam.Map(process_message_using_dlp)
                | 'Write to BigQuery' >> beam.io.WriteToBigQuery('your-dataset.your-table'))
    # branch 2
    _ = (raw_messages
                | 'Write raw to BigQuery' >> beam.io.WriteToBigQuery('your-dataset.your-table'))