Google Cloud Data Loss Prevention (DLP) scanning .parquet files in GCS

711 views Asked by At

I am new to Google Cloud DLP and I ran a POST https://dlp.googleapis.com/v2beta1/inspect/operations to scan a .parquet file within a Google Cloud Storage directory and also using cloudStorageOptions to save the .csv output.

The .parquet file is 53.93 M.

When I make the API call on the .parquet file I get :

"processedBytes": "102308122",
"infoTypeStats": [{
   "infoType": {
      "name": "AMERICAN_BANKERS_CUSIP_ID"
   },
   "count": "1"
}, {
   "infoType": {
      "name": "IP_ADDRESS"
   },
   "count": "17"
}, {
   "infoType": {
      "name": "US_TOLLFREE_PHONE_NUMBER"
   },
   "count": "148"
}, {
   "infoType": {
      "name": "EMAIL_ADDRESS"
   },
   "count": "30"
}, {
   "infoType": {
      "name": "US_STATE"
   },
   "count": "22"
}]

When I convert the .parquet file to .csv I get a 360.58 MB file. Then if I make the API call on the .csv file I get:

"processedBytes": "377530307",
"infoTypeStats": [{
   "infoType": {
      "name": "CREDIT_CARD_NUMBER"
   },
   "count": "56546"
}, {
   "infoType": {
      "name": "EMAIL_ADDRESS"
   },
   "count": "372527"
}, {
   "infoType": {
      "name": "NETHERLANDS_BSN_NUMBER"
   },
   "count": "5"
}, {
   "infoType": {
      "name": "US_TOLLFREE_PHONE_NUMBER"
   },
   "count": "1331321"
}, {
   "infoType": {
      "name": "AUSTRALIA_TAX_FILE_NUMBER"
   },
   "count": "52269"
}, {
   "infoType": {
      "name": "PHONE_NUMBER"
   },
   "count": "28"
}, {
   "infoType": {
      "name": "US_DRIVERS_LICENSE_NUMBER"
   },
   "count": "114"
}, {
   "infoType": {
      "name": "US_STATE"
   },
   "count": "141383"
}, {
   "infoType": {
      "name": "KOREA_RRN"
   },
   "count": "56144"
}],

Obviously when I scan the .parquet file not all the infoTypes are detected compared to running the scan on the .csv file where I verified that all EmailAddresses were detected.

I couldn't find any documentation on compressed files such as parquet, so I am assuming that Google Cloud DLP doesn't offer this capability.

Any help would be greatly appreciated.

1

There are 1 answers

0
Jordanna Chord On

Parquet files are currently scanned as binary objects, as the system does not parse them smartly yet. In the V2 api the supported file types are listed here https://cloud.google.com/dlp/docs/reference/rpc/google.privacy.dlp.v2#filetype.