Google Cloud Data Loss Prevention (DLP) scanning .parquet files in GCS

Question

Google Cloud Data Loss Prevention (DLP) scanning .parquet files in GCS

703 views Asked by Kenzie Tahiri At 01 September 2017 at 22:41

I am new to Google Cloud DLP and I ran a POST https://dlp.googleapis.com/v2beta1/inspect/operations to scan a .parquet file within a Google Cloud Storage directory and also using cloudStorageOptions to save the .csv output.

The .parquet file is 53.93 M.

When I make the API call on the .parquet file I get :

"processedBytes": "102308122",
"infoTypeStats": [{
   "infoType": {
      "name": "AMERICAN_BANKERS_CUSIP_ID"
   },
   "count": "1"
}, {
   "infoType": {
      "name": "IP_ADDRESS"
   },
   "count": "17"
}, {
   "infoType": {
      "name": "US_TOLLFREE_PHONE_NUMBER"
   },
   "count": "148"
}, {
   "infoType": {
      "name": "EMAIL_ADDRESS"
   },
   "count": "30"
}, {
   "infoType": {
      "name": "US_STATE"
   },
   "count": "22"
}]

When I convert the .parquet file to .csv I get a 360.58 MB file. Then if I make the API call on the .csv file I get:

"processedBytes": "377530307",
"infoTypeStats": [{
   "infoType": {
      "name": "CREDIT_CARD_NUMBER"
   },
   "count": "56546"
}, {
   "infoType": {
      "name": "EMAIL_ADDRESS"
   },
   "count": "372527"
}, {
   "infoType": {
      "name": "NETHERLANDS_BSN_NUMBER"
   },
   "count": "5"
}, {
   "infoType": {
      "name": "US_TOLLFREE_PHONE_NUMBER"
   },
   "count": "1331321"
}, {
   "infoType": {
      "name": "AUSTRALIA_TAX_FILE_NUMBER"
   },
   "count": "52269"
}, {
   "infoType": {
      "name": "PHONE_NUMBER"
   },
   "count": "28"
}, {
   "infoType": {
      "name": "US_DRIVERS_LICENSE_NUMBER"
   },
   "count": "114"
}, {
   "infoType": {
      "name": "US_STATE"
   },
   "count": "141383"
}, {
   "infoType": {
      "name": "KOREA_RRN"
   },
   "count": "56144"
}],

Obviously when I scan the .parquet file not all the infoTypes are detected compared to running the scan on the .csv file where I verified that all EmailAddresses were detected.

I couldn't find any documentation on compressed files such as parquet, so I am assuming that Google Cloud DLP doesn't offer this capability.

Any help would be greatly appreciated.

Original Q&A

There are 1 answers

**Jordanna Chord** · Answer 1 · 2018-04-18T22:12:42+00:00

Jordanna Chord On 18 April 2018 at 22:12

Parquet files are currently scanned as binary objects, as the system does not parse them smartly yet. In the V2 api the supported file types are listed here https://cloud.google.com/dlp/docs/reference/rpc/google.privacy.dlp.v2#filetype.

TechQA.

Google Cloud Data Loss Prevention (DLP) scanning .parquet files in GCS

There are 1 answers

Related Questions in GOOGLE-API

Related Questions in GOOGLE-CLOUD-DLP

Popular Questions

Popular Tags

Trending Questions