I am new to Google Cloud DLP and I ran a POST https://dlp.googleapis.com/v2beta1/inspect/operations to scan a .parquet
file within a Google Cloud Storage directory and also using cloudStorageOptions
to save the .csv
output.
The .parquet
file is 53.93 M.
When I make the API call on the .parquet
file I get :
"processedBytes": "102308122",
"infoTypeStats": [{
"infoType": {
"name": "AMERICAN_BANKERS_CUSIP_ID"
},
"count": "1"
}, {
"infoType": {
"name": "IP_ADDRESS"
},
"count": "17"
}, {
"infoType": {
"name": "US_TOLLFREE_PHONE_NUMBER"
},
"count": "148"
}, {
"infoType": {
"name": "EMAIL_ADDRESS"
},
"count": "30"
}, {
"infoType": {
"name": "US_STATE"
},
"count": "22"
}]
When I convert the .parquet
file to .csv
I get a 360.58 MB file. Then if I make the API call on the .csv
file I get:
"processedBytes": "377530307",
"infoTypeStats": [{
"infoType": {
"name": "CREDIT_CARD_NUMBER"
},
"count": "56546"
}, {
"infoType": {
"name": "EMAIL_ADDRESS"
},
"count": "372527"
}, {
"infoType": {
"name": "NETHERLANDS_BSN_NUMBER"
},
"count": "5"
}, {
"infoType": {
"name": "US_TOLLFREE_PHONE_NUMBER"
},
"count": "1331321"
}, {
"infoType": {
"name": "AUSTRALIA_TAX_FILE_NUMBER"
},
"count": "52269"
}, {
"infoType": {
"name": "PHONE_NUMBER"
},
"count": "28"
}, {
"infoType": {
"name": "US_DRIVERS_LICENSE_NUMBER"
},
"count": "114"
}, {
"infoType": {
"name": "US_STATE"
},
"count": "141383"
}, {
"infoType": {
"name": "KOREA_RRN"
},
"count": "56144"
}],
Obviously when I scan the .parquet
file not all the infoTypes
are detected compared to running the scan on the .csv
file where I verified that all EmailAddresses
were detected.
I couldn't find any documentation on compressed files such as parquet, so I am assuming that Google Cloud DLP doesn't offer this capability.
Any help would be greatly appreciated.
Parquet files are currently scanned as binary objects, as the system does not parse them smartly yet. In the V2 api the supported file types are listed here https://cloud.google.com/dlp/docs/reference/rpc/google.privacy.dlp.v2#filetype.