GCP Bigquery Error in Load Operation: Bytes are Missing

709 views Asked by At

I am very new to Google Cloud Platform and I'm trying to create a table in bigquery from ~60,000 csv.gz files stored in a GCP bucket.

To do this, I've opened Cloud Shell, and I'm trying the following:

$ bq --location=US mk my_data
$ bq --location=US \
     load --null_marker='' \
     --source_format=CSV --autodetect \
     my_data.my_table gs://my_bucket/*.csv.gz

This throws the following error:

BigQuery error in load operation: Error processing job 'my_job:bqjob_r3eede45779dc9a51_0000017529110a63_1': 
Error while reading data, error message:
FAILED_PRECONDITION: Invalid gzip file: bytes are missing

I don't know how to find which file might be problematic when loading the files. I've checked a few of the files, and they are all valid .gz files that I can open with any csv reader after decompression, but I don't know how to check through all the files to find a problematic one.

Thank you in advance for any help with this!

2

There are 2 answers

0
Ksign On BEST ANSWER

To loop through your bucket, you can use the eval command

#!/bin/bash
FILES="gsutil ls gs://YOUR_BUCKET"
RESULTS=$(eval $FILES)
for f in $RESULTS
do
  read="gsutil cat $f | zcat | wc -c"
  if [[ $(eval $read) == "0" ]]
    then
        #<Process it, Print name or Delete from bucket like below>
        delete="gsutil rm $f"
        eval $delete
    fi
done

Another option is to download all your files locally, if possible, and process from there:

gsutil -m cp -R gs://YOUR_BUCKET .
1
Rally H On

There might be .gz files that do not contain any data within. You might want to write a script which will filter if the .gz files are valid.

This sample bash script will do a directory loop through the .gz files and delete them if they are empty.

for f in dir/*
do
    if [[ $(gunzip -c $f | head -c1 | wc -c) == "0" ]] 
    then
        do_file_creation
    fi
done