How to read every file in a Google Storage bucket from BigQuery and create one table for each file in the bucket?

1.9k views Asked by At

I have a bucket that has 28 files and I would like to load them in different tables (one table per file in the bucket).

Each file in the bucket looks like this:

file_1_2020.csv
file_2_2020.csv
..
file_28_2020.csv

I would like that the table had the first 6 characters of the filename + _table(they are all csv with headers). For example, file_1_table.

I read in a blog that I could achieve this by creating a federated query that list all files in the bucket. Then I could loop through all names and create the tables I need. The thing is that all documentation I have read on GCP is about federating queries in Cloud SQL, so I don't really know how I can make BigQuery read my bucket.

How can I achieve this? It's not important if it is done with a federated query (but I would like to know, because it's always good to know and learn new things). Also, I would like to have the tables partitioned and clustered by datetime, which is the first column of each file (the schema of each file is datetime:TIMESTAMP,col1:FLOAT,col2:FLOAT)

2

There are 2 answers

3
Soumendra Mishra On BEST ANSWER

You can use EXECUTE IMMEDIATE to achieve this:

Step-1: Create an external table to get list of files in storage bucket

CREATE EXTERNAL TABLE mydataset.mytable OPTIONS (format = 'CSV', uris = ['gs://bucket_name/*.csv']);

Step-2: Use EXECUTE IMMEDIATE to create tables dynamically

DECLARE FILE_LIST ARRAY<STRING>;
DECLARE TABLE_NAME STRING;
DECLARE I INT64 DEFAULT 1;
DECLARE CNT INT64 DEFAULT 0;
SET FILE_LIST = ARRAY(SELECT DISTINCT _FILE_NAME as FILENAME FROM mydataset.mytable);
SET CNT = ARRAY_LENGTH(FILE_LIST);
WHILE I <= CNT 
DO
  SET TABLE_NAME = CONCAT(SUBSTR(REPLACE(SUBSTR(FILE_LIST[ORDINAL(i)], INSTR(FILE_LIST[ORDINAL(i)], '/', -1) + 1) ,'.csv', ''), 1, 6), '_table');
  EXECUTE IMMEDIATE "CREATE EXTERNAL TABLE mydataset." || TABLE_NAME || " OPTIONS (format = 'CSV', uris = ['" || FILE_LIST[ORDINAL(I)] || "'])";
  SET I = I + 1;
END WHILE;
0
Pentium10 On

You could leverage Cloud Workflows a new Google Cloud product for this use case.

In a nutshell, Workflows allows you to connect services together, anything that has a public API. Integrate any Google Cloud API, SaaS API, or private APIs.

Essentially you need to connect Cloud Storage with BigQuery to load the files.

enter image description here

You will need to have the following steps defined in Workflow:

  • use Cloud Storage API to fetch files from your bucket/folder
  • extract segments from filename eg: 12 and 2020 into variables for later use
  • use BigQuery Load API to send the input sourceURIs obtained in first step combining segments from step 2 for your destination name format

There is a full blog post I have about this: Using Cloud Workflows to load Cloud Storage files into BigQuery