Load multiple folders containing several jsonl files into a Cloud Database in GCP (e.g.: BigQuery)

1k views Asked by At

Data

The data is formatted as several .jsonl.gz files in multiple folders, in a Google Cloud Storage bucket.

For example, the files I would load, if I loaded the data file-by-file would be something like:

gs://my-bucket/folder1/something1-10.jsonl.gz
gs://my-bucket/folder1/something11-20.jsonl.gz
gs://my-bucket/folder1/something21-30.jsonl.gz
gs://my-bucket/folder2/something1-10.jsonl.gz
gs://my-bucket/folder2/something11-20.jsonl.gz
gs://my-bucket/folder2/something21-30.jsonl.gz

These jsonl files contents are not different from the usual, meaning its content is something like:

{"name": "Foo", "surname": "Loo"}
{"name": "Lauren", "surname": "Impsum"}
...

The files individually are not large, but summing all of them, I have a large dataset with a lot more than 500 GB, which would make it quite hard to load in memory, concatenate it into a single file and upload to cloud.

BigQuery

It seems like BigQuery allows users to upload data from jsonl files, either they are compressed or not. Yet, it does not seem to support reading from multiple files, neither multiple folders.

Other Tools

I am not restricted to use BigQuery, but I am restricted on using GCP (even if that represents creating a GKE cluster-backed database instance).

Do anyone know how can I handle loading this data into a database?

2

There are 2 answers

0
Steven Ensslen On

The bq command line supports loading a set of gzipped JSONL files natively.

You can include a single URI, a comma-separated list of URIs, or a URI containing a wildcard.

Try the command

bq load \
    --autodetect \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://my-bucket/folder*
0
Pentium10 On

You could leverage Cloud Workflows a Google Cloud product for this use case.

In a nutshell, Workflows allows you to connect services together, anything that has a public API. Integrate any Google Cloud API, SaaS API, or private APIs.

You will need to have the following steps defined in Workflow:

  • use Cloud Storage API to fetch files from your bucket/folder
  • use BigQuery Load API to send the input sourceURIs obtained in first step combining segments from step 2 for your destination name format

There is a full blog post I have about this for CSV files but quickly can be changed to JSON type imports: Using Cloud Workflows to load Cloud Storage files into BigQuery