Load multiple folders containing several jsonl files into a Cloud Database in GCP (e.g.: BigQuery)

Question

Load multiple folders containing several jsonl files into a Cloud Database in GCP (e.g.: BigQuery)

1k views Asked by Victor Maricato At 02 December 2020 at 20:04

Data

The data is formatted as several .jsonl.gz files in multiple folders, in a Google Cloud Storage bucket.

For example, the files I would load, if I loaded the data file-by-file would be something like:

gs://my-bucket/folder1/something1-10.jsonl.gz
gs://my-bucket/folder1/something11-20.jsonl.gz
gs://my-bucket/folder1/something21-30.jsonl.gz
gs://my-bucket/folder2/something1-10.jsonl.gz
gs://my-bucket/folder2/something11-20.jsonl.gz
gs://my-bucket/folder2/something21-30.jsonl.gz

These jsonl files contents are not different from the usual, meaning its content is something like:

{"name": "Foo", "surname": "Loo"}
{"name": "Lauren", "surname": "Impsum"}
...

The files individually are not large, but summing all of them, I have a large dataset with a lot more than 500 GB, which would make it quite hard to load in memory, concatenate it into a single file and upload to cloud.

BigQuery

It seems like BigQuery allows users to upload data from jsonl files, either they are compressed or not. Yet, it does not seem to support reading from multiple files, neither multiple folders.

Other Tools

I am not restricted to use BigQuery, but I am restricted on using GCP (even if that represents creating a GKE cluster-backed database instance).

Do anyone know how can I handle loading this data into a database?

Original Q&A

There are 2 answers

**Steven Ensslen** · Answer 1 · 2020-12-02T20:26:39+00:00

The bq command line supports loading a set of gzipped JSONL files natively.

You can include a single URI, a comma-separated list of URIs, or a URI containing a wildcard.

Try the command

bq load \
    --autodetect \
    --source_format=NEWLINE_DELIMITED_JSON \
    mydataset.mytable \
    gs://my-bucket/folder*

**Pentium10** · Answer 2 · 2020-12-02T21:25:01+00:00

You could leverage Cloud Workflows a Google Cloud product for this use case.

In a nutshell, Workflows allows you to connect services together, anything that has a public API. Integrate any Google Cloud API, SaaS API, or private APIs.

You will need to have the following steps defined in Workflow:

use Cloud Storage API to fetch files from your bucket/folder
use BigQuery Load API to send the input sourceURIs obtained in first step combining segments from step 2 for your destination name format

There is a full blog post I have about this for CSV files but quickly can be changed to JSON type imports: Using Cloud Workflows to load Cloud Storage files into BigQuery

TechQA.

Load multiple folders containing several jsonl files into a Cloud Database in GCP (e.g.: BigQuery)

Data

BigQuery

Other Tools

There are 2 answers

Related Questions in DATABASE

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in GOOGLE-BIGQUERY

Related Questions in GOOGLE-CLOUD-STORAGE

Related Questions in JSONLINES

Popular Questions

Popular Tags

Trending Questions