Data
The data is formatted as several .jsonl.gz
files in multiple folders, in a Google Cloud Storage bucket.
For example, the files I would load, if I loaded the data file-by-file would be something like:
gs://my-bucket/folder1/something1-10.jsonl.gz
gs://my-bucket/folder1/something11-20.jsonl.gz
gs://my-bucket/folder1/something21-30.jsonl.gz
gs://my-bucket/folder2/something1-10.jsonl.gz
gs://my-bucket/folder2/something11-20.jsonl.gz
gs://my-bucket/folder2/something21-30.jsonl.gz
These jsonl
files contents are not different from the usual, meaning its content is something like:
{"name": "Foo", "surname": "Loo"}
{"name": "Lauren", "surname": "Impsum"}
...
The files individually are not large, but summing all of them, I have a large dataset with a lot more than 500 GB, which would make it quite hard to load in memory, concatenate it into a single file and upload to cloud.
BigQuery
It seems like BigQuery allows users to upload data from jsonl
files, either they are compressed or not. Yet, it does not seem to support reading from multiple files, neither multiple folders.
Other Tools
I am not restricted to use BigQuery, but I am restricted on using GCP (even if that represents creating a GKE cluster-backed database instance).
Do anyone know how can I handle loading this data into a database?
The
bq
command line supports loading a set of gzipped JSONL files natively.Try the command