Concat Avro files in Google Cloud Storage

796 views Asked by At

I have some big .avro files in the Google Cloud Storage and I want to concat all of them in a single file.

I got

java -jar avro-tools.jar concat

However, as my files are in the google storage path: gs://files.avro I can't concat them by using avro-tools. Any suggestion about how to solve it?

2

There are 2 answers

2
Donnald Cucharo On BEST ANSWER

You can use the gsutil compose command. For example:

gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite

Note: For extremely large files and/or very low per-machine bandwidth, you may want to split the file and upload it from multiple machines, and later compose these parts of the file manually.

On my case I tested it with the following values: foo.txt contains a word Hello and bar.txt contains a word World. Running this command:

gsutil compose gs://bucket/foo.txt gs://bucket/bar.txt gs://bucket/baz.txt

baz.txt would return:

Hello
World

Note: GCS does not support inter-bucket composing.

Just in case if you're encountering an exception error with regards to integrity checks, run gsutil help crcmod to get an instructions on how to fix it.

0
punkrockpolly On

Check out https://github.com/spotify/gcs-tools

Light weight wrapper that adds Google Cloud Storage (GCS) support to common Hadoop tools, including avro-tools, parquet-cli, proto-tools for Scio's Protobuf in Avro file, and magnolify-tools for Magnolify code generation, so that they can be used from regular workstations or laptops, outside of a Google Compute Engine (GCE) instance.