use s3-dist-cp to merge parquet files

2.4k views Asked by At

just wonder if it's possible to use s3-dist-cp tool to merge parquet file (snappy compressed). I tried with "--groupBy" and "--targetSize" options and it did merge the small files into bigger files. But I then can't read them within Spark or AWS Athena. In aws athena I got following error:

HIVE_CURSOR_ERROR: Expected 246379 values in column chunk at s3://my_analytics/parquet/auctions/region=us/year=2017/month=1/day=1/output123 offset 4 but got 247604 values instead over 1 pages ending at file offset 39

This query ran against the "randomlogdatabase" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 4ff77c55-3b69-414d-8fd9-a3d135f5ff2f.

Any help is appreciated.

2

There are 2 answers

1
Steve McKay On

Parquet files have significant structure. This page covers it in detail, but the upshot is that the metadata is stored at the end like a zip file and concatenating Parquet files will break them. To merge Parquet files you need to use something like Spark that understands Parquet's file format.

0
jmng On

According to the AWS docs:

S3DistCp does not support concatenation for Parquet files

Note that the recommendation on that website, which is to read the files into a Spark DataFrame and then coalesce(n) before writing, may come with significant challenges for large datasets, as warned in the API docs:

if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1).