just wonder if it's possible to use s3-dist-cp tool to merge parquet file (snappy compressed). I tried with "--groupBy" and "--targetSize" options and it did merge the small files into bigger files. But I then can't read them within Spark or AWS Athena. In aws athena I got following error:
HIVE_CURSOR_ERROR: Expected 246379 values in column chunk at s3://my_analytics/parquet/auctions/region=us/year=2017/month=1/day=1/output123 offset 4 but got 247604 values instead over 1 pages ending at file offset 39
This query ran against the "randomlogdatabase" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 4ff77c55-3b69-414d-8fd9-a3d135f5ff2f.
Any help is appreciated.
Parquet files have significant structure. This page covers it in detail, but the upshot is that the metadata is stored at the end like a zip file and concatenating Parquet files will break them. To merge Parquet files you need to use something like Spark that understands Parquet's file format.