s3-dist-cp groupby Regex Capture

50 views Asked by At

I'm using EMR to combine hundreds of thousands of very small (1-5) row csv files. I want to concatenate them into around 100MB files so they are easier to work with.

My EMR job uses command-runner.jar with args:

s3-dist-cp --src=s3://surge-experiment-hourly-data/snapshot_day_zl_2023-09-17/ --dest=s3://surge-experiment-hourly-data/combined_test/ --groupBy='.*(csv)' --targetSize=100

Logs show it fetches all file info but then:

2023-10-30 21:33:34,930 WARN com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): No. of files to copy should be greater than 0

Example file:

Bucket = fake-data-bucket, Key = snapshot_day_zl_2023-09-17/snapshot_day=2023-09-18 00%3A00%3A00/site_id=DZZ1/part-07747-e878e452-0754-44f6-ae9a-d822e42f2bf1.c000.csv

I'm thinking this is an issue with the regex in the groupby, somehow not identifying that these files match. But checking on regex101, they seem to match.

0

There are 0 answers