s3distcp - takes long time to copy large number of small files from one bucket to another

632 views Asked by At

I need to copy large number of small files from one S3 bucket to another. I'm using S3-Dist-Cp command provided by AWS.

s3-dist-cp --src=s3://some-bucket/ --dest=s3://another-bucket/ --groupBy=<some-pattern> --targetSize=<size> --deleteOnSuccess

Now, the problem with this command is that it takes forever to copy all small files and merge them.

Note - Source bucket is being written continuously with new files by some other job and I think s3-dist-cp never catches with last file.

Is there any workaround for this solution? destination bucket will be used by Spark job to process these files.

0

There are 0 answers