Spark partition by files

Question

Spark partition by files

1.8k views Asked by Shubham Jain At 06 September 2017 at 09:43

I have several thousand compressed CSV files on a S3 bucket, each of size approximately 30MB(around 120-160MB after decompression), which I want to process using spark.

In my spark job, I am doing simple filter select queries on each row.

While partitioning Spark is dividing the files into two or more parts and then creating tasks for each partition. Each task is taking around 1 min to complete just to process 125K records. I want to avoid this partitioning of a single file across many tasks.

Is there a way to fetch files and partition data such that each task works on one complete file, that is, Number of tasks = Number of input files.?

Original Q&A

There are 1 answers

**stevel** · Answer 1 · 2017-09-06T11:54:52+00:00

as well as playing with spark options, you can tell the s3a filesystem client to tell it to tell Spark that the "block size" of a file in S3 is 128 MB. The default is 32 MB, which is close enough to your "approximately 30MB" number that spark could be splitting the files in two

spark.hadoop.fs.s3a.block.size 134217728

using the wholeTextFiles() operation is safer though

TechQA.

Spark partition by files

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in AMAZON-S3

Related Questions in DATA-PARTITIONING

Popular Questions

Popular Tags

Trending Questions