How to configure Executor in Spark Local Mode

4.2k views Asked by At

In Short

I want to configure my application to use lz4 compression instead of snappy, what I did is:

session = SparkSession.builder()
        .master(SPARK_MASTER) //local[1]
        .appName(SPARK_APP_NAME)
        .config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
        .getOrCreate();

but looking at the console output, it's still using snappy in the executor

org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY

and

[Executor task launch worker-0] compress.CodecPool (CodecPool.java:getCompressor(153)) - Got brand-new compressor [.snappy]

According to this post, what I did here only configure the driver, but not the executor. The solution on the post is to change the spark-defaults.conf file, but I'm running spark in local mode, I don't have that file anywhere.

Some more detail:

I need to run the application in local mode (for the purpose of unit test). The tests works fine locally on my machine, but when I submit the test to a build engine(RHEL5_64), I got the error

snappy-1.0.5-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found

I did some research and it seems the simplest fix is to use lz4 instead of snappy for codec, so I try the above solution.

I have been stuck in this issue for several hours, any help is appreciated, thank you.

2

There are 2 answers

0
Ning Lin On BEST ANSWER

Posting my solution here, @user8371915 does answered the question, but did not solve my problem, because in my case I can't modified the property files.

What I end up doing is adding another configuration

session = SparkSession.builder()
        .master(SPARK_MASTER) //local[1]
        .appName(SPARK_APP_NAME)
        .config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
        .config("spark.sql.parquet.compression.codec", "uncompressed")
        .getOrCreate();
0
Alper t. Turker On

what I did here only configure the driver, but not the executor.

In local mode there is only one JVM which hosts both driver and executor threads.

the spark-defaults.conf file, but I'm running spark in local mode, I don't have that file anywhere.

Mode is not relevant here. Spark in local mode uses the same configuration files. If you go to the directory where you store Spark binaries you should see conf directory:

spark-2.2.0-bin-hadoop2.7 $ ls
bin  conf  data  examples  jars  LICENSE  licenses  NOTICE  python  R  README.md  RELEASE  sbin  yarn

In this directory there is a bunch of template files:

spark-2.2.0-bin-hadoop2.7 $ ls conf 
docker.properties.template  log4j.properties.template    slaves.template               spark-env.sh.template

fairscheduler.xml.template metrics.properties.template spark-defaults.conf.template

If you want to set configuration option copy spark-defaults.conf.template to spark-defaults.conf and edit it according to your requirements.