Cannot get faster results via yarn when running spark in a hadoop cluster

Question

Cannot get faster results via yarn when running spark in a hadoop cluster

490 views Asked by mlee_jordan At 19 December 2016 at 18:17

Applying an LSH algorithm in Spark 1.4 (https://github.com/soundcloud/cosine-lsh-join-spark/tree/master/src/main/scala/com/soundcloud/lsh), I process a text file (4GB) in a LIBSVM format (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) to find duplicates. First, I have run my scala script in a server using only one executor with 36 cores. I retrieved my results in 1,5 hrs.

In order to get my results much faster, I tried to run my code in a hadoop cluster via yarn in an hpc with 3 nodes where each node has 20 cores and 64 gb memory. Since I am not experienced much running codes in hpc, I have followed the suggestions given here: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

As a result, I have submitted spark as below:

spark-submit --class com.soundcloud.lsh.MainCerebro --master yarn-cluster --num-executors 11 --executor-memory 19G --executor-cores 5 --driver-memory 2g cosine-lsh_yarn.jar

As I understood, I have assigned 3 executors per node and 19 gb for each executor.

However, I could not get my results even though more than 2 hours passed.

My spark configuration is:

val conf = new SparkConf()
      .setAppName("LSH-Cosine")
      .setMaster("yarn-cluster")
      .set("spark.driver.maxResultSize", "0");

How can I dig this issue? From where should I start to improve calculation time?

EDIT:

1)

I have noticed that coalesce is way much slower in yarn

  entries.coalesce(1, true).saveAsTextFile(text_string)

2)

EXECUTORS AND STAGES FROM HPC:

EXECUTORS AND STAGES FROM SERVER:

Original Q&A

There are 1 answers

**loneStar** · Answer 1 · 2017-07-19T19:03:31+00:00

More memory is clogged in the storage memory. You are not using that memory efficiently ie (you are caching the data). A total of less than 10 gigs is used of 40 gigs. You are reduce that memorystorge and use that memoryexecution.

Even though you specified 11 executors it started only 4 executors. Inference from first spark UI screenshot. Total cores used by the spark is only 19 across all executors. Total cores equal to number of task running.

Please go through the following link.

https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html

TechQA.

Cannot get faster results via yarn when running spark in a hadoop cluster

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in HADOOP-YARN

Related Questions in HPC

Related Questions in LOCALITY-SENSITIVE-HASH

Popular Questions

Popular Tags

Trending Questions