Applying an LSH algorithm in Spark 1.4 (https://github.com/soundcloud/cosine-lsh-join-spark/tree/master/src/main/scala/com/soundcloud/lsh), I process a text file (4GB) in a LIBSVM format (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) to find duplicates. First, I have run my scala script in a server using only one executor with 36 cores. I retrieved my results in 1,5 hrs.
In order to get my results much faster, I tried to run my code in a hadoop cluster via yarn in an hpc with 3 nodes where each node has 20 cores and 64 gb memory. Since I am not experienced much running codes in hpc, I have followed the suggestions given here: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
As a result, I have submitted spark as below:
spark-submit --class com.soundcloud.lsh.MainCerebro --master yarn-cluster --num-executors 11 --executor-memory 19G --executor-cores 5 --driver-memory 2g cosine-lsh_yarn.jar
As I understood, I have assigned 3 executors per node and 19 gb for each executor.
However, I could not get my results even though more than 2 hours passed.
My spark configuration is:
val conf = new SparkConf()
.setAppName("LSH-Cosine")
.setMaster("yarn-cluster")
.set("spark.driver.maxResultSize", "0");
How can I dig this issue? From where should I start to improve calculation time?
EDIT:
1)
I have noticed that coalesce is way much slower in yarn
entries.coalesce(1, true).saveAsTextFile(text_string)
2)
EXECUTORS AND STAGES FROM HPC:
EXECUTORS AND STAGES FROM SERVER:
More memory is clogged in the storage memory. You are not using that memory efficiently ie (you are caching the data). A total of less than 10 gigs is used of 40 gigs. You are reduce that memorystorge and use that memoryexecution.
Even though you specified 11 executors it started only 4 executors. Inference from first spark UI screenshot. Total cores used by the spark is only 19 across all executors. Total cores equal to number of task running.
Please go through the following link.
https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html