I am trying to run my PySpark in Cluster with 2 nodes and 1 master (all have 16 Gb RAM). I have run my spark with below command.
spark-submit --master yarn --deploy-mode cluster --name "Pyspark" --num-executors 40 --executor-memory 2g CD.py
However my code runs very slow, it takes almost 1 hour to parse 8.2 GB of data. Then i tried to change the configuration in my YARN. I changed following properties.
yarn.scheduler.increment-allocation-mb = 2 GiB
yarn.scheduler.minimum-allocation-mb = 2 GiB
yarn.scheduler.increment-allocation-mb = 2 GiB
yarn.scheduler.maximum-allocation-mb = 2 GiB
After doing these changes still my spark is running very slow and taking more than 1 hour to parse 8.2 GB of files.
could you please try with the below configuration
spark.executor.memory 5g
spark.executor.cores 5
spark.executor.instances 3
spark.driver.cores 2