I'm having a slow performance in my pyspark application. I have a function that involves 5 dataframes and theres joins and agregations inside. When I call this function only once, it runs successfully. But when I call it more than once, inside the process(only changing a parameter, but the data volume is the same) it does not terminate. It stops in some part that I'm not able to identify. My question is: How to debug my spark application to identify this bottleneck?

1 Answers

Ranga Vure On

I generally test and troubleshoot my spark applications using the below steps

  1. master=local

    Execute application/pipeline using master=local with small datasets. Using this option i can run Spark application using my favorite IDE in local desktop and can use debug option also.

   spark = SparkSession.builder.appName("MyApp").config("master", "local").getOrCreate()
  1. --deploy-mode client

    Once the issues are resolved and is working locally, package and deploy application to edgenode and execute in client mode with small datasets. We can see any error messages/stacktraces etc in console, if any.

  2. --deploy-mode cluster

    Now execute in cluster mode, with large/actual datasets and update settings for spark performance like no of executors, executor memory etc.