How can I know if spark application/job is idle/waiting for resources?

33 views Asked by At

I'm very new to spark, so forgive me if some of my concepts are not right.

My main "problem" is that I want to know if my spark application is waiting or idle for resources. Does the state of the job/application will always indicate that? I don't think so.

In the development that I'm working right now, there are going to be a LOT of spark applications. Those applications can run at the same time, or with a small interval of time.

My Spark Driver is configured to have 1 master and 2 workers (each in a different machine). I'm also running my application with Spark Standalone Cluster. In spark master UI (port 8080) I have some additional informations:

enter image description here

Here is my spark master conf file:

spark.master spark://myServer:7077
spark.sql.caseSensitive false
spark.executor.heartbeatInterval 90000
spark.network.timeout 400000
spark.executor.heartbeat.maxFailures 10
spark.shuffle.registration.timeout 500000
spark.shuffle.push.finalize.timeout 600s
spark.files.fetchTimeout 600s
spark.rpc.lookupTimeout 600s
spark.scheduler.excludeOnFailure.unschedulableTaskSetTimeout 600s
spark.eventLog.enabled true
spark.eventLog.dir file:/opt/spark/logs/spark-events/
spark.history.fs.logDirectory file:/opt/spark/logs/spark-events/
spark.executor.logsDirectory /opt/spark/logs
spark.sql.adaptive.enabled true
spark.sql.adaptive.skewJoin.enabled true
spark.sql.adaptive.localShuffleReader.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.executor.memory 5g
spark.executor.cores 2
spark.driver.memory 8g
spark.driver.cores 4
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.shuffleTracking.enabled true
spark.dynamicAllocation.executorIdleTimeout 600s
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.port 7078
spark.blockManager.port 7087
spark.driver.blockManager.port 7011
spark.shuffle.service.enabled true
spark.shuffle.service.port 7337
spark.submit.deployMode client
spark.worker.cleanup.enabled true

Ok, so now you know that i have:

  • 3 workers (including the master)
  • 30 cores
  • 90GB of memory
  • Using standalone cluster
  • Dynamic allocation is enabled
  • 15 executors (I don't now how this number of executors was set, but I have this information).
  • Each executor has 5GB of memory
  • Each partition of my data has 4GB
  • Spark version is 3.5

And as I already explained, in a real scenario, I can run 2, 3 or more applications at the same time. If one of these applications uses a very large table, with many partitions and very large data, that application will use a lot of resources. If I run several large applications at the same time, some application will be idle for resources.

I integrate spark logs into Livy, so in Livy, I was able to get the following log:

2024-03-25 09:11:31 INFO  - Requesting 1 new executor because tasks are backlogged (new desired total will be 1 for resource profile id: 0)
...
2024-03-25 09:11:46 WARN  - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2024-03-25 09:12:01 WARN  - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2024-03-25 09:12:06 INFO  - Executor added: app-x-y/0 on worker-x-IP-z (IP:z) with 2 core(s)
2024-03-25 09:12:06 INFO  - Granted executor ID app-x-y/0 on hostPort IP:zwith 2 core(s), 4.0 GiB RAM
2024-03-25 09:12:06 INFO  - Executor updated: app-x-y/0 is now RUNNING
2024-03-25 09:12:09 INFO  - Executor added: app-x-y/1 on worker-X-IP-z(IP:z) with 2 core(s)
2024-03-25 09:12:09 INFO  - Granted executor ID app-x-y/1 on hostPort IP:z with 2 core(s), 4.0 GiB RAM
2024-03-25 09:12:09 INFO  - Executor updated: app-x-y/1 is now RUNNING
2024-03-25 09:12:11 INFO  - Registered executor NettyRpcEndpointRef(spark-client://Executor) (IP:u) with ID 0,  ResourceProfileId 0
2024-03-25 09:12:11 INFO  - New executor 0 has registered (new total is 1)
2024-03-25 09:12:11 INFO  - Registering block manager IP:k with 2.2 GiB RAM, BlockManagerId(0, IP, k, None)
2024-03-25 09:12:12 INFO  - Starting task 0.0 in stage 0.0 (TID 0) (IP, executor 0, partition 0, PROCESS_LOCAL, 7865 bytes) 
2024-03-25 09:12:12 INFO  - Starting task 1.0 in stage 0.0 (TID 1) (IP, executor 0, partition 1, PROCESS_LOCAL, 7865 bytes) 
2024-03-25 09:12:12 INFO  - Added broadcast_0_piece0 in memory on IP:k(size: 12.0 KiB, free: 2.2 GiB)
2024-03-25 09:12:14 INFO  - Registered executor NettyRpcEndpointRef(spark-client://Executor) (IP:u) with ID 1,  ResourceProfileId 0
2024-03-25 09:12:14 INFO  - New executor 1 has registered (new total is 2)

From what I understand, the log message "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources" indicates that this application is not following the normal flow, as it is waiting for resources.

My problem is not the amount of resources, as the machine has a resource specification that I cannot change. However, I would like to know easily when an application waits for resources. Any functionality in spark Web UI? Any specific tag in spark-defaults.conf file? Any function in the Spark API?

0

There are 0 answers