How does Spark determine the number of slave node simultaneous threads?

1k views Asked by At

The two relevant parameters seem to me to be spark.default.parallelism and spark.cores.max.

spark.default.parallelism sets the number of partitions of the in-memory data, and spark.cores.max sets the number of available CPU cores. However, in traditional parallel computing, I would specifically launch some number of threads.

Will Spark launch one thread per partition, regardless of the number of available cores? If there are 1 million partitions, will Spark limit the number of threads to some reasonable multiple of the number of available cores?

How is the number of threads determined?

1

There are 1 answers

8
zero323 On BEST ANSWER

The two relevant parameters seem to me to be spark.default.parallelism and spark.cores.max.

There are almost completely irrelevant.

Number of data processing threads on each worker depends mostly on three factors:

  • Number of cores (threads) which particular worker advertises.

    This is the maximum number of threads used at the time, not including threads used for secondary purposes. Determined by CORES (advertised parallelization capability) in standalone mode, and equivalent properties in other cluster managers.

  • Number of cores allocated on this worker to executors.

    This is the maximum number of threads that active applications can actually use (less or equal the first number).

  • Number of active tasks scheduled on the executors assigned to this particular worker.

    This is the actual number of threads used at the time. Less or equal to the previous number.

This assumes that the application is honest, and uses only the allocated cores and tasks don't attempt to starts threads, which have not been requested with spark.task.cpus.