The two relevant parameters seem to me to be spark.default.parallelism
and spark.cores.max
.
spark.default.parallelism
sets the number of partitions of the in-memory data, and spark.cores.max
sets the number of available CPU cores. However, in traditional parallel computing, I would specifically launch some number of threads.
Will Spark launch one thread per partition, regardless of the number of available cores? If there are 1 million partitions, will Spark limit the number of threads to some reasonable multiple of the number of available cores?
How is the number of threads determined?
There are almost completely irrelevant.
Number of data processing threads on each worker depends mostly on three factors:
Number of cores (threads) which particular worker advertises.
This is the maximum number of threads used at the time, not including threads used for secondary purposes. Determined by
CORES
(advertised parallelization capability) in standalone mode, and equivalent properties in other cluster managers.Number of cores allocated on this worker to executors.
This is the maximum number of threads that active applications can actually use (less or equal the first number).
Number of active tasks scheduled on the executors assigned to this particular worker.
This is the actual number of threads used at the time. Less or equal to the previous number.
This assumes that the application is honest, and uses only the allocated cores and tasks don't attempt to starts threads, which have not been requested with
spark.task.cpus
.