Apache storm: why and how to choose number of tasks per executor?

2k views Asked by At

According to the official documentation:

How many instances to create for a spout/bolt. A task runs on a thread with zero or more other tasks for the same spout/bolt. The number of tasks for a spout/bolt is always the same throughout the lifetime of a topology, but the number of executors (threads) for a spout/bolt can change over time. This allows a topology to scale to more or less resources without redeploying the topology or violating the constraints of Storm (such as a fields grouping guaranteeing that the same value goes to the same task)

My questions are:

  1. Under what circumstances would I choose to run multiple tasks in one executor?
  2. If I do use multiple tasks in one executor, what might be reasons that I would choose different number of tasks per executor between my spout and my bolt (such as 2 tasks per bolt executor but only 1 task per spout executor)?
3

There are 3 answers

1
Stig Rohde Døssing On BEST ANSWER

I thought https://stackoverflow.com/a/47714449/8845188 was a fine answer, but I'll try to reword it as examples:

The number of tasks for a component (e.g. spout or bolt) is set in stone when you submit the topology, while the number of executors can be changed without redeploying the topology. The number of executors is always less than or equal to the number of tasks for a component.

Question 1

You wouldn't normally have a reason to choose running e.g. 2 tasks in 1 executor, but if you currently have a low load but expect a high load later, you may choose to submit the topology with a high number of tasks but a low number of executors. You could of course just submit the topology with as many executors as you expect to need, but using many threads when you only need a few is inefficient due to context switching and/or potential resource contention.

For example, lets say you submit your topology so the spout has 4 tasks and 4 executors (one per). When your load increases, you can't scale further because 4 is the maximum number of executors you can have. You now have to redeploy the topology in order to scale with the load.

Let's say instead you submit your topology so the spout has 32 tasks and 4 executors (8 per). When the load increases, you can increase the number of executors to 32, even though you started out with only 4. You can do this scaling up without redeploying the topology.

Question 2

Let's say your topology has a spout A, and a bolt B. Let's say bolt B does some heavyweight work (e.g. can do 10 tuples per executor per second), while the spout is lightweight (e.g. can do 1000 tuples per executor per second). Let's say your load is initially 20 messages per second into the topology, but you expect that to grow.

In this case it makes sense that you might configure your spout with 1 executor and 1 task, since it's likely to be idle most of the time. At the same time you want to configure your bolt with a high number of tasks so you can scale the number of executors for it, and at least 2-3 executors to start.

1
Suresh Rajagopal On
Config#TOPOLOGY_TASKS -> How many tasks to create per component.    

A task performs the actual data processing and is run within its parent executor’s thread of execution. Each spout or bolt that you implement in your code executes as many tasks across the cluster.

The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads <= #tasks.

By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread (which is usually what you want anyways).

Also be aware that:

  • The number of executor threads can be changed after the topology has been started.

  • The number of tasks of a topology is static.

0
Harsh Kumar On

There is another reason where having tasks in place of executors makes more sense. Lets suppose you have 2 tasks of the same bolt running on a single executor(thread). Lets suppose you are calling a relatively long running(1 second maybe) database subroutine and the result is needed before proceeding further.

Case 1 - Your database call would be running on the executor thread and it would pause for a while and you would not gain anything by running 2 tasks. Case 2 - You refactor your database call code to spawn a new thread and execute. In this case, your main executor thread would not hang and it would be able to start processing of the second bolt task while the newly spawned thread would be fetching data from database.

Unless you introduce your own parallelism within the component, I do not see a performance gain and no reason to run multiple tasks apart from maintenance reasons as mentioned in other answers.