I cannot understand the difference between multi-threading and partitioning in Spring batch. The implementation is of course different: In partitioning you need to prepare the partitions then process it. I want to know what is the difference and which one is more efficient way to process when the bottleneck is the item-processor.
Spring batch difference between Multithreading vs partitioning
10.7k views Asked by mettok At
1
There are 1 answers
Related Questions in MULTITHREADING
- Azure VM: Single disk (filesystem) greater than 1023 GB?
- Backup strategy for build tool hosted on Azure VM
- New-AzureQuickVM not creating VM on exsisting Cloud Service?
- Ping Azure VM in same subnet using VM name
- 'Your credentials did not work' in MS Azure
- Installing Azure powershell in an azure Virtual Machine
- Azure Virtual Network Custom DNS Server
- Extend On-premise AD to Azure
- How can I use Azure-provided DNS for Resource Manager VMs?
- Find out data traffic coming in and going out through azure VM
Related Questions in SPRING
- Azure VM: Single disk (filesystem) greater than 1023 GB?
- Backup strategy for build tool hosted on Azure VM
- New-AzureQuickVM not creating VM on exsisting Cloud Service?
- Ping Azure VM in same subnet using VM name
- 'Your credentials did not work' in MS Azure
- Installing Azure powershell in an azure Virtual Machine
- Azure Virtual Network Custom DNS Server
- Extend On-premise AD to Azure
- How can I use Azure-provided DNS for Resource Manager VMs?
- Find out data traffic coming in and going out through azure VM
Related Questions in SPRING-BATCH
- Azure VM: Single disk (filesystem) greater than 1023 GB?
- Backup strategy for build tool hosted on Azure VM
- New-AzureQuickVM not creating VM on exsisting Cloud Service?
- Ping Azure VM in same subnet using VM name
- 'Your credentials did not work' in MS Azure
- Installing Azure powershell in an azure Virtual Machine
- Azure Virtual Network Custom DNS Server
- Extend On-premise AD to Azure
- How can I use Azure-provided DNS for Resource Manager VMs?
- Find out data traffic coming in and going out through azure VM
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Popular Tags
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
TL;DR;
Neither approach is intended to help when the bottleneck is in the processor. You will see some gains by having multiple items going through a processor at the same time, but both of the options you point out get their full benefits when used in processes that are I/O bound. The
AsyncItemProcessor
/AsyncItemWriter
may be a better option.Overview of Spring Batch Scalability
There are five options for scaling Spring Batch jobs:
AsyncItemProcessor
/AsyncItemWriter
Each has it's own benefits and disadvantages. Let's walk through each:
Multithreaded step
A multithreaded step takes a single step and executes each chunk within that step on a separate thread. This means that the same instances of each of the batch components (readers, writers, etc) are shared across the threads. This can increase performance by adding some parallelism to the step at the cost of restartability in most cases. You sacrifice restartability because in most cases, the ability to restart is based on the state maintained within the reader/writer/etc. With multiple threads updating that state, it becomes invalid and useless for restart. Because of this, you typically need to turn save state off on individual components and set the restartable flag to false on the job.
Parallel steps
Parallel steps are achieved via a split. It allows you to execute multiple, independent steps in parallel via threads. This does not sacrifice restartability, but does not help improve the performance of a single step or piece of business logic.
Partitioning
Partitioning is the dividing of data, in advance, into smaller chunks (called partitions) by a master step and then having slaves work independently on the partitions. In Spring Batch, both the master and each slave, is an independent step so you can get the benefits of parallelism within a single step without sacrificing restartability. Partitioning also provides the ability to scale beyond a single JVM in that the slaves do not have to be local (you can use various communication mechanisms to communicate with remote slaves).
An important note about partitioning is that the only communication between the master and slave is a description of the data and not the data itself. For example, the master may tell slave1 to process records 1-100, slave2 to process records 101-200, etc. The master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. Because of this, the data must be local to the slave processes and the master can be located anywhere.
Remote chunking
Remote chunking allows you to scale the process and optionally the write logic across JVMs. In this use case, the master reads the data and then sends it over the wire to the slaves where it is processed and then either written locally to the slave or returned to the master for writing local to the master.
The important difference between partitioning and remote chunking is that instead of a description going over the wire, remote chunking sends the actual data over the wire. So instead of a single packet saying process records 1-100, remote chunking is going to send the actual records 1-100. This can have a large impact on the I/O profile of a step, but if the processor is enough of a bottleneck, this can be useful.
AsyncItemProcessor
/AsyncItemWriter
The final option for scaling Spring Batch processes is the
AsyncItemProcessor
/AsycnItemWriter
combination. In this case, theAsyncItemProcessor
wraps yourItemProcessor
implementation and executes the call to your implementation in a separate thread. TheAsyncItemProcessor
then returns aFuture
that is passed to theAsyncItemWriter
where it is unwrapped and passed to the delegateItemWriter
implementation.Because of the nature of how data flows through this option, certain listener scenarios are not supported (since we don't know the outcome of the
ItemProcessor
call until inside theItemWriter
) but overall, it can provide a useful tool for parallelizing just theItemProcessor
logic in a single JVM without sacrificing restartability.