Spring Batch read enormous data from Rest web service

974 views Asked by At

I need to process data from Rest web service. the following basic exemple is :

import org.springframework.batch.item.ItemReader;
import org.springframework.http.ResponseEntity;
import org.springframework.web.client.RestTemplate;

import java.util.Arrays;
import java.util.List;

class RESTDataReader implements ItemReader<DataDTO> {

private final String apiUrl;
private final RestTemplate restTemplate;

private int nextDataIndex;
private List<DataDTO> data;

RESTDataReader(String apiUrl, RestTemplate restTemplate) {
    this.apiUrl = apiUrl;
    this.restTemplate = restTemplate;
    nextDataIndex = 0;
}

@Override
public DataDTO read() throws Exception {
    if (dataIsNotInitialized()) {
        data = fetchDataFromAPI();
    }

    DataDTO nextData = null;

    if (nextDataIndex < data.size()) {
        nextData = data.get(nextDataIndex);
        nextDataIndex++;
    }
    else {
        nextDataIndex= 0;
        data = null;
    }

    return nextData;
}

private boolean dataIsNotInitialized() {
    return this.data == null;
}

private List<DataDTO> fetchDataFromAPI() {
    ResponseEntity<DataDTO[]> response = restTemplate.getForEntity(apiUrl,
            DataDTO[].class
    );
    DataDTO[] data= response.getBody();
    return Arrays.asList(data);
}
}

However, my fetchDataFromAPI method is called with time slots and it could get more than 20 Millions objects.

For example : if i call it between 01012020 and 01012021 i'll get 80 Millions data.

PS : the web service works by pagination of a single day, i.e. if I want to retrieve the data between 01/09/2020 and 07/09/2020 I have to call it several times (between 01/09-02/09 then between 02/09-03/09 and so on until 06/09-07/09)

My problem in this case is a heap space memory if the data is bulky.

I had to create a step for each month to avoid this problem in my BatchConfiguration (12 steps). The first step which will call the web service between 01/01/2020 and 01/02/2020 etc

Is there a solution to read all this volume of data with only one step before going to the processor ??

Thanks in advance

1

There are 1 answers

5
Mahmoud Ben Hassine On

Since your web service does not provide pagination within a single day, you need to ensure that the process that calls this web service (ie your Spring Batch job) has enough memory to store all items returned by this service.

For example : if i call it between 01012020 and 01012021 i'll get 80 Millions data.

This means that if you call this web service with curl on a machine that does not have enough memory to hold the result, then the curl command will fail. The point I want to make here is that the only way to solve this issue is to give enough memory to the JVM that runs your Spring Batch job to hold such a big result set.

As a side note: if you have control over this web service, I highly recommend you to improve it by introducing a more granular pagination mechanism.