I've written a Spring Batch configuration that is intended to merge the contents of multiple CSV files into a single file. The goal is to process files with a specific naming pattern (temp*.csv), combine the contents while skipping the header of subsequent files, and then produce a consolidated output.csv.
Here's the main logic I've implemented:
File Discovery: It looks for all files in the current directory that match the naming pattern temp*.csv.
File Aggregation:
It sorts the discovered files by name.
Reads each file one by one:
If it's the first file, it retains the header.
For subsequent files, it skips the header.
The trailer (last line) of the last file is captured but not immediately written.
Each line is written to the output.csv.
The trailer is written at the end of output.csv.
Each processed temp file is deleted.
Spring Batch Configuration:
A Tasklet step (fileAggregatorTasklet) contains the logic for file aggregation.
A single step job (fileAggregatorJob) is defined that uses the above tasklet.
Here's the code:
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
@Configuration
@EnableBatchProcessing
public class BatchConfig {
@Autowired
private JobBuilderFactory jobBuilderFactory;
@Autowired
private StepBuilderFactory stepBuilderFactory;
@Bean
public Tasklet fileAggregatorTasklet() {
return (contribution, chunkContext) -> {
File outputFile = new File("D:\\files\\output.csv");
try (FileWriter writer = new FileWriter(outputFile)) {
int totalRecords = 0;
File[] tempFiles = new File("D:\\files").listFiles((dir, name) -> name.startsWith("temp") && name.endsWith(".csv"));
Arrays.sort(tempFiles, Comparator.comparing(File::getName));
boolean isFirstFile = true;
String trailer = "";
for (File tempFile : tempFiles) {
List<String> lines = Files.readAllLines(tempFile.toPath());
if (!isFirstFile) {
lines.remove(0); // Remove header line
} else {
isFirstFile = false; // Only keep the header from the first file
}
if (tempFile == tempFiles[tempFiles.length - 1]) {
trailer = lines.remove(lines.size() - 1); // Save and remove trailer line
}
for (String line : lines) {
writer.write(line + System.lineSeparator());
totalRecords++;
}
//tempFile.delete();
}
// Trailer processing
if (!trailer.isEmpty()) {
// Assuming the count is at the end of the trailer line and has a fixed width
int countLength = 15; // As specified by the "%015d" format
int indexToInsertCount = trailer.length() - countLength;
String countStr = String.format("%015d", totalRecords); // New count as a string
// Ensure that indexToInsertCount is non-negative
if (indexToInsertCount >= 0) {
// Construct the new trailer
String newTrailer = trailer.substring(0, indexToInsertCount) + countStr;
writer.write(newTrailer); // Write the new trailer
} else {
// Handle the case where trailer is shorter than expected
throw new IllegalStateException("Trailer is shorter than expected count length.");
}
}
} catch (IOException e) {
throw new RuntimeException("An I/O error occurred", e);
}
return RepeatStatus.FINISHED;
};
}
@Bean
public Step fileAggregatorStep(Tasklet fileAggregatorTasklet) {
return stepBuilderFactory.get("fileAggregatorStep")
.tasklet(fileAggregatorTasklet)
.build();
}
@Bean
public Job fileAggregatorJob(Step fileAggregatorStep) {
return jobBuilderFactory.get("fileAggregatorJob")
.incrementer(new RunIdIncrementer())
.flow(fileAggregatorStep)
.end()
.build();
}
}
What I Tried: Before developing this Spring Batch solution, I experimented with manually aggregating files using various file stream operations in plain Java. However, due to the large number of files and the need for better scalability, I decided to leverage Spring Batch for its chunk-based processing and job management capabilities.
What I'm Expecting:
File Handling: I expect the code to handle any number of temp*.csv files present in the directory without running into memory issues. Specifically, regardless of the number of files or their individual sizes, the solution should merge them efficiently.
Consistency: The header should be taken from the first file, and subsequent file headers should be ignored. The trailer from the last file should be the final line in the output.csv.