In an xlsx workbook there are some cells with some unbounded SUMIF formulas like this: SUMIF(MySheetname!$B:$B,$E4,MySheetname!$I:$I)
.
The evaluation of one SUMIF function with Apache POI 5.0.0 lasts 100ms and the evaluation of the given workbook lasts some minutes.
One approach to improve the execution duration is to bound the formulas to something like this: SUMIF(MySheetname!$B1:$B100,$E4,MySheetname!$I1:$I100)
. In my case this is not a solution as I'm not the author of the xlsx files and the system gets unknown xlsx files from unknown persons (so I can't just tell them to limit the SUMIF ranges).
The current implementation of org.apache.poi.ss.formula.functions.Sumif
iterates all cells in the given (unbounded) ranges, so every evaluation iterates 1048576 cells.
This is part of the implementation of method sumMatchingCells(AreaEval, I_MatchPredicate, AreaEval)
:
for (int r=0; r<height; r++) {
for (int c=0; c<width; c++) {
result += accumulate(aeRange, mp, aeSum, r, c);
}
}
I would like to improve the performance of this method by checking whether the row or the column actually exists in the sum range. Maybe something like this (with a non existing method sheetContainsRowIndex
):
for (int r = 0; r < height; r++) {
if (aeSum.sheetContainsRowIndex(aeSum.getFirstRow() + r)) {
for (int c = 0; c < width; c++) {
if (aeSum.sheetContainsColumnIndex(aeSum.getFirstColumn() + c)) {
[...]
The LazyAreaEval
contains a SheetRangeEvaluator
and this contains SheetRefEvaluator
s and these contain an EvaluationSheet
and this knows at least the getLastRowNum()
. Unfortunately this attribute chain is private.
Any idea how to achieve this? Or any other idea how to improve performance for SUMIF execution?
Patching the
apache poi
formula evaluation needs a deep immersion into the sources and rummaging through the evaluation process. That is nothing I will do.But a workaround could be to replace all full column references in formulas by area references from row 1 to last row in sheet before evaluation.
If you only read the workbook then this only affects the random access memory and not the stored file. Of course, if you need to save the changed workbook, then it will affect the stored file. Then the workaround might be not usable.
This has noticeable effect on process duration when there are multiple formulas having full column references in the worksheet, at least using
*.xlsx
(XSSF
) and although the additional replacement process for each formula needs be done.Complete code example:
Comment out the part
to see the difference.