Accelerate Apache POI SUMIF

361 views Asked by At

In an xlsx workbook there are some cells with some unbounded SUMIF formulas like this: SUMIF(MySheetname!$B:$B,$E4,MySheetname!$I:$I). The evaluation of one SUMIF function with Apache POI 5.0.0 lasts 100ms and the evaluation of the given workbook lasts some minutes.

One approach to improve the execution duration is to bound the formulas to something like this: SUMIF(MySheetname!$B1:$B100,$E4,MySheetname!$I1:$I100). In my case this is not a solution as I'm not the author of the xlsx files and the system gets unknown xlsx files from unknown persons (so I can't just tell them to limit the SUMIF ranges).

The current implementation of org.apache.poi.ss.formula.functions.Sumif iterates all cells in the given (unbounded) ranges, so every evaluation iterates 1048576 cells.

This is part of the implementation of method sumMatchingCells(AreaEval, I_MatchPredicate, AreaEval) :

for (int r=0; r<height; r++) {
    for (int c=0; c<width; c++) {
        result += accumulate(aeRange, mp, aeSum, r, c);
    }
}

I would like to improve the performance of this method by checking whether the row or the column actually exists in the sum range. Maybe something like this (with a non existing method sheetContainsRowIndex):

for (int r = 0; r < height; r++) {
    if (aeSum.sheetContainsRowIndex(aeSum.getFirstRow() + r)) {
        for (int c = 0; c < width; c++) {
            if (aeSum.sheetContainsColumnIndex(aeSum.getFirstColumn() + c)) {
               [...]

The LazyAreaEval contains a SheetRangeEvaluator and this contains SheetRefEvaluator s and these contain an EvaluationSheet and this knows at least the getLastRowNum(). Unfortunately this attribute chain is private.

Any idea how to achieve this? Or any other idea how to improve performance for SUMIF execution?

1

There are 1 answers

3
Axel Richter On BEST ANSWER

Patching the apache poi formula evaluation needs a deep immersion into the sources and rummaging through the evaluation process. That is nothing I will do.

But a workaround could be to replace all full column references in formulas by area references from row 1 to last row in sheet before evaluation.

If you only read the workbook then this only affects the random access memory and not the stored file. Of course, if you need to save the changed workbook, then it will affect the stored file. Then the workaround might be not usable.

This has noticeable effect on process duration when there are multiple formulas having full column references in the worksheet, at least using *.xlsx (XSSF) and although the additional replacement process for each formula needs be done.

Complete code example:

import java.io.FileInputStream;

import org.apache.poi.ss.formula.*;
import org.apache.poi.ss.formula.ptg.*;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.*;
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.ss.SpreadsheetVersion;

public class ExcelEvaluateFullColumnFormulas {

 private static String replaceFullColumnReferences(XSSFSheet sheet, String formula) {
  //System.out.println(formula);
     
  XSSFWorkbook workbook = sheet.getWorkbook();
  XSSFEvaluationWorkbook evaluationWorkbook = XSSFEvaluationWorkbook.create(workbook);
  
  Ptg[] ptgs = FormulaParser.parse(formula, (FormulaParsingWorkbook)evaluationWorkbook, 
   FormulaType.CELL, sheet.getWorkbook().getSheetIndex(sheet));
   
  for (int i = 0; i < ptgs.length; i++) {
   if (ptgs[i] instanceof AreaPtgBase) { // the operand Ptg is an area reference
    AreaPtgBase ref = (AreaPtgBase) ptgs[i];
    if (ref.getFirstRow() == 0 && ref.getLastRow() == SpreadsheetVersion.EXCEL2007.getLastRowIndex()) { // only for full column area references
     int lastRowInSheet = SpreadsheetVersion.EXCEL2007.getLastRowIndex();
     if (ref instanceof Area2DPtgBase) { // the area reference is a 2D area reference in same sheet
      lastRowInSheet = sheet.getLastRowNum(); // get last row of this sheet
     } else if (ref instanceof Area3DPxg) { // the area reference is a 3D area reference in another sheet
      Area3DPxg ref3D = (Area3DPxg)ref; 
      String sheetName = ref3D.getSheetName();
      lastRowInSheet = workbook.getSheet(sheetName).getLastRowNum(); // get last row of referenced sheet
     }      
     ref.setLastRow(lastRowInSheet);
     formula = FormulaRenderer.toFormulaString((FormulaRenderingWorkbook)evaluationWorkbook, ptgs);
    }
   }
  }
  //System.out.println(formula);
  return formula;
  
 }

 public static void main(String[] args) throws Exception {

  DataFormatter formatter = new DataFormatter();
  Workbook workbook = WorkbookFactory.create(new FileInputStream("test.xlsx"));  
  FormulaEvaluator evaluator = workbook.getCreationHelper().createFormulaEvaluator();

  Sheet sheet = workbook.getSheetAt(0);
  java.time.LocalDateTime startTime = java.time.LocalDateTime.now();
  for (Row row : sheet) {
   for (Cell cell : row) {
    ///*
    if (cell.getCellType() == CellType.FORMULA) {
     if (sheet instanceof XSSFSheet){ // do it for XSSF only, not necessary for HSSF.
      String formula = cell.getCellFormula();
      formula = replaceFullColumnReferences((XSSFSheet)sheet, formula);
      cell.setCellFormula(formula);
     }
    }
    //*/
    String value = formatter.formatCellValue(cell, evaluator);
    System.out.print(value + "\t");
   }
   System.out.println();
  }
  
 java.time.LocalDateTime endTime = java.time.LocalDateTime.now();
 java.time.Duration duration = java.time.Duration.between(startTime, endTime);
 System.out.println("process duration: " + duration);
 
 workbook.close();
 }
}

Comment out the part

...
    /*
    if (cell.getCellType() == CellType.FORMULA) {
     if (sheet instanceof XSSFSheet){ // do it for XSSF only, not necessary for HSSF.
      String formula = cell.getCellFormula();
      formula = replaceFullColumnReferences((XSSFSheet)sheet, formula);
      cell.setCellFormula(formula);
     }
    }
    */
...

to see the difference.