java.lang.NullPointerException while reading specific sheet from xlsx using org.zuinnote.spark.office.excel

Question

java.lang.NullPointerException while reading specific sheet from xlsx using org.zuinnote.spark.office.excel

52 views Asked by Ashish Mishra At 15 June 2023 at 12:42

We are trying to read one specific sheet from Excel (.xlsx with 3 sheets) using org.zuinnote.spark.office.excel into spark dataframe.

We are using MSExcelLowFootprintParser parser.

code used is

    val hadoopConf = new Configuration()
val spark = SparkSession.builder()
  .appName("ExcelReadingExample")
  .master("local[*]")
  .getOrCreate()


spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.lowFootprint", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.header.read","true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.sheets", "Most Runs Over - 2010")

spark.sparkContext.setLogLevel("INFO")
val schema = StructType(Seq(
    StructField("col", IntegerType, nullable = true),
    StructField("col1", StringType, nullable = true),
    StructField("Runs", IntegerType, nullable = true),
    StructField("BF", IntegerType, nullable = true),
    StructField("SR", DoubleType, nullable = true),
    StructField("s4", IntegerType, nullable = true),
    StructField("s6", IntegerType, nullable = true),
    StructField("Against", StringType, nullable = true),
    StructField("Venue", StringType, nullable = true),
    StructField("Match Date", StringType, nullable = true),
    StructField("Match Time", StringType, nullable = true),
    StructField("Match Partition Time", StringType, nullable = true)
))

val df: Dataset[Row] = spark.read
  .format("org.zuinnote.spark.office.excel")
  .option("hadoopoffice.read.sheets", "Most Runs Over - 2010")
  .option("read.spark.simpleMode","true")
  .option("read.lowFootprint", "true")
  .option("multiLine", "true")
  .option("read.spark.simpleMode.maxInferRows","1000")
  .option("read.header.read","true")
  .schema(schema)
  .load("D:\\excel\\spark-hadoopoffice-ds-s2-ho-1.3.9\\spark-hadoopoffice-ds-s2-ho-1.3.9\\src\\resources\\MostRuns_Over2008.xlsx")

df.show();

We are facing exception

java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.parse(MSExcelLowFootprintParser.java:267)
at org.zuinnote.hadoop.office.format.common.OfficeReader.parse(OfficeReader.java:92)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordReader.initialize(AbstractSpreadSheetDocumentRecordReader.java:138)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.<init>(HadoopFileExcelReader.scala:55)
at org.zuinnote.spark.office.excel.DefaultSource.$anonfun$buildReader$4(DefaultSource.scala:322)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:185)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:240)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

Its working if i read all sheets instead of one sheet. Its also working if i dont use low foot print parser.

org.apache.poi version 4.1.2
spark-hadoopoffice-ds version 1.3.9
spark version 3.1.2

Original Q&A

There are 1 answers

**Axel Richter** · Accepted Answer · 2023-06-16T05:30:47+00:00

Its working if i read all sheets instead of one sheet. It is also working if i don't use low footprint parser.

Good observed. File a bug about this on https://github.com/ZuInnoTe/hadoopoffice/issues.

Reason for the NullPointerException

java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)

MSExcelLowFootprintParser in program line 374 calls iter.getSheetName() where iter is XSSFReader.SheetIterator. Then XSSFReader.SheetIterator.getSheetName calls xssfSheetRef.getName() but the XSSFSheetRef xssfSheetRef will not even be set to something and will be null unless XSSFReader.SheetIterator.next gets called.

Its working if you read all sheets instead of one sheet because then MSExcelLowFootprintParser.java code line 374 never gets reached because no sheet names are given. Then the first call after while (iter.hasNext()) {... will be InputStream rawSheetInputStream = iter.next(); and that calls XSSFReader.SheetIterator.next which sets xssfSheetRef.

It is also working if you don't use low footprint parser because that not uses org.apache.poi.xssf.eventusermodel.XSSFReader at all.

Solution

InputStream rawSheetInputStream = iter.next(); needs get called immidatelly after while (iter.hasNext()) { in org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage. That will set XSSFReader.SheetIterator.xssfSheetRef and so avoid the NullPointerException.

TechQA.

java.lang.NullPointerException while reading specific sheet from xlsx using org.zuinnote.spark.office.excel

There are 1 answers

Related Questions in APACHE-POI

Related Questions in SPARK3

Related Questions in SPARK-EXCEL

Popular Questions

Trending Questions