We are trying to read one specific sheet from Excel (.xlsx with 3 sheets) using org.zuinnote.spark.office.excel into spark dataframe.
We are using MSExcelLowFootprintParser parser.
code used is
val hadoopConf = new Configuration()
val spark = SparkSession.builder()
.appName("ExcelReadingExample")
.master("local[*]")
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.lowFootprint", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.header.read","true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.sheets", "Most Runs Over - 2010")
spark.sparkContext.setLogLevel("INFO")
val schema = StructType(Seq(
StructField("col", IntegerType, nullable = true),
StructField("col1", StringType, nullable = true),
StructField("Runs", IntegerType, nullable = true),
StructField("BF", IntegerType, nullable = true),
StructField("SR", DoubleType, nullable = true),
StructField("s4", IntegerType, nullable = true),
StructField("s6", IntegerType, nullable = true),
StructField("Against", StringType, nullable = true),
StructField("Venue", StringType, nullable = true),
StructField("Match Date", StringType, nullable = true),
StructField("Match Time", StringType, nullable = true),
StructField("Match Partition Time", StringType, nullable = true)
))
val df: Dataset[Row] = spark.read
.format("org.zuinnote.spark.office.excel")
.option("hadoopoffice.read.sheets", "Most Runs Over - 2010")
.option("read.spark.simpleMode","true")
.option("read.lowFootprint", "true")
.option("multiLine", "true")
.option("read.spark.simpleMode.maxInferRows","1000")
.option("read.header.read","true")
.schema(schema)
.load("D:\\excel\\spark-hadoopoffice-ds-s2-ho-1.3.9\\spark-hadoopoffice-ds-s2-ho-1.3.9\\src\\resources\\MostRuns_Over2008.xlsx")
df.show();
We are facing exception
java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.parse(MSExcelLowFootprintParser.java:267)
at org.zuinnote.hadoop.office.format.common.OfficeReader.parse(OfficeReader.java:92)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordReader.initialize(AbstractSpreadSheetDocumentRecordReader.java:138)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.<init>(HadoopFileExcelReader.scala:55)
at org.zuinnote.spark.office.excel.DefaultSource.$anonfun$buildReader$4(DefaultSource.scala:322)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:185)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:240)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Its working if i read all sheets instead of one sheet. Its also working if i dont use low foot print parser.
- org.apache.poi version 4.1.2
- spark-hadoopoffice-ds version 1.3.9
- spark version 3.1.2
Its working if i read all sheets instead of one sheet. It is also working if i don't use low footprint parser.
Good observed. File a bug about this on https://github.com/ZuInnoTe/hadoopoffice/issues.
Reason for the NullPointerException
MSExcelLowFootprintParser in program line 374 calls
iter.getSheetName()whereiteris XSSFReader.SheetIterator. Then XSSFReader.SheetIterator.getSheetName callsxssfSheetRef.getName()but theXSSFSheetRef xssfSheetRefwill not even be set to something and will be null unless XSSFReader.SheetIterator.next gets called.Its working if you read all sheets instead of one sheet because then
MSExcelLowFootprintParser.javacode line 374 never gets reached because no sheet names are given. Then the first call afterwhile (iter.hasNext()) {...will beInputStream rawSheetInputStream = iter.next();and that callsXSSFReader.SheetIterator.nextwhich setsxssfSheetRef.It is also working if you don't use low footprint parser because that not uses
org.apache.poi.xssf.eventusermodel.XSSFReaderat all.Solution
InputStream rawSheetInputStream = iter.next();needs get called immidatelly afterwhile (iter.hasNext()) {inorg.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage. That will setXSSFReader.SheetIterator.xssfSheetRefand so avoid the NullPointerException.