Spark saveAsTable with location at s3 bucket's root cause NullPointerException

859 views Asked by At

I am working with Spark 3.0.1 and my partitioned table is stored in s3. Please find here the description of the issue.

Create Table

Create table root_table_test_spark_3_0_1 (
    id string,
    name string
)
USING PARQUET
PARTITIONED BY (id)
LOCATION  's3a://MY_BUCKET_NAME/'

Code that is causing the NullPointerException on the second run

Seq(MinimalObject("id_1", "name_1"), MinimalObject("id_2", "name_2"))
      .toDS()
      .write
      .partitionBy("id")
      .mode(SaveMode.Append)
      .saveAsTable("root_table_test_spark_3_0_1")

When the Hive metastore is empty everything is ok but the issue is happening when spark is trying to do the getCustomPartitionLocations in InsertIntoHadoopFsRelationCommand phase. (on the second run for example)

Indeed it calls the below method : from (org.apache.hadoop.fs.Path)

/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
    return new Path(getParent(), getName()+suffix);
}

But the getParent() will return null when we are at root, resulting in a NullPointerException. The only option i'm thinking at the moment is to override this method to do something like :

/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
    return (isRoot()) ? new Path(uri.getScheme(), uri.getAuthority(), suffix) : new Path(getParent(), getName()+suffix);
}

Anyone having issues when LOCATION of a spark hive table is at root level? Any workaround? Is there any known issues opened?

My Runtime does not allow me to override the Path class and fix the suffix method and i can't move my data from the bucket's root as it exists since 2 years now.

The issue happen because i'm migrating from Spark 2.1.0 to Spark 3.0.1 and the behavior checking the custom partitions appeared in Spark 2.2.0 (https://github.com/apache/spark/pull/16460)

This whole context help to understand the problem but basically you can reproduce it easily doing

val path: Path = new Path("s3a://MY_BUCKET_NAME/")
println(path.suffix("/id=id"))

FYI. the hadoop-common version is 2.7.4 and please find here the full stacktrace

NullPointerException
at org.apache.hadoop.fs.Path.<init>(Path.java:104)
    at org.apache.hadoop.fs.Path.<init>(Path.java:93)
    at org.apache.hadoop.fs.Path.suffix(Path.java:361)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:262)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.getCustomPartitionLocations(InsertIntoHadoopFsRelationCommand.scala:260)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:107)
    at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:575)
    at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:218)
    at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:166)

Thanks

1

There are 1 answers

2
stevel On BEST ANSWER

Looks like a situation where the spark code calls Path.suffix("something) and because the root path has no parent, an NPE is triggered

Long term fix

  1. File JIRA on issues.apache.org against HADOOP; provide a patch with test for fix suffix() to downgrade properly when called on root path. Best for all
  2. don't use the root path as a destination of a table.
  3. Do both of these

Option #2 should avoid other surprises about how tables are created/committed etc...some of the code may fail because an attempt to delete the root of the path (here s3a://some-bucket") won't delete the root, will it?

Put differently: root directories have "odd" semantics everywhere; most of the time you don't notice this on a local FS because you never try to use / as a destination of work, get surprised that rm -rf / is different from rm -rf /subdir, etc etc. Spark, Hive etc were never written to use / as a destination of work, so you get to see the failures.