aws libraries compatible with spark 3.1.1

884 views Asked by At

I have a spark operator with sparkVersion: "3.1.1" and would like to use it for structured streaming to/from minIO. However, I have not been able to find a compatible combination of libraries for anything newer than hadoop 2.7.0. (which does not support the new s3a:// paths)

Is there a compatible set of spark/hadoop/aws libraries for the 3.1.1 version of spark?

my current dependencies in sbt should work based on https://mvnrepository.com/ dependencies, but they dont (NoSuchMethodError):

scalaVersion := "2.12.0"

lazy val Versions = new {
  val spark = "3.1.1"
  val hadoop = "3.2.0"
  val scalatest = "3.0.4"
}

    "org.apache.spark" %% "spark-core" % Versions.spark % Provided
  , "org.apache.spark" %% "spark-sql" % Versions.spark % Provided
  , "org.apache.spark" %% "spark-hive" % Versions.spark % Provided
  , "org.scalatest"  %% "scalatest" % Versions.scalatest % Test
  , "org.apache.hadoop" % "hadoop-aws" % Versions.hadoop
  , "org.apache.hadoop" % "hadoop-common" % Versions.hadoop
  , "org.apache.hadoop" % "hadoop-mapreduce-client-core" % Versions.hadoop
  , "org.apache.hadoop" % "hadoop-client" %  Versions.hadoop
  , "com.typesafe" % "config" % "1.3.1"
  , "com.github.scopt" %% "scopt" % "3.7.0"
  , "com.github.melrief" %% "purecsv" % "0.1.1"
  , "joda-time" % "joda-time" % "2.9.9"

thanks a lot for any help

1

There are 1 answers

0
Tomáš Sedloň On

This combo of libraries works:

"org.apache.spark" %% "spark-core" % "3.1.1" % Provided,

  "org.apache.spark" %% "spark-sql" % "3.1.1" % Provided,

  "org.apache.hadoop" % "hadoop-aws" % "3.2.0",

  "org.apache.hadoop" % "hadoop-common" % "3.2.0",

  "org.apache.hadoop" % "hadoop-client" % "3.2.0",

  "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "3.2.0",

  "org.apache.hadoop" % "hadoop-minikdc" % "3.2.0",

  "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.375",

  "com.typesafe" % "config" % "1.3.1"

  , "joda-time" % "joda-time" % "2.9.9"

The trick is to use this image for spark gcr.io/spark-operator/spark:v3.1.1-hadoop3, as the default one still has Hadoop 2.7 even for Spark 3.1.1