The common problems when building and deploying Spark applications are:
java.lang.ClassNotFoundException
.object x is not a member of package y
compilation errors.java.lang.NoSuchMethodError
How these can be resolved?
The common problems when building and deploying Spark applications are:
java.lang.ClassNotFoundException
.object x is not a member of package y
compilation errors.java.lang.NoSuchMethodError
How these can be resolved?
Dependency classes of your application shall be specified in the application-jar option of your launching command.
More details can be found at the Spark documentation
Taken from the documentation:
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes
In addition to the very extensive answer already given by user7337271, if the problem results from missing external dependencies you can build a jar with your dependencies with e.g. maven assembly plugin
In that case, make sure to mark all the core spark dependencies as "provided" in your build system and, as already noted, make sure they correlate with your runtime spark version.
Apache Spark's classpath is built dynamically (to accommodate per-application user code) which makes it vulnerable to such issues. @user7337271's answer is correct, but there are some more concerns, depending on the cluster manager ("master") you're using.
First, a Spark application consists of these components (each one is a separate JVM, therefore potentially contains different classes in its classpath):
SparkSession
(or SparkContext
) and connecting to a cluster manager to perform the actual workThe relationsip between these is described in this diagram from Apache Spark's cluster mode overview:
Now - which classes should reside in each of these components?
This can be answered by the following diagram:
Let's parse that slowly:
Spark Code are Spark's libraries. They should exist in ALL three components as they include the glue that let's Spark perform the communication between them. By the way - Spark authors made a design decision to include code for ALL components in ALL components (e.g. to include code that should only run in Executor in driver too) to simplify this - so Spark's "fat jar" (in versions up to 1.6) or "archive" (in 2.0, details bellow) contain the necessary code for all components and should be available in all of them.
Driver-Only Code this is user code that does not include anything that should be used on Executors, i.e. code that isn't used in any transformations on the RDD / DataFrame / Dataset. This does not necessarily have to be separated from the distributed user code, but it can be.
Distributed Code this is user code that is compiled with driver code, but also has to be executed on the Executors - everything the actual transformations use must be included in this jar(s).
Now that we got that straight, how do we get the classes to load correctly in each component, and what rules should they follow?
Spark Code: as previous answers state, you must use the same Scala and Spark versions in all components.
1.1 In Standalone mode, there's a "pre-existing" Spark installation to which applications (drivers) can connect. That means that all drivers must use that same Spark version running on the master and executors.
1.2 In YARN / Mesos, each application can use a different Spark version, but all components of the same application must use the same one. That means that if you used version X to compile and package your driver application, you should provide the same version when starting the SparkSession (e.g. via spark.yarn.archive
or spark.yarn.jars
parameters when using YARN). The jars / archive you provide should include all Spark dependencies (including transitive dependencies), and it will be shipped by the cluster manager to each executor when the application starts.
Driver Code: that's entirely up to - driver code can be shipped as a bunch of jars or a "fat jar", as long as it includes all Spark dependencies + all user code
Distributed Code: in addition to being present on the Driver, this code must be shipped to executors (again, along with all of its transitive dependencies). This is done using the spark.jars
parameter.
To summarize, here's a suggested approach to building and deploying a Spark Application (in this case - using YARN):
spark.jars
parameter when starting the SparkSession
lib/
folder of the downloaded Spark binaries as the value of spark.yarn.archive
I think this problem must solve a assembly plugin. You need build a fat jar. For example in sbt :
$PROJECT_ROOT/project/assembly.sbt
with code addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.0")
added some libraries
libraryDependencies ++= Seq("com.some.company" %% "some-lib" % "1.0.0")` If you need more information, go to https://github.com/sbt/sbt-assembly
I have the following build.sbt
lazy val root = (project in file(".")).
settings(
name := "spark-samples",
version := "1.0",
scalaVersion := "2.11.12",
mainClass in Compile := Some("StreamingExample")
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-streaming" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"com.couchbase.client" %% "spark-connector" % "2.2.0"
)
// META-INF discarding
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
I've created a fat jar of my appliction using sbt assembly plugin, but when running using spark-submit it fails with the error :
java.lang.NoClassDefFoundError: rx/Completable$OnSubscribe
at com.couchbase.spark.connection.CouchbaseConnection.streamClient(CouchbaseConnection.scala:154)
I can see that the class exists in my fat jar:
jar tf target/scala-2.11/spark-samples-assembly-1.0.jar | grep 'Completable$OnSubscribe'
rx/Completable$OnSubscribe.class
not sure what am i missing here, any clues?
Add all the jar files from spark-2.4.0-bin-hadoop2.7\spark-2.4.0-bin-hadoop2.7\jars in the project. The spark-2.4.0-bin-hadoop2.7 can be downloaded from https://spark.apache.org/downloads.html
When building and deploying Spark applications all dependencies require compatible versions.
Scala version. All packages have to use the same major (2.10, 2.11, 2.12) Scala version.
Consider following (incorrect)
build.sbt
:We use
spark-streaming
for Scala 2.10 while remaining packages are for Scala 2.11. A valid file could bebut it is better to specify version globally and use
%%
(which appends the scala version for you):Similarly in Maven:
Spark version All packages have to use the same major Spark version (1.6, 2.0, 2.1, ...).
Consider following (incorrect) build.sbt:
We use
spark-core
1.6 while remaining components are in Spark 2.0. A valid file could bebut it is better to use a variable (still incorrect):
Similarly in Maven:
Spark version used in Spark dependencies has to match Spark version of the Spark installation. For example if you use 1.6.1 on the cluster you have to use 1.6.1 to build jars. Minor versions mismatch are not always accepted.
Scala version used to build jar has to match Scala version used to build deployed Spark. By default (downloadable binaries and default builds):
Additional packages should be accessible on the worker nodes if included in the fat jar. There are number of options including:
--jars
argument forspark-submit
- to distribute localjar
files.--packages
argument forspark-submit
- to fetch dependencies from Maven repository.When submitting in the cluster node you should include application
jar
in--jars
.