I have an older version of Spark setup with YARN that I don't want to wipe out but still want to use a newer version. I found a couple posts referring to how a fat jar can be used for this.
Many SO posts point to either maven(officially supported) or sbt to build a fat jar because it's not directly available for download. There seem to be multiple plugins to do it using maven: maven-assembly-plugin, maven-shade-plugin, onejar-maven-plugin etc.
However, I can't figure out if I really need a plugin and if so, which one and how exactly to go about it. I tried directly compiling github source using 'build/mvn' and 'build/sbt' but the 'spark-assembly_2.11-2.0.2.jar' file is just 283 bytes.
My goal is to run pyspark shell using the newer version's fat jar in a similar way as mentioned here.
The easiest solution (without changing your Spark on YARN architecture and speaking to your YARN admins) is to:
Define a library dependency on Spark 2 in your build system, be it sbt or maven.
Assemble your Spark application to create a so-called uber-jar or fatjar with Spark libraries inside.
It works and I personally tested it at least once in a project.
The only (?) downside of it is that the build process takes longer (you have to
sbt assembly
notsbt package
) and the size of your Spark application's deployable fatjar is...well...much bigger. That also makes the deployment longer since you have tospark-submit
it to YARN over the wire.All in all, it works but takes longer (which may still be shorter than convincing your admin gods to, say forget about what is available in commercial offerings like Cloudera's CDH or Hortonworks' HDP or MapR distro).