I'm trying to spark-submit a PySpark application but every time I try it throws this error when it tries to download a pre-trained model from Spark NLP:
TypeError: 'JavaPackage' object is not callable
Any idea what might be causing this? Also, it's interesting to note that I've been practicing with these pre-trained pipelines in a jupyter notebook and it worked fine.
In case it's relevant, I'm using Java 8, Spark 3.2.1, PySpark 3.2.1, Spark NLP 3.4.0 and Python 3.10 (I've also tried with 3.9).
I'm also using a pipenv environment.
This is my spark session configuration:
packages = ",".join(
[
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1",
"com.amazonaws:aws-java-sdk:1.11.563",
"org.apache.hadoop:hadoop-aws:3.2.2",
"org.apache.hadoop:hadoop-client-api:3.2.2",
"org.apache.hadoop:hadoop-client-runtime:3.2.2",
"org.apache.hadoop:hadoop-yarn-server-web-proxy:3.2.2",
"com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2",
"org.mongodb.spark:mongo-spark-connector_2.12:3.0.1",
]
)
# Starts spark session
spark = (
SparkSession.builder.appName("twitter_app_nlp")
.master("local[*]")
.config("spark.jars.packages", packages)
.config("spark.streaming.stopGracefullyOnShutdown", "true")
.config(
"spark.hadoop.fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider",
)
.config("spark.hadoop.fs.s3a.access.key", ACCESS_KEY)
.config("spark.hadoop.fs.s3a.secret.key", SECRET_ACCESS_KEY)
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.sql.shuffle.partitions", 3)
.config("spark.driver.memory", "8G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.mongodb.input.uri", mongoDB)
.config("spark.mongodb.output.uri", mongoDB)
.getOrCreate()
)
I've seen that installing the FAT jar inside the environment normally solves it. I did it and I added it to the spark session like this:
.config('spark.jars.packages', '/Users/mac/.local/share/virtualenvs/tests-uPYwcfrj/lib/spark-nlp-assembly-3.4.3.jar')
But still doesn't work, I've also tried to post that path to the .config('spark.driver.extraClassPath') but no luck either.
I also tried using --packages in the command:
spark-submit main.py --files config.json \
--packages com.johnsnowlabs.nlp:spark-nlp-spark_2.12:3.4.3
But didn't work either.
The
spark-submitcommand has the option--packageslisting a remote library that I'm not sure is in the correct format.Maybe try
using absolute path or file:/ URI because directory expansion does not work with
--jars(not sure if the order--options before application--is relevant) (see Sumbitting Applications - Advanced Dependency Management).If this works, it might be that Spark was not able to find the JARs in your pipenv environment, so it's necessary to configure the paths in the Spark configuration or use the
--driver-class-pathoptionNote: changing
spark.driver.extraClassPathin the configuration will have no effect in client mode--and defaultdeploy-modeisclient(see submitting applications).