Spark NLP is not working in PySpark: TypeError: 'JavaPackage' object is not callable

823 views Asked by At

I'm trying to spark-submit a PySpark application but every time I try it throws this error when it tries to download a pre-trained model from Spark NLP:

TypeError: 'JavaPackage' object is not callable

Any idea what might be causing this? Also, it's interesting to note that I've been practicing with these pre-trained pipelines in a jupyter notebook and it worked fine.

In case it's relevant, I'm using Java 8, Spark 3.2.1, PySpark 3.2.1, Spark NLP 3.4.0 and Python 3.10 (I've also tried with 3.9).

I'm also using a pipenv environment.

This is my spark session configuration:

packages = ",".join(
    [
        "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1",
        "com.amazonaws:aws-java-sdk:1.11.563",
        "org.apache.hadoop:hadoop-aws:3.2.2",
        "org.apache.hadoop:hadoop-client-api:3.2.2",
        "org.apache.hadoop:hadoop-client-runtime:3.2.2",
        "org.apache.hadoop:hadoop-yarn-server-web-proxy:3.2.2",
        "com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2",
        "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1",
    ]
)

# Starts spark session

spark = (
    SparkSession.builder.appName("twitter_app_nlp")
    .master("local[*]")
    .config("spark.jars.packages", packages)
    .config("spark.streaming.stopGracefullyOnShutdown", "true")
    .config(
        "spark.hadoop.fs.s3a.aws.credentials.provider",
        "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider",
    )
    .config("spark.hadoop.fs.s3a.access.key", ACCESS_KEY)
    .config("spark.hadoop.fs.s3a.secret.key", SECRET_ACCESS_KEY)
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.sql.shuffle.partitions", 3)
    .config("spark.driver.memory", "8G")
    .config("spark.driver.maxResultSize", "0")
    .config("spark.kryoserializer.buffer.max", "2000M")
    .config("spark.mongodb.input.uri", mongoDB)
    .config("spark.mongodb.output.uri", mongoDB)
    .getOrCreate()
)

I've seen that installing the FAT jar inside the environment normally solves it. I did it and I added it to the spark session like this:

.config('spark.jars.packages', '/Users/mac/.local/share/virtualenvs/tests-uPYwcfrj/lib/spark-nlp-assembly-3.4.3.jar')

But still doesn't work, I've also tried to post that path to the .config('spark.driver.extraClassPath') but no luck either.

I also tried using --packages in the command:

spark-submit main.py --files config.json \
     --packages com.johnsnowlabs.nlp:spark-nlp-spark_2.12:3.4.3

But didn't work either.

1

There are 1 answers

1
user2314737 On

The spark-submit command has the option --packages listing a remote library that I'm not sure is in the correct format.

Maybe try

export NLPJAR='/Users/mac/.local/share/virtualenvs/tests-uPYwcfrj/lib/spark-nlp-assembly-3.4.3.jar'

spark-submit --jars $NLPJAR main.py

using absolute path or file:/ URI because directory expansion does not work with --jars (not sure if the order--options before application--is relevant) (see Sumbitting Applications - Advanced Dependency Management).

If this works, it might be that Spark was not able to find the JARs in your pipenv environment, so it's necessary to configure the paths in the Spark configuration or use the --driver-class-path option

Note: changing spark.driver.extraClassPath in the configuration will have no effect in client mode--and default deploy-mode is client (see submitting applications).