SparkException on collect() in Spark Connect with PySpark

195 views Asked by At

I'm developing an API that makes requests to a Spark cluster (Spark 3.5), but I'm encountering a SparkException error when trying to collect results from a DataFrame. I'm relatively new to Spark and I'm using Spark Connect with PySpark. Here's the relevant part of my code:

def make_request(self) -> list:
    spark: SparkSession = SparkSession.builder.remote("sc://localhost:15002").appName("SimpleApp").getOrCreate()
    request = self.generate_query()  # a select id from iceberg_catalog where some_filters
    result = spark.sql(request)  # type of <class 'pyspark.sql.connect.dataframe.DataFrame'>
    result.show()  # This works fine and shows the expected result (1 row of one string)
    return result.collect()  # This line causes a crash

The show() method works correctly and displays the expected row. However, the program crashes at result.collect(), and I receive the following error:

(org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 43.0 failed 4 times, most recent failure: Lost task 0.3 in stage 43.0 (TID 85) (10.233.117.232 executor 0): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD

From what I understand, this might be related to an RDD operation that isn't supported by Spark Connect (as per Spark Connect Overview). However, the DataFrame API also has a collect() method that should be supported by Spark Connect.

I've tried other method such as take or count, and it doesn't seem to be working either, with a similar error.

I'm unsure what's wrong here, as I just started using Spark. Could this be an issue with how I'm using Spark Connect or something specific with the collect() method or Spark Connect wrongly setup? Any insights or suggestions would be greatly appreciated.

1

There are 1 answers

1
gtnchtb On

I am facing exactly the same problem with Spark Connect 3.5.0. I have no problem processing but as soon as I use collect, count ... I have this RDD exception.

Looks like the issue is known by Apache under this ticket number: SPARK-46032

Have you resolved your problem?