Running Spark-Connect Server on kubernetes in cluster mode/high availability mode

767 views Asked by At

I am trying to figure out how to effectively use the new Spark-Connect feature of Spark version >= 3.4.0. Specifically, I want so set up a kubernetes Spark cluster where various applications (mainly pyspark) will connect and submit their workloads. It is my understanding (and please correct me if I'm wrong) that by running the command

./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0

a shared Spark Context is created, and it is not possible to submit further configurations (i.e. driver/executor cores and memory, packages, etc.) after it is created.

The command creates a spark driver instance inside the pod running the spark connect server (i.e. in client mode). I was also able to set kubernetes as a master, and thus have spark executors be created dynamically uppon task submission from my clients application.

What I want to know is if it is possible to configure the spark cluster in "cluster mode" instead, so that the driver is instantiated in a separate pod from the spark-connect server?

Also, is it possible to run the spark-connect server in high-availability mode?

Finally, are there any configurations that can be passed from the spark session builder object, something like:

from pyspark.sql import SparkSession

spark = (SparkSession.builder
.remote("sc://spark-connect.spark.svc.cluster.local:15002")
.config("spark.xxx.yyy", "some-value")
.getOrCreate())

Thanks to anyone who can answer!

1

There are 1 answers

0
scienceseba On BEST ANSWER

I ended up writing my own helm chart for deploying a Spark Connect cluster on Kubernetes (EKS on AWS): https://github.com/sebastiandaberdaku/charts/tree/gh-pages/spark-connect.

To answer my questions in order:

It is not possible to run Spark Connect in "cluster-mode". Only "client-mode" is supported as of v3.5.0.

The Spark Connect server cannot be set-up in HA mode out of the box. In my helm chart I am using a stateful set where the user can set the desired replicas and the Service will route the requests from a user to the server pod based on the client's IP address (https://kubernetes.io/docs/reference/networking/virtual-ips/#session-affinity). So it is basically like having multiple independent clusters.

Finally, some level of dependency management seems to be available at the Session level: https://www.databricks.com/blog/python-dependency-management-spark-connect.