I am trying to figure out how to effectively use the new Spark-Connect feature of Spark version >= 3.4.0. Specifically, I want so set up a kubernetes Spark cluster where various applications (mainly pyspark) will connect and submit their workloads. It is my understanding (and please correct me if I'm wrong) that by running the command
./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0
a shared Spark Context is created, and it is not possible to submit further configurations (i.e. driver/executor cores and memory, packages, etc.) after it is created.
The command creates a spark driver instance inside the pod running the spark connect server (i.e. in client mode). I was also able to set kubernetes as a master, and thus have spark executors be created dynamically uppon task submission from my clients application.
What I want to know is if it is possible to configure the spark cluster in "cluster mode" instead, so that the driver is instantiated in a separate pod from the spark-connect server?
Also, is it possible to run the spark-connect server in high-availability mode?
Finally, are there any configurations that can be passed from the spark session builder object, something like:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.remote("sc://spark-connect.spark.svc.cluster.local:15002")
.config("spark.xxx.yyy", "some-value")
.getOrCreate())
Thanks to anyone who can answer!
I ended up writing my own helm chart for deploying a Spark Connect cluster on Kubernetes (EKS on AWS): https://github.com/sebastiandaberdaku/charts/tree/gh-pages/spark-connect.
To answer my questions in order:
It is not possible to run Spark Connect in "cluster-mode". Only "client-mode" is supported as of v3.5.0.
The Spark Connect server cannot be set-up in HA mode out of the box. In my helm chart I am using a stateful set where the user can set the desired replicas and the Service will route the requests from a user to the server pod based on the client's IP address (https://kubernetes.io/docs/reference/networking/virtual-ips/#session-affinity). So it is basically like having multiple independent clusters.
Finally, some level of dependency management seems to be available at the Session level: https://www.databricks.com/blog/python-dependency-management-spark-connect.