What to set Spark Master address to when deploying on Kubernetes Spark Operator?

1k views Asked by At

The official spark documentation only has information on the spark-submit method for deploying code to a spark cluster. It mentions we must prefix the address from kubernetes api server with k8s://. What should we do when deploying through Spark Operator?

For instance if I have a basic pyspark application that starts up like this how do I set the master:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext

sc = SparkContext("local", "Big data App")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('app_name')

Here I have local, where if I was running on a non-k8's cluster I would mention the master address with spark:// prefix or yarn. Must I also use the k8s:// prefix if deploying through the Spark Operator? If not what should be used for master parameter?

1

There are 1 answers

0
Alex Ott On

It's better not to use setMaster in the code, but instead specify it when running the code via spark-submit, something like this (see documentation for details):

./bin/spark-submit \
    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    --deploy-mode cluster \
    your_script.py

I haven't used Spark operator, but it should set master automatically, as I understand from the documentation.

you also need to get convert this code:

sc = SparkContext("local", "Big data App")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('app_name')

to more modern (see doc):

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

as SQLContext is deprecated.

P.S. I recommend to get through first chapters of Learning Spark, 2ed that is freely available from the Databricks site.