I am trying to do some calculus by using petastorm v0.11.4
in a docker container and minikube v1.25.2
As long as I run the process locally, everything works as expected. As soon as I try to spread the work in the minikube cluster, I receive the following error message from kubelet:
Error: failed to start container "spark-kubernetes-executor": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "executor": executable file not found in $PATH: unknown
And the executor pods terminate and new once are created.
The code looks as follows:
spark_conf = SparkConf()
spark_conf.setMaster("k8s://https://kubernetes.default:443")
spark_conf.setAppName("PetastormDsCreator")
spark_conf.set(
"spark.driver.memory",
"2g"
)
#k8s conf can be red here https://spark.apache.org/docs/latest/running-on-kubernetes.html
spark_conf.set(
"spark.kubernetes.namespace",
"spark"
)
spark_conf.set(
"spark.kubernetes.authenticate.driver.serviceAccountName",
"spark-driver"
)
spark_conf.set(
"spark.kubernetes.authenticate.caCertFile",
"/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
)
spark_conf.set(
"spark.kubernetes.authenticate.oauthTokenFile",
"/var/run/secrets/kubernetes.io/serviceaccount/token"
)
spark_conf.set(
"spark.executor.instances",
"2"
)
spark_conf.set(
"spark.driver.host",
"petastorm-ds-creator" #must match the pods name =)
)
spark_conf.set(
"spark.driver.port",
"20022"
)
spark_conf.set(
"spark.kubernetes.container.image",
"localhost:5000/petastorm:v0.0.1"
)
spark_conf.set(
"spark.kubernetes.driver.volumes.hostPath.data.mount.path", #spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.path
"/data"
)
spark_conf.set(
"spark.kubernetes.executor.volumes.hostPath.data.mount.path", #spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.path
"/data"
)
spark_conf.set(
"spark.kubernetes.driver.volumes.hostPath.data.options.path", #spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.path
"/data"
)
spark_conf.set(
"spark.kubernetes.executor.volumes.hostPath.data.options.path", #spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.path
"/data"
)
spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
sc = spark.sparkContext
t = sc.parallelize(range(10))
r = t.sumApprox(3)
print('Approximate sum: %s' % r)
Did anyone face a similar issue? Unfortunately, I did not find many tutorials explaining how to configure or using petastorm
in kubernetes.