The kubernetes Communicating between containers tutorial defines the following pipeline yaml:
apiVersion: v1
kind: Pod
metadata:
name: two-containers
spec:
restartPolicy: Never
volumes: <--- This is what I need
- name: shared-data
emptyDir: {}
containers:
- name: nginx-container
image: nginx
volumeMounts:
- name: shared-data
mountPath: /usr/share/nginx/html
- name: debian-container
image: debian
volumeMounts:
- name: shared-data
mountPath: /pod-data
command: ["/bin/sh"]
args: ["-c", "echo Hello from the debian container > /pod-data/index.html"]
Note that the volumes
key is defined under spec
, and thus the volume is available to all defined containers.
I want to achieve the same behavior using kfp, which is the API for kubeflow pipelines.
However, I can only add volumes to individual containers, but not to the whole workflow spec using kfp.dsl.ContainerOp.container.add_volume_mount
that points to a previously created volume (kfp.dsl.PipelineVolume), because the volume seems to only be defined within a container.
Here is what I have tried, but the volume is always defined in the first container, not the "global" level. How do I get it so that op2
has access to the volume?
I would have expected it to be inside kfp.dsl.PipelineConf, but volumes can not be added to it.
Is it just not implemented?
import kubernetes as k8s
from kfp import compiler, dsl
from kubernetes.client import V1VolumeMount
import pprint
@dsl.pipeline(name="debug", description="Debug only pipeline")
def pipeline_func():
op = dsl.ContainerOp(
name='echo',
image='library/bash:4.4.23',
command=['sh', '-c'],
arguments=['echo "[1,2,3]"> /tmp/output1.txt'],
file_outputs={'output': '/tmp/output1.txt'})
op2 = dsl.ContainerOp(
name='echo2',
image='library/bash:4.4.23',
command=['sh', '-c'],
arguments=['echo "[4,5,6]">> /tmp/output1.txt'],
file_outputs={'output': '/tmp/output1.txt'})
mount_folder = "/tmp"
volume = dsl.PipelineVolume(volume=k8s.client.V1Volume(
name=f"test-storage",
empty_dir=k8s.client.V1EmptyDirVolumeSource()))
op.add_pvolumes({mount_folder: volume})
op2.container.add_volume_mount(volume_mount=V1VolumeMount(mount_path=mount_folder,
name=volume.name))
op2.after(op)
workflow = compiler.Compiler().create_workflow(pipeline_func=pipeline_func)
pprint.pprint(workflow["spec"])
You might want to check the difference between Kubernetes pods and containers. The Kubernetes example you've posted shows a two-container pod. You can recreate the same example in KFP by adding a sidecar container to an instantiated ContainerOp. What your second example is doing is creating two single-container pods that do not see each other by design.
To exchange data between pods you'd need some real volume, not emptyDir which only works for container is a single pod.
Please do not use dsl.PipelineVolume or op.add_pvolume unless you know what it is and why you want it. Just use normal
op.add_volume
andop.container.add_volume_mount
.Nevertheless, is there a particular reason you need to use volumes? Volumes make pipelines and components non-portable. No 1st-party components use volumes.
KFP team encourages users to use normal data passing methods: non-python, python