How do I define pipeline-level volumes in kubeflow pipelines to share across components?

5.4k views Asked by At

The kubernetes Communicating between containers tutorial defines the following pipeline yaml:

apiVersion: v1
kind: Pod
metadata:
  name: two-containers
spec:

  restartPolicy: Never

  volumes:                      <--- This is what I need
  - name: shared-data
    emptyDir: {}

  containers:

  - name: nginx-container
    image: nginx
    volumeMounts:
    - name: shared-data
      mountPath: /usr/share/nginx/html

  - name: debian-container
    image: debian
    volumeMounts:
    - name: shared-data
      mountPath: /pod-data
    command: ["/bin/sh"]
    args: ["-c", "echo Hello from the debian container > /pod-data/index.html"]

Note that the volumes key is defined under spec, and thus the volume is available to all defined containers. I want to achieve the same behavior using kfp, which is the API for kubeflow pipelines.

However, I can only add volumes to individual containers, but not to the whole workflow spec using kfp.dsl.ContainerOp.container.add_volume_mount that points to a previously created volume (kfp.dsl.PipelineVolume), because the volume seems to only be defined within a container.

Here is what I have tried, but the volume is always defined in the first container, not the "global" level. How do I get it so that op2 has access to the volume? I would have expected it to be inside kfp.dsl.PipelineConf, but volumes can not be added to it. Is it just not implemented?

import kubernetes as k8s
from kfp import compiler, dsl
from kubernetes.client import V1VolumeMount
import pprint

@dsl.pipeline(name="debug", description="Debug only pipeline")
def pipeline_func():
    op = dsl.ContainerOp(
            name='echo',
            image='library/bash:4.4.23',
            command=['sh', '-c'],
            arguments=['echo "[1,2,3]"> /tmp/output1.txt'],
            file_outputs={'output': '/tmp/output1.txt'})
    op2 = dsl.ContainerOp(
            name='echo2',
            image='library/bash:4.4.23',
            command=['sh', '-c'],
            arguments=['echo "[4,5,6]">> /tmp/output1.txt'],
            file_outputs={'output': '/tmp/output1.txt'})

    mount_folder = "/tmp"
    volume = dsl.PipelineVolume(volume=k8s.client.V1Volume(
            name=f"test-storage",
            empty_dir=k8s.client.V1EmptyDirVolumeSource()))
    op.add_pvolumes({mount_folder: volume})
    op2.container.add_volume_mount(volume_mount=V1VolumeMount(mount_path=mount_folder,
                                                              name=volume.name))
    op2.after(op)


workflow = compiler.Compiler().create_workflow(pipeline_func=pipeline_func)
pprint.pprint(workflow["spec"])

1

There are 1 answers

7
Ark-kun On

You might want to check the difference between Kubernetes pods and containers. The Kubernetes example you've posted shows a two-container pod. You can recreate the same example in KFP by adding a sidecar container to an instantiated ContainerOp. What your second example is doing is creating two single-container pods that do not see each other by design.

To exchange data between pods you'd need some real volume, not emptyDir which only works for container is a single pod.

volume = dsl.PipelineVolume(volume=k8s.client.V1Volume(
        name=f"test-storage",
        empty_dir=k8s.client.V1EmptyDirVolumeSource()))
op.add_pvolumes({mount_folder: volume})

Please do not use dsl.PipelineVolume or op.add_pvolume unless you know what it is and why you want it. Just use normal op.add_volume and op.container.add_volume_mount.

Nevertheless, is there a particular reason you need to use volumes? Volumes make pipelines and components non-portable. No 1st-party components use volumes.

KFP team encourages users to use normal data passing methods: non-python, python