I'm working on a Kubeflow pipeline and encountering an error. The error message is as follows: "This step is in Error state with this message: Error (exit code 1): open /var/run/argo/outputs/artifacts/app/results_FC_shift_cycle/model/saved_model.pb.tgz: no such file or directory."
I'm trying to run a Kubeflow pipeline, and this specific step seems to be failing. The pipeline is supposed to generate a saved model, but it's failing at this point.
I'm using the following code for this step:
import kfp
import kfp.dsl as dsl
from utils.utils import upload_pipe,compile
import argparse
parser = argparse.ArgumentParser(description='getting gitlab ci pipeline arguments')
parser.add_argument('--build_id', help='build id of pipeline')
parser.add_argument('--commit_id', help='build id of pipeline')
parser.add_argument('--build_url', help='build id of pipeline')
args = parser.parse_args()
build_id = args.build_id
commit_id = args.commit_id
build_url = args.build_url
# Define the pipeline function and its inputs and outputs
def shift_cycle_pipeline():
# Create a Docker container with the necessary dependencies and the shift_cycle_kfp.py script
return dsl.ContainerOp(
name='shift-cycle',
image=f'europe-west9-docker.pkg.dev/project_id/cycle:{build_id}',
command=['python', '/app/main.py'],
file_outputs={
'output_mlmodel': '/app/results_FC_shift_cycle/model/saved_model.pb',
}
)
@dsl.pipeline(
name="Shift Cycle Training Pipeline",
description="A pipeline that uses a custom Docker image for Model Training."
)
def shift_cycle():
train_task = shift_cycle_pipeline()
train_task.container.set_image_pull_policy("Always")
pipeline_file = "shift_cycle.yaml"
pipeline_name = 'shift_cycle' # Specify a name for your pipeline
experiment_name = 'shift_cycle'
namespace = "kubeflow-user-example-com"
# Compile the pipeline
compile(shift_cycle, pipeline_file)
# Run pipeline
pipe_desc = f'This is Gitlab Ci Training pipeline infos:\nCommit ID is: {commit_id}\nBuild ID is: {build_id}\nPipeline Build URL is: {build_url}'
upload_pipe(pipeline_file, pipeline_name, experiment_name, namespace,pipeline_version_name=build_id,pipe_desc=pipe_desc, run=True)
I've run the image docker manually and it's working without any problems
the main script include this step :
def upload_model():
print('UPLOAD MODEL TO GCP......')
gcp_json = '/app/service_account_key.json'
bucket_name = 'bucketname'
filename = 'results_FC_shift_cycle/model/saved_model.pb'
# check if the model exist 'results_FC_shift_cycle/model/saved_model.pb'
#if os.path.exists('results_FC_shift_cycle/model/saved_model.pb'):
print('Model exist')
client = storage.Client.from_service_account_json(gcp_json)
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(filename)
blob.upload_from_filename('/app/results_FC_shift_cycle/model/saved_model.pb')
print('Model uploaded to GCP')
This is the Dockerfile I use to build the image:
FROM python:3.9-slim-buster
# Set the working directory to /app
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Create a virtual environment and activate it
RUN python -m venv venv
ENV PATH="/app/venv/bin:$PATH"
# Install required packages
RUN pip install --no-cache-dir --trusted-host pypi.python.org -r requirements.txt
# Run the script
CMD ["python", "main.py"]
After the training step, the script should run a function to upload the model to Google Cloud Storage (GCS), but this is not happening.
I expect the pipeline to generate the 'saved_model.pb.tgz' file at the specified path and upload it to Google Cloud Storage.