DVC using cached run although parameter changed

46 views Asked by At

I am trying to perform pipeline tracking using dvc. The problem is, that if i change for example the size parameter from the params.yaml, it does not rerun the stage but simply uses a cached run, altough i did never run that stage with this specific size parameter: dvc status also tells me, that the parameter has been modified, but the stage is not rerun nonetheless.

Reproducing the issue:

My pipeline tracking is setup using the following dvc.yaml file:

stages:
  data_ingestion:
    cmd: python 01_data_ingestion.py
    deps:
      - 01_data_ingestion.py
    params:
      - size
    outs:
      - artifacts/data_ingestion/

my params.yaml file simply looks like this:

size: 30

The 01_data_ingestion.py file looks like this and creates this csv file having as much rows as defined in the params.yaml size key. Here is my code for that:

import os
import pandas as pd
import yaml


def main():
    with open("params.yaml") as yaml_file:
        config = yaml.safe_load(yaml_file)

    numbers = []
    squares = []
    for i in range(config['size']):
        numbers.append(i)
        squares.append(i**2)
    df = pd.DataFrame({"number": numbers, "square": squares})
    os.makedirs('artifacts/data_ingestion', exist_ok=True)
    df.to_csv('artifacts/data_ingestion/test_data.csv', index=False)



if __name__ == "__main__":
    main()

If i now run dvc repro from my terminal it will run the stage and create the output csv file with 30 rows.

The problem is, that if i change the parameter size to 40, dvc status tells me that the parameter has been changed, but it is not rerunning the stage and uses a cached run, altough i did never run this stage with set parameter. And in fact: if i go checking the csv file, it only has 30 rows and not 40. Could someone explain me please what i am understanding wrong about the dvc repro or where the problem could lie?

i am using dvc==3.42.0 and python 3.8

0

There are 0 answers