I am trying to perform pipeline tracking using dvc. The problem is, that if i change for example the size parameter from the params.yaml, it does not rerun the stage but simply uses a cached run, altough i did never run that stage with this specific size parameter: dvc status also tells me, that the parameter has been modified, but the stage is not rerun nonetheless.
Reproducing the issue:
My pipeline tracking is setup using the following dvc.yaml file:
stages:
data_ingestion:
cmd: python 01_data_ingestion.py
deps:
- 01_data_ingestion.py
params:
- size
outs:
- artifacts/data_ingestion/
my params.yaml file simply looks like this:
size: 30
The 01_data_ingestion.py file looks like this and creates this csv file having as much rows as defined in the params.yaml size key. Here is my code for that:
import os
import pandas as pd
import yaml
def main():
with open("params.yaml") as yaml_file:
config = yaml.safe_load(yaml_file)
numbers = []
squares = []
for i in range(config['size']):
numbers.append(i)
squares.append(i**2)
df = pd.DataFrame({"number": numbers, "square": squares})
os.makedirs('artifacts/data_ingestion', exist_ok=True)
df.to_csv('artifacts/data_ingestion/test_data.csv', index=False)
if __name__ == "__main__":
main()
If i now run dvc repro from my terminal it will run the stage and create the output csv file with 30 rows.
The problem is, that if i change the parameter size to 40, dvc status tells me that the parameter has been changed, but it is not rerunning the stage and uses a cached run, altough i did never run this stage with set parameter. And in fact: if i go checking the csv file, it only has 30 rows and not 40. Could someone explain me please what i am understanding wrong about the dvc repro or where the problem could lie?
i am using dvc==3.42.0 and python 3.8