KEDRO - How to specify an arbitrary binary file in catalog.yml?

119 views Asked by At

I'm currently working on a datascience project using LLMs (Large language models). Weights for models usually come in different formats, most frequently .bin or .gguf, and I'd like to keep it that way.

However the only way to store binary files I know is to use type: pickle.PickleDataset like so

test_model: # simple example without compression
  type: pickle.PickleDataSet
  filepath: data/07_model_output/test_model.pkl
  backend: pickle

I'm not okay with that as I want my model files to be language agnostic.

What would be a correct way to specify arbitrary binary file in catalog.yml?

(additional question: and what if I want to fetch it from certain url or by running some kind of script which fetches it from the net? Should I create a separate pipeline for that?)

1

There are 1 answers

2
mediumnok On

You can implement your own custom dataset for specific format. I am not familiar with LLM format but I don’t think there is a universal format for binary?

For your second question you may use the APIDataset to fetch from some endpoint. There is a HuggingfaceDataset that you may take as inspiration.