KEDRO - How to specify an arbitrary binary file in catalog.yml?

Question

KEDRO - How to specify an arbitrary binary file in catalog.yml?

88 views Asked by Quakumei At 28 October 2023 at 11:32

I'm currently working on a datascience project using LLMs (Large language models). Weights for models usually come in different formats, most frequently .bin or .gguf, and I'd like to keep it that way.

However the only way to store binary files I know is to use type: pickle.PickleDataset like so

test_model: # simple example without compression
  type: pickle.PickleDataSet
  filepath: data/07_model_output/test_model.pkl
  backend: pickle

I'm not okay with that as I want my model files to be language agnostic.

What would be a correct way to specify arbitrary binary file in catalog.yml?

(additional question: and what if I want to fetch it from certain url or by running some kind of script which fetches it from the net? Should I create a separate pipeline for that?)

Original Q&A

There are 1 answers

**mediumnok** · Answer 1 · 2023-10-29T14:07:30+00:00

You can implement your own custom dataset for specific format. I am not familiar with LLM format but I don’t think there is a universal format for binary?

For your second question you may use the APIDataset to fetch from some endpoint. There is a HuggingfaceDataset that you may take as inspiration.

TechQA.

KEDRO - How to specify an arbitrary binary file in catalog.yml?

There are 1 answers

Related Questions in PYTHON

Related Questions in DATA-SCIENCE

Related Questions in PIPELINE

Related Questions in MLOPS

Related Questions in KEDRO

Popular Questions

Popular Tags

Trending Questions