I'm currently working on a datascience project using LLMs (Large language models). Weights for models usually come in different formats, most frequently .bin or .gguf, and I'd like to keep it that way.
However the only way to store binary files I know is to use type: pickle.PickleDataset like so
test_model: # simple example without compression
type: pickle.PickleDataSet
filepath: data/07_model_output/test_model.pkl
backend: pickle
I'm not okay with that as I want my model files to be language agnostic.
What would be a correct way to specify arbitrary binary file in catalog.yml?
(additional question: and what if I want to fetch it from certain url or by running some kind of script which fetches it from the net? Should I create a separate pipeline for that?)
You can implement your own custom dataset for specific format. I am not familiar with LLM format but I don’t think there is a universal format for binary?
For your second question you may use the APIDataset to fetch from some endpoint. There is a HuggingfaceDataset that you may take as inspiration.