At the end of a ML experiment I store the trained sklearn pipeline to mlflow. Inside the pipeline I use a pre-trained embedder like so:
from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder
import mlflow
from sklearn.pipeline import make_pipeline
with mlflow.start_run() as run:
model = make_pipeline(
ColumnGrabber('X'),
SentenceEncoder('distiluse-base-multilingual-cased-v2'),
...)
model.fit(X, y)
mlflow.sklearn.log_model(model, 'sk_pipeline')
This saves the rather large embetter for every run even though it never changes. Is there a way to store only the information needed to recreate the pipeline (the class and URL) without storing the entire embetter? It should still be possible to deploy the model via mlflow, though.
This would also solve another problem: When using embetters from tensorflow hub in the pipeline it is complicated to store the pipeline to mlflow as these embetters are not easily pickable.