Reconstruct sklearn pipeline without storing pre-trained embeddings

20 views Asked by raywib At 13 January 2024 at 06:54

At the end of a ML experiment I store the trained sklearn pipeline to mlflow. Inside the pipeline I use a pre-trained embedder like so:

from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder
import mlflow
from sklearn.pipeline import make_pipeline

with mlflow.start_run() as run:
  model = make_pipeline(
    ColumnGrabber('X'),
    SentenceEncoder('distiluse-base-multilingual-cased-v2'),
    ...)
  model.fit(X, y)
  mlflow.sklearn.log_model(model, 'sk_pipeline')

This saves the rather large embetter for every run even though it never changes. Is there a way to store only the information needed to recreate the pipeline (the class and URL) without storing the entire embetter? It should still be possible to deploy the model via mlflow, though.

This would also solve another problem: When using embetters from tensorflow hub in the pipeline it is complicated to store the pipeline to mlflow as these embetters are not easily pickable.

Original Q&A

TechQA.

Reconstruct sklearn pipeline without storing pre-trained embeddings

There are 0 answers

Related Questions in TENSORFLOW

Related Questions in SCIKIT-LEARN

Related Questions in PICKLE

Related Questions in MLFLOW

Related Questions in SCIKIT-LEARN-PIPELINE

Popular Questions

Trending Questions