Save Catboost model in Python and load in Spark

259 views Asked by At

I have a use-case where model training is a Python process. The model is a Catboost Regressor with categorical features.

In general, language agnostic model binary formats like ONNX and PMML work well in such cases - where model training and prediction happen in different processes.

But from what I can see, Catboost + ONNX doesn't work with categorical features (ref)

So that leaves me with PMML and catboost's own binary format - CBM. I eyeballed and I think PMML also won't suit my use-case because it needs to one-hot-encode cat features which will explode my model. So, the only option I have is CBM format?

I tried to save_model in Python and upload the binary to HDFS. When trying to load the model in Apache Spark, it doesn't work.

Approach#1 (loadNativeModel)

import ai.catboost.spark._
val loadedModel = CatBoostRegressionModel.loadNativeModel("/path/to/model.cbm")

Traceback

ai.catboost.CatBoostError: /src/catboost/catboost/libs/model/model_import_interface.h:19: Model file doesn't exist: /path/to/model.cbm
    at ru.yandex.catboost.spark.catboost4j_spark.core.src.native_impl.native_implJNI.ReadModel__SWIG_0(Native Method)
    at ru.yandex.catboost.spark.catboost4j_spark.core.src.native_impl.native_impl.ReadModel(native_impl.java:193)
    at ai.catboost.spark.CatBoostRegressionModel$.loadNativeModel(CatBoostRegressor.scala:145)

Approach#2 (load)

import ai.catboost.spark._
val loadedModel = CatBoostRegressionModel.load("/path/to/parent_directory")

Traceback

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://path/to/parent_directory/metadata
    at org.apache.hadoop.mapred.LocatedFileStatusFetcher.getFileStatuses(LocatedFileStatusFetcher.java:156)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:247)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)

Has anyone tried something like this?

0

There are 0 answers