I have a use-case where model training is a Python process. The model is a Catboost Regressor with categorical features.
In general, language agnostic model binary formats like ONNX and PMML work well in such cases - where model training and prediction happen in different processes.
But from what I can see, Catboost + ONNX doesn't work with categorical features (ref)
So that leaves me with PMML and catboost's own binary format - CBM. I eyeballed and I think PMML also won't suit my use-case because it needs to one-hot-encode cat features which will explode my model. So, the only option I have is CBM format?
I tried to save_model in Python and upload the binary to HDFS. When trying to load the model in Apache Spark, it doesn't work.
Approach#1 (loadNativeModel)
import ai.catboost.spark._
val loadedModel = CatBoostRegressionModel.loadNativeModel("/path/to/model.cbm")
Traceback
ai.catboost.CatBoostError: /src/catboost/catboost/libs/model/model_import_interface.h:19: Model file doesn't exist: /path/to/model.cbm
at ru.yandex.catboost.spark.catboost4j_spark.core.src.native_impl.native_implJNI.ReadModel__SWIG_0(Native Method)
at ru.yandex.catboost.spark.catboost4j_spark.core.src.native_impl.native_impl.ReadModel(native_impl.java:193)
at ai.catboost.spark.CatBoostRegressionModel$.loadNativeModel(CatBoostRegressor.scala:145)
Approach#2 (load)
import ai.catboost.spark._
val loadedModel = CatBoostRegressionModel.load("/path/to/parent_directory")
Traceback
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://path/to/parent_directory/metadata
at org.apache.hadoop.mapred.LocatedFileStatusFetcher.getFileStatuses(LocatedFileStatusFetcher.java:156)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:247)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
Has anyone tried something like this?