When using Catboost on data with a categorical variable, CTR values are calculated during training for each value of this categorical feature. These values are then used to determine paths through the tree at prediction time. Given a trained model, how can I access these CTR values?
(Please note that our model uses non symmetric trees, for which model export to Python or C++ is not supported.)
What I've tried / partial progress: I can see the CTR values in the JSON export, but these are stored next to the hash of each feature value, not the feature value itself. If I knew how the hash was calculated (and what exactly was hashed, i.e. is it just the feature name?) then I would have the CTR values.
Solving this took some effort, so I'll answer here for others.
The CTR values are available in the JSON export of a Catboost model. Specifically, you can find the CTR values in
jsonexport['ctr_data'][feature_identifier]['hash_map']. This is a list that looks like:The hash values are hashes of the categorical feature values, while the integers
ctr_iare raw counts, which are combined to form the true CTR values in the manner described here. The hash values are computed in the following manner:Note that it is crucial to use this specific older version of Cityhash. The hash function above was deduced from the Python export and was correct for our model (which used counters of type "Buckets").