how LGBM handles the categorical features without specification

597 views Asked by At

I am playing with LGBM and indexed my categorical features using StingIndexer. but after that I haven't tell my model which features is categorical features. So, I am wondering how it knows which features are categorical features

Here is how I init my LGBM model.

val lgbm = new LightGBMClassifier("lgbm").
  setObjective("binary").
  setFeatureFraction(0.85).
  setFeaturesCol("features").
  setLabelCol("is_booker")
1

There are 1 answers

0
James Lamb On BEST ANSWER

If you are using mmlspark (you didn't mention how you're using LightGBM in Scala), LightGBM automatically figures out which columns should be treated as categorical, based on the attributes of the columns.

From Azure/mmlspark#559:

...if you use string indexer or our value indexer, categorical metadata will be automatically added to the dataframe and lightgbm will actually be able to interpret it and treat those columns as categoricals by splitting on the feature values directly (so you won't need to one-hot-encode them)

The method that accomplishes that is called LightGBMUtils.getCategoricalIndexes(), and you can find it at https://github.com/Azure/mmlspark/blob/95c1f8a782191e3578587a49313e1d57abee5da3/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMUtils.scala#L74-L104.

That method is re-used by LightGBMBase.getCategoricalIndexes() during training:

If I'm right that you're using mmlspark and you have further questions about how this works, I recommend opening issues in Azure/mmlspark.