I have created and am debugging a PySpark ML RandomForestClassificationModel which was of course created by calling pyspark.ml.classification.RandomForestClassifier.fit(). I want to interpret the feature vectors returned by the RandomForestClassificationModel.featureImportances property. They are a SparseVector.

As you can see in the notebook below, I had to transform my features in several stages to get them into the final Features_vec that fed the algorithm. What I want is a list of features keyed by the feature type and column. How can I use the SparseVector of features to get to a list of feature importances along with feature names, or some other format that is interpretable?

The code is in a Jupyter Notebook here. Skip to the end.

This shouldn't be specific to PySpark, so if you know a Scala solution, please chime in.

0

There are 0 answers