I have created and am debugging a PySpark ML RandomForestClassificationModel which was of course created by calling pyspark.ml.classification.RandomForestClassifier.fit()
. I want to interpret the feature vectors returned by the RandomForestClassificationModel.featureImportances property. They are a SparseVector.
As you can see in the notebook below, I had to transform my features in several stages to get them into the final Features_vec
that fed the algorithm. What I want is a list of features keyed by the feature type and column. How can I use the SparseVector of features to get to a list of feature importances along with feature names, or some other format that is interpretable?
The code is in a Jupyter Notebook here. Skip to the end.
This shouldn't be specific to PySpark, so if you know a Scala solution, please chime in.