I have trained a linear regression model in Pyspark. Aside for continuous predictors, it contains categorical features that I onehot-coded. I'd like to have a look at the coefficients per input variable e.g. for input column "fruit" with values "apple" and "banana" I'd like to create a mapping like
- some_numerical_predictor: coefficient for predictor
- fruit_apple : coefficient for apple
- fruit_banana : coefficient for banana
- ...
However, due to the encoding I can not simply stitch the input columns to the coefficients. Here a bit more details...
After putting the categorical predictors into string_columns, I preprocess like so:
string_feature_indexers = [
StringIndexer(inputCol=col, outputCol=f"int_{col}").fit(df)
for col in string_columns
]
onehot_encoder = [
OneHotEncoder(inputCol="int_" + col, outputCol=f"onehot_{col}")
for col in string_columns
]
all_columns = numeric_columns + boolean_columns + ["onehot_"+ col for col in string_columns]
assembler = VectorAssembler(inputCols=[col for col in all_columns], outputCol="features")
After training the model I access the coefficients and input columns like so:
coefficients = bestModel.coefficients.toArray() # this has length > 80
input_features = assembler.getInputCols() # this has length 39
Because getInputCols() doesn't account for the onehot-encoding it's way shorter than the coefficient array. I read the documentation on the onehotencoder, assembler etc. to see if there is anything that would help me creating the column_category-value: coefficient mapping but didn't find anything.
Can anybody tell me if there is a built-in/custom way to do this in presence of onehot-encoding?
Thanks, Michael