Map Onehot Encoded Features to Regression Coefficients in Pyspark

37 views Asked by At

I have trained a linear regression model in Pyspark. Aside for continuous predictors, it contains categorical features that I onehot-coded. I'd like to have a look at the coefficients per input variable e.g. for input column "fruit" with values "apple" and "banana" I'd like to create a mapping like

  • some_numerical_predictor: coefficient for predictor
  • fruit_apple : coefficient for apple
  • fruit_banana : coefficient for banana
  • ...

However, due to the encoding I can not simply stitch the input columns to the coefficients. Here a bit more details...

After putting the categorical predictors into string_columns, I preprocess like so:


    string_feature_indexers = [
       StringIndexer(inputCol=col, outputCol=f"int_{col}").fit(df)
       for col in string_columns
    ]
    
    onehot_encoder = [
       OneHotEncoder(inputCol="int_" + col, outputCol=f"onehot_{col}")
       for col in string_columns
    ]
    
    all_columns = numeric_columns + boolean_columns + ["onehot_"+ col for col in string_columns]
    
    assembler = VectorAssembler(inputCols=[col for col in all_columns], outputCol="features")

After training the model I access the coefficients and input columns like so:


    coefficients = bestModel.coefficients.toArray() # this has length > 80
    input_features = assembler.getInputCols() # this has length 39

Because getInputCols() doesn't account for the onehot-encoding it's way shorter than the coefficient array. I read the documentation on the onehotencoder, assembler etc. to see if there is anything that would help me creating the column_category-value: coefficient mapping but didn't find anything.

Can anybody tell me if there is a built-in/custom way to do this in presence of onehot-encoding?

Thanks, Michael

0

There are 0 answers