How do I return the actual statistical value computed by from pyspark's ml Imputer class?

18 views Asked by At

Rather than just write & read to use the imputer downstream, I'm being asked to save the statistic computed by Imputer from pyspark.ml.feature in a yaml file for later consumption. I don't see any obvious attributes that will output this for me.

SO, let's use the example below. I want to extract the computed fill_method from the imputer object and set it to imputer_statistic. How on earth do I do this?

import pyspark
from pyspark.ml.feature import Imputer

df = spark.createDataFrame([("joe", 34, 3), ("luisa", 22, 1), ("jonny", 21, 2), ("alice",31, None), ("montey", 22,None)], ["first_name", "age", "daily_meals"])
fill_method='mean'

imputer = Imputer(
    inputCol="daily_meals", 
    outputCol="daily_meals", #TODO: must test that replacement occurs.
).setStrategy(fill_method)
new_df = imputer.fit(df).transform(df)
imputer_statistic = imputer.get(fill_method)  #does not work
0

There are 0 answers