Rather than just write & read to use the imputer downstream, I'm being asked to save the statistic computed by Imputer
from pyspark.ml.feature
in a yaml file for later consumption. I don't see any obvious attributes that will output this for me.
SO, let's use the example below. I want to extract the computed fill_method
from the imputer object and set it to imputer_statistic
. How on earth do I do this?
import pyspark
from pyspark.ml.feature import Imputer
df = spark.createDataFrame([("joe", 34, 3), ("luisa", 22, 1), ("jonny", 21, 2), ("alice",31, None), ("montey", 22,None)], ["first_name", "age", "daily_meals"])
fill_method='mean'
imputer = Imputer(
inputCol="daily_meals",
outputCol="daily_meals", #TODO: must test that replacement occurs.
).setStrategy(fill_method)
new_df = imputer.fit(df).transform(df)
imputer_statistic = imputer.get(fill_method) #does not work