I have an example data, where one column contains string values (e.g "34 12"). I created two new columns during the preprocessing step, storing the right and left integers of the string column. At the end I want to get rid of the string column. I don't know how to do this within the pipeline.
Here a smaller code version to recreate my problem. I tried using
("column_dropper", "drop", ["string1"])
in the ColumnTransformer.
But when I inspect x_transformed
, it is a numpy array, which still contains the string values:
array([[1.0, 6.5, '34 12', 34, 12],
[2.0, 6.0, '34 5', 34, 5],
[1.5, 7.0, '56 6', 56, 6]], dtype=object)
Here the code:
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
#creating example data
data= {"string1": ["34 12", "34 5", "56 6"], "age": [1, 2, None], "grade": [None, 6, 7]}
x_train = pd.DataFrame(data=data)
#define functions
def extract_int2(x):
num = x.split(" ")[-1]
if num.isnumeric():
return int(num)
else:
return 0
def extract_int1(x):
num = x.split(" ")[0]
if num.isnumeric():
return int(num)
else:
return 0
def int_features(df):
df["num1"] = df["string1"].apply(extract_int1)
df["num2"] = df["string1"].apply(extract_int2)
return df
columns_to_drop="string1"
#define Pipeline
num_vals = Pipeline([("imputer", SimpleImputer(strategy = "mean"))])
features_vals = Pipeline([("new_features", FunctionTransformer(int_features, validate=False))])
preprocess_pipeline = ColumnTransformer(transformers=[
("num_preprocess", num_vals, ["age", "grade"]),
("feature_preprocess", features_vals, ["string1"]),
("column_dropper", "drop", ["string1"])])
preprocess_pipeline.fit(x_train)
x_transformed = preprocess_pipeline.transform(x_train)
x_transformed
I tried to use a user defined dropping function with FunctionTransformer()
to, but it didn't work either.
def drop_column(df):
df = df.drop(columns=["string1"])
return df
#define Pipeline
num_vals = Pipeline([("imputer", SimpleImputer(strategy = "mean"))])
features_vals = Pipeline([("new_features", FunctionTransformer(int_features, validate=False))])
dropping= Pipeline([("drop_string", FunctionTransformer(drop_column))])
preprocess_pipeline = ColumnTransformer(transformers=[
("num_preprocess", num_vals, ["age", "grade"]),
("feature_preprocess", features_vals, ["string1"]),
("drop_preprocess", dropping, ["string1"])]
)
preprocess_pipeline.fit(x_train)
x_transformed = preprocess_pipeline.transform(x_train)
x_transformed
Try to change your data into np.array, it works for me: