Dropping a column in sklearn Pipeline after using it to create new features

161 views Asked by At

I have an example data, where one column contains string values (e.g "34 12"). I created two new columns during the preprocessing step, storing the right and left integers of the string column. At the end I want to get rid of the string column. I don't know how to do this within the pipeline.

Here a smaller code version to recreate my problem. I tried using ("column_dropper", "drop", ["string1"]) in the ColumnTransformer. But when I inspect x_transformed, it is a numpy array, which still contains the string values:

array([[1.0, 6.5, '34 12', 34, 12],
       [2.0, 6.0, '34 5', 34, 5],
       [1.5, 7.0, '56 6', 56, 6]], dtype=object)

Here the code:

import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin

#creating example data
data= {"string1": ["34 12", "34 5", "56 6"], "age": [1, 2, None], "grade": [None, 6, 7]}
x_train = pd.DataFrame(data=data)

#define functions
def extract_int2(x):
    num = x.split(" ")[-1]
    if num.isnumeric():
        return int(num)
    else:
        return 0
            
def extract_int1(x):
    num = x.split(" ")[0]
    if num.isnumeric():
        return int(num)
    else:
        return 0

def int_features(df):
    df["num1"] = df["string1"].apply(extract_int1)
    df["num2"] = df["string1"].apply(extract_int2)
    return df

columns_to_drop="string1"

#define Pipeline
num_vals =  Pipeline([("imputer", SimpleImputer(strategy = "mean"))])
features_vals = Pipeline([("new_features", FunctionTransformer(int_features, validate=False))])


preprocess_pipeline = ColumnTransformer(transformers=[
    ("num_preprocess", num_vals, ["age", "grade"]),
    ("feature_preprocess", features_vals, ["string1"]),
    ("column_dropper", "drop", ["string1"])])

preprocess_pipeline.fit(x_train)

x_transformed = preprocess_pipeline.transform(x_train)
x_transformed

I tried to use a user defined dropping function with FunctionTransformer() to, but it didn't work either.

def drop_column(df):
    df = df.drop(columns=["string1"])
    return df

#define Pipeline
num_vals =  Pipeline([("imputer", SimpleImputer(strategy = "mean"))])
features_vals = Pipeline([("new_features", FunctionTransformer(int_features, validate=False))])
dropping= Pipeline([("drop_string", FunctionTransformer(drop_column))])


preprocess_pipeline = ColumnTransformer(transformers=[
    ("num_preprocess", num_vals, ["age", "grade"]),
    ("feature_preprocess", features_vals, ["string1"]),
    ("drop_preprocess", dropping, ["string1"])]
                              )

preprocess_pipeline.fit(x_train)

x_transformed = preprocess_pipeline.transform(x_train)
x_transformed
1

There are 1 answers

0
zongfang liu On

Try to change your data into np.array, it works for me:

pca_features = [17, 18, 19, 20, 21, 22]

def drop_column(data_array):
    data_array = np.delete(data_array, pca_features, axis=1)
    print(data_array.shape)
    return data_array