How do I create my own custom transformers and utilize them within a pipeline in scikit-learn?

Question

How do I create my own custom transformers and utilize them within a pipeline in scikit-learn?

840 views Asked by Je Stra At 06 December 2024 at 11:06

Im new to machine learning so I appologize ahead of time if the question is silly and my code doesnt make much sense.

I have downloaded the titanic data set from kaggle and want to create a classifier model to predict whether or not a passenger survives based on few important features. Some of the important features are:

Name
Sex
Age
Cabin
Ticket
Cabin
embarked

Here is what the data looks like:

Some of these features have empty values or I want to modify the column before I one_hot_encode it (for example the name column I want to just have the string 'Mr', 'Mrs' instead of their full name)

I know how to do all of these by just using normal pandas operations i.e fillna() etc... But I want to create a pipeline and a Column Transformer so that I can easily prepare/process the data quickly, rather then retyping out the same code to modify the data.

For the empty values in the 'Age' Columns I will use a SimpleImputer to fill the empty values with the median and for the 'Embarked' and 'Cabin' I will fill their values with 'U':

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Fill empty age values with median
median_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))
])

# Fill cabin and embarked with 'U'
fill_pipeline = Pipeline([
    ('fill_U', SimpleImputer(strategy='constant', fill_value='U'))
])

Here I will also create a custom transformer to modify the names so that it will only be 'mr', 'mrs' etc.. This code is probably wrong, Im not sure what I need to be returning here from transform?

from sklearn.base import BaseEstimator, TransformerMixin

class change_name(BaseEstimator, TransformerMixin):
    def __init__(self, modify_name=True):
        self.modify_name = modify_name
    def fit(self, X,y=None):
        return self
    def transform(self, X, y=None):
        if self.modify_name:
            X['Name'] = X['Name'].apply(lambda x: x.split()[1].replace('.', ''))

        return X

Next I create new column to determine whether that person with the ticket Id is travelling alone:

from sklearn.base import BaseEstimator, TransformerMixin

class create_multiple_ticket(BaseEstimator, TransformerMixin):
    tickets = {}
    for ticket, num in X.Ticket.value_counts().items():
        if num > 1:
            tickets[ticket] = 1

    def __init__(self, change=True):
        self.change=True
        self.tickets = tickets
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        if self.change:
            # Creating a new column of boolean values
            X['multiple_tickets'] = X['Ticket'].apply(lambda x: x in tickets)
        return X.drop('Ticket', axis=1)

Next I just get the subscript of the Cabin to be just one letter.

class modify_cabin(BaseEstimator, TransformerMixin):
    def __init__(self, modify=True):
        self.modify = True
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        if self.modify:
            X['Cabin'] = X['Cabin'].apply(lambda x: x[0])
        return X

Lastly I create my pipelines.

ticket_pipeline = Pipeline([
    ('modify_ticket', create_multiple_ticket()) 
])

name_pipeline = Pipeline([
    ('modify_name', change_name())
])

cabin_pipeline = Pipeline([
    ('modify_cabin', modify_cabin())
])

Lastly I use ColumnTransformer

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

median_attributes = ['Age']
fill_attributes = ['Cabin', 'Embarked']

onehot_encoded_attributes = ['Cabin', 'Name', 'Embarked', 'Sex']

full_pipeline = ColumnTransformer([
    ('age', median_pipeline, median_attributes), # fill empty values with age median
    ('fill', fill_pipeline, fill_attributes), # fill 'Cabin' and 'Embarked' with 'U value'
    ('name', name_pipeline, ['Name']), # Converts name column to just 'Mr', 'Miss' etc...
    ('ticket', ticket_pipeline, ['Ticket']), # Creates a new column called multiple tickets and returns true or false
    ('cabin', cabin_pipeline, ['Cabin']), # Cabin column gets modified to just one letter.
    ('categorical', OneHotEncoder(), onehot_encoded_attributes) # One_hot encodes the attributes
], remainder='passthrough')




X_train_prepared = full_pipeline.fit_transform(X_train)
X_train_prepared

Now when I run the code above I get the following error:

<ipython-input-68-a83b52eb52ee> in <lambda>(x)
      8     def transform(self, X, y=None):
      9         if self.modify:
---> 10             X['Cabin'] = X['Cabin'].apply(lambda x: x[0])
     11         return X

TypeError: 'float' object is not subscriptable

This is due to the Cabin still having NaN values therefore you cant subscript NaN. But in my pipeline I should first fill the empty Cabin values with 'U' using 'SimpleImputer' before the cabin_pipeline is run. Why is it still doing this?

Couple questions:

How do I write a custom transformers and what should I be returning from the transform method?
why isnt the pipeline going in order i.e first filling in empty values AND THEN accessing the subscript value of the Cabin column.

Any help is greatly appreciated!

Original Q&A

There are 1 answers

**kalbarena** · Answer 1 · 2020-03-12T14:05:22+00:00

ColumnTransformer does not perform the transformations in order.

Steps in the Columntransformer should apply to particular columns rather than individual transformations on the columns. If you need mutliple transformations on one column you should pack them into Pipeline first and then run it in ColumnTransformer.

Hope it helps!

I found some good examples here: https://towardsdatascience.com/using-columntransformer-to-combine-data-processing-steps-af383f7d5260

TechQA.

How do I create my own custom transformers and utilize them within a pipeline in scikit-learn?

There are 1 answers

Related Questions in PANDAS

Related Questions in SCIKIT-LEARN

Related Questions in DATA-SCIENCE

Popular Questions

Popular Tags

Trending Questions