Im new to machine learning so I appologize ahead of time if the question is silly and my code doesnt make much sense.
I have downloaded the titanic data set from kaggle and want to create a classifier model to predict whether or not a passenger survives based on few important features. Some of the important features are:
- Name
- Sex
- Age
- Cabin
- Ticket
- Cabin
- embarked
Here is what the data looks like:
Some of these features have empty values or I want to modify the column before I one_hot_encode it (for example the name column I want to just have the string 'Mr', 'Mrs' instead of their full name)
I know how to do all of these by just using normal pandas operations i.e fillna() etc... But I want to create a pipeline and a Column Transformer so that I can easily prepare/process the data quickly, rather then retyping out the same code to modify the data.
For the empty values in the 'Age' Columns I will use a SimpleImputer to fill the empty values with the median and for the 'Embarked' and 'Cabin' I will fill their values with 'U':
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# Fill empty age values with median
median_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median'))
])
# Fill cabin and embarked with 'U'
fill_pipeline = Pipeline([
('fill_U', SimpleImputer(strategy='constant', fill_value='U'))
])
Here I will also create a custom transformer to modify the names so that it will only be 'mr', 'mrs' etc.. This code is probably wrong, Im not sure what I need to be returning here from transform?
from sklearn.base import BaseEstimator, TransformerMixin
class change_name(BaseEstimator, TransformerMixin):
def __init__(self, modify_name=True):
self.modify_name = modify_name
def fit(self, X,y=None):
return self
def transform(self, X, y=None):
if self.modify_name:
X['Name'] = X['Name'].apply(lambda x: x.split()[1].replace('.', ''))
return X
Next I create new column to determine whether that person with the ticket Id is travelling alone:
from sklearn.base import BaseEstimator, TransformerMixin
class create_multiple_ticket(BaseEstimator, TransformerMixin):
tickets = {}
for ticket, num in X.Ticket.value_counts().items():
if num > 1:
tickets[ticket] = 1
def __init__(self, change=True):
self.change=True
self.tickets = tickets
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
if self.change:
# Creating a new column of boolean values
X['multiple_tickets'] = X['Ticket'].apply(lambda x: x in tickets)
return X.drop('Ticket', axis=1)
Next I just get the subscript of the Cabin to be just one letter.
class modify_cabin(BaseEstimator, TransformerMixin):
def __init__(self, modify=True):
self.modify = True
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
if self.modify:
X['Cabin'] = X['Cabin'].apply(lambda x: x[0])
return X
Lastly I create my pipelines.
ticket_pipeline = Pipeline([
('modify_ticket', create_multiple_ticket())
])
name_pipeline = Pipeline([
('modify_name', change_name())
])
cabin_pipeline = Pipeline([
('modify_cabin', modify_cabin())
])
Lastly I use ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
median_attributes = ['Age']
fill_attributes = ['Cabin', 'Embarked']
onehot_encoded_attributes = ['Cabin', 'Name', 'Embarked', 'Sex']
full_pipeline = ColumnTransformer([
('age', median_pipeline, median_attributes), # fill empty values with age median
('fill', fill_pipeline, fill_attributes), # fill 'Cabin' and 'Embarked' with 'U value'
('name', name_pipeline, ['Name']), # Converts name column to just 'Mr', 'Miss' etc...
('ticket', ticket_pipeline, ['Ticket']), # Creates a new column called multiple tickets and returns true or false
('cabin', cabin_pipeline, ['Cabin']), # Cabin column gets modified to just one letter.
('categorical', OneHotEncoder(), onehot_encoded_attributes) # One_hot encodes the attributes
], remainder='passthrough')
X_train_prepared = full_pipeline.fit_transform(X_train)
X_train_prepared
Now when I run the code above I get the following error:
<ipython-input-68-a83b52eb52ee> in <lambda>(x)
8 def transform(self, X, y=None):
9 if self.modify:
---> 10 X['Cabin'] = X['Cabin'].apply(lambda x: x[0])
11 return X
TypeError: 'float' object is not subscriptable
This is due to the Cabin still having NaN values therefore you cant subscript NaN. But in my pipeline I should first fill the empty Cabin values with 'U' using 'SimpleImputer' before the cabin_pipeline is run. Why is it still doing this?
Couple questions:
- How do I write a custom transformers and what should I be returning from the transform method?
- why isnt the pipeline going in order i.e first filling in empty values AND THEN accessing the subscript value of the Cabin column.
Any help is greatly appreciated!
ColumnTransformer does not perform the transformations in order.
Steps in the Columntransformer should apply to particular columns rather than individual transformations on the columns. If you need mutliple transformations on one column you should pack them into Pipeline first and then run it in ColumnTransformer.
Hope it helps!
I found some good examples here: https://towardsdatascience.com/using-columntransformer-to-combine-data-processing-steps-af383f7d5260