How to encode a dataset having multiple datatypes?

Question

How to encode a dataset having multiple datatypes?

576 views Asked by Samar Pratap Singh At 02 October 2020 at 06:12

I have a dataset like:

e = pd.DataFrame({
    'col1': ['A', 'A', 'B', 'W', 'F', 'C'],
    'col2': [2, 1, 9, 8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})

Here I encoded the data using sklearn.preprocessing.LabelEncoder. By the following lines of code:

x = list(e.columns)
# Import label encoder 
from sklearn import preprocessing 
  
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 
for i in x:  
# Encode labels in column 'species'. 
    e[i] = label_encoder.fit_transform(e[i])
print(e)

But this is encoding even the numeric datapoint of int type, which is not required.

Encoded dataset :

col1  col2  col3  col4
0     0     1     0     3
1     0     0     1     0
2     1     5     5     4
3     4     4     4     1
4     3     3     2     5
5     2     2     3     2

How can I rectify this?

Original Q&A

There are 2 answers

fpajot On 02 October 2020 at 08:33

Filtering, and adapting your preprocessing to column types is the right idea, and the most efficient way of making it is with pandas pipeline.

from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
from sklearn.preprocecssing import LabelEncoder, StandardScaler

Example 1: applying a transformer depending on the column name

my_transformer1 = ColumnTransformer(
                     [
                         ('transform_name_for_col1', LabelEncoder(), 'col1'),
                         ('transformer_name_for_col2_and_col3', StandardScaler(), ['col2', 'col3'])
                     ]
                 )

Example 2: applying a transformer depending on column type

my_transformer2 = ColumnTransformer(
                     [
                         ('transform_name_categories', LabelEncoder(), make_column_selector(dtype_include=object)),
                         ('transformer_name_for_numerical', StandardScaler(), make_column_selector(dtype_include=np.number))
                     ]
                 )

Obviously, replace LabelEncoder and StandardScaler by the transformer of your choice, including a custom one:

class MyCustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        # do something
    
    def fit(self, X, y = None):
        # do something
        return self

    def transform(self, X, y = None):
        # do something
        # return something transformed

To go further, I recommend using scikit-learn Pipeline to combine different preprocessing depending on column and/or column type (which will be far more flexible).

See class details here:

**paxton4416** · Accepted Answer · 2020-10-02T06:34:59+00:00

One really simple possibility would be to only encode columns with string values. E.g., tweaking your code to be:

import pandas as pd
from sklearn import preprocessing 


e = pd.DataFrame({
    'col1': ['A', 'A', 'B', 'W', 'F', 'C'],
    'col2': [2, 1, 9, 8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})


label_encoder = preprocessing.LabelEncoder() 
for col in e.columns:  
    if e[col].dtype == 'O':
        e[col] = label_encoder.fit_transform(e[col])

print(e)

or better yet:

import pandas as pd
from sklearn import preprocessing 


def encode_labels(ser):
    if ser.dtype == 'O':
        return label_encoder.fit_transform(ser)
    else:
        return ser


label_encoder = preprocessing.LabelEncoder() 
e = pd.DataFrame({
    'col1': ['A', 'A', 'B', 'W', 'F', 'C'],
    'col2': [2, 1, 9, 8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})


e_encoded = e.apply(encode_labels)
print(e_encoded)

TechQA.

How to encode a dataset having multiple datatypes?

There are 2 answers

Example 1: applying a transformer depending on the column name

Example 2: applying a transformer depending on column type

Related Questions in PYTHON-3.X

Related Questions in DATAFRAME

Related Questions in SCIKIT-LEARN

Related Questions in SKLEARN-PANDAS

Related Questions in LABEL-ENCODING

Popular Questions

Popular Tags

Trending Questions