Sklearn transformer output returns more columns with some columns not having the transformation

42 views Asked by At

I am building a scikit-learn pipeline. I downloaded a dataset from an online ML repository and generated descriptive stats for it. I am using the processed.cleveland.data dataset found here: https://archive.ics.uci.edu/dataset/45/heart+disease.

I added the column names manually and I am converting numerics to strings as required. I converted the DataFrame to a Numpy array to separate the predictor and target variables. After that, I retrieve the list of numeric and categorical variables for the pipeline.

I develop the pipeline and then summarize the DataFrame.

The result of this is extra columns that I did not generate. Why are there extra columns except the columns generated from OneHotEncoder?

Ideally, my output would contain the same number of columns from the original dataset with the transformations (simple imputer) and the columns generated by OneHotEncoder for the categorical variables. The normalized column still includes nulls, while the original columns from the dataset include the median.

Could someone please let me know the issues?

import pandas as pd
import numpy as np
import os
from pathlib import Path

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

url = ...
names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang',     
'oldpeak', 'slope', 'ca', 'thal', 'num']

def getData():
    return pd.read_csv(url, sep=',', names=names)

input = getData()
print(input.info())
print(input.describe())

array = input.values
X = array[:,0:13]
y = array[:,13]

dataframe = pd.DataFrame.from_records(X)
dataframe[[1, 2, 5, 6, 8]] = dataframe[[1, 2, 5, 6, 8]].astype(str)


numerical = dataframe.select_dtypes(include=['int64', 'float64']).columns
categorical = dataframe.select_dtypes(include=['object', 'bool']).columns

print(numerical)
print(categorical)

t = [('cat0', SimpleImputer(strategy='most_frequent'), [1, 2, 5, 6, 8]), ('cat1',       
OneHotEncoder(), categorical), ('num0', SimpleImputer(strategy='median'), numerical), ('num1',  
MinMaxScaler(), numerical)]
column_transforms = ColumnTransformer(transformers=t)

pipeline = Pipeline(steps=[('t', column_transforms)])
result = pipeline.fit_transform(dataframe)

print(type(pd.DataFrame.from_records(result)))
print(pd.DataFrame.from_records(result).to_string())``

I was expecting the DataFrame to be returned in the same order (with SimpleImputer and StandardScaler) and new variables created by OneHotEncoder.

0

There are 0 answers