I've been trying to run some ML code but I keep faltering at the fitting stage after running my pipeline. I've looked around on various forums to not much avail. What I've discovered is that some people say you can't use LabelEncoder within a pipeline. I'm not sure how true that is. If anyone has any insights on the matter I'd be very happy to hear them.
I keep getting this error:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
And so I'm not sure if the problem is from me or from python. Here's my code:
data = pd.read_csv("ks-projects-201801.csv",
index_col="ID",
parse_dates=["deadline","launched"],
infer_datetime_format=True)
var = list(data)
data = data.drop(labels=[1014746686,1245461087, 1384087152, 1480763647, 330942060, 462917959, 69489148])
missing = [i for i in var if data[i].isnull().any()]
data = data.dropna(subset=missing,axis=0)
le = LabelEncoder()
oe = OrdinalEncoder()
oh = OneHotEncoder()
y = [i for i in var if i=="state"]
y = data[var.pop(8)]
p,p.index = pd.Series(le.fit_transform(y)),y.index
q = pd.read_csv("y.csv",index_col="ID")["0"]
label_y = le.fit_transform(y)
x = data[var]
obj_feat = x.select_dtypes(include="object")
dat_feat = x.select_dtypes(include="datetime64[ns]")
dat_feat = dat_feat.assign(dmonth=dat_feat.deadline.dt.month.astype("int64"),
dyear = dat_feat.deadline.dt.year.astype("int64"),
lmonth=dat_feat.launched.dt.month.astype("int64"),
lyear=dat_feat.launched.dt.year.astype("int64"))
dat_feat = dat_feat.drop(labels=["deadline","launched"],axis=1)
num_feat = x.select_dtypes(include=["int64","float64"])
u = dict(zip(list(obj_feat),[len(obj_feat[i].unique()) for i in obj_feat]))
le_obj = [i for i in u if u[i]<10]
oh_obj = [i for i in u if u[i]<20 and u[i]>10]
te_obj = [i for i in u if u[i]>20 and u[i]<25]
cb_obj = [i for i in u if u[i]>100]
# Pipeline time
#Impute and encode
strat = ["constant","most_frequent","mean","median"]
sc = StandardScaler()
oh_unk = "ignore"
encoders = [LabelEncoder(),
OneHotEncoder(handle_unknown=oh_unk),
TargetEncoder(),
CatBoostEncoder()]
#num_trans = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[2])),
num_trans = Pipeline(steps=[("sc",sc)])
#obj_imp = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[1]))])
oh_enc = Pipeline(steps=[("oh_enc",encoders[1])])
te_enc = Pipeline(steps=[("te_enc",encoders[2])])
cb_enc = Pipeline(steps=[("cb_enc",encoders[0])])
trans = ColumnTransformer(transformers=[
("num",num_trans,list(num_feat)+list(dat_feat)),
#("obj",obj_imp,list(obj_feat)),
("onehot",oh_enc,oh_obj),
("target",te_enc,te_obj),
("catboost",cb_enc,cb_obj)
])
models = [RandomForestClassifier(random_state=0),
KNeighborsClassifier(),
DecisionTreeClassifier(random_state=0)]
model = models[2]
print("Check 4")
# Chaining it all together
run = Pipeline(steps=[("Transformation",trans),("Model",model)])
x = pd.concat([obj_feat,dat_feat,num_feat],axis=1)
print("Check 5")
run.fit(x,p)
It runs fine until run.fit where it throws an error. I'd love to hear any advice anyone might have, and any possible ways to resolve this problem would also be greatly appreciated! Thank you.
The problem is the same as spotted in this answer, but with a
LabelEncoder
in your case. TheLabelEncoder
'sfit_transform
method takes:Whereas
Pipeline
is expecting that all its transformers are taking three positional argumentsfit_transform(self, X, y)
.You could make a custom transformer as in the aforementioned answer, however, a
LabelEncoder
should not be used as a feature transformer. An extensive explanation on why can be seen in LabelEncoder for categorical features?. So I'd recommend not using aLabelEcoder
and using some other bayesian encoders if the amount of features gets too high such as theTargetEncoder
which you also have in the list of encoders.