I am trying to train a model which takes a mixture of numerical, categorical and text features. My question is which one of the following should I do for vectorizing my text and categorical features?
- I split my data into
train
,cv
andtest
for purpose of features vectorization i.e usingvectorizor.fit(train)
andvectorizor.transform(cv)
,vectorizor.transform(test)
- Use
vectorizor.fit
transform
on entire data
My goal is to hstack( all above features) and apply NaiveBayes. I think I should split my data into train_test before this point, inorder to find optimal hyperparameter for NB.
Please share some thought on this. I am new to data-science.
If you are going to fit anything like an imputer or a Standard Scaler to the data, I recommend doing that after the split, since this way you avoid any of the test dataset leaking into your training set. However, things like formatting and simple transformations of the data, one-hot encoding should be able to be done safely on the entire dataset without issue, and avoids some extra work.