What is right time to perform train_test_split when building a model with text and categorical features?

495 views Asked by At

I am trying to train a model which takes a mixture of numerical, categorical and text features. My question is which one of the following should I do for vectorizing my text and categorical features?

  1. I split my data into train,cv and test for purpose of features vectorization i.e using vectorizor.fit(train) and vectorizor.transform(cv),vectorizor.transform(test)
  2. Use vectorizor.fit transform on entire data

My goal is to hstack( all above features) and apply NaiveBayes. I think I should split my data into train_test before this point, inorder to find optimal hyperparameter for NB.

Please share some thought on this. I am new to data-science.

2

There are 2 answers

0
whege On

If you are going to fit anything like an imputer or a Standard Scaler to the data, I recommend doing that after the split, since this way you avoid any of the test dataset leaking into your training set. However, things like formatting and simple transformations of the data, one-hot encoding should be able to be done safely on the entire dataset without issue, and avoids some extra work.

0
Samruddhi Chitnis On

I think you should go with the 2nd option i.e vectorizer.fit_transform on entire data because if you split the data before, it may happen that some of the data which is in test may not be in train so in that case some classes may remain unrecognised