What is right time to perform train_test_split when building a model with text and categorical features?

Question

What is right time to perform train_test_split when building a model with text and categorical features?

487 views Asked by Sandeep Maurya At 30 September 2020 at 17:04

I am trying to train a model which takes a mixture of numerical, categorical and text features. My question is which one of the following should I do for vectorizing my text and categorical features?

I split my data into train,cv and test for purpose of features vectorization i.e using vectorizor.fit(train) and vectorizor.transform(cv),vectorizor.transform(test)
Use vectorizor.fit transform on entire data

My goal is to hstack( all above features) and apply NaiveBayes. I think I should split my data into train_test before this point, inorder to find optimal hyperparameter for NB.

Please share some thought on this. I am new to data-science.

Original Q&A

There are 2 answers

**whege** · Answer 1 · 2020-09-30T17:08:45+00:00

If you are going to fit anything like an imputer or a Standard Scaler to the data, I recommend doing that after the split, since this way you avoid any of the test dataset leaking into your training set. However, things like formatting and simple transformations of the data, one-hot encoding should be able to be done safely on the entire dataset without issue, and avoids some extra work.

**Samruddhi Chitnis** · Answer 2 · 2020-09-30T17:58:17+00:00

Samruddhi Chitnis On 30 September 2020 at 17:58

I think you should go with the 2nd option i.e vectorizer.fit_transform on entire data because if you split the data before, it may happen that some of the data which is in test may not be in train so in that case some classes may remain unrecognised

TechQA.

What is right time to perform train_test_split when building a model with text and categorical features?

There are 2 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in DATA-SCIENCE

Related Questions in COUNTVECTORIZER

Related Questions in TRAIN-TEST-SPLIT

Popular Questions

Popular Tags

Trending Questions