Data preprocessing to both train and test sets?

231 views Asked by At

Need help in understanding the step of data preprocessing while building predictive models.

Believe is the scenario:

I built decision tree,

I Combined files from multiple csvs ; did some steps like creating multiple categories for target variable and also converted numeric variables to categorical and added some new calculated columns , made imputation for input variables

Then did partitioning 70 training and 30 test

And then I Joined the test 70 percent partitioning to decision tree learner node in knime And joined 30 percent output from partitioning node to a decission tree predictor node and I got 100 percent accuracy this is wired.

After digging I found that all the data preprocessing I did (adding extra calculated columns, imputation, converting categorical columns from numerical) has to be done prior to data partition to get train and test sets.

My doubt is if I do it after splitting how can the predictor node know about my changes because I created new target variable column based on numeric column in the dataset. When I do it on test data how can I even use the target variable which doesn't even exist in predictor node.

And also read somewhere that all the steps I did before, I must do it on test data as well. Is it really the case? I feel this is too tedious for replicating every data cleaning step on test data.

Even if do the cleaning after splitting what difference does it really make.

I'm an absolute beginner please help me out and if you could provide answers specific to knime platform it would be much helpful. Thanks in advance

Describe everything in the question.

1

There are 1 answers

0
Vatsal Maheshwari On

You should do all preprocessing steps on both train and test. Drop your prediction column (Y variable) from test. Train the model on train dataset and predict on test set. Compare the predictions (YP) with Y to derive goodness metrics such as accuracy, rmse etc.

The reason that you would want to do this separately is because you might compute some metrics that contribute to information leak. For example if you create an index of values of an independent variable - as X/mean(X) then the denominator leaks information from the entire dataset to the new variable. Hence, it is recommended to do the preprocessing separately. Incase, you are doing record level transformations only, I see no harm in combined preprocessing both.