Dropping duplicated rows

18 views Asked by At

I want to ask a question: Which is correct or more correct to drop duplicated rows before or after splitting data into test and train set???

In one hand, deleting after is better; for train the model will not be biased to them and for test there is no wrong high performance, but on other hand when I split data I drop label >> this normally increase the duplicated rows So what is the best?

1

There are 1 answers

0
Reza On

Absolutely before splitting the data.

If you don't do that, one of the duplicates can go to the test subset and one to the train subset. Thereby, you have identical samples in train and test subsets, which biases the results.

Even if the duplicates stay in one of the subsets, they will bias the results again. If the prediction is good on them, they will enhance your metrics, and if the prediction is not goode, they will worsen your metrics.