Dropping duplicated rows

Question

Dropping duplicated rows

18 views Asked by Toqa Ghozlan At 25 March 2024 at 11:20

I want to ask a question: Which is correct or more correct to drop duplicated rows before or after splitting data into test and train set???

In one hand, deleting after is better; for train the model will not be biased to them and for test there is no wrong high performance, but on other hand when I split data I drop label >> this normally increase the duplicated rows So what is the best?

Original Q&A

There are 1 answers

**Reza** · Answer 1 · 2024-03-25T11:25:54+00:00

Absolutely before splitting the data.

If you don't do that, one of the duplicates can go to the test subset and one to the train subset. Thereby, you have identical samples in train and test subsets, which biases the results.

Even if the duplicates stay in one of the subsets, they will bias the results again. If the prediction is good on them, they will enhance your metrics, and if the prediction is not goode, they will worsen your metrics.

TechQA.

Dropping duplicated rows

There are 1 answers

Related Questions in PYTHON

Related Questions in PROCESS

Related Questions in DROP

Popular Questions

Trending Questions