Deduplicating content by removing similar rows of text in Python

474 views Asked by At

I'm fairly new to Python. While I know it's possible to deduplicate rows in Pandas with drop_duplicates for identical text results, is there a way to drop similar rows of text?

E.g. for this fictional collection of online article headlines, populated in chronological order

1 "The dog ate my homework" says confused child in Banbury

2 Confused Banbury child says dog ate homework

3 Why are dogs so cute

4 Teacher in disbelief as child says dog ate homework - Banbury Times

5 Dogs don't like eggs, here's why

6 The moment a senior stray is adopted - try not to cry

7 Dog smugglers in Banbury arrested in police sting operation

My ideal outcome would be that only rows 1, 3, 5, 6 and 7 remain, with rows 1, 2 and 4 having been grouped for similarity and then only 1, the oldest/ 'first' entry, kept.

(How) could I get there? Even advice purely about the grouping approach would be very helpful. I would want to be able to run this on hundreds of rows of text, without having a specific, manually pre-determined article or headline to measure similarity against, just group similar rows.

Thank you so much for your thoughts and time!

1

There are 1 answers

0
roddar92 On

You cam try to obtain your data with doc2vec (example of usage), then cluster your text with cosine distance with kmedoids of hierarchical algorithms.