I'm fairly new to Python. While I know it's possible to deduplicate rows in Pandas with drop_duplicates for identical text results, is there a way to drop similar rows of text?
E.g. for this fictional collection of online article headlines, populated in chronological order
1 "The dog ate my homework" says confused child in Banbury
2 Confused Banbury child says dog ate homework
3 Why are dogs so cute
4 Teacher in disbelief as child says dog ate homework - Banbury Times
5 Dogs don't like eggs, here's why
6 The moment a senior stray is adopted - try not to cry
7 Dog smugglers in Banbury arrested in police sting operation
My ideal outcome would be that only rows 1, 3, 5, 6 and 7 remain, with rows 1, 2 and 4 having been grouped for similarity and then only 1, the oldest/ 'first' entry, kept.
(How) could I get there? Even advice purely about the grouping approach would be very helpful. I would want to be able to run this on hundreds of rows of text, without having a specific, manually pre-determined article or headline to measure similarity against, just group similar rows.
Thank you so much for your thoughts and time!
You cam try to obtain your data with
doc2vec
(example of usage), then cluster your text with cosine distance withkmedoids
of hierarchical algorithms.