Deduplicating content by removing similar rows of text in Python

Question

Deduplicating content by removing similar rows of text in Python

477 views Asked by Gazzer At 01 December 2020 at 10:05

I'm fairly new to Python. While I know it's possible to deduplicate rows in Pandas with drop_duplicates for identical text results, is there a way to drop similar rows of text?

E.g. for this fictional collection of online article headlines, populated in chronological order

1 "The dog ate my homework" says confused child in Banbury

2 Confused Banbury child says dog ate homework

3 Why are dogs so cute

4 Teacher in disbelief as child says dog ate homework - Banbury Times

5 Dogs don't like eggs, here's why

6 The moment a senior stray is adopted - try not to cry

7 Dog smugglers in Banbury arrested in police sting operation

My ideal outcome would be that only rows 1, 3, 5, 6 and 7 remain, with rows 1, 2 and 4 having been grouped for similarity and then only 1, the oldest/ 'first' entry, kept.

(How) could I get there? Even advice purely about the grouping approach would be very helpful. I would want to be able to run this on hundreds of rows of text, without having a specific, manually pre-determined article or headline to measure similarity against, just group similar rows.

Thank you so much for your thoughts and time!

Original Q&A

There are 1 answers

**roddar92** · Answer 1 · 2020-12-01T11:07:04+00:00

roddar92 On 01 December 2020 at 11:07

You cam try to obtain your data with doc2vec (example of usage), then cluster your text with cosine distance with kmedoids of hierarchical algorithms.

TechQA.

Deduplicating content by removing similar rows of text in Python

There are 1 answers

Related Questions in PYTHON

Related Questions in NLP

Related Questions in DUPLICATES

Related Questions in SIMILARITY

Related Questions in SENTENCE-SIMILARITY

Popular Questions

Popular Tags

Trending Questions