Remove rows containing identical or word-permuted sentences from a data frame in R

Question

82 views Asked by JoergP At 20 December 2016 at 13:38

I have a data frame with text

TERM
good morning
hello
morning good
you're welcome
hello
hi

I would like to filter out all duplicates and all with the same words but in different order. So that I get:

TERM
good morning
hello
you're welcome
hi

I know how to get the distance of two words with stringdist.

stringdist(stringOriginal,stringCompare,method=qgram)

But since I have very long data frames I don't want to loop through all entries.

How can I filter out the similar terms?

Thx Joerg

There are 1 answers

**G. Grothendieck** · Answer 1 · 2016-12-20T13:47:59+00:00

Break it up into words, sort the words in each record and keep rows for which the sorted words are not duplicates. No packages are used.

subset(DF, !duplicated(lapply(strsplit(TERM, " "), sort)))

giving:

            TERM
1   good morning
2          hello
4 you're welcome
6             hi

Note: The input in reproducible form is:

Lines <- "TERM
good morning
hello
morning good
you're welcome
hello
hi"
DF <- read.csv(text = Lines, as.is = TRUE, strip.white = TRUE)