I have a data frame with text
TERM
good morning
hello
morning good
you're welcome
hello
hi
I would like to filter out all duplicates and all with the same words but in different order. So that I get:
TERM
good morning
hello
you're welcome
hi
I know how to get the distance of two words with stringdist.
stringdist(stringOriginal,stringCompare,method=qgram)
But since I have very long data frames I don't want to loop through all entries.
How can I filter out the similar terms?
Thx Joerg
Break it up into words, sort the words in each record and keep rows for which the sorted words are not duplicates. No packages are used.
giving:
Note: The input in reproducible form is: