How to check if any words in a list of phrases are contained in a list in R?

2.1k views Asked by At

I have a data frame with a column called listA, and a listB. I want to pull out only those rows in the data frame which match to an entry in listB, so I have:

newData <- mydata[mydata$listA %in% listB,]

However, some entries of listA are in the format "ABC /// DEF", where both ABC and DEF are possible entries in listB. I want to pull out the rows of the data frame which have a listA for which any of the words match to an entry in listB. So if listB had "ABC" in it, that entry would be included in newData. I found the strsplit function, but things like

strsplit(mydata$listA," ") %in% listB

always returns FALSE, presumably because it's checking if the whole list returned by strsplit is an entry in listB.

1

There are 1 answers

0
smci On
  1. match(word_vector, target_vector) allows both arguments to be vectors, which is what you want (note: that's vectors, not lists). In fact, %in% operator is a synonym for match(), as its help tells you.
  2. But stringi package's methods stri_match_* may well directly do what you want, are all vectorized, and are way more performant than either match() or strsplit(): stri_match_all stri_match_all_regex stri_match_first stri_match_first_regex stri_match_last stri_match_last_regex

Also, you probably won't need to use an explicit split function, but if you must, then use stringi::stri_split_*(), avoid base::strsplit()

Note on performance: avoid splitting strings like the plague in R whenever possible, it creates memory leaks via unnecessary conscells, as gc() will show you. That's yet another reason why stringi is very efficient.