I have a data frame with a column called listA, and a listB. I want to pull out only those rows in the data frame which match to an entry in listB, so I have:
newData <- mydata[mydata$listA %in% listB,]
However, some entries of listA are in the format "ABC /// DEF", where both ABC and DEF are possible entries in listB. I want to pull out the rows of the data frame which have a listA for which any of the words match to an entry in listB. So if listB had "ABC" in it, that entry would be included in newData. I found the strsplit function, but things like
strsplit(mydata$listA," ") %in% listB
always returns FALSE, presumably because it's checking if the whole list returned by strsplit is an entry in listB.
match(word_vector, target_vector)
allows both arguments to be vectors, which is what you want (note: that's vectors, not lists). In fact,%in%
operator is a synonym formatch()
, as its help tells you.stringi
package's methodsstri_match_*
may well directly do what you want, are all vectorized, and are way more performant than eithermatch()
orstrsplit()
:stri_match_all stri_match_all_regex stri_match_first stri_match_first_regex stri_match_last stri_match_last_regex
Also, you probably won't need to use an explicit split function, but if you must, then use
stringi::stri_split_*()
, avoidbase::strsplit()
Note on performance: avoid splitting strings like the plague in R whenever possible, it creates memory leaks via unnecessary conscells, as
gc()
will show you. That's yet another reason whystringi
is very efficient.