I have a data frame of tweets. I am searching the tweet text in this data frame for a variable defined in a vector of vectors. Vector of vectors looks like this:
chnID <- c("china", "beijing", "中国", "日中")
jpnID <- c("japan", "tokyo")
usaID <- c("united states", "washington", "アメリカ", "米国", "日米")
allID <- c(chnID, jpnID, usaID)
Example tweet text from might look something like this:
usaTW <- data.frame(text = "beijing, tokyo, & washington to discuss economic concerns this weekend in japan.")
Next I filter the data frame to find all tweets containing references to China or Japan contained in the allID vector (while excluding matches in the usaID vector):
usaPD <- filter(usaTW, str_detect(text, paste(allID[which(!allID %in% usaID)], collapse = "|")))
Given that the above example does contain values from the allID vector, it is identified and placed in the newly created usaPD data frame. Next I would like to convert this text into an edge list for network graphing. To do this, I do the following:
usaEL <- data.frame(source = "United States", dest = str_extract(usaPD$text, paste(allID[which(!allID %in% usaID)], collapse = "|")))
This returns the following:
| source | dest |
|---|---|
| United States | beijing |
However, I'd like to achieve the following three things:
- I'd instead like to create multiple rows; one for each match found.
- Additionally, instead of listing the found value, I'd like it return the country name.
- Finally, I'd like it to exclude additional matches from the same country list. In this way, it would ignore "japan" since it has already found "tokyo" earlier in the text.
Ultimately, the example text should look like the following:
| source | dest |
|---|---|
| United States | China |
| United States | Japan |
For each row you need to identify what country patterns do you need (using
source) and see if it is within text (usingstringi::stri_detect_regex).Data:
country patterns:
find which country pattern is not required for each source:
for each text see if any of required country patterns is detected:
create final result: