kwic() function returns less rows than it should

55 views Asked by At

I'm currently trying to perform a sentiment analysis on a kwic object, but I'm afraid that the kwic() function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.

I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:

ostalgie_cluster <- full_data %>%
  filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
                speechContent,
                ignore.case = TRUE))

The resulting data frame consists of 201 observations. When I perform kwic() on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...

#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
                            docid_field = "id",
                            text_field = "speechContent")

#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp, 
                     remove_punct = TRUE,
                     remove_numbers = TRUE,
                     remove_symbols = TRUE,
                     padding = FALSE) %>%
  tokens_remove(stopwords("de"), padding = FALSE) %>%
  tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")

ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")

test_kwic <- kwic(qtd_tokens,
                  pattern = ostalgie_words,
                  window = 5)
1

There are 1 answers

0
Ken Benoit On BEST ANSWER

It's something of a guess without having a reproducible example (your input full_data, namely) but here's my best guess. Your kwic() call is using the default "glob" pattern matching, and what you want is a regular expression match instead.

Fix it this way:

kwic(qtd_tokens, pattern = ostalgie_words, valuetype = "regex", 
     window = 5