R: Filtering Bigrams and Different Results with Quanteda and Tidytext

30 views Asked by At

I must be making a mistake. I want to find in a text all the bigrams where the first term is "europe" (after converting all the words to lowercase).

I tried to achieve the same goal both with quanteda and with tidytext, but for some reasons the results are not coincident (in particular, my tidytext approach appears faulty).

You need to download a harmless text file (speech2023.txt) from

https://e.pcloud.link/publink/show?code=XZ51msZqHU5L5wMOEQiU3tGARoreFQuOecy

in order to run the reprex at the end of this post.

Any help is welcome!

library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 15.0
#> ICU version: 72.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.

require(quanteda.textstats)
#> Loading required package: quanteda.textstats
library(readtext)
#> 
#> Attaching package: 'readtext'
#> The following object is masked from 'package:quanteda':
#> 
#>     texts
library(tidyverse)
library(tidytext)


### First identidy the bigrams containing "europe" as the first word
### using quanteda

df1 <- readtext("speech2023.txt")


mycorpus <- corpus(df1)
summary(mycorpus)
#> Corpus consisting of 1 document, showing 1 document:
#> 
#>            Text Types Tokens Sentences
#>  speech2023.txt  1780   7531       471


toks <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE) |> 
    tokens_remove(pattern = stopwords("en", source = "marimo"))  |> 
    tokens_keep(pattern = "^[a-zA-Z]+$", valuetype = "regex") |>
    tokens_tolower() 


toks_eu_bigram <- tokens_compound(toks, pattern = phrase("europe *"))

toks_eu_bigram_select <- tokens_select(toks_eu_bigram, pattern = phrase("europe_*"))

toks_eu_bigram_select
#> Tokens consisting of 1 document.
#> speech2023.txt :
#>  [1] "europe_want"        "europe_answer"      "europe_must"       
#>  [4] "europe_know"        "europe_today"       "europe_bold"       
#>  [7] "europe_stark"       "europe_just"        "europe_honourable" 
#> [10] "europe_competition" "europe_global"      "europe_open"       
#> [ ... and 39 more ]

####

data(stop_words)


df2 <- readLines("speech2023.txt")|> 
        gsub(pattern="’", replacement= '')   |> 
        gsub(pattern="[0-9]+", replacement="")   |> 
        gsub(pattern="[[:punct:]]",replacement=" ")

text_df2 <- tibble(line = 1:length(df2), text = df2) |>
    mutate(text=tolower(text))

bigrams_tidy <-  text_df2 |>
    unnest_tokens(bigram, text, token="ngrams", n=2)  |>
    separate(bigram, c("word1", "word2"), sep=" ")   |>
    filter(!word1 %in% stop_words$word )  |>
    filter(!word2 %in% stop_words$word ) |>
    filter(word1=="europe")

bigrams_tidy
#> # A tibble: 6 × 3
#>    line word1  word2     
#>   <int> <chr>  <chr>     
#> 1   381 europe approach  
#> 2   415 europe economic  
#> 3   417 europe faster    
#> 4   507 europe answering 
#> 5   525 europe responding
#> 6   624 europe stands

##why are the bigrams found with quanteda and with the tidytext approach not the same ?

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] tidytext_0.4.1            lubridate_1.9.3          
#>  [3] forcats_1.0.0             stringr_1.5.0            
#>  [5] dplyr_1.1.3               purrr_1.0.2              
#>  [7] readr_2.1.4               tidyr_1.3.0              
#>  [9] tibble_3.2.1              ggplot2_3.4.4            
#> [11] tidyverse_2.0.0           readtext_0.90            
#> [13] quanteda.textstats_0.96.3 quanteda_3.3.1           
#> 
#> loaded via a namespace (and not attached):
#>  [1] janeaustenr_1.0.0  utf8_1.2.3         generics_0.1.3     stringi_1.7.12    
#>  [5] lattice_0.21-9     hms_1.1.3          digest_0.6.33      magrittr_2.0.3    
#>  [9] timechange_0.2.0   evaluate_0.22      grid_4.3.1         fastmap_1.1.1     
#> [13] Matrix_1.6-1.1     httr_1.4.7         stopwords_2.3      fansi_1.0.5       
#> [17] scales_1.2.1       cli_3.6.1          rlang_1.1.1        tokenizers_0.3.0  
#> [21] munsell_0.5.0      reprex_2.0.2       withr_2.5.1        yaml_2.3.7        
#> [25] tools_4.3.1        tzdb_0.4.0         colorspace_2.1-0   fastmatch_1.1-4   
#> [29] vctrs_0.6.4        R6_2.5.1           lifecycle_1.0.3    fs_1.6.3          
#> [33] pkgconfig_2.0.3    RcppParallel_5.1.7 pillar_1.9.0       gtable_0.3.4      
#> [37] data.table_1.14.8  glue_1.6.2         Rcpp_1.0.11        xfun_0.40         
#> [41] tidyselect_1.2.0   knitr_1.44         SnowballC_0.7.1    htmltools_0.5.6.1 
#> [45] rmarkdown_2.25     compiler_4.3.1     nsyllable_1.0.1

Created on 2023-10-24 with reprex v2.0.2

0

There are 0 answers