I must be making a mistake. I want to find in a text all the bigrams where the first term is "europe" (after converting all the words to lowercase).
I tried to achieve the same goal both with quanteda and with tidytext, but for some reasons the results are not coincident (in particular, my tidytext approach appears faulty).
You need to download a harmless text file (speech2023.txt) from
https://e.pcloud.link/publink/show?code=XZ51msZqHU5L5wMOEQiU3tGARoreFQuOecy
in order to run the reprex at the end of this post.
Any help is welcome!
library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 15.0
#> ICU version: 72.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.textstats)
#> Loading required package: quanteda.textstats
library(readtext)
#>
#> Attaching package: 'readtext'
#> The following object is masked from 'package:quanteda':
#>
#> texts
library(tidyverse)
library(tidytext)
### First identidy the bigrams containing "europe" as the first word
### using quanteda
df1 <- readtext("speech2023.txt")
mycorpus <- corpus(df1)
summary(mycorpus)
#> Corpus consisting of 1 document, showing 1 document:
#>
#> Text Types Tokens Sentences
#> speech2023.txt 1780 7531 471
toks <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE) |>
tokens_remove(pattern = stopwords("en", source = "marimo")) |>
tokens_keep(pattern = "^[a-zA-Z]+$", valuetype = "regex") |>
tokens_tolower()
toks_eu_bigram <- tokens_compound(toks, pattern = phrase("europe *"))
toks_eu_bigram_select <- tokens_select(toks_eu_bigram, pattern = phrase("europe_*"))
toks_eu_bigram_select
#> Tokens consisting of 1 document.
#> speech2023.txt :
#> [1] "europe_want" "europe_answer" "europe_must"
#> [4] "europe_know" "europe_today" "europe_bold"
#> [7] "europe_stark" "europe_just" "europe_honourable"
#> [10] "europe_competition" "europe_global" "europe_open"
#> [ ... and 39 more ]
####
data(stop_words)
df2 <- readLines("speech2023.txt")|>
gsub(pattern="’", replacement= '') |>
gsub(pattern="[0-9]+", replacement="") |>
gsub(pattern="[[:punct:]]",replacement=" ")
text_df2 <- tibble(line = 1:length(df2), text = df2) |>
mutate(text=tolower(text))
bigrams_tidy <- text_df2 |>
unnest_tokens(bigram, text, token="ngrams", n=2) |>
separate(bigram, c("word1", "word2"), sep=" ") |>
filter(!word1 %in% stop_words$word ) |>
filter(!word2 %in% stop_words$word ) |>
filter(word1=="europe")
bigrams_tidy
#> # A tibble: 6 × 3
#> line word1 word2
#> <int> <chr> <chr>
#> 1 381 europe approach
#> 2 415 europe economic
#> 3 417 europe faster
#> 4 507 europe answering
#> 5 525 europe responding
#> 6 624 europe stands
##why are the bigrams found with quanteda and with the tidytext approach not the same ?
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] tidytext_0.4.1 lubridate_1.9.3
#> [3] forcats_1.0.0 stringr_1.5.0
#> [5] dplyr_1.1.3 purrr_1.0.2
#> [7] readr_2.1.4 tidyr_1.3.0
#> [9] tibble_3.2.1 ggplot2_3.4.4
#> [11] tidyverse_2.0.0 readtext_0.90
#> [13] quanteda.textstats_0.96.3 quanteda_3.3.1
#>
#> loaded via a namespace (and not attached):
#> [1] janeaustenr_1.0.0 utf8_1.2.3 generics_0.1.3 stringi_1.7.12
#> [5] lattice_0.21-9 hms_1.1.3 digest_0.6.33 magrittr_2.0.3
#> [9] timechange_0.2.0 evaluate_0.22 grid_4.3.1 fastmap_1.1.1
#> [13] Matrix_1.6-1.1 httr_1.4.7 stopwords_2.3 fansi_1.0.5
#> [17] scales_1.2.1 cli_3.6.1 rlang_1.1.1 tokenizers_0.3.0
#> [21] munsell_0.5.0 reprex_2.0.2 withr_2.5.1 yaml_2.3.7
#> [25] tools_4.3.1 tzdb_0.4.0 colorspace_2.1-0 fastmatch_1.1-4
#> [29] vctrs_0.6.4 R6_2.5.1 lifecycle_1.0.3 fs_1.6.3
#> [33] pkgconfig_2.0.3 RcppParallel_5.1.7 pillar_1.9.0 gtable_0.3.4
#> [37] data.table_1.14.8 glue_1.6.2 Rcpp_1.0.11 xfun_0.40
#> [41] tidyselect_1.2.0 knitr_1.44 SnowballC_0.7.1 htmltools_0.5.6.1
#> [45] rmarkdown_2.25 compiler_4.3.1 nsyllable_1.0.1
Created on 2023-10-24 with reprex v2.0.2