How do I load large (25k and + words) .txt documents to then structure it as one token per row?

59 views Asked by At

How could I load a big folder (more than 100 .txt files) of files for textmining (analysing the most frequent words, their evolution, word clustering and topic, POS, and so) with the TidyText package?

I am currently using Silge's & Robinson's "text mining with R" (https://www.tidytextmining.com) but i am facing some challenges. I am able to reproduce their examples but I am not able to load my own files to work on them.

The files I want to work with are the .txt files of the Annual Reports of the Rockefeller Foundation. From the first decade of the last century to the 1970s, they are quite long documents, between 450 and 250 pages. And I suppose we should also clean up the documents with gsub(pattern = "\W", replace = " ", object); tolower(object); sub(pattern="\A-z]\B{1}", replace=" ", object); stripWhitespace(object) to facilitate the analysis. I have been able to perform these last operations.

I tried cleaning one file with:

read_lines("Annual-Report-1960.txt")
AR1960 <- readLines("Annual-Report-1960.txt")
AR1960.v1 <- gsub(pattern = "\\W", replace = " ", AR1960) 
AR1960.v2 <- gsub(pattern="\\b[2003 The Rockefeller Foundation]", replace=" ", AR1960.v1)
AR1960.v3 <- tolower(AR1960.v2)
AR1960.v4 <- gsub(pattern="\\b[A-z]\\b{1}", replace=" ", AR1960.v3)
AR1960.v5 <- removeWords(AR1960.v4, c("Inc", "inc", "xxiii", "xxii", "xxv", "xxiv", "xxi", "xix"))
AR1960.v6 <- stripWhitespace(AR1960.v5)

to ease the analysis but then I am not able to upload even one single file following Silge's and Robinson's guidance.

Then I used

AR1960.v6 %>%
unnest_tokens(word, text)

and got :

AR1960.v6 %>%
+     +     unnest_tokens(word, text)
Error in UseMethod("pull") : 
  no applicable method for 'pull' applied to an object of class "function"

Thank you for your time and understanding

1

There are 1 answers

0
danlooo On

The function unnest_tokens expects a data frame of some words per row and not just a character vector. This is how to get tokens for every txt file in the current working directory:

library(tidytext)
library(tidyverse)

exclude_words <- c("inc", "xxii")

list.files(pattern = "txt$") %>%
  map(~ {
    .x %>%
      read_file() %>%
      str_to_lower() %>%
      str_remove_all(exclude_words %>% paste0(collapse = "|")) %>%
      str_remove_all(pattern = "\t") %>%
      tibble(file = .x, text = .) %>%
      unnest_tokens(word, text)
  }) %>%
  bind_rows()
# A tibble: 20 × 2
#   file     word       
#   <chr>    <chr>      
# 1 text.txt because    
# 2 text.txt i          
# 3 text.txt could      
# 4 text.txt not