Compiling and analysing a Corpus with R and koRpus

485 views Asked by At

I'm a student of literature lost in data sciences. I'm trying to analyse a corpus of 70 .txt-files, which are all in one directory.

My final goal is to get a table containing the filename (or something similar), the sentence and word counts, a Flesch-Kincaid readability score and a MTLD lexical diversity score.

I've found the packages koRpus and tm (and tm.plugin.koRpus) and have tried to understand their documentation, but haven't come far. With the help of the RKward IDE and the koRpus-Plugin I manage to get all of these measure for one file at a time and can copy that data into a table manually, but that is very cumbersome and still a lot of work.

What I've tried so far is this command to create a corpus of my files:

simpleCorpus(dir = "/home/user/files/", lang = "en", tagger = "tokenize",
encoding = "UTF-8", pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text", source = "Wikipedia", format = "file",
mc.cores = getOption("mc.cores", 1L))

But I always get the error:

Error in data.table(token = tokens, tag = unk.kRp):column or argument 1 is NULL).

If someone could help an absolute newbie to R I'd be incredibly grateful!

3

There are 3 answers

0
SamVimes On BEST ANSWER

I have found the solution with the help of unDocUMeantIt, the author of the package (thank you!). An empty file in the directory caused the error, after removal I've managed to get everything running.

1
pyll On

This is a very comprehensive walkthrough...I would go step by step through this if I were you.

http://tidytextmining.com/tidytext.html

0
Ken Benoit On

I suggest you take a look at our vignette for quanteda, Digital Humanities Use Case: Replication of analyses from Text Analysis with R for Students of Literature, which replicates Matt Jocker's book of the same title.

For what you are looking for above, the following would work:

require(readtext)
require(quanteda)

# reads in all of your texts and puts them into a corpus
mycorpus <- corpus(readtext("/home/user/files/*"))

# sentence and word counts
(output_df <- summary(mycorpus))

# to compute Flesch-Kincaid readability on the texts
textstat_readability(mycorpus, "Flesch.Kincaid")

# to compute lexical diversity on the texts
textstat_lexdiv(dfm(mycorpus))

The textstat_lexdiv() function does not currently have MLTD, but we are working on it, and it does have a half dozen others.