Compiling and analysing a Corpus with R and koRpus

Question

Compiling and analysing a Corpus with R and koRpus

485 views Asked by SamVimes At 24 July 2017 at 14:37

I'm a student of literature lost in data sciences. I'm trying to analyse a corpus of 70 .txt-files, which are all in one directory.

My final goal is to get a table containing the filename (or something similar), the sentence and word counts, a Flesch-Kincaid readability score and a MTLD lexical diversity score.

I've found the packages koRpus and tm (and tm.plugin.koRpus) and have tried to understand their documentation, but haven't come far. With the help of the RKward IDE and the koRpus-Plugin I manage to get all of these measure for one file at a time and can copy that data into a table manually, but that is very cumbersome and still a lot of work.

What I've tried so far is this command to create a corpus of my files:

simpleCorpus(dir = "/home/user/files/", lang = "en", tagger = "tokenize",
encoding = "UTF-8", pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text", source = "Wikipedia", format = "file",
mc.cores = getOption("mc.cores", 1L))

But I always get the error:

Error in data.table(token = tokens, tag = unk.kRp):column or argument 1 is NULL).

If someone could help an absolute newbie to R I'd be incredibly grateful!

Original Q&A

There are 3 answers

pyll On 24 July 2017 at 14:48

This is a very comprehensive walkthrough...I would go step by step through this if I were you.

http://tidytextmining.com/tidytext.html

Ken Benoit On 28 July 2017 at 15:36

I suggest you take a look at our vignette for quanteda, Digital Humanities Use Case: Replication of analyses from Text Analysis with R for Students of Literature, which replicates Matt Jocker's book of the same title.

For what you are looking for above, the following would work:

require(readtext)
require(quanteda)

# reads in all of your texts and puts them into a corpus
mycorpus <- corpus(readtext("/home/user/files/*"))

# sentence and word counts
(output_df <- summary(mycorpus))

# to compute Flesch-Kincaid readability on the texts
textstat_readability(mycorpus, "Flesch.Kincaid")

# to compute lexical diversity on the texts
textstat_lexdiv(dfm(mycorpus))

The textstat_lexdiv() function does not currently have MLTD, but we are working on it, and it does have a half dozen others.

**SamVimes** · Accepted Answer · 2017-07-25T15:06:01+00:00

SamVimes On 25 July 2017 at 15:06 BEST ANSWER

I have found the solution with the help of unDocUMeantIt, the author of the package (thank you!). An empty file in the directory caused the error, after removal I've managed to get everything running.

TechQA.

Compiling and analysing a Corpus with R and koRpus

There are 3 answers

Related Questions in R

Related Questions in TM

Related Questions in CORPUS

Related Questions in KORPUS

Popular Questions

Trending Questions