Computing frequency of words for each document in a corpus/DFM for R

Question

Computing frequency of words for each document in a corpus/DFM for R

166 views Asked by yasiko41 At 29 November 2022 at 16:59

I want to replicate a measure of common words from a Paper in R.

They describe their procedure as follows: "To construct Common words,..., we first determine the relative frequency of all words occurring in all documents. We then calculate Common words as the average of this proportion for every word occurring in a given document. The higher the value of common words, the more ordinary is the documents’s language and thus the more readable it should be." (Loughran & McDonald 2014)

Can anybody help me with this? I work with corpus objects in order to make analysis with the text documents in R.

I have already computed the relative frequency of all words occurring in all documents as follows:

dfm_Notes_Summary <- dfm(tokens_Notes_Summary)
Summary_FreqStats_Notes <- textstat_frequency(dfm_Notes_Summary)

Summary_FreqStats_Notes$RelativeFreq <- Summary_FreqStats_Notes$frequency/sum(Summary_FreqStats_Notes$frequency)

-> I basically transformed the tokens object (tokens_Notes_Summary) into an dfm Object (dfm_Notes_Summary) and got the relative frequency of all words in all documents.

Now I struggle to calculate the average of this proportion for every word occurring in a given document.

Original Q&A

There are 1 answers

**Ken Benoit** · Answer 1 · 2022-11-30T15:18:49+00:00

I reread Loughran and McDonald (2014) meant, since I could not find code for that, but I think it's based on the average of a document's terms' document frequencies. The code will probably make this more clear:

library("quanteda")
#> Package version: 3.2.3
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

dfmat <- data_corpus_inaugural |>
    head(5) |>
    tokens(remove_punct = TRUE, remove_numbers = TRUE) |>
    dfm()

readability_commonwords <- function(x) {
    # compute document frequencies of all features
    relative_docfreq <- docfreq(x) / nfeat(x)
    # average of all words by the relative document frequency
    result <- x %*% relative_docfreq
    # return as a named vector
    structure(result[, 1], names = rownames(result))
}

readability_commonwords(dfmat)
#> 1789-Washington 1793-Washington      1797-Adams  1801-Jefferson  1805-Jefferson 
#>       2.6090768       0.2738525       4.2026818       3.0928314       3.8256833

To know full details though you should ask the authors.

^{Created on 2022-11-30 with reprex v2.0.2}

TechQA.

Computing frequency of words for each document in a corpus/DFM for R

There are 1 answers

Related Questions in R

Related Questions in CORPUS

Related Questions in QUANTEDA

Related Questions in WORD-FREQUENCY

Popular Questions

Trending Questions