I tried the code from http://tidytextmining.com/tfidf.html. My result can be seen in this image.
My question is: How can I rewrite the code to produce the negative relationship between the term frequency and the rank?
The following is the term-document matrix. Any comments are highly appreciated.
# Zipf 's law
freq_rk < -DTM_words %>%
group_by(document) %>%
mutate(rank=row_number(),
'term_frequency'=count/total)
freq_rk %>%
ggplot(aes(rank,term_frequency,color=document)) +
geom_line(size=1.2,alpha=0.8)
DTM_words
# A tibble: 4,530 x 5
document term count n total
<chr> <chr> <dbl> <int> <dbl>
1 1 activ 1 1 109
2 1 agencydebt 1 1 109
3 1 assess 1 1 109
4 1 avail 1 1 109
5 1 balanc 2 1 109
# ... with 4,520 more rows
To use
row_number()to get rank, you need to make sure that your data frame is ordered byn, the number of times a word is used in a document. Let's look at an example. It sounds like you are starting with a document-term matrix that you are tidying? (I'm going to use some example data that is similar to a DTM from quanteda.)Notice that here, you have a tidy data frame with one word per row, but it is not ordered by
count, the number of times that each word was used in each document. If we usedrow_number()here to try to assign rank, it isn't meaningful because the words are all jumbled up in order.Instead, we can arrange this by descending count.
Now we can use
row_number()to get rank, because the data frame is actually ranked/arranged/ordered/sorted/however you want to say it.