I am looking into metrics for measuring the quality of text-summarization. For this, I have found this SO answer which states:
Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.
Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.
Although in this answer of SE I find this:
ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.
This is contradictory. Its sounds like Rouge-Precision is equal to BLEU and Rouge-Recall is equal to the statement made in the SO answer. Is Rouge-Precision the same as BLEU as it implemented BLEU?
It is also stated in the paper:
It is clear that ROUGE-N is a recall-related measure because the denominator of the equation is the total sum of the number of n-grams occurring at the reference summary side. A closely related measure, BLEU, used in automatic evaluation of machine translation, is a precision-based measure.
I dont understand this, as (atleast) rouge returns a precision and a recall value. Can anybody bring some clearness into this? Thank you!