I am trying to understand the concept of evaluating the machine translation evaluation scores.
I understand how what BLEU score is trying to achieve. It looks into different n-grams like BLEU-1,BLEU-2, BLEU-3, BLEU-4 and try to match with the human written translation.
However, I can't really understand what METEOR score is for evaluating MT quality. I am trying understand the rationale intuitively. I am already looking into different blog post but cant really figure out.
How these two evaluation metrics are different and how they are relevant?
Can anyone please help?
METEOR is a modification of the standard precisions-recall type of evaluation for MT. You want all words from the translation hypothesis to have a counterpart in the reference translation (precision) and everything from the reference translation in the translation hypothesis (recall). The recall is weighted as 9-times more important than the precision.
For this, (monolingual) alignment between the words in the hypothesis and the reference is needed. This is not that easy with machine translation because the translation might use different words to express the same. For this, METEOR uses a table with word n-gram paraphrases, which are language-specific.
Finally, there is a penalty for the alignment being ugly. If you randomly shuffled the translation, you could still get a perfect alignment, but the sentence is obviously broken. The penalty is a one minus the cube of the number of continuous word chunks aligned over a total number of aligned words.