What is the difference between mteval-v13a.pl and NLTK BLEU?

1.5k views Asked by At

There is an implementation of BLEU score in Python NLTK, nltk.translate.bleu_score.corpus_bleu

But I am not sure if it is the same as the mtevalv13a.pl script.

What is the difference between them?

1

There are 1 answers

1
alvas On BEST ANSWER

TL;DR

Use https://github.com/mjpost/sacrebleu when evaluating Machine Translation systems.

In Short

No, the BLEU in NLTK isn't the exactly the same as the mteval-13a.perl.

But it can get really close, see https://github.com/nltk/nltk/issues/1330#issuecomment-256237324

nltk.translate.corpus_bleu corresponds to mteval-13a.pl up to the 4th order of ngram with some floating point discrepancies

The details of the comparison and the dataset used can be downloaded from https://github.com/nltk/nltk_data/blob/gh-pages/packages/models/wmt15_eval.zip or:

import nltk
nltk.download('wmt15_eval')

The major differences:

enter image description here


In Long

There are several difference between mteval-13a.pl and nltk.translate.corpus_bleu:

  • The first difference is the fact that mteval-13a.pl comes with its own NIST tokenizer while the NLTK version of BLEU is the implementation of the metric and assumes that input is pre-tokenized.

    • BTW, this ongoing PR will bridge the gap between NLTK and NIST tokenizers
  • The other major difference is that mteval-13a.pl expects the input to be in .sgm format while NLTK BLEU takes in python list of lists of strings, see the README.txt in the zipball here for more information of how to convert textfile to SGM.

  • mteval-13a.pl expects an ngram order of at least 1-4. If the minimum ngram order for the sentence/corpus is less than 4, it will return a 0 probability which is a math.log(float('-inf')). To emulate this behavior, NLTK has a put an _emulate_multibleu flag:

  • mteval-13a.pl is able to generate NIST scores while NLTK doesn't have NIST score implementation (at least not yet)

Other than the differences, NLTK BLEU scores packed in more features:

Lastly to validate the features added in NLTK's version of BLEU, a regression test is added to accounts for them, see https://github.com/nltk/nltk/blob/develop/nltk/test/unit/translate/test_bleu.py