There is an implementation of BLEU score in Python NLTK,
nltk.translate.bleu_score.corpus_bleu
But I am not sure if it is the same as the mtevalv13a.pl script.
What is the difference between them?
There is an implementation of BLEU score in Python NLTK,
nltk.translate.bleu_score.corpus_bleu
But I am not sure if it is the same as the mtevalv13a.pl script.
What is the difference between them?
TL;DR
Use https://github.com/mjpost/sacrebleu when evaluating Machine Translation systems.
In Short
No, the BLEU in NLTK isn't the exactly the same as the
mteval-13a.perl
.But it can get really close, see https://github.com/nltk/nltk/issues/1330#issuecomment-256237324
The details of the comparison and the dataset used can be downloaded from https://github.com/nltk/nltk_data/blob/gh-pages/packages/models/wmt15_eval.zip or:
The major differences:
In Long
There are several difference between
mteval-13a.pl
andnltk.translate.corpus_bleu
:The first difference is the fact that
mteval-13a.pl
comes with its own NIST tokenizer while the NLTK version of BLEU is the implementation of the metric and assumes that input is pre-tokenized.The other major difference is that
mteval-13a.pl
expects the input to be in.sgm
format while NLTK BLEU takes in python list of lists of strings, see the README.txt in the zipball here for more information of how to convert textfile to SGM.mteval-13a.pl
expects an ngram order of at least 1-4. If the minimum ngram order for the sentence/corpus is less than 4, it will return a 0 probability which is amath.log(float('-inf'))
. To emulate this behavior, NLTK has a put an_emulate_multibleu
flag:mteval-13a.pl
is able to generate NIST scores while NLTK doesn't have NIST score implementation (at least not yet)Other than the differences, NLTK BLEU scores packed in more features:
to handle fringe cases that the original BLEU (Papineni, 2002) overlooked
Also to handle fringe cases where the largest order of Ngram is < 4, the uniform weights of the individual ngram precision will be reweighted such that the mass of the weights sums to 1.0
while NIST has a smoothing method for geometric sequence smoothing, NLTK has an equivalent object with the same smoothing method and even more smoothing methods to handle sentence level BLEU from Chen and Collin, 2014
Lastly to validate the features added in NLTK's version of BLEU, a regression test is added to accounts for them, see https://github.com/nltk/nltk/blob/develop/nltk/test/unit/translate/test_bleu.py