Correct NMT metrics with Fairseq on non-latin languages

40 views Asked by At

As you may know to calculate BLEU properly, you need to pass a tokenizer to it's parameters, in my example I'm working with Korean language, so I expect to pass --tokenize ko-meca to sacrebleu. I know that fairseq calculates bleu for translation task during validation steps, but I found no way to pass that option inside (and even opened an issue https://github.com/facebookresearch/fairseq/issues/5308).

Another option I considered was using cHRF since it's not dependant on tokenization, but as it seems form the code fairseq only uses bleu metric from sacrebleu.

I'm also aware that there's an option to compute bleu with your own tokenizer, but in that case the metric becomes tokenizer dependant, which I also don't want.

I would be grateful for any kind of suggestions on the matter.

0

There are 0 answers