OpenNMT-py low BLEU scores for translators to German

1k views Asked by At

I've trained OpenNLP-py models from English to German and from Italian to German on Europarl and I got very low BLEU scores: 8.13 for English -> German and 4.79 for Italian -> German.

As I'm no expert in NNs (yet), I adopted the default configurations provided by the library. Training 13 epochs took in both cases approximately 20 hours. In both cases I used 80% of the dataset for training, 10% for validation, and 10% for testing.

Below are the commands I used for creating the Italian -> German model, I used a similar sequence of commands for the other model. Can anybody give me any advice on how to improve the effectiveness of my models?

# $ wc -l Europarl.de-it.de
# 1832052 Europarl.de-it.de

head -1465640 Europarl.de-it.de > train_de-it.de
head -1465640 Europarl.de-it.it > train_de-it.it

tail -n 366412 Europarl.de-it.de | head -183206 > dev_de-it.de
tail -n 366412 Europarl.de-it.it | head -183206 > dev_de-it.it

tail -n 183206 Europarl.de-it.de > test_de-it.de
tail -n 183206 Europarl.de-it.it > test_de-it.it

perl tokenizer.perl -a -no-escape -l de < ../data/train_de-it.de > ../data/train_de-it.atok.de
perl tokenizer.perl -a -no-escape -l de < ../data/dev_de-it.de > ../data/dev_de-it.atok.de
perl tokenizer.perl -a -no-escape -l de < ../data/test_de-it.de > ../data/test_de-it.atok.de

perl tokenizer.perl -a -no-escape -l it < ../data/train_de-it.it > ../data/train_de-it.atok.it
perl tokenizer.perl -a -no-escape -l it < ../data/dev_de-it.it > ../data/dev_de-it.atok.it
perl tokenizer.perl -a -no-escape -l it < ../data/test_de-it.it > ../data/test_de-it.atok.it

python3 preprocess.py \
-train_src ../data/train_de-it.atok.it \
-train_tgt ../data/train_de-it.atok.de \
-valid_src ../data/dev_de-it.atok.it \
-valid_tgt ../data/dev_de-it.atok.de \
-save_data ../data/europarl_de_it.atok.low \
-lower

python3 train.py \
-data ../data/europarl_de_it.atok.low.train.pt \
-save_model ../models_en_de/europarl_it_de_models \
-gpus 0
1

There are 1 answers

0
Wiktor Stribiżew On BEST ANSWER

You can get a lot of hints at Training Romance Multi-Way model and also Training English-German WMT15 NMT engine. The main idea is to run BPE tokenization on a concatenated XXYY training corpus and then tokenize the training corpora with the learned BPE models.

The Byte Pair Encoding tokenization should be beneficial for German because of its compounding, the algorithm helps to segment words into subword units. The trick is that you need to train a BPE model on a single training corpus containing both source and target. See Jean Senellart's comment:

The BPE model should be trained on the training corpus only - and ideally, you train one single model for source and target so that the model learns easily to translate identical word fragments from source to target. So I would concatenate source and target training corpus - then train tokenize it once, then learn a BPE model on this single corpus, that you then use for tokenization of test/valid/train corpus in source and target.

Another idea is to tokenize with -case_feature. It is also a good idea for all languages where letters can have different case. See Jean's comment:

in general using -case_feature is a good idea for almost all languages (with case) - and shows good performance for dealing and rendering in target case variation in the source (for instance all uppercase/lowercase, or capitalized words, ...).

To improve MT quality, you might also try

  1. Getting more corpora (e.g. WMT16 corpora)
  2. Tune using in-domain training