I've trained OpenNLP-py models from English to German and from Italian to German on Europarl and I got very low BLEU scores: 8.13 for English -> German and 4.79 for Italian -> German.
As I'm no expert in NNs (yet), I adopted the default configurations provided by the library. Training 13 epochs took in both cases approximately 20 hours. In both cases I used 80% of the dataset for training, 10% for validation, and 10% for testing.
Below are the commands I used for creating the Italian -> German model, I used a similar sequence of commands for the other model. Can anybody give me any advice on how to improve the effectiveness of my models?
# $ wc -l Europarl.de-it.de
# 1832052 Europarl.de-it.de
head -1465640 Europarl.de-it.de > train_de-it.de
head -1465640 Europarl.de-it.it > train_de-it.it
tail -n 366412 Europarl.de-it.de | head -183206 > dev_de-it.de
tail -n 366412 Europarl.de-it.it | head -183206 > dev_de-it.it
tail -n 183206 Europarl.de-it.de > test_de-it.de
tail -n 183206 Europarl.de-it.it > test_de-it.it
perl tokenizer.perl -a -no-escape -l de < ../data/train_de-it.de > ../data/train_de-it.atok.de
perl tokenizer.perl -a -no-escape -l de < ../data/dev_de-it.de > ../data/dev_de-it.atok.de
perl tokenizer.perl -a -no-escape -l de < ../data/test_de-it.de > ../data/test_de-it.atok.de
perl tokenizer.perl -a -no-escape -l it < ../data/train_de-it.it > ../data/train_de-it.atok.it
perl tokenizer.perl -a -no-escape -l it < ../data/dev_de-it.it > ../data/dev_de-it.atok.it
perl tokenizer.perl -a -no-escape -l it < ../data/test_de-it.it > ../data/test_de-it.atok.it
python3 preprocess.py \
-train_src ../data/train_de-it.atok.it \
-train_tgt ../data/train_de-it.atok.de \
-valid_src ../data/dev_de-it.atok.it \
-valid_tgt ../data/dev_de-it.atok.de \
-save_data ../data/europarl_de_it.atok.low \
-lower
python3 train.py \
-data ../data/europarl_de_it.atok.low.train.pt \
-save_model ../models_en_de/europarl_it_de_models \
-gpus 0
You can get a lot of hints at Training Romance Multi-Way model and also Training English-German WMT15 NMT engine. The main idea is to run BPE tokenization on a concatenated XXYY training corpus and then tokenize the training corpora with the learned BPE models.
The Byte Pair Encoding tokenization should be beneficial for German because of its compounding, the algorithm helps to segment words into subword units. The trick is that you need to train a BPE model on a single training corpus containing both source and target. See Jean Senellart's comment:
Another idea is to tokenize with
-case_feature
. It is also a good idea for all languages where letters can have different case. See Jean's comment:To improve MT quality, you might also try