I am using the GIZA++ for aligning word from the bitexts from the Europarl corpus.
Before i train the alignment model using GIZA++
, i need to use the mkcls script to making classes that is necessary for Hidden Markov Model algorithm as such:
mkcls -n10 -pcorp.tok.low.src -Vcorp.tok.low.src.vcb.classes
I have tried it with a small size 1000 lines corpus and it works properly and completed in a few minutes. Now i'm trying it on corpus with 1,500,000 lines and it's taking up 100% of one of the my CPU (Six-Core AMD Opteron(tm) Processor 2431 × 12)
Before making the classes, i have taken the necessary step to tokenize, lower all upper cases and filter out lines with more than 40 words.
Does anyone have similar experience on the mkcls
for GIZA++? How is it solved? If anyone had done the same on the Europarl corpus, how long did it take you to run the mkcls
?
Because the
mkcls
script forMOSES
andGIZA++
isn't parallelized, and the number of sentences and words in the 1.5 million words in Europarl corpus, it takes around 1-2 hours to make the vocabulary classes.the other pre-GIZA++ processing steps (viz.
plain2snt
,snt2cooc
) takes much much lesser time and processing power.