I have tried to work with giza++ on window (using Cygwin compiler). I used this code:
//Suppose source language is French and target language is English
plain2snt.out FrenchCorpus.f EnglishCorpus.e
mkcls -c30 -n20 -pFrenchCorpus.f -VFrenchCorpus.f.vcb.classes opt
mkcls -c30 -n20 -pEnglishCorpus.e -VEnglishCorpus.e.vcb.classes opt
snt2cooc.out FrenchCorpus.f.vcb EnglishCorpus.e.vcb FrenchCorpus.f_EnglishCorpus.e.snt >courpuscooc.cooc
GIZA++ -S FrenchCorpus.f.vcb -T EnglishCorpus.e.vcb -C FrenchCorpus.f_EnglishCorpus.e.snt -m1 100 -m2 30 -mh 30 -m3 30 -m4 30 -m5 30 -p1 o.95 -CoocurrenceFile courpuscooc.cooc -o dictionary
But the after getting the output files from giza++ and evaluate output, I observed that the results were too bad.
My evaluation result was:
RECALL = 0.0889
PRECISION = 0.0990
F_MEASURE = 0.0937
AER = 0.9035
Dose any body know the reason? Could the reason be that I have forgotten some parameters or I should change some of them?
in other word:
first I wanted train giza++ by huge amount of data and then test it by small corpus and compare its result by desired alignment(GOLD STANDARD) , but I don't find any document or useful page in web.
can you introduce useful document?
Therefore I ran it by small courpus (447 sentence) and compared result by desired alignment.do you think this is right way?
Also I changed my code as follows and got better result but It's still not good:
GIZA++ -S testlowsf.f.vcb -T testlowde.e.vcb -C testlowsf.f_testlowde.e.snt -m1 5 -m2 0 -mh 5 -m3 5 -m4 0 -CoocurrenceFile inputcooc.cooc -o dictionary -model1dumpfrequency 1 -model4smoothfactor 0.4 -nodumps 0 -nsmooth 4 -onlyaldumps 1 -p0 0.999 -diagonal yes -final yes
result of evaluation :
// suppose A is result of GIZA++ and G is Gold standard. As and Gs is S link in A And G files. Ap and Gp is p link in A and G files.
RECALL = As intersect Gs/Gs = 0.6295
PRECISION = Ap intersect Gp/A = 0.1090
FMEASURE = (2*PRECISION*RECALL)/(RECALL + PRECISION) = 0.1859
AER = 1 - ((As intersect Gs + Ap intersect Gp)/(A + S)) = 0.7425
Do you know the reason?
Where did you get those parameters? 100 iterations of model1?! Well, if you actually manage to run this, I strongly suspect that you have a very small parallel corpus. If so, you should consider adding more parallel data in training. And how exactly do you calculate the recall and precision measures?
EDIT:
With less than 500 sentences you're unlikely to get any reasonable performance. The usual way to do it is not find a larger (unaligned) parallel corpus, run GIZA++ on both together and then evaluate the small part for which you have the manual alignments. Check Europarl or MultiUN, these are freely available corpora, both contain a relatively large amount of English-French parallel data. The instructions on preparing the data can be found on the websites.