I have been searching for the meaning of numbers in giza++ phrase-table output within the official website (and pdf manual): http://www.statmt.org/moses/?n=FactoredTraining.ScorePhrases
And this is what I've come up to.
Let's say this is a line from phrase-table
načiniti na koji ||| way in which ||| 0.833333 * 0.33333 * ||| * ||| 12 3 1
that means:
e = "načiniti na koji"
f = "way in which"
count(e) = 12
count(f) = 3
count(e, f) = 1
p(f|e) = count(f, e) / count(e) = 1/12 = 0.833333
p(e|f) = count(f, e) / count(f) = 1/3 = 0.333333
These all makes perfect sense.
Yet, if I make a text search with textual editor, I get:
count("načiniti na koji") = 4
count("way in which") = 9
i.e, totally different numbers.
Another strange thing is:
osnivanje i ||| the ||| 0.000124085 * 1 * ||| 0-0 ||| 8059 1 1
so, considering the explanation from the official website,
count("the) = 1,
and
count("osnivanje i") = 8059.
One explanation could be that it might just be opposite.
But, real count("the") is 21466.
Are there some other tutorials/manuals that better clarify content of giza++ output files?
So, I figured out it should go something like this:
Giza runs through the parallel corpus
whenever two phrases are aligned, they are flushed into textual file, let's name it f_phrases
So may the notation be:
e - foreign giza member
f - english giza member
After this is done, f_phrases is sorted in two ways, and that's how we get two table files
Pairs are sorted so that all English translations of a certain foreign phrase (e) are next to each other, e.g.
therefore we conclude that
Afterwards, pairs are sorted so that all foreign language translations of a certain native phrase (f) are next to each other, e.g.
and we see that
count(f) = count("analysis and") = 14
considering it is the same table, just sorted in other manner, we see that
count("analysis and", "analiza i") = count("analiza i", "analysis and") = 17
Resulting phrase-table then looks like:
When conditional probabilities are calculated, then the inverse order is used, as the order is in the phrase-table: