What do the counts in giza++ phrase-table mean?

491 views Asked by At

I have been searching for the meaning of numbers in giza++ phrase-table output within the official website (and pdf manual): http://www.statmt.org/moses/?n=FactoredTraining.ScorePhrases

And this is what I've come up to.

Let's say this is a line from phrase-table

načiniti na koji ||| way in which ||| 0.833333 * 0.33333 * ||| * ||| 12 3 1

that means:

e = "načiniti na koji"
f = "way in which"

count(e) = 12
count(f) = 3
count(e, f) = 1

p(f|e) = count(f, e) / count(e) = 1/12 = 0.833333
p(e|f) = count(f, e) / count(f) = 1/3 = 0.333333

These all makes perfect sense.

Yet, if I make a text search with textual editor, I get:

count("načiniti na koji") = 4
count("way in which") = 9

i.e, totally different numbers.

Another strange thing is:

osnivanje i ||| the ||| 0.000124085 * 1 * ||| 0-0 ||| 8059 1 1

so, considering the explanation from the official website,

count("the) = 1,

and

count("osnivanje i") = 8059.

One explanation could be that it might just be opposite.

But, real count("the") is 21466.

Are there some other tutorials/manuals that better clarify content of giza++ output files?

1

There are 1 answers

0
Slana Girica On

So, I figured out it should go something like this:

  • Giza runs through the parallel corpus

  • whenever two phrases are aligned, they are flushed into textual file, let's name it f_phrases

So may the notation be:

e - foreign giza member

f - english giza member

After this is done, f_phrases is sorted in two ways, and that's how we get two table files

  1. extract.o.sorted

Pairs are sorted so that all English translations of a certain foreign phrase (e) are next to each other, e.g.

analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      analysis and
analiza i      and
analiza i      evaluation and
analiza i      the analysis and
analiza i      through evaluation and

therefore we conclude that

count(e) = count("analiza i") = 17

Afterwards, pairs are sorted so that all foreign language translations of a certain native phrase (f) are next to each other, e.g.

  1. extract.inv.sorted
analysis and              Analysis and
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i
analysis and              analiza i

and we see that count(f) = count("analysis and") = 14

considering it is the same table, just sorted in other manner, we see that count("analysis and", "analiza i") = count("analiza i", "analysis and") = 17

Resulting phrase-table then looks like:

analiza i|||analysis and|||     14      |||             17|||               13

e       ||| f        ||| count(f) ||| count(e)   ||| count(e, f) = count(f, e)

When conditional probabilities are calculated, then the inverse order is used, as the order is in the phrase-table:

p(e|f) = p(e, f) / p(f) phrase translation probability 
p(f|e) = p(f, e) / p(e) inverse phrase translation probability