The Urls in its 'readme' file is not valid (http://www.fjoch.com/mkcls.html and http://www.fjoch.com/GIZA++.html). Is there a good tutorial about giza++? Or is there some alternatives that have complete documentation?
Is there a tutorial about giza++?
11.7k views Asked by Intelligence Gear At
5
There are 5 answers
0
On
This one is very helpful : http://fabioticconi.wordpress.com/2011/01/17/how-to-do-a-word-alignment-with-giza-or-mgiza-from-parallel-corpus/
IIT-B scholars have put up nice and detailed presentations for GIZA++ and MOSES setup and use.
Some of them are : http://www.cse.iitb.ac.in/~pb/cs712-2013/potpouri/kashyap-giza-mozes-jan2013.pdf
http://www.cse.iitb.ac.in/~anoopk/publications/presentations/moses_giza_intro.pdf
1
On
There is a supplemental explanation of how to format input files and how to run GIZA++ over here:
http://www.tc.umn.edu/~bthomson/wordalignment/GIZAREADME.txt
The following is excerpted from a tutorial I'm putting together for a class. (NB: This assumes you have successfully installed GIZA++-v2 on a *nix system.)
Sample 1 -
train.en
Sample 2 -
train.fr
plain2snt.out
to get target and source vocabulary files (*.vcb
) as well as a sentence pair file (*.snt
).From the GIZA++ directory, run:
where
TEXT1
andTEXT2
are the data files described in step 1.This produces four files in the same directory as
TEXT1
andTEXT2
(assuming they are in the same directory):The vocab files contain a unique (integer) ID for each word in the text (NB: not tokenized/lemmatized), the word/string, and the number of times that string occurred. These are separated by a single space character.
The sentence files contain numbers. For each sentence pair, there are three lines: the first is a count of the number of times that sentence pair occurs in the corpus and the second and third are a string of (space-separated) numbers corresponding to the entries for words in the vocab files. Based on the naming convention for
*.snt
files, the first file is assumed to be the source, and the second is assumed to be the target language. For example, in the fileTEXT1_TEXT2.snt
, the first line will be a count of the number of times the first sentence-pair occurred in the corpus, the second line will be a string of numbers corresponding to words in theTEXT1.vcb
file, and the third line will be a string of numbers corresponding to words in theTEXT2.vcb
file.TEXT1.vcb
,TEXT2.vcb
, and either of the two*.snt
files can be used as input to GIZA++ to produce an alignment.For example:
But note that when I tried to run this, I had to rename
TEXT1_TEXT2.snt
to something without an underscore in the name in order to get any proper output.