I'm reading the top example: http://www.openfst.org/twiki/bin/view/FST/FstExamples about tokenization.
In the example, they create three fsts: Mars.fst
, Martian.fst
, and man.fst
, and manually run some fst commands to merge them into one big transducer. They get the word "Mars", "Martian", and "man" from wotw.syms
, which has 7102 words.
My question is, is there a smart way to create a word.fst
for all 7102 words, so that all 7102 words can be made into one big automata, or does it have to be done manually, like they did for the three word Martian, Mars, and man?
They gave a script: https://www.openfst.org/twiki/pub/FST/FstExamples/makelex.py.txt We may simply: