Given a finite character vocabulary, what is the easiest way to represent arbitrarily long sequences of characters with uniform length?

46 views Asked by At

I am attempting to manipulate a finite state transducer for a project. However, in constructing the FST, I need the output symbols to each be some arbitrarily long sequence of characters from the input symbols, which are simply individual unique characters from an associated corpus of text. Additionally, I need to represent these arbitrarily long sequences uniformly, such that each combination's representation has the same length. Of course, with arbitrary length, the longest possible combination has infinite length, so let us assume that no combination can be longer than the longest document from the associated corpus.

In other words, given an input_vocabulary of ['a', 'b', 'c'], an output_vocabulary of ['a', 'ab', 'acb', 'abcb'] needs each to be represented as some vector of length 4 with each item in each vector being an item from the input_vocabulary. My only idea is to do so with a padded vector, such as, for this example, [ [0, 3, 3, 3], [0, 1, 3, 3], [0, 2, 1, 3], [0, 1, 2, 1] ], where 3 is a pad token, but I am very new to this, so any help would be greatly appreciated.

To clarify, I want to know if there is a way to do this without pad tokens.

0

There are 0 answers