Add reserved tokens to `tft.vocabulary`

222 views Asked by At

I would like to append words to the vocabulary created by tft.vocabulary that are not a part of the training samples (i.e. <mask> and <pad> tokens).

I see in the docs that the tft.vocabulary function can take an argument key_fn which the docs says:

Supply key_fn if you would like to generate a vocabulary with coverage over specific keys.

but with the key_fn below it still does not append the <mask> and <pad> tokens to the vocabulary.


def _key_fn(x):
  return tf.constant(['<mask>', '<pad>'])

vocab = tft.vocabulary(
  words,
  key_fn = lambda x : _key_fn(x),
  top_k = config.VOCAB_SIZE

)

1

There are 1 answers

0
Zohar On

What is it that you're trying to achieve?

I don't think that key_fn is related as it only affects the ordering of the vocabulary (and top k when provided)

Could you compute the vocabulary after appending the added information?

tft.vocabulary(tf.strings.join([words, <mask>, <pad>]), ...)

This would result in the vocabulary including the added suffix