Building a custom tokenizer via HuggingFace Tokenizers library from scratch, some vocabularies are added, but some are not

35 views Asked by At

I try to create a custom Tokenizer via the HuggingFace Tokenizers library from scratch, following this tutorial.

My dataset consists of 80 million Chinese sentences. The structure of my SentencePieceBPETokenizer-based custom tokenizer consists of a custom pre-tokenizer, normalizer and decoder.

Normalizer: The normalizer is responsible for cleaning up the sentences. Used NKFC(), Replace(Regex("\s+"), " ") and Lowercase() in sequence.

class CustomNormalizer:
    def normalize(self, normalized: NormalizedString):
        normalized.nfkc()
        normalized.filter(lambda char: not char.isnumeric())
        normalized.replace(Regex("\s+"), " ")
        normalized.lowercase()

Pre-Tokenizer: The pre-tokenizer is responsible for word segmenting, splitting sentences into vocabularies specifically. The codes of pre-tokenizer can be found here.

class JiebaPreTokenizer:
    def jieba_split(self, i: int, normalized_string: NormalizedString) -> List[NormalizedString]:
        splits = []
        # we need to call `str(normalized_string)` because jieba expects a str,
        # not a NormalizedString
        for token, start, stop in jieba.tokenize(str(normalized_string)):
            splits.append(normalized_string[start:stop])

        return splits
        # We can also easily do it in one line:
        # return [normalized_string[w[1] : w[2]] for w in jieba.tokenize(str(normalized_string))]

    def odd_number_split(
        self, i: int, normalized_string: NormalizedString
    ) -> List[NormalizedString]:
        # Just an odd example...
        splits = []
        last = 0
        for (i, char) in enumerate(str(normalized_string)):
            if char.isnumeric() and int(char) % 2 == 1:
                splits.append(normalized_string[last:i])
                last = i
        # Don't forget the last one
        splits.append(normalized_string[last:])
        return splits

    def pre_tokenize(self, pretok: PreTokenizedString):
        # Let's call split on the PreTokenizedString to split using `self.jieba_split`
        pretok.split(self.jieba_split)
        # Here we can call `pretok.split` multiple times if we want to apply
        # different algorithm, but we generally just need to call it once.
        pretok.split(self.odd_number_split)

Decoder: The decoder is just joining the texts if needed.

class CustomDecoder:
    def decode(self, tokens: List[str]) -> str:
        return "".join(tokens)
    
    def decode_chain(self, tokens: List[str]) -> List[str]:
        return [f" {t}" for t in tokens]

Building code of the Tokenizer is:

from tokenizers import SentencePieceBPETokenizer

special_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]
tk_tokenizer = SentencePieceBPETokenizer()
tk_tokenizer.normalizer = Normalizer.custom(CustomNormalizer())
tk_tokenizer.pre_tokenizer = PreTokenizer.custom(JiebaPreTokenizer())
tk_tokenizer.decoder = Decoder.custom(CustomDecoder())

To prove the pre-tokenization process work, I used this code:

text = "件衫巢,幫我燙吓喇"
print(tk_tokenizer.pre_tokenizer.pre_tokenize_str(text))

which outputs:

[('件衫', (0, 2)), ('巢', (2, 5)), (',', (5, 6)), ('幫', (6, 7)), ('我', (7, 8)), ('燙', (8, 9)), ('吓', (9, 10)), ('喇', (10, 11))]

which correctly segments the words.

However, after the training of the custom tokenizer using the following codes, some vocabularies like '巢' (means wrinkle) and '燙' (means ironing) cannot be identified:

tk_tokenizer.train_from_iterator(
    get_training_corpus(),
    vocab_size=60000,
    min_frequency=1,
    show_progress=True,
    special_tokens=special_tokens
)

# Helper function
def get_training_corpus(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]

After training, I tested the functionality of the Tokenizer:

encoding = tk_tokenizer.encode("件衫巢,幫我燙吓喇")
print(encoding.ids, encoding.tokens, encoding.offsets)

which outputs:

[3248, 13, 350, 406, 191, 222] ['件衫', ',', '幫', '我', '吓', '喇'] [(0, 2), (5, 6), (6, 7), (7, 8), (9, 10), (10, 11)]

About Vocab Size

The final vocabulary size of the custom tokenizer is 58,685. The reason for setting the limit of 60,000 vocabulary size is that the GPT vocab size is 40,478 while GPT-2 vocabulary size is 50,257. Modern Chinese has approximately 106,230 Chinese vocabulary but less than half of them are commonly used.

The Vocab Missing Problem

I know the reason why '巢' cannot be identified, as the dataset does not contain such vocabulary. However, for '燙' the dataset has over 100 instances of such vocabulary. In theory, the tokenizer can store the vocabulary as one of the vocabularies, instead of presenting the vocabulary as an unknown token.

My Question

My question is, how to improve the tokenizer code, to let the vocabulary "燙" re-appear in the Tokenizer vocabulary list? Thanks.

0

There are 0 answers