Compare vocabulary size of WordPiece and BPE tokenizer algorithm

21 views Asked by Zahra Reyhanian At 05 March 2024 at 15:33

I have a text file. I used hugging face tokenizer library to ran WordPiece and BPE tokenizer algorithm on my text file. I trained them on the file and got the vocabulary size. In spite of my expectation, vocabulary size of WordPiece became bigger. The result was:

WordPiece vocabulary size: 17555

Byte-Pair Encoding vocabulary size: 16553

Anyone knows what is the reason of this result? Here is my code for WordPiece algorithm:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train([path], trainer)

And here is the code of Byte-Pair Encoding:

from tokenizers import Tokenizer
from tokenizers.models import BPE
bpe_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
bpe_trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
bpe_tokenizer.pre_tokenizer = Whitespace()

bpe_tokenizer.train([path], bpe_trainer)

Original Q&A

TechQA.

Compare vocabulary size of WordPiece and BPE tokenizer algorithm

There are 0 answers

Related Questions in PYTHON

Related Questions in NLP

Related Questions in TOKENIZE

Related Questions in HUGGINGFACE-TOKENIZERS

Popular Questions

Trending Questions