Compare vocabulary size of WordPiece and BPE tokenizer algorithm

21 views Asked by At

I have a text file. I used hugging face tokenizer library to ran WordPiece and BPE tokenizer algorithm on my text file. I trained them on the file and got the vocabulary size. In spite of my expectation, vocabulary size of WordPiece became bigger. The result was:

WordPiece vocabulary size: 17555

Byte-Pair Encoding vocabulary size: 16553

Anyone knows what is the reason of this result? Here is my code for WordPiece algorithm:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train([path], trainer)

And here is the code of Byte-Pair Encoding:

from tokenizers import Tokenizer
from tokenizers.models import BPE
bpe_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
bpe_trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
bpe_tokenizer.pre_tokenizer = Whitespace()

bpe_tokenizer.train([path], bpe_trainer)
0

There are 0 answers