I am testing the functionality of Tokenizer using various pre-trained models on Chinese sentences. Here are my codes:
from transformers import BartTokenizer, BertTokenizer
text_eng = 'I go to school by train.'
text_can = '我乘搭火車上學。'
text_chi = '我搭火車返學。'
tokenizer_bartchinese = BertTokenizer.from_pretrained('fnlp/bart-base-chinese')
tokenizer_bertchinese = BertTokenizer.from_pretrained('bert-base-chinese')
tokenizer_fb = BartTokenizer.from_pretrained('facebook/bart-large')
# BART Chinese
print(tokenizer_bartchinese.tokenize(text_eng))
print(tokenizer_bartchinese.tokenize(text_can))
print(tokenizer_bartchinese.tokenize(text_chi))
# BERT Chinese
print(tokenizer_bertchinese.tokenize(text_eng))
print(tokenizer_bertchinese.tokenize(text_can))
print(tokenizer_bertchinese.tokenize(text_chi))
# BART Large
print(tokenizer_fb.tokenize(text_eng))
print(tokenizer_fb.tokenize(text_can))
print(tokenizer_fb.tokenize(text_chi))
Here are the results:
['I', 'go', 'to', 'school', 'by', 'train', '.']
['我', '乘', '搭', '火', '車', '上', '學', '。']
['我', '搭', '火', '車', '返', '學', '。']
['[UNK]', 'go', 'to', 'school', 'by', 't', '##rain', '.']
['我', '乘', '搭', '火', '車', '上', '學', '。']
['我', '搭', '火', '車', '返', '學', '。']
['I', 'Ġgo', 'Ġto', 'Ġschool', 'Ġby', 'Ġtrain', '.']
['æĪ', 'ij', 'ä¹', 'ĺ', 'æ', 'IJ', 'Ń', 'ç', 'ģ«', 'è»', 'Ĭ', 'ä¸Ĭ', 'åŃ', '¸', 'ãĢĤ']
['æĪ', 'ij', 'æ', 'IJ', 'Ń', 'ç', 'ģ«', 'è»', 'Ĭ', 'è¿', 'Ķ', 'åŃ', '¸', 'ãĢĤ']
Should the Tokenizer recognize vocabulary in Chinese, such that the Tokenizer will segment the vocabulary instead of splitting each character? For example, 火車 (train) in Chinese is a vocabulary; and it should not be split into 火(fire) and 車(car).
My expected behaviour:
['我', '乘搭', '火車', '上', '學', '。']
which translates (just for reference):
['I', 'ride', 'train', 'to', 'school', '.']
Also, I noticed that facebook/bart-large
generates weird characters (the last 3 lines of the code output). Is this a normal behaviour?