I am testing the functionality of Tokenizer using various pre-trained models on Chinese sentences. Here are my codes:

from transformers import BartTokenizer, BertTokenizer

text_eng = 'I go to school by train.'
text_can = '我乘搭火車上學。'
text_chi = '我搭火車返學。'

tokenizer_bartchinese = BertTokenizer.from_pretrained('fnlp/bart-base-chinese')
tokenizer_bertchinese = BertTokenizer.from_pretrained('bert-base-chinese')
tokenizer_fb = BartTokenizer.from_pretrained('facebook/bart-large') 

# BART Chinese
print(tokenizer_bartchinese.tokenize(text_eng))
print(tokenizer_bartchinese.tokenize(text_can))
print(tokenizer_bartchinese.tokenize(text_chi))

# BERT Chinese
print(tokenizer_bertchinese.tokenize(text_eng))
print(tokenizer_bertchinese.tokenize(text_can))
print(tokenizer_bertchinese.tokenize(text_chi))

# BART Large
print(tokenizer_fb.tokenize(text_eng))
print(tokenizer_fb.tokenize(text_can))
print(tokenizer_fb.tokenize(text_chi))

Here are the results:

['I', 'go', 'to', 'school', 'by', 'train', '.']
['我', '乘', '搭', '火', '車', '上', '學', '。']
['我', '搭', '火', '車', '返', '學', '。']
['[UNK]', 'go', 'to', 'school', 'by', 't', '##rain', '.']
['我', '乘', '搭', '火', '車', '上', '學', '。']
['我', '搭', '火', '車', '返', '學', '。']
['I', 'Ġgo', 'Ġto', 'Ġschool', 'Ġby', 'Ġtrain', '.']
['æĪ', 'ij', 'ä¹', 'ĺ', 'æ', 'IJ', 'Ń', 'ç', 'ģ«', 'è»', 'Ĭ', 'ä¸Ĭ', 'åŃ', '¸', 'ãĢĤ']
['æĪ', 'ij', 'æ', 'IJ', 'Ń', 'ç', 'ģ«', 'è»', 'Ĭ', 'è¿', 'Ķ', 'åŃ', '¸', 'ãĢĤ']

Should the Tokenizer recognize vocabulary in Chinese, such that the Tokenizer will segment the vocabulary instead of splitting each character? For example, 火車 (train) in Chinese is a vocabulary; and it should not be split into 火(fire) and 車(car).

My expected behaviour:

['我', '乘搭', '火車', '上', '學', '。']

which translates (just for reference):

['I', 'ride', 'train', 'to', 'school', '.']

Also, I noticed that facebook/bart-large generates weird characters (the last 3 lines of the code output). Is this a normal behaviour?

0

There are 0 answers