I want to use sentencepiece, from https://github.com/google/sentencepiece in a Google Colab project where I am training an OpenNMT model. I'm a little confused with how to set up the sentencepiece binaries in Google Colab. Do I need to build with cmake?
When I try and install using pip install sentencepiece
and try to include sentencepiece in my "transforms" in my script, I get this following error
After running this script (matched from the OpenNMT translation tutorial)
!onmt_build_vocab -config en-sp.yaml -n_sample -1
I get:
Traceback (most recent call last):
File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 63, in main
build_vocab_main(opts)
File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 32, in build_vocab_main
transforms = make_transforms(opts, transforms_cls, fields)
File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/transform.py", line 176, in make_transforms
transform_obj.warm_up(vocabs)
File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/tokenize.py", line 110, in warm_up
load_src_model.Load(self.src_subword_model)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
Below is how my script is written. I'm not sure what the not a string is coming from.
## Where the samples will be written
save_data: en-sp/run/example
## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt
## Where the model will be saved
save_model: drive/MyDrive/Europarl/model/model
# Prevent overwriting existing files in the folder
overwrite: False
# Corpus opts:
data:
europarl:
path_src: train_europarl-v7.es-en.es
path_tgt: train_europarl-v7.es-en.en
transforms: [sentencepiece, filtertoolong]
weight: 1
valid:
path_src: dev_europarl-v7.es-en.es
path_tgt: dev_europarl-v7.es-en.en
transforms: [sentencepiece]
skip_empty_level: silent
world_size: 1
gpu_ranks: [0]
...
EDIT: So I went ahead and Googled the issue more and found a google colab project that built sentencepiece using cmake here https://colab.research.google.com/github/mymusise/gpt2-quickly/blob/main/examples/gpt2_quickly.ipynb#scrollTo=dDAup5dxDXZW. However, even after building using cmake, I'm still getting this issue.
To fix this issue, I had to filter and tokenize my dataset and then train with sentencepiece. I used the scripts from this helpful source: https://github.com/ymoslem/MT-Preparation to do everything and now my model is training!