SentencePiece in Google Colab

2.4k views Asked by At

I want to use sentencepiece, from https://github.com/google/sentencepiece in a Google Colab project where I am training an OpenNMT model. I'm a little confused with how to set up the sentencepiece binaries in Google Colab. Do I need to build with cmake?

When I try and install using pip install sentencepiece and try to include sentencepiece in my "transforms" in my script, I get this following error

After running this script (matched from the OpenNMT translation tutorial) !onmt_build_vocab -config en-sp.yaml -n_sample -1

I get:

Traceback (most recent call last):
  File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 63, in main
    build_vocab_main(opts)
  File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 32, in build_vocab_main
    transforms = make_transforms(opts, transforms_cls, fields)
  File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/transform.py", line 176, in make_transforms
    transform_obj.warm_up(vocabs)
  File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/tokenize.py", line 110, in warm_up
    load_src_model.Load(self.src_subword_model)
  File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load
    return self.LoadFromFile(model_file)
  File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

Below is how my script is written. I'm not sure what the not a string is coming from.

## Where the samples will be written
save_data: en-sp/run/example

## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt

## Where the model will be saved
save_model: drive/MyDrive/Europarl/model/model

# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    europarl:
        path_src: train_europarl-v7.es-en.es
        path_tgt: train_europarl-v7.es-en.en
        transforms: [sentencepiece, filtertoolong]
        weight: 1

    valid:
        path_src: dev_europarl-v7.es-en.es
        path_tgt: dev_europarl-v7.es-en.en
        transforms: [sentencepiece]

skip_empty_level: silent

world_size: 1
gpu_ranks: [0]
...

EDIT: So I went ahead and Googled the issue more and found a google colab project that built sentencepiece using cmake here https://colab.research.google.com/github/mymusise/gpt2-quickly/blob/main/examples/gpt2_quickly.ipynb#scrollTo=dDAup5dxDXZW. However, even after building using cmake, I'm still getting this issue.

1

There are 1 answers

0
Jose Chavez On BEST ANSWER

To fix this issue, I had to filter and tokenize my dataset and then train with sentencepiece. I used the scripts from this helpful source: https://github.com/ymoslem/MT-Preparation to do everything and now my model is training!