How do I train word embeddings within a large block of custom text using BERT?

843 views Asked by At

I found a great tutorial to generate contextualized word embedding for a custom sentence here: http://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/

However, it does not tell me how to train this on a larger paragraph. I have around 1,000 tokens that I want the model to learn. How can I adapt the link's code and apply it to a whole paragraph, so that each word learns the context from the whole document?

1

There are 1 answers

0
polm23 On

The tutorial you link to currently uses Huggingface Transformers. According to the authors, their BERT model is limited to 512 tokens. If you want to process longer sentences you'll need to train your own BERT from scratch.

Note that in general getting good embeddings for long documents is still an area of active research and you won't get good results just by changing some numbers in a configuration file.