Subsampling when training word embeddings

16 views Asked by At

NLP newbie here with a question about word embeddings. As a learning exercise, I'm trying to train my own set of word embeddings based on word2vec. I have a corpus of english sentences that I've downloaded and cleaned and I think I have a decent grasp of how the training is supposed to work, but there's something I still don't really understand.

As one might imagine, the corpus contains many more instances of common words like 'the', 'and', and so on. The word frequency distribution is a fairly extreme power law, which makes sense. My question is this: what are the best practices to deal with this when I'm generating samples to train the word embeddings?

I can see a few of options:

  1. When I'm generating training samples, do some sort of probabilistic sampling based on the frequency of the input token in the dataset. My newbie intuition is that this makes some sense, but I'm not 100% sure how the sampling should work.
  2. With some probability, drop the most common words from the vocabulary altogether and don't learn embeddings for them at all. I've seen some guidance on the web (and in the original word2vec paper) that recommends doing this and just treating them as an OOV token when looking up the embedding, but it just feels ... weird. After all, I do want to have an embedding for the word 'the', even if it appears very frequently.
  3. Just power through and live with the fact that I'm going to have a lot more training samples for the word 'the' than the word 'persnickety'. This will make a training epoch take a lot longer.

Can anyone give me some guidance here? How do people usually deal with this kind of imbalance?

1

There are 1 answers

0
Seth On

I think I figured out the answer, but I'm not 100% sure (due to the fact that I'm pretty new at this stuff). Please feel free to correct me.

I think my original question was based on a misunderstanding of the original word2vec paper: I don't think you're supposed to drop common words from the vocabulary altogether, I think you're supposed to omit them when you're generating the training pairs (in my case, skipgrams). The words stay in the vocabulary, but you (ahem) skip them when you're making the training data with some probability.

So the answer (according to the paper) is option 1.