What is the best way to scale up Gensim Doc2Vec training?

94 views Asked by At

I have one million documents. I chunk those documents up into 20 million paragraphs (each paragraph is about 50 words) to train a Doc2Vec model with. It currently takes about 20 hours to train the model with these paragraphs and I would like to speed this up (it only takes 15 mins to build the vocabulary). Gensim never uses more than 4 cores during training (no matter how many cores I have available). I know there are issues with the Python GIL that may be the cause of this.

Here's the code I use to train the Doc2Vec model:

model = Doc2Vec(vector_size=50, min_count=3, epochs=40, workers=24, dm=0, window=3, dbow_words=0, sample=0.00010, negative=15, ns_exponent=0.75)
model.build_vocab(training_corpus)
model.train(training_corpus, total_examples=model.corpus_count, epochs=40)

Is there a way to distribute this across multiple machines so I can train the model more quickly?

1

There are 1 answers

13
gojomo On

Gensim has no support for distributing Doc2Vec training over multiple machines.

With your workers=24, Gensim's Doc2Vec will spawn 24 worker threads – in addition to the main/master that's reading the corpus & parceling out batches of documents to each worker. So potentially up-to-25 cores will be at least a little active – but when you see something in top like "400%", that may mean each of 25 cores is only about 16% in-use, due to other bottlenecks.

In particular, the classic & Pythonic manner of supplying the corpus as an iterable, plus Gensim using a single thread to split up that corpus, and Gensim only having moved the innermost raw calculations into fully-immune-to-the-Python-GIL into raw Cython/C code, means it has a hard time keeping more than around 12 threads busy at once (and thus ~1200% utilization reported by top).

In fact, past a certain optimium number of workers that's affected by your other corpus details & model parameters, trying to use more threads starts to hurt, from contention/switching overhead.

So: holding the rest of your setup/model-parameters constant, if you want to achieve the maximum training core utilization in this corpus mode, you'd need to experiment with different workers values & watch the logging-reported words-per-second for each over a few representative minutes, and pick the workers value that reports best. Notably:

  • The optimal value in the corpus mode you're using is likely to be in the range 4-16.
  • As long as your corpus isn't wildly different in ts doc lengths and words-used in different ranges, the 1st few minutes of train() should be representative of the entire run – so you can start a run, estimate its rate, & interrupt it.
  • To not have to pay the build_vocab() cost each time – a step which requires a full pass and also remains single-threaded – you can .save() the model after .build_vocab(), then re-.load() repeatedly, tamper with the internal workers value, then run .train() for each trial.

A few other speed-of-training factors to consider:

  • If your training_corpus is doing anything other than reading an already-tokenized corpus from RAM or a fast volume – like accessing a DB or performing extra preprocessing – it will be an even bigger limit on how busy the workers can get. So it's often best to do those things in a prior step, saving the result as a plain-text space-delimited local-fast-volume corpus – or even, in plentiful-RAM situations, already Python list-of-lists-of-strings.
  • The optimal number of workers given contention issues is most affected by a few of the parameters that make each individual training-example intense, especially 'window', 'negative', & 'vector_size'. Each of these, when larger, typically slows training versus a smaller value, ceteris paribus. BUT, by making the blocks of optimized (Python-GIL-immune) calculation run longer, larger values here increase the achievable core-utilization. So in experimenting with parameters, if you have more cores than this corpus-mode can usually saturate, it may be 'nearly free' to try higher values in some of these parameters, because all the extra calc-costs soak up cores that'd otherwise be idled by contention. But remember: a 'workers' count that's proven optimal-throughput for one set of paramters may be wrong when these other core-calculation-changing parameters are varied.

There is an alternate mode to specify your corpus that more fully escapes Python GIL limitations: a corpus_file parameter that, instead of streaming documents from a Python iterable collection, lets as many worker threads as you choose to read directly, independently, without contention from a corpus-on disk. This can potentially saturate as many cores as you have, but has a few caveats:

  • the whole corpus must be a single file on disk, with a document per line, already tokenized by spaces - which can become unweildy with exactly the giant corpora on which this mode would be most attractive
  • you lose the ability to specify your own document tags (keys to the doc-vectors), or more than one tag per document – instead each document's tag is only its single line-number
  • this newer less-tested mode still shows some evidence of bugginess around the 'seams' between the ranges of the file that different threads train on, when tiny sections of the training data may be ignored - see project issue #2757 – and places where the performance varies for unclear reasons – see project issue #3089.

So: worth considering if your performance needs acute & hardware is plentiful, but you'll want to do some extra checks on the results.

Another consideration as your corpus grows: though training has some challenges to parallelize/distribute, and grows in duration with corpus, inference of doc-vectors for arbitrary new documents is easier to distribute, including across machines – by just moving copies of the frozen trained model to new machines. So if you've got a representative X million documents, sufficient in their coverage of vocab & themes to train a good model, you can then use that good model to infer 1000*X as many doc-vectors on other documents that were never part of training. (They'd not be available for lookup/most-similar ops from a single Doc2Vec model object, but could be accessed through other systems/vector-DBs.)

Finally, a couple other notes on your setup:

  • It almost always improves these models to increase the min_count over its default of 5, as your corpus grows, rather than decrease. Rare words don[t get good word-vectors or contribute meaningfully to the doc-vector quality, but can (in aggregate) take up lots of the model state & training time. Pretending they aren't there, when too few to be helpful, tends to make the remaining vectors score better on downstream tasks. So don't assume "lower value means more data used and more data used can only help" – evaluate alternatives, because some rare words are essentially 'noise'.
  • Often training on large corpora uses just 10-20 epochs and does fine, so 30 may be overkill. Something to test.
  • With large corpora more-aggressive (lower) sample values are often a big win, both speeding training and improving results, by essentially spending less time on (& giving less influence to) very-frequent words. Again, sometimes to test.
  • Make sure your total_examples as passed to train() matches the true count, or else both internal alpha learning-rate decay and logged reporting progress estimates will be handled wrong.

Good luck!