I have one million documents. I chunk those documents up into 20 million paragraphs (each paragraph is about 50 words) to train a Doc2Vec model with. It currently takes about 20 hours to train the model with these paragraphs and I would like to speed this up (it only takes 15 mins to build the vocabulary). Gensim never uses more than 4 cores during training (no matter how many cores I have available). I know there are issues with the Python GIL that may be the cause of this.
Here's the code I use to train the Doc2Vec model:
model = Doc2Vec(vector_size=50, min_count=3, epochs=40, workers=24, dm=0, window=3, dbow_words=0, sample=0.00010, negative=15, ns_exponent=0.75)
model.build_vocab(training_corpus)
model.train(training_corpus, total_examples=model.corpus_count, epochs=40)
Is there a way to distribute this across multiple machines so I can train the model more quickly?
Gensim has no support for distributing
Doc2Vectraining over multiple machines.With your
workers=24, Gensim'sDoc2Vecwill spawn 24 worker threads – in addition to the main/master that's reading the corpus & parceling out batches of documents to each worker. So potentially up-to-25 cores will be at least a little active – but when you see something intoplike "400%", that may mean each of 25 cores is only about 16% in-use, due to other bottlenecks.In particular, the classic & Pythonic manner of supplying the corpus as an iterable, plus Gensim using a single thread to split up that corpus, and Gensim only having moved the innermost raw calculations into fully-immune-to-the-Python-GIL into raw Cython/C code, means it has a hard time keeping more than around 12 threads busy at once (and thus ~1200% utilization reported by
top).In fact, past a certain optimium number of workers that's affected by your other corpus details & model parameters, trying to use more threads starts to hurt, from contention/switching overhead.
So: holding the rest of your setup/model-parameters constant, if you want to achieve the maximum training core utilization in this corpus mode, you'd need to experiment with different
workersvalues & watch the logging-reported words-per-second for each over a few representative minutes, and pick theworkersvalue that reports best. Notably:train()should be representative of the entire run – so you can start a run, estimate its rate, & interrupt it.build_vocab()cost each time – a step which requires a full pass and also remains single-threaded – you can.save()the model after.build_vocab(), then re-.load()repeatedly, tamper with the internalworkersvalue, then run.train()for each trial.A few other speed-of-training factors to consider:
training_corpusis doing anything other than reading an already-tokenized corpus from RAM or a fast volume – like accessing a DB or performing extra preprocessing – it will be an even bigger limit on how busy the workers can get. So it's often best to do those things in a prior step, saving the result as a plain-text space-delimited local-fast-volume corpus – or even, in plentiful-RAM situations, already Python list-of-lists-of-strings.There is an alternate mode to specify your corpus that more fully escapes Python GIL limitations: a
corpus_fileparameter that, instead of streaming documents from a Python iterable collection, lets as many worker threads as you choose to read directly, independently, without contention from a corpus-on disk. This can potentially saturate as many cores as you have, but has a few caveats:tags(keys to the doc-vectors), or more than onetagper document – instead each document's tag is only its single line-numberSo: worth considering if your performance needs acute & hardware is plentiful, but you'll want to do some extra checks on the results.
Another consideration as your corpus grows: though training has some challenges to parallelize/distribute, and grows in duration with corpus, inference of doc-vectors for arbitrary new documents is easier to distribute, including across machines – by just moving copies of the frozen trained model to new machines. So if you've got a representative X million documents, sufficient in their coverage of vocab & themes to train a good model, you can then use that good model to infer 1000*X as many doc-vectors on other documents that were never part of training. (They'd not be available for lookup/most-similar ops from a single
Doc2Vecmodel object, but could be accessed through other systems/vector-DBs.)Finally, a couple other notes on your setup:
min_countover its default of 5, as your corpus grows, rather than decrease. Rare words don[t get good word-vectors or contribute meaningfully to the doc-vector quality, but can (in aggregate) take up lots of the model state & training time. Pretending they aren't there, when too few to be helpful, tends to make the remaining vectors score better on downstream tasks. So don't assume "lower value means more data used and more data used can only help" – evaluate alternatives, because some rare words are essentially 'noise'.samplevalues are often a big win, both speeding training and improving results, by essentially spending less time on (& giving less influence to) very-frequent words. Again, sometimes to test.total_examplesas passed totrain()matches the true count, or else both internalalphalearning-rate decay and logged reporting progress estimates will be handled wrong.Good luck!