I am trying to train a Doc2Vec model using gensim with 114M unique documents/labels and vocab size of around 3M unique words. I have 115GB Ram linux machine on Azure. When I run build_vocab, the iterator parses all files and then throws memory error as listed below.
Traceback (most recent call last):
File "doc_2_vec.py", line 63, in <module>
model.build_vocab(sentences.to_array())
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 579, in build_vocab
self.finalize_vocab(update=update) # build tables & arrays
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 752, in finalize_vocab
self.reset_weights()
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 662, in reset_weights
self.docvecs.reset_weights(self)
File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 390, in reset_weights
self.doctag_syn0 = empty((length, model.vector_size), dtype=REAL)
MemoryError
My code-
import parquet
import json
import collections
import multiprocessing
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
class LabeledLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
def __iter__(self):
for src in self.sources:
with open(src) as fo:
for row in parquet.DictReader(fo, columns=['Id','tokens']):
yield LabeledSentence(utils.to_unicode(row['tokens']).split('\x01'), [row['Id']])
## list of files to be open ##
sources = glob.glob("/data/meghana_home/data/*")
sentences = LabeledLineSentence(sources)
#pre = Doc2Vec(min_count=0)
#pre.scan_vocab(sentences)
"""
for num in range(0, 20):
print('min_count: {}, size of vocab: '.format(num), pre.scale_vocab(min_count=num, dry_run=True)['memory']['vocab']/700)
print("done")
"""
NUM_WORKERS = multiprocessing.cpu_count()
NUM_VECTORS = 300
model = Doc2Vec(alpha=0.025, min_alpha=0.0001,min_count=15, window=3, size=NUM_VECTORS, sample=1e-4, negative=10, workers=NUM_WORKERS)
model.build_vocab(sentences)
print("built vocab.......")
model.train(sentences,total_examples=model.corpus_count, epochs=10)
Memory usage as per top is-
Can someone please tell me how much is the expected memory? What is better option- Adding swap space and slow the process or add more memory so that cost of cluster might eventually be equivalent. What vectors gensim stores in memory? Any flag that i am missing for memory efficient usage.
114 million doctags will require at least
114,000,000 doctags * 300 dimensions * 4 bytes/float = 136GB
just to store the raw doctag-vectors during training.(If the doctag keys
row['Id']
are strings, there'll be extra overhead for remembering the string-to-int-index mapping dict. If the doctag keys are raw ints from 0 to 114 million, that will avoid filling that dict. If the doctag keys are raw ints, but include any int higher than 114 million, the model will attempt to allocate an array large enough to include a row for the largest int – even if many other lower ints are unused.)The raw word-vectors and model output-layer (
model.syn1
) will require about another 8GB, and the vocabulary dictionary another few GB.So you'd ideally want more addressable memory, or a smaller set of doctags.
You mention a 'cluster', but gensim
Doc2Vec
does not support multi-machine distribution.Using swap space is generally a bad idea for these algorithms, which can involve a fair amount of random access and thus become very slow during swapping. But for the case of Doc2Vec, you can set its doctags array to be served by a memory-mapped file, using the
Doc2Vec.__init__()
optional parameterdocvecs_mapfile
. In the case of each document having a single tag, and those tags appearing in the same ascending order on each repeated sweep through the training texts, performance may be acceptable.Separately:
Your management of training iterations and the
alpha
learning-rate is buggy. You're achieving 2 passes over the data, atalpha
values of 0.025 and 0.023, even though eachtrain()
call is attempting a default 5 passes but then just getting a single iteration from your non-restartablesentences.to_array()
object.You should aim for more passes, with the model managing
alpha
from its initial-high to default final-tinymin_alpha
value, in fewer lines of code. You need only calltrain()
once unless you're absolutely certain you need to do extra steps between multiple calls. (Nothing shown here requires that.)Make your
sentences
object a true iterable-object that can be iterated over multiple times, by changingto_array()
to__iter__()
, then passing thesentences
alone (rather thansentences.to_array()
) to the model.Then call
train()
once with this multiply-iterable object, and let it do the specified number of iterations with a smoothalpha
update from high-to-low. (The default inherited fromWord2Vec
is 5 iterations, but 10 to 20 are more commonly used in publishedDoc2Vec
work. The defaultmin_alpha
of 0.0001 should hardly ever be changed.)