Gensim Doc2vec finalize_vocab Memory Error

968 views Asked by At

I am trying to train a Doc2Vec model using gensim with 114M unique documents/labels and vocab size of around 3M unique words. I have 115GB Ram linux machine on Azure. When I run build_vocab, the iterator parses all files and then throws memory error as listed below.

    Traceback (most recent call last):
  File "doc_2_vec.py", line 63, in <module>
    model.build_vocab(sentences.to_array())
  File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 579, in build_vocab
    self.finalize_vocab(update=update)  # build tables & arrays
  File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 752, in finalize_vocab
    self.reset_weights()
  File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 662, in reset_weights
    self.docvecs.reset_weights(self)
  File "/home/meghana/.local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 390, in reset_weights
    self.doctag_syn0 = empty((length, model.vector_size), dtype=REAL)
MemoryError

My code-

import parquet
import json
import collections
import multiprocessing


# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec

class LabeledLineSentence(object):
    def __init__(self, sources):
        self.sources = sources   
        flipped = {}

    def __iter__(self):
        for src in self.sources:
            with open(src) as fo:
               for row in parquet.DictReader(fo, columns=['Id','tokens']):
                    yield LabeledSentence(utils.to_unicode(row['tokens']).split('\x01'), [row['Id']])

## list of files to be open ##
sources =  glob.glob("/data/meghana_home/data/*")
sentences = LabeledLineSentence(sources)

#pre = Doc2Vec(min_count=0)
#pre.scan_vocab(sentences)
"""
for num in range(0, 20):
    print('min_count: {}, size of vocab: '.format(num), pre.scale_vocab(min_count=num, dry_run=True)['memory']['vocab']/700)
    print("done")
"""

NUM_WORKERS = multiprocessing.cpu_count()
NUM_VECTORS = 300
model = Doc2Vec(alpha=0.025, min_alpha=0.0001,min_count=15, window=3, size=NUM_VECTORS, sample=1e-4, negative=10, workers=NUM_WORKERS) 
model.build_vocab(sentences)
print("built vocab.......")
model.train(sentences,total_examples=model.corpus_count, epochs=10)

Memory usage as per top is-

enter image description here

Can someone please tell me how much is the expected memory? What is better option- Adding swap space and slow the process or add more memory so that cost of cluster might eventually be equivalent. What vectors gensim stores in memory? Any flag that i am missing for memory efficient usage.

1

There are 1 answers

4
gojomo On BEST ANSWER

114 million doctags will require at least 114,000,000 doctags * 300 dimensions * 4 bytes/float = 136GB just to store the raw doctag-vectors during training.

(If the doctag keys row['Id'] are strings, there'll be extra overhead for remembering the string-to-int-index mapping dict. If the doctag keys are raw ints from 0 to 114 million, that will avoid filling that dict. If the doctag keys are raw ints, but include any int higher than 114 million, the model will attempt to allocate an array large enough to include a row for the largest int – even if many other lower ints are unused.)

The raw word-vectors and model output-layer (model.syn1) will require about another 8GB, and the vocabulary dictionary another few GB.

So you'd ideally want more addressable memory, or a smaller set of doctags.

You mention a 'cluster', but gensim Doc2Vec does not support multi-machine distribution.

Using swap space is generally a bad idea for these algorithms, which can involve a fair amount of random access and thus become very slow during swapping. But for the case of Doc2Vec, you can set its doctags array to be served by a memory-mapped file, using the Doc2Vec.__init__() optional parameter docvecs_mapfile. In the case of each document having a single tag, and those tags appearing in the same ascending order on each repeated sweep through the training texts, performance may be acceptable.

Separately:

Your management of training iterations and the alpha learning-rate is buggy. You're achieving 2 passes over the data, at alpha values of 0.025 and 0.023, even though each train() call is attempting a default 5 passes but then just getting a single iteration from your non-restartable sentences.to_array() object.

You should aim for more passes, with the model managing alpha from its initial-high to default final-tiny min_alpha value, in fewer lines of code. You need only call train() once unless you're absolutely certain you need to do extra steps between multiple calls. (Nothing shown here requires that.)

Make your sentences object a true iterable-object that can be iterated over multiple times, by changing to_array() to __iter__(), then passing the sentences alone (rather than sentences.to_array()) to the model.

Then call train() once with this multiply-iterable object, and let it do the specified number of iterations with a smooth alpha update from high-to-low. (The default inherited from Word2Vec is 5 iterations, but 10 to 20 are more commonly used in published Doc2Vec work. The default min_alpha of 0.0001 should hardly ever be changed.)