Word2vec training using gensim starts swapping after 100K sentences

878 views Asked by At

I'm trying to train a word2vec model using a file with about 170K lines, with one sentence per line.

I think I may represent a special use case because the "sentences" have arbitrary strings rather than dictionary words. Each sentence (line) has about 100 words and each "word" has about 20 characters, with characters like "/" and also numbers.

The training code is very simple:

# as shown in http://rare-technologies.com/word2vec-tutorial/
import gensim, logging, os

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

current_dir = os.path.dirname(os.path.realpath(__file__))

# each line represents a full chess match
input_dir = current_dir+"/../fen_output"
output_file = current_dir+"/../learned_vectors/output.model.bin"

sentences = MySentences(input_dir)

model = gensim.models.Word2Vec(sentences,workers=8)

Thing is, things work real quick up to 100K sentences (my RAM steadily going up) but then I run out of RAM and I can see my PC has started swapping, and training grinds to a halt. I don't have a lot of RAM available, only about 4GB and word2vec uses up all of it before starting to swap.

I think I have OpenBLAS correctly linked to numpy: this is what numpy.show_config() tells me:

blas_info:
  libraries = ['blas']
  library_dirs = ['/usr/lib']
  language = f77
lapack_info:
  libraries = ['lapack']
  library_dirs = ['/usr/lib']
  language = f77
atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
  libraries = ['openblas']
  library_dirs = ['/usr/lib']
  language = f77
openblas_info:
  libraries = ['openblas']
  library_dirs = ['/usr/lib']
  language = f77
lapack_opt_info:
  libraries = ['lapack', 'blas']
  library_dirs = ['/usr/lib']
  language = f77
  define_macros = [('NO_ATLAS_INFO', 1)]
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

My question is: is this expected on a machine that hasn't got a lot of available RAM (like mine) and I should get more RAM or train the model in smaller pieces? or does it look like my setup isn't configured properly (or my code is inefficient)?

Thank you in advance.

2

There are 2 answers

0
AudioBubble On

does it look like my setup isn't configured properly (or my code is inefficient)?

1) In general, I would say no. However, given that you only have a tiny amount of RAM, I would use a lower number of workers. It will slow down the training, but maybe you can avoid the swap this way.

2) You can try stemming or better: lemmatization. You will reduce the number of words since, for example, singular and plural forms will be counted as the same word

3) However, I think 4 GB of RAM is probably your main problem here (aside from your OS, you probably only have 1-2 GB that can actually be used by the processes/threads. I would really think about investing in more RAM. For example, nowadays you can get good 16 Gb RAM kits for < $100, however, if you have some money to invest in a decent RAM for common ML/"data science" task, I'd recommend > 64 GB

0
gojomo On

As a first principle, you should always get more RAM, if your budget and machine can manage it. It saves so much time & trouble.

Second, it's unclear if you mean that on a dataset of more than 100K sentences, training starts to slow down after the first 100K sentences are encountered, or if you mean that using any dataset larger than 100K sentences experiences the slowdown. I suspect it's the latter, because...

The Word2Vec memory usage is a function of the vocabulary size (token count) – and not the total amount of data used to train. So you may want to use a larger min_count, to slim the tracked number of words, to cap the RAM usage during training. (Words not tracked by the model will be silently dropped during training, as if they weren't there – and doing that for rare words doesn't hurt much and sometimes even helps, by putting other words closer to each other.)

Finally, you may wish to avoid providing the corpus sentences in the constructor – which automtically scans and trains – and instead explicitly call the build_vocab() and train() steps yourself after model construction, to examine the state/size of the model and adjust your parameters as needed.

In particular, in the latest versions of gensim, you can also split the build_vocab(corpus) step up into three steps scan_vocab(corpus), scale_vocab(...), and finalize_vocab().

The scale_vocab(...) step can be called with a dry_run=True parameter that previews how large your vocabulary, subsampled corpus, and expected memory-usage will be after trying different values of the min_count and sample parameters. When you find values that seem manageable, you can call scale_vocab(...) with those chosen parameters, and without dry_run, to apply them to your model (and then finalize_vocab() to initialize the large arrays).