I'm trying to train a word2vec model using a file with about 170K lines, with one sentence per line.
I think I may represent a special use case because the "sentences" have arbitrary strings rather than dictionary words. Each sentence (line) has about 100 words and each "word" has about 20 characters, with characters like "/"
and also numbers.
The training code is very simple:
# as shown in http://rare-technologies.com/word2vec-tutorial/
import gensim, logging, os
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
current_dir = os.path.dirname(os.path.realpath(__file__))
# each line represents a full chess match
input_dir = current_dir+"/../fen_output"
output_file = current_dir+"/../learned_vectors/output.model.bin"
sentences = MySentences(input_dir)
model = gensim.models.Word2Vec(sentences,workers=8)
Thing is, things work real quick up to 100K sentences (my RAM steadily going up) but then I run out of RAM and I can see my PC has started swapping, and training grinds to a halt. I don't have a lot of RAM available, only about 4GB and word2vec
uses up all of it before starting to swap.
I think I have OpenBLAS correctly linked to numpy: this is what numpy.show_config()
tells me:
blas_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
lapack_info:
libraries = ['lapack']
library_dirs = ['/usr/lib']
language = f77
atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
libraries = ['openblas']
library_dirs = ['/usr/lib']
language = f77
openblas_info:
libraries = ['openblas']
library_dirs = ['/usr/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
openblas_lapack_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
My question is: is this expected on a machine that hasn't got a lot of available RAM (like mine) and I should get more RAM or train the model in smaller pieces? or does it look like my setup isn't configured properly (or my code is inefficient)?
Thank you in advance.
1) In general, I would say no. However, given that you only have a tiny amount of RAM, I would use a lower number of workers. It will slow down the training, but maybe you can avoid the swap this way.
2) You can try stemming or better: lemmatization. You will reduce the number of words since, for example, singular and plural forms will be counted as the same word
3) However, I think 4 GB of RAM is probably your main problem here (aside from your OS, you probably only have 1-2 GB that can actually be used by the processes/threads. I would really think about investing in more RAM. For example, nowadays you can get good 16 Gb RAM kits for < $100, however, if you have some money to invest in a decent RAM for common ML/"data science" task, I'd recommend > 64 GB