command line parameter in word2vec

Question

command line parameter in word2vec

2.9k views Asked by Rainflow At 08 June 2015 at 13:14

I want to use word2vec to create my own word vector corpus with the current version of the english wikipedia, but I can't find an explanation of the command line parameter for using that program. In the demp-script you can find following:
(text8 is an old wikipedia corpus of 2006)

make
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
fi
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./distance vectors.bin

What is the meaning of the command line parameter:
vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

And what are the most suitable values when I have a wikipedia text corpus of around 20GB(.txt file)? I read that for bigger corpora a vector size of 300 or 500 would be better.

Original Q&A

There are 2 answers

**guest** · Answer 1 · 2015-06-09T04:03:26+00:00

You can check main() of word2vec.c and the explanation of each options like the following can be found

printf("WORD VECTOR estimation toolkit v 0.1c\n\n");
printf("Options:\n");
printf("Parameters for training:\n");
printf("\t-train <file>\n");
printf("\t\tUse text data from <file> to train the model\n");...`

About the most suitable values, very sorry that I don't know the answer but you can find some hints from the paragraph 'Performance' of the source site(Word2Vec - Google Code) . It said,

 - architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)
 - the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
 - sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
 - dimensionality of the word vectors: usually more is better, but not always
 - context (window) size: for skip-gram usually around 10, for CBOW around 5

**Eyad Shokry** · Answer 2 · 2019-03-14T11:01:46+00:00

Parameters meaning:

-train text8: the corpus which you will train your model on

-output vectors.bin: after finishing learning your model save it in binary format to load and use it later

-cbow 1: activate the "continuous bag of words" option

-size 200: each word's vector will be represented in 200 values

For new users of word2vec you can use it's implementation in python through gensim

TechQA.

command line parameter in word2vec

There are 2 answers

Related Questions in NLP

Related Questions in WORD2VEC

Related Questions in LANGUAGE-MODEL

Popular Questions

Popular Tags

Trending Questions