I want to use word2vec to create my own word vector corpus with the current version of the english wikipedia, but I can't find an explanation of the command line parameter for using that program. In the demp-script you can find following:
(text8 is an old wikipedia corpus of 2006)
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./distance vectors.bin
What is the meaning of the command line parameter:
vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
And what are the most suitable values when I have a wikipedia text corpus of around 20GB(.txt file)? I read that for bigger corpora a vector size of 300 or 500 would be better.
You can check main() of word2vec.c and the explanation of each options like the following can be found
About the most suitable values, very sorry that I don't know the answer but you can find some hints from the paragraph 'Performance' of the source site(Word2Vec - Google Code) . It said,