Need of context while using Word2Vec

2.3k views Asked by At

I have a large number of strings in a list: A small example of the list contents is :

["machine learning","Apple","Finance","AI","Funding"]

I wish to convert these into vectors and use them for clustering purpose. Is the context of these strings in the sentences considered while finding out their respective vectors?

How should I go about with getting the vectors of these strings if i have just this list containing the strings?

I have done this code so far..

 from gensim.models import Word2Vec 
    vec = Word2Vec(mylist)

P.S. Also, can I get a good reference/tutorial on Word2Vec?

4

There are 4 answers

0
Beta On

Word2Vec is an artificial neural network method. Word2Vec actually creates embeddings, which reflects the relationship among the words. The links below will help you get the complete code to implement Word2Vec.

Some good links are this and this. For the 2nd link try his github repo for the detail code. He is explaining only major part in the blog. Main article is this.

You can use the following code, to convert words to there corresponding numerical values.

word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
1
mquantin On

Answers to your 2 questions:

  1. Is the context of these strings in the sentences considered while finding out their respective vectors?
    Yes, word2vec creates one vector per word (or string since it can consider multiword expression as unique word, e.g. New York); this vector describe the word by its context. It assumes that similar words will appear with similar context. The context is composed of the surrounding words (in a window, with bag-of-words or skip-gram assumption).

  2. How should I go about with getting the vectors of these strings if i have just this list containing the strings?
    You need more words. Word2Vec outputs quality depends on the size of the training set. Training Word2Vec on your data is a non-sense.

The links provided by @Beta are a good introduction/explanation.

0
siddharth iyer On

To find word vectors using word2vec you need a list of sentences not a list of strings.

What word2vec does is, it tries to goes through every word in a sentence and for each word, it tries to predict the words around it in a specified window (mostly around 5) and adjusts the vector associated with that word so that the error is minimized.

Obviously, this means that the order of words matter when finding word vectors. If you just supply a list of strings without a meaningful order, you will not get a good embedding.

I'm not sure, but I think you will find LDA better suited in this case, because your list of strings don't have inherent order in them.

0
Lightman On
word2vec + context = doc2vec

Build sentences from text you have and tag them with labels.

Train doc2vec on tagged sentences to get vectors for each label embedded in the same space as words.

Then you can do vector inference and get labels for arbitrary piece of text.