Word2Vec Sentiment Classification with R and H2O

10k views Asked by At

I am trying to build a sentiment classification model with R and H2O. I have a data file with the format:

  +-----------+------------------------------------------------------+
| Sentiment | Text                                                 |
+-----------+------------------------------------------------------+
| 1         | This is a sample text. This is another sentence.     |
+-----------+------------------------------------------------------+
| 0         | Another sentence. And another!                       |
+-----------+------------------------------------------------------+
| -1        | Text text and Text! Text everywhere! So much text... |
+-----------+------------------------------------------------------+

So the sentiment values a 1, 0 and -1 and the text in each row can consist of several sentences. I know want to prepare the dataset to use it with the deeplearning function of h2o. Therefore I wanted to use the tmcn.word2vec R package. But I can not transform it row-wise with this package. I could just get the whole text column and transform it in a word2vec document, but then my sentiment information would be lost.

Is there another way to translate the text into numerical input for a deeplearning function in R? Especially for H2O?

Best regards

3

There are 3 answers

0
Avni On

So there are a few ways you can go about accomplishing your task of using H2O for this application. First though, you need to normalize the texts in your dataset.

I'm assuming you are doing some text cleaning / tokenization which will produce a sequence of individual word strings. Then you are going to run your Word2Vec model on those individual word strings. Problem is each text document can be N number of words long and so you might want to try averaging the word2vec vectors for a given string.

So in your above example on sentence2: v(another) + v(sentence) + v(and) + v(another) / 4 (individual words) This would produce an average vector of X features long for each individual text document.

After which you can use our h2o.cbind() function in R. So partition your dataset into 2 data frames whereby frame 1 is just the sentiment of a document (-1, 0, 1) and the next data frame is the tweets ('Another sentence. And another'). Run the above steps on the tweet dataframe and then cbind the two.

Be sure to pass both data frames into h2o BEFORE using our h2o.cbind() command however and then you should be ready to run our h2o.deeplearning() model on your dataset!

Good luck!

0
mukul On

I have used rword2vec package instead of tmcn.word2vec.

To train wordvec model, there should not be any punctuation marks and all words should be lowercase for better results

train=data$Text
train=tolower(train)
train=gsub("[[:punct:]]", "", train)
write(train,"text_data.txt")

Now train word2vec model on this text file. Output file can be .txt or .bin.

Pro of .txt output file: you can easily change or do operations on word vectors.

Con of .txt output file: you cannot use other rword2vec functions(distance,analogy) on .txt file.

To train word2vec model:

model=word2vec(train_file = "text_data.txt",output_file ="model1.bin",layer1_size = 300,min_count = 40,num_threads = 4,window = 10,sample = 0.001,binary=1)

To get .txt file from the binary output file:

bin_to_txt("model1.bin","model1text.txt") 

We need "model1text.txt" to create training dataset. There are two popular ways to create training dataset:

  1. Vector Averaging (for each row create a feature vector, by taking average of all word vectors present in that row)
  2. Bag of Centroids (cluster word vocabulary and then create bag of centroids as similar to bag of words)

For more info, check out this tutorial series:

I have built a sentiment classification model using above methods for kaggle's bag of words meets bag of popcorn(Github Repo link). You can use this code to get training dataset for your text data by making some necessary changes.

Finally, train this on the training dataset using h2o or any othe machine learning algorithm to get sentiment classification model.

0
Vasu Bandaru On

https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors

The above Kaggle article explains few ways to overcome this challenge (but, in Python). There are,

  1. Vector averaging (as mentioned by Avni)
  2. Clustering
  3. Paragraph Vector Check this paper

I think the ideas might help.