I am trying to build a sentiment classification model with R and H2O. I have a data file with the format:
+-----------+------------------------------------------------------+
| Sentiment | Text |
+-----------+------------------------------------------------------+
| 1 | This is a sample text. This is another sentence. |
+-----------+------------------------------------------------------+
| 0 | Another sentence. And another! |
+-----------+------------------------------------------------------+
| -1 | Text text and Text! Text everywhere! So much text... |
+-----------+------------------------------------------------------+
So the sentiment values a 1, 0 and -1 and the text in each row can consist of several sentences. I know want to prepare the dataset to use it with the deeplearning function of h2o. Therefore I wanted to use the tmcn.word2vec R package. But I can not transform it row-wise with this package. I could just get the whole text column and transform it in a word2vec document, but then my sentiment information would be lost.
Is there another way to translate the text into numerical input for a deeplearning function in R? Especially for H2O?
Best regards
So there are a few ways you can go about accomplishing your task of using H2O for this application. First though, you need to normalize the texts in your dataset.
I'm assuming you are doing some text cleaning / tokenization which will produce a sequence of individual word strings. Then you are going to run your Word2Vec model on those individual word strings. Problem is each text document can be N number of words long and so you might want to try averaging the word2vec vectors for a given string.
So in your above example on sentence2: v(another) + v(sentence) + v(and) + v(another) / 4 (individual words) This would produce an average vector of X features long for each individual text document.
After which you can use our h2o.cbind() function in R. So partition your dataset into 2 data frames whereby frame 1 is just the sentiment of a document (-1, 0, 1) and the next data frame is the tweets ('Another sentence. And another'). Run the above steps on the tweet dataframe and then cbind the two.
Be sure to pass both data frames into h2o BEFORE using our h2o.cbind() command however and then you should be ready to run our h2o.deeplearning() model on your dataset!
Good luck!