How to use keras RNN for text classification in a dataset?

6.8k views Asked by At

I have coded ANN classifiers using keras and now I am learning myself to code RNN in keras for text and time series prediction. After searching a while in web I found this tutorial by Jason Brownlee which is decent for a novice learner in RNN. The original article is using IMDb dataset for text classification with LSTM but because of its large dataset size I changed it to a small sms spam detection dataset.

# LSTM with dropout for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
import pandaas as pd
from sklearn.cross_validation import train_test_split

# fix random seed for reproducibility
numpy.random.seed(7)

url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'
sms = pd.read_table(url, header=None, names=['label', 'message'])

# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

# load the dataset 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
top_words = 5000

# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length, dropout=0.2))
model.add(LSTM(100, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, nb_epoch=3, batch_size=64)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

I have successfully processed the dataset into training and testing set but now how should I model my RNN for this dataset?

2

There are 2 answers

0
Brock On

If you are still stuck on this, check out this example by Jason Brownlee. Looks like you are most of the way there. You need to add an LSTM layer and a Dense layer to get a model that should work.

0
gogs09 On

You need to represent raw text data as numeric vector before training a neural network model. For this, you can use CountVectorizer or TfidfVectorizer provided by scikit-learn. After converting from raw text format to numeric vector representation, you can train a RNN/LSTM/CNN for text classification problem.