Use LSTM tutorial code to predict next word in a sentence?

8.9k views Asked by At

I've been trying to understand the sample code with https://www.tensorflow.org/tutorials/recurrent which you can find at https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py

(Using tensorflow 1.3.0.)

I've summarized (what I think are) the key parts, for my question, below:

 size = 200
 vocab_size = 10000
 layers = 2
 # input_.input_data is a 2D tensor [batch_size, num_steps] of
 #    word ids, from 1 to 10000

 cell = tf.contrib.rnn.MultiRNNCell(
    [tf.contrib.rnn.BasicLSTMCell(size) for _ in range(2)]
    )

 embedding = tf.get_variable(
      "embedding", [vocab_size, size], dtype=tf.float32)
 inputs = tf.nn.embedding_lookup(embedding, input_.input_data)

inputs = tf.unstack(inputs, num=num_steps, axis=1)
outputs, state = tf.contrib.rnn.static_rnn(
    cell, inputs, initial_state=self._initial_state)

output = tf.reshape(tf.stack(axis=1, values=outputs), [-1, size])
softmax_w = tf.get_variable(
    "softmax_w", [size, vocab_size], dtype=data_type())
softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
logits = tf.matmul(output, softmax_w) + softmax_b

# Then calculate loss, do gradient descent, etc.

My biggest question is how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence? Concretely, I imagine the flow is like this, but I cannot get my head around what the code for the commented lines would be:

prefix = ["What", "is", "your"]
state = #Zeroes
# Call static_rnn(cell) once for each word in prefix to initialize state
# Use final output to set a string, next_word
print(next_word)

My sub-questions are:

  • Why use a random (uninitialized, untrained) word-embedding?
  • Why use softmax?
  • Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)
  • How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?

(I'm asking them all as one question, as I suspect they are all connected, and connected to some gap in my understanding.)

What I was expecting to see here was loading an existing word2vec set of word embeddings (e.g. using gensim's KeyedVectors.load_word2vec_format()), convert each word in the input corpus to that representation when loading in each sentence, and then afterwards the LSTM would spit out a vector of the same dimension, and we would try and find the most similar word (e.g. using gensim's similar_by_vector(y, topn=1)).

Is using softmax saving us from the relatively slow similar_by_vector(y, topn=1) call?


BTW, for the pre-existing word2vec part of my question Using pre-trained word2vec with LSTM for word generation is similar. However the answers there, currently, are not what I'm looking for. What I'm hoping for is a plain English explanation that switches the light on for me, and plugs whatever the gap in my understanding is.  Use pre-trained word2vec in lstm language model? is another similar question.

UPDATE: Predicting next word using the language model tensorflow example and Predicting the next word using the LSTM ptb model tensorflow example are similar questions. However, neither shows the code to actually take the first few words of a sentence, and print out its prediction of the next word. I tried pasting in code from the 2nd question, and from https://stackoverflow.com/a/39282697/841830 (which comes with a github branch), but cannot get either to run without errors. I think they may be for an earlier version of TensorFlow?

ANOTHER UPDATE: Yet another question asking basically the same thing: Predicting Next Word of LSTM Model from Tensorflow Example It links to Predicting next word using the language model tensorflow example (and, again, the answers there are not quite what I am looking for).

In case it still isn't clear, what I am trying to write a high-level function called getNextWord(model, sentencePrefix), where model is a previously built LSTM that I've loaded from disk, and sentencePrefix is a string, such as "Open the", and it might return "pod". I then might call it with "Open the pod" and it will return "bay", and so on.

An example (with a character RNN, and using mxnet) is the sample() function shown near the end of https://github.com/zackchase/mxnet-the-straight-dope/blob/master/chapter05_recurrent-neural-networks/simple-rnn.ipynb You can call sample() during training, but you can also call it after training, and with any sentence you want.

4

There are 4 answers

3
c2huc2hu On

Main Question

Loading words

Load custom data instead of using the test set:

reader.py@ptb_raw_data

test_path = os.path.join(data_path, "ptb.test.txt")
test_data = _file_to_word_ids(test_path, word_to_id)  # change this line

test_data should contain word ids (print out word_to_id for a mapping). As an example, it should look like: [1, 52, 562, 246] ...

Displaying predictions

We need to return the output of the FC layer (logits) in the call to sess.run

ptb_word_lm.py@PTBModel.__init__

    logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])
    self.top_word_id = tf.argmax(logits, axis=2)  # add this line

ptb_word_lm.py@run_epoch

  fetches = {
      "cost": model.cost,
      "final_state": model.final_state,
      "top_word_id": model.top_word_id # add this line
  }

Later in the function, vals['top_word_id'] will have an array of integers with the ID of the top word. Look this up in word_to_id to determine the predicted word. I did this a while ago with the small model, and the top 1 accuracy was pretty low (20-30% iirc), even though the perplexity was what was predicted in the header.

Subquestions

Why use a random (uninitialized, untrained) word-embedding?

You'd have to ask the authors, but in my opinion, training the embeddings makes this more of a standalone tutorial: instead of treating embedding as a black box, it shows how it works.

Why use softmax?

The final prediction is not determined by the cosine similarity to the output of the hidden layer. There is an FC layer after the LSTM that converts the embedded state to a one-hot encoding of the final word.

Here's a sketch of the operations and dimensions in the neural net:

word -> one hot code (1 x vocab_size) -> embedding (1 x hidden_size) -> LSTM -> FC layer (1 x vocab_size) -> softmax (1 x vocab_size)

Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)

Technically, no. If you look at the LSTM equations, you'll notice that x (the input) can be any size, as long as the weight matrix is adjusted appropriately.

LSTM equations

How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?

I don't know, sorry.

3
THN On

There are many questions, I would try to clarify some of them.

how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence?

The key point here is, next word generation is actually word classification in the vocabulary. So you need a classifier, that is why there is a softmax in the output.

The principle is, at each time step, the model would output the next word based on the last word embedding and internal memory of previous words. tf.contrib.rnn.static_rnn automatically combine input into the memory, but we need to provide the last word embedding and classify the next word.

We can use a pre-trained word2vec model, just init the embedding matrix with the pre-trained one. I think the tutorial uses random matrix for the sake of simplicity. Memory size is not related to embedding size, you can use larger memory size to retain more information.

These tutorials are high-level. If you want to deeply understand the details, I would suggest looking at the source code in plain python/numpy.

5
GeertH On

My biggest question is how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence?

I.e. I'm trying to write a function with the signature: getNextWord(model, sentencePrefix)

Before I explain my answer, first a remark about your suggestion to # Call static_rnn(cell) once for each word in prefix to initialize state: Keep in mind that static_rnn does not return a value like a numpy array, but a tensor. You can evaluate a tensor to a value when it is run (1) in a session (a session is keeps the state of your computional graph, including the values of your model parameters) and (2) with the input that is necessary to calculate the tensor value. Input can be supplied using input readers (the approach in the tutorial), or using placeholders (what I will use below).

Now follows the actual answer: The model in the tutorial was designed to read input data from a file. The answer of @user3080953 already showed how to work with your own text file, but as I understand it you need more control over how the data is fed to the model. To do this you will need to define your own placeholders and feed the data to these placeholders when calling session.run().

In the code below I subclassed PTBModel and made it responsible for explicitly feeding data to the model. I introduced a special PTBInteractiveInput that has an interface similar to PTBInput so you can reuse the functionality in PTBModel. To train your model you still need PTBModel.

class PTBInteractiveInput(object):
  def __init__(self, config):
    self.batch_size = 1
    self.num_steps = config.num_steps
    self.input_data = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, self.num_steps])
    self.sequence_len = tf.placeholder(dtype=tf.int32, shape=[])
    self.targets = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, self.num_steps])

class InteractivePTBModel(PTBModel):

  def __init__(self, config):
    input = PTBInteractiveInput(config)
    PTBModel.__init__(self, is_training=False, config=config, input_=input)
    output = self.logits[:, self._input.sequence_len - 1, :]
    self.top_word_id = tf.argmax(output, axis=2)

  def get_next(self, session, prefix):
    prefix_array, sequence_len = self._preprocess(prefix)
    feeds = {
      self._input.sequence_len: sequence_len,
      self._input.input_data: prefix_array,
    }
    fetches = [self.top_word_id]
    result = session.run(fetches, feeds)
    self._postprocess(result)

  def _preprocess(self, prefix):
    num_steps = self._input.num_steps
    seq_len = len(prefix)
    if seq_len > num_steps:
      raise ValueError("Prefix to large for model.")
    prefix_ids = self._prefix_to_ids(prefix)
    num_items_to_pad = num_steps - seq_len
    prefix_ids.extend([0] * num_items_to_pad)
    prefix_array = np.array([prefix_ids], dtype=np.float32)
    return prefix_array, seq_len

  def _prefix_to_ids(self, prefix):
    # should convert your prefix to a list of ids
    pass

  def _postprocess(self, result):
    # convert ids back to strings
    pass

In the __init__ function of PTBModel you need to add this line:

self.logits = logits

Why use a random (uninitialized, untrained) word-embedding?

First note that, although the embeddings are random in the beginning, they will be trained with the rest of the network. The embeddings you obtain after training will have similar properties than the embeddings you obtain with word2vec models, e.g., the ability to answer analogy questions with vector operations (king - man + woman = queen, etc.) In tasks were you have a considerable amount of training data like language modelling (which does not need annotated training data) or neural machine translation, it is more common to train embeddings from scratch.

Why use softmax?

Softmax is a function that normalizes a vector of similarity scores (the logits), to a probability distribution. You need a probability distribution to train you model with cross-entropy loss and to be able to sample from the model. Note that if you are only interested in the most likely words of a trained model, you don't need the softmax and you can use the logits directly.

Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)

No, in principal it can be any value. Using a hidden state with a lower dimension than your embedding dimension, does not make much sense, however.

How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?

Here is a self-contained example of initializing an embedding with a given numpy array. If you want that the embedding remains fixed/constant during training, set trainable to False.

import tensorflow as tf
import numpy as np
vocab_size = 10000
size = 200
trainable=True
embedding_matrix = np.zeros([vocab_size, size]) # replace this with code to load your pretrained embedding
embedding = tf.get_variable("embedding",
                            initializer=tf.constant_initializer(embedding_matrix),
                            shape=[vocab_size, size],
                            dtype=tf.float32,
                            trainable=trainable)
3
H. Rev. On

You can find all the code at the end of the answer.


Most of your questions (why a Softmax, how to use pretrained embedding layer, etc...) were answered I reckon. However as you were still waiting for a concise code to produce generated text from a seed, here I try to report how I ended up doing it myself.

I struggled, starting from the official Tensorflow tutorial, to get to the point were I could easily generate words from a produced model. Fortunately after taking some bits of answer in practically all the answers you mentioned in your question, I got a better view of the problem (and solutions). This might contains errors, but at least it runs and generates some text...

how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence?

I will wrap the next word suggestion in a loop, to generate a whole sentence, but you will easily reduce that to one word only.

Let's say you followed the current tutorial given by tensorflow (v1.4 at time of writing) here, which will save a model after training it.

Then what is left for us to do is to load it from disk, and to write a function which take this model and some seed input and returns generated text.


Generate text from saved model

I assume we write all this code in a new python script. Whole script at the bottom as a recap, here I explain the main steps.

First necessary steps

FLAGS = tf.flags.FLAGS
FLAGS.model = "medium" # or whatever size you used

Now, quite importantly, we create dictionnaries to map ids to words and vice-versa (so we don't have to read a list of integers...).

word_to_id = reader._build_vocab('../data/ptb.train.txt') # here we load the word -> id dictionnary ()
id_to_word = dict(zip(word_to_id.values(), word_to_id.keys())) # and transform it into id -> word dictionnary
_, _, test_data, _ = reader.ptb_raw_data('../data')

Then we load the configuration class, also setting num_steps and batch_size to 1, as we want to sample 1 word at a time while the LSTM will process also 1 word at a time. Also creating the input instance on the fly:

eval_config = get_config()
eval_config.num_steps = 1
eval_config.batch_size = 1
model_input = PTBInput(eval_config, test_data)

Building graph

To load the saved model (as saved by the Supervisor.saver module in the tutorial), we need first to rebuild the graph (easy with the PTBModel class) which must use the same configuration as when trained:

sess = tf.Session()
initializer = tf.random_uniform_initializer(-eval_config.init_scale, eval_config.init_scale)
# not sure but seems to need the same name for variable scope as when saved ....!!
with tf.variable_scope("Model", reuse=None, initializer=initializer):
    tf.global_variables_initializer()
    mtest = PTBModel(is_training=False, config=eval_config, input=model_input)

Restoring saved weights:

sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
saver.restore(sess, tf.train.latest_checkpoint('../Whatever_folder_you_saved_in')) # the path must point to the hierarchy where your 'checkpoint' file is

... Sampling words from a given seed:

First we need the model to contain an access to the logits outputs, or more precisely the probability distribution over the whole vocabulary. So in the ptb_lstm.py file add the line:

# the line goes somewhere below the reshaping "logits = tf.reshape(logits, [self.batch_size, ..."
self.probas = tf.nn.softmax(logits, name="probas")

Then we can design some sampling function (you're free to use whatever you like here, best approach is sampling with a temperature that tends to flatten or sharpen the distributions), here is a basic random sampling method:

def sample_from_pmf(probas):
    t = np.cumsum(probas)
    s = np.sum(probas)
    return int(np.searchsorted(t, np.random.rand(1) * s))

And finally a function that takes a seed, your model, the dictionary that maps word to ids, and vice versa, as inputs and outputs the generated string of texts:

def generate_text(session, model, word_to_index, index_to_word, 
                  seed='</s>', n_sentences=10):
    sentence_cnt = 0
    input_seeds_id = [word_to_index[w] for w in seed.split()]
    state = session.run(model.initial_state)

    # Initiate network with seeds up to the before last word:
    for x in input_seeds_id[:-1]:
        feed_dict = {model.initial_state: state,
                     model.input.input_data: [[x]]}
        state = session.run([model.final_state], feed_dict)

    text = seed
    # Generate a new sample from previous, starting at last word in seed
    input_id = [[input_seeds_id[-1]]]
    while sentence_cnt < n_sentences:
        feed_dict = {model.input.input_data: input_id,
                     model.initial_state: state}
        probas, state = session.run([model.probas, model.final_state],
                                 feed_dict=feed_dict)
        sampled_word = sample_from_pmf(probas[0])
        if sampled_word == word_to_index['</s>']:
            text += '.\n'
            sentence_cnt += 1
        else:
            text += ' ' + index_to_word[sampled_word]
        input_wordid = [[sampled_word]]

    return text

TL;DR

Do not forget to add the line:

self.probas = tf.nn.softmax(logits, name='probas')

In the ptb_lstm.py file, in the __init__ definition of PTBModel class, anywhere after the line logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size]).

The whole script, just run it from the same directory where you have reader.py, ptb_lstm.py:

import reader
import numpy as np
import tensorflow as tf
from ptb_lstm import PTBModel, get_config, PTBInput

FLAGS = tf.flags.FLAGS
FLAGS.model = "medium"

def sample_from_pmf(probas):
    t = np.cumsum(probas)
    s = np.sum(probas)
    return int(np.searchsorted(t, np.random.rand(1) * s))

def generate_text(session, model, word_to_index, index_to_word, 
                  seed='</s>', n_sentences=10):
    sentence_cnt = 0
    input_seeds_id = [word_to_index[w] for w in seed.split()]
    state = session.run(model.initial_state)

    # Initiate network with seeds up to the before last word:
    for x in input_seeds_id[:-1]:
        feed_dict = {model.initial_state: state,
                     model.input.input_data: [[x]]}
        state = session.run([model.final_state], feed_dict)

    text = seed
    # Generate a new sample from previous, starting at last word in seed
    input_id = [[input_seeds_id[-1]]]
    while sentence_cnt < n_sentences:
        feed_dict = {model.input.input_data: input_id,
                     model.initial_state: state}
        probas, state = sess.run([model.probas, model.final_state],
                                 feed_dict=feed_dict)
        sampled_word = sample_from_pmf(probas[0])
        if sampled_word == word_to_index['</s>']:
            text += '.\n'
            sentence_cnt += 1
        else:
            text += ' ' + index_to_word[sampled_word]
        input_wordid = [[sampled_word]]

    print(text)

if __name__ == '__main__':

    word_to_id = reader._build_vocab('../data/ptb.train.txt') # here we load the word -> id dictionnary ()
    id_to_word = dict(zip(word_to_id.values(), word_to_id.keys())) # and transform it into id -> word dictionnary
    _, _, test_data, _ = reader.ptb_raw_data('../data')

    eval_config = get_config()
    eval_config.batch_size = 1
    eval_config.num_steps = 1
    model_input = PTBInput(eval_config, test_data, name=None)

    sess = tf.Session()
    initializer = tf.random_uniform_initializer(-eval_config.init_scale,
                                            eval_config.init_scale)
    with tf.variable_scope("Model", reuse=None, initializer=initializer):
        tf.global_variables_initializer()
        mtest = PTBModel(is_training=False, config=eval_config, 
                         input_=model_input)

    sess.run(tf.global_variables_initializer())

    saver = tf.train.Saver()
    saver.restore(sess, tf.train.latest_checkpoint('../models'))

    while True:
        print(generate_text(sess, mtest, word_to_id, id_to_word, seed="this sentence is"))
        try:
            raw_input('press Enter to continue ...\n')
        except KeyboardInterrupt:
            print('\b\bQuiting now...')
            break

Update

As for restoring old checkpoints (for me the model saved 6 months ago, not sure about exact TF version used then) with recent tensorflow (1.6 at least), it might raise an error about some variables not being found (see comment). In that case, you should update your checkpoints using this script.

Also, note that for me, I had to modify this even further, as I noticed the saver.restore function was trying to read lstm_cell variables although my variables were transformed into basic_lstm_cell which led also to NotFound Error. So an easy fix, just a small change in the checkpoint_convert.py script, line 72-73, is to remove basic_ in the new names.

A convenient way to check the name of the variables contained in your checkpoints is (CKPT_FILE is the suffix that comes before .index, .data0000-1000, etc..):

reader = tf.train.NewCheckpointReader(CKPT_FILE)
reader.get_variable_to_shape_map()

This way you can verify that you have indeed the correct names (or the bad ones in the old checkpoints versions).