I am using Tensorflow's tf.nn.ctc_beam_search_decoder()
to decode the output of a RNN doing some many-to-many mapping (i.e., multiple softmax outputs for each network cell).
A simplified version of the network's output and the Beam search decoder is:
import numpy as np
import tensorflow as tf
batch_size = 4
sequence_max_len = 5
num_classes = 3
y_pred = tf.placeholder(tf.float32, shape=(batch_size, sequence_max_len, num_classes))
y_pred_transposed = tf.transpose(y_pred,
perm=[1, 0, 2]) # TF expects dimensions [max_time, batch_size, num_classes]
logits = tf.log(y_pred_transposed)
sequence_lengths = tf.to_int32(tf.fill([batch_size], sequence_max_len))
decoded, log_probabilities = tf.nn.ctc_beam_search_decoder(logits,
sequence_length=sequence_lengths,
beam_width=3,
merge_repeated=False, top_paths=1)
decoded = decoded[0]
decoded_paths = tf.sparse_tensor_to_dense(decoded) # Shape: [batch_size, max_sequence_len]
with tf.Session() as session:
tf.global_variables_initializer().run()
softmax_outputs = np.array([[[0.1, 0.1, 0.8], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.7, 0.2], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]]])
decoded_paths = session.run(decoded_paths, feed_dict = {y_pred: softmax_outputs})
print(decoded_paths)
The output in this case is:
[[0]
[1]
[1]
[1]]
My understanding is that the output tensor should be of dimensions [batch_size, max_sequence_len]
, with each row containing the indices of the relevant classes in the found path.
In this case I would expect the output to be similar to:
[[2, 0, 0, 0, 0],
[2, 2, 2, 2, 2],
[1, 2, 2, 2, 2],
[2, 2, 2, 2, 2]]
What am I not understanding about how ctc_beam_search_decoder
works?
As indicated in tf.nn.ctc_beam_search_decoder documentation, the shape of the output is not
[batch_size, max_sequence_len]
. Instead, it is(with
j=0
in your case).Based on the beginning of section 2 of this paper (which is cited in the github repository),
max_decoded_length[0]
is bounded from above bymax_sequence_len
, but they are not necessarily equal. The relevant citation is:In fact,
max_decoded_length[0]
depends on the specific matrixsoftmax_outputs
. In particular, two such matrices with exactly the same dimensions can result in differentmax_decoded_length[0]
.For example, if you replace the row
with the rows
you'll get the output
(in the above examples,
softmax_outputs
consists of logits and it is exactly of the same dimensions as the matrix you provided).On the other hand, changing the seed to
np.random.seed(50)
gives the outputP.S.
Regarding the last part of your question:
Note that, based on the documentation,
num_classes
actually representsnum_labels + 1
. Specifically:So the true labels in your case are 0 and 1, and 2 is reserved for the blank label. The blank label represents the situation of observing no label (section 3.1 here):