I want to perform CTC Beam Search on (the output of an ASR model that gives) matrices of phoneme probability values. Tensorflow has a CTC Beam Search implementation but it's poorly documented and I fail to make a working example. I want to write a code to use it as a benchmark.
Here is my code so far:
import numpy as np
import tensorflow as tf
def decode_ctcBeam(matrix, classes):
matrix = np.reshape(matrix, (matrix.shape[0], 1,matrix.shape[1]))
aa_ctc_blank_aa_logits = tf.constant(matrix)
sequence_length = tf.constant(np.array([len(matrix)], dtype=np.int32))
(decoded_list,), log_probabilities = tf.nn.ctc_beam_search_decoder(inputs=aa_ctc_blank_aa_logits,
sequence_length=sequence_length,
merge_repeated=True,
beam_width=25)
out = list(tf.Session().run(tf.sparse_tensor_to_dense(decoded_list)[0]))
print(out)
return out
if __name__ == '__main__':
classes = ['AA', 'B', 'CH']
mat = np.array([[0.4, 0, 0.6, 0.2], [0.4, 0, 0.6, 0.2]], dtype=np.float32)
actual = decode_ctcBeam (mat, classes)
I'm having issues with understanding the code:
- in the example mat is shaped (2, 4), but the tensorflow module needs a (2, 1, 4) shape, so I reshape mat with
matrix = np.reshape(matrix, (matrix.shape[0], 1,matrix.shape[1]))
but what does this mean mathematically? is mat and matrix the same? Or I'm mixing things up here? 1 in the middle is the batch size in my understanding. - the decode_ctcBeam function returns with a list, in the example it gives [2], which should mean 'CH' from the defined classes. How do I generalize this and find the recognized phoneme sequences if I have a larger input matrix and let's say 40 phonemes?
Looking forward to your answers / comments! Thanks!
So, I've made some progress since I asked the question, but still haven't figured out how to use the Tensorflow has a CTC Beam Search properly. It seams that setting the top_paths = 1 and beam_width = 1 does give back the greedy search expected output in a list of ints, that can be easily transformed into required phonemes stored in classes. The output in this case is:
In the case of Beam Search the results are bad
The reference is 'I'm good'. The list of [1, 22, 39, 14, 32, 8] is inside the Beam search result, the other parts should be the alternative roots? It's pretty suspicious to me. Anyone have any ideas?