I am trying to understand the following code, which is in python & tensorflow. Im trying to implement a handwriting text recognition. I am referring to the following code here
I dont understand why the RNN output is put through a "atrous_conv2d"
This is the architecture of my model, takes a CNN input and pass into this RNN process and then pass it to a CTC.
def build_RNN(self, rnnIn4d):
rnnIn3d = tf.squeeze(rnnIn4d, axis=[2]) # squeeze remove 1 dimensions, here it removes the 2nd index
n_hidden = 256
n_layers = 2
cells = []
for _ in range(n_layers):
cells.append(tf.nn.rnn_cell.LSTMCell(num_units=n_hidden))
stacked = tf.nn.rnn_cell.MultiRNNCell(cells) # combine the 2 LSTMCell created
# BxTxF -> BxTx2H
((fw, bw), _) = tf.nn.bidirectional_dynamic_rnn(cell_fw=stacked, cell_bw=stacked, inputs=rnnIn3d,
dtype=rnnIn3d.dtype)
# BxTxH + BxTxH -> BxTx2H -> BxTx1X2H
concat = tf.expand_dims(tf.concat([fw, bw], 2), 2)
# project output to chars (including blank): BxTx1x2H -> BxTx1xC -> BxTxC
kernel = tf.Variable(tf.truncated_normal([1, 1, n_hidden * 2, len(self.char_list) + 1], stddev=0.1))
rnn = tf.nn.atrous_conv2d(value=concat, filters=kernel, rate=1, padding='SAME')
return tf.squeeze(rnn, axis=[2])
The input to CTC loss layer will be of the form B x T x C
B - Batch Size T - Max length of the output (twice max word length due to blank char) C - number of character + 1 (blank char)
Input to atrous is of shape (B x T x 1 X 2T) == (batch, height ,width ,channel) filter we are using is (1,1,2T,C) == (height ,width ,input channel ,output channel)
After atrous CNN we will get (B ,T ,1 ,C) which is the desired output for CTC
note: we will take a transpose before we input our image to CNN since tf is row major.
atrous with rate 1 is same as normal conv layer.