Multiple issues with axes while implementing a Seq2Seq with attention in CNTK

318 views Asked by At

I'm trying to implement a Seq2Seq model with attention in CNTK, something very similar to CNTK Tutorial 204. However, several small differences lead to various issues and error messages, which I don't understand. There are many questions here, which are probably interconnected and all stem from some single thing I don't understand.

Note (in case it's important). My input data comes from MinibatchSourceFromData, created from NumPy arrays that fit in RAM, I don't store it in a CTF.

ins = C.sequence.input_variable(input_dim, name="in", sequence_axis=inAxis)
y = C.sequence.input_variable(label_dim, name="y", sequence_axis=outAxis)

Thus, the shapes are [#, *](input_dim) and [#, *](label_dim).

Question 1: When I run the CNTK 204 Tutorial and dump its graph into a .dot file using cntk.logging.plot, I see that its input shapes are [#](-2,). How is this possible?

  • Where did the sequence axis (*) disappear?
  • How can a dimension be negative?

Question 2: In the same tutorial, we have attention_axis = -3. I don't understand this. In my model there are 2 dynamic axis and 1 static, so "third to last" axis would be #, the batch axis. But attention definitely shouldn't be computed over the batch axis.
I hoped that looking at the actual axes in the tutorial code would help me understand this, but the [#](-2,) issue above made this even more confusing.

Setting attention_axis to -2 gives the following error:

RuntimeError: Times: The left operand 'Placeholder('stab_result', [#, outAxis], [128])'
              rank (1) must be >= #axes (2) being reduced over.

during creation of the training-time model:

def train_model(m):
    @C.Function
    def model(ins: InputSequence[Tensor[input_dim]],                  
              labels: OutputSequence[Tensor[label_dim]]):
        past_labels = Delay(initial_state=C.Constant(seq_start_encoding))(labels)
        return m(ins, past_labels)  #<<<<<<<<<<<<<< HERE
    return model

where stab_result is a Stabilizer right before the final Dense layer in the decoder. I can see in the dot-file that there are spurious trailing dimensions of size 1 that appear in the middle of the AttentionModel implementation.

Setting attention_axis to -1 gives the following error:

RuntimeError: Binary elementwise operation ElementTimes: Left operand 'Output('Block346442_Output_0', [#, outAxis], [64])'
              shape '[64]' is not compatible with right operand 
              'Output('attention_weights', [#, outAxis], [200])' shape '[200]'.

where 64 is my attention_dim and 200 is my attention_span. As I understand, the elementwise * inside the attention model definitely shouldn't be conflating these two together, therefore -1 is definitely not the right axis here.

Question 3: Is my understanding above correct? What should be the right axis and why is it causing one of the two exceptions above?

Thanks for the explanations!

1

There are 1 answers

2
Nikos Karampatziakis On BEST ANSWER

First, some good news: A couple of things have been fixed in the AttentionModel in the latest master (will be generally available with CNTK 2.2 in a few days):

  • You don't need to specify an attention_span or an attention_axis. If you don't specify them and leave them at their default values, the attention is computed over the whole sequence. In fact these arguments have been deprecated.
  • If you do the above the 204 notebook runs 2x faster, so the 204 notebook does not use these arguments anymore
  • A bug has been fixed in the AttentionModel and it now faithfully implements the Bahdanau et. al. paper.

Regarding your questions:

The dimension is not negative. We use certain negative numbers in various places to mean certain things: -1 is a dimension that will be inferred once based on the first minibatch, -2 is I think the shape of a placeholder, and -3 is a dimension that will be inferred with each minibatch (such as when you feed variable sized images to convolutions). I think if you print the graph after the first minibatch, you should see all shapes are concrete.

attention_axis is an implementation detail that should have been hidden. Basically attention_axis=-3 will create a shape of (1, 1, 200), attention_axis=-4 will create a shape of (1, 1, 1, 200) and so on. In general anything more than -3 is not guaranteed to work and anything less than -3 just adds more 1s without any clear benefit. The good news of course is that you can just ignore this argument in the latest master.

TL;DR: If you are in master (or starting with CNTK 2.2 in a few days) replace AttentionModel(attention_dim, attention_span=200, attention_axis=-3) with AttentionModel(attention_dim). It is faster and does not contain confusing arguments. Starting from CNTK 2.2 the original API is deprecated.