What is the reason for MultiHeadAttention having a different call convention than Attention and AdditiveAttention?

142 views Asked by Tobias Hermann At 03 January 2025 at 14:05

Attention and AdditiveAttention are called with their input tensors in a list. (same as Add, Average, Concatenate, Dot, Maximum, Multiply, Subtract)

But MultiHeadAttention is called by passing the input tensors as separate arguments.

The following minimal example shows the difference in how the inbound_nodes are linked:

import json

from tensorflow.keras.layers import AdditiveAttention, Attention, Input, MultiHeadAttention
from tensorflow.keras.models import Model

inputs = [Input(shape=(8, 16)), Input(shape=(4, 16))]
outputs = [Attention()([inputs[0], inputs[1]]),
           AdditiveAttention()([inputs[0], inputs[1]]),
           MultiHeadAttention(num_heads=2, key_dim=2)(inputs[0], inputs[1])]
model = Model(inputs=inputs, outputs=outputs)

print(json.dumps(json.loads(model.to_json()), indent=4))

[...]
                "name": "attention",
                "inbound_nodes": [
                    [
                        [
                            "input_1",
                            0,
                            0,
                            {}
                        ],
                        [
                            "input_2",
                            0,
                            0,
                            {}
                        ]
                    ]
                ]
[...]
                "name": "additive_attention",
                "inbound_nodes": [
                    [
                        [
                            "input_1",
                            0,
                            0,
                            {}
                        ],
                        [
                            "input_2",
                            0,
                            0,
                            {}
                        ]
                    ]
                ]
[...]
                "name": "multi_head_attention",
                "inbound_nodes": [
                    [
                        [
                            "input_1",
                            0,
                            0,
                            {
                                "value": [
                                    "input_2",
                                    0,
                                    0
                                ]
                            }
                        ]
                    ]
                ]
[...]

What's the reason for MultiHeadAttention not following the convention of the other two attention layers, and what does it mean for input_2 being stored as value in input_1 in the inbound_nodes?

(Some context on why this is relevant to me: I'm maintaining this library and would like to implement support for MultiHeadAttention.)

Original Q&A

TechQA.

What is the reason for MultiHeadAttention having a different call convention than Attention and AdditiveAttention?

There are 0 answers

Related Questions in PYTHON

Related Questions in TENSORFLOW

Related Questions in KERAS

Related Questions in ATTENTION-MODEL

Related Questions in MULTIHEAD-ATTENTION

Popular Questions

Popular Tags

Trending Questions