Bert encoder takes the input and goes for the multi-head attention model. But how do they maintain sequence? Since current words don't take sequence of previous words. Besides, why is it bidirectional? Does it maintain forward and backward sequence like LSTM?
How bert is a bidirectional?
2.1k views Asked by kowser66 AtThere are 2 answers
It is bidirectional because it uses context from both sides of the current word (instead of e.g. using just the previous few words it uses the whole sequence).
It depends on how much you want to go into detail but basically there are the attention and the self-attention mechanisms to make this "handle everything in the sequence at once" way work.
In a nutshell the attention mechanism means that instead of going through the sentence sequentially/word-by-word, the entire sequence is used to do the decoding on the currently handled word while using an attention system to give weights to decide which word in the input gets how much say in how the current word is handled.
The Self-Attention mechanism means that even for the encoding of the input sequence itself the context (rest of the sentence) is already used. So e.g. if you have a sentence with an "it" that is used as a pronoun, the encoding of that token is going to be strongly context dependent. Self-Attention means similarily to attention there is a weighting function for which other input token is how relevant for the encoding of the current input tokens.
A popular way to explain Self-Attention is this: The cat ran over the street, because it got startled.
The encoding of it
in this sentence is strongly dependent on The cat
and a bit dependent on the street
, because the model learnt during the pre-training that to predict masked words after/around it
in this kind of sentence will strongly depend on these nouns.
If you didn't yet you definitely should check out the Attention is all you need-Paper as well as the BERT-Paper (at least the abstract), they explain in detail how the mechanisms and the pretraining process work.
Another great source to get a better understanding of how it really works is Illustrated Transformer.
The
BERT
pre-training process consists of two parts: 1.Mask LM
; 2.NSP
. The bidirectional structure is reflected in Mask LM. For example:Tom likes to study [MASK] Learning
. This sentence is input into the model, and[MASK]
combines the information of the left and right contexts through attention, which reflects the two-way. Attention is two-way, but GPT achieves one-way through attention mask, that is: let[MASK]
not see the words oflearning
, and only see the aboveTom likes to study.