Whiteboard Transformers & Attention
Attention explained by chatGPT
<aside> 💡 Attention is a mechanism that allows neural networks to assign a different amount of weight or “attention” to each element in a sequence.
</aside>
For text sequences, the elements are token embeddings like the ones we encountered in Chapter 2, where each token is mapped to a vector of some fixed dimension. For example, in BERT each token is represented as a 768-dimensional vector.
The "self" part of self-attention refers to the fact that these weights are computed for all hidden states in the same set—for example, all the hidden states of the encoder. By contrast, the attention mechanism associated with recurrent models involves computing the relevance of each encoder hidden state to the decoder hidden state at a given decoding timestep.