Summary from HuggingFace Book

Attention in simple words

Attention explained by Computerphile
Broad overview about the most important concepts (video)
Attention concept explained fast and good → video

Self-Attention

<aside> 💡 Attention is a mechanism that allows neural networks to assign a different amount of weight or “attention” to each element in a sequence.

</aside>

Self-Attention in simple words → video

For text sequences, the elements are token embeddings like the ones we encountered in Chapter 2, where each token is mapped to a vector of some fixed dimension. For example, in BERT each token is represented as a 768-dimensional vector.

The "self" part of self-attention refers to the fact that these weights are computed for all hidden states in the same set—for example, all the hidden states of the encoder. By contrast, the attention mechanism associated with recurrent models involves computing the relevance of each encoder hidden state to the decoder hidden state at a given decoding timestep.