Whiteboard Transformers & Attention

Transformer Basics

Multi-headed attention

Transformer architecture

Summary from HuggingFace Book

Attention in simple words

Attention explained by chatGPT

Self-Attention

<aside> 💡 Attention is a mechanism that allows neural networks to assign a different amount of weight or “attention” to each element in a sequence.

</aside>

For text sequences, the elements are token embeddings like the ones we encountered in Chapter 2, where each token is mapped to a vector of some fixed dimension. For example, in BERT each token is represented as a 768-dimensional vector.

The "self" part of self-attention refers to the fact that these weights are computed for all hidden states in the same set—for example, all the hidden states of the encoder. By contrast, the attention mechanism associated with recurrent models involves computing the relevance of each encoder hidden state to the decoder hidden state at a given decoding timestep.