Transformer architecture

Videos

Videos: A step by step explanation
Jeremy Howard explains LLMs

The encoding component is a stack of encoders. The decoding component is a stack of decoders of the same number.

Untitled

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

Untitled

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word.

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).

Untitled