Whiteboard Transformers & Attention

Self-Attention

State of AI in 2026

Definition: A Transformer is a type of neural network architecture developed by Vaswani et al. in 2017 [1]. Without going into too much detail, this model architecture consists of a multi-head self-attention mechanism combined with an encoder-decoder structure.

Performance: It can achieve State of the art (SOTA) results that outperform various other models leveraging recurrent (RNN) or convolutional neural networks (CNN) both in terms of evaluation score (BLEU score) and training time.

Key advantage: A key advantage of a Transformer over other Deep Neural Networks (NN) structures is that a longer-distanced context around a word is considered in a more computationally efficient way. The phrase “making…more difficult” would be recognized in the sentence “making the registration or voting process more difficult” even though “more difficult” is a rather distant dependency of “making”. The computation of the relevant context around a word can be done in parallel, saving significant training resources. Transformers are a type of Deep Learning model that are widely used in state-of-the-art applications. Their main benefit is that they are able to distribute their computations across many machines, making the complicated computations required for e.g. automatic speech recognition feasible to do in a reasonable amount of time.

Transformer in NLP models: Current SOTA NLP models use the Transformer architecture in part or as a whole. The GPT model only uses the decoder of the Transformer structure (unidirectional), while BERT is based on the Transformer encoder (bidirectional). T5 utilizes an encoder-decoder Transformer structure very similar to the original implementation. These general architectures also differ in the number and dimension of the elements that comprise an encoder or decoder (i.e., the number of layers, the hidden size, and the number of self-attention heads they employ. Aside from these variations in model structure, the language models also diverge in the data and tasks they used for pre-training

Transfer learning: Many Transformer-based NLP models were specifically created for transfer learning. Transfer learning describes an approach where a model is first pre-trained on large unlabeled text corpora using self-supervised learning. Then it is minimally adjusted during fine-tuning on a specific NLP (downstream) task. Labeled datasets for specific NLP tasks are usually relatively small. Training a model only on such a small dataset without pre-training would lower results compared to its pre-trained version. The same pre-trained model can be used to fine-tune various NLP downstream tasks, including text classification, summarization, or question answering.

Transfer Learning

Different pre-training techniques: Various unlabeled data sources can be utilized for pre-training. They ****can be entirely unrelated to the data or task during fine-tuning as long as the dataset is large enough. GPT-2 was pre-trained using 40 GB of text [6]. Consequently, pre-training is very time- and resource-intensive and usually done utilizing multiple GPUs over several hours or even days. The datasets and learning objectives implemented during pre-training largely differ among models. While GPT used a standard language modeling objective [2] which predicts the next word in a sentence, BERT was trained on Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) [4]. The RoBERTa model replicated the BERT model architecture but changed the pre-training using more data, training for longer, and removing the NSP objective [7].

Fine-Tuning: The model checkpoints of the pre-trained models serve as the starting point for fine-tuning. A labeled dataset for a specific downstream task is used as training data. There are several different fine-tuning approaches, including the following:

  1. Training the entire model on the labeled data.
  2. Training only higher layers and freezing the lower layers.