Overview of Transformer Models
The paper examines transformer language models, mainly focusing on two-layer or smaller models containing attention blocks. Analyzing simplified models helps us understand complex, large-scale models like GPT-3. The primary goal is to reverse-engineer transformers to unravel their computations, like decoding complex binaries into understandable source code.
Model Simplifications and Assumptions
To streamline the analysis, the research focuses on “attention-only” transformers, disregarding MLP layers and biases. It discusses the high-level structure of autoregressive transformer models, emphasizing the importance of the residual stream and how each layer adds its results into this stream, preserving linear structures essential for understanding the models.
Residual Stream and Virtual Weights
The concept of residual streams as a communication channel is critical, acting as a sum of outputs from all previous layers and the original embedding. This linear structure facilitates virtual weights, allowing direct connections between layers and highlighting the interaction of layer results.
Dividing Attention Heads’ Roles
Attention heads are analyzed as independent and additive operations. Each head is deconstructed into query-key (QK) and output-value (OV) circuits, elucidating how attention patterns determine which tokens’ information is moved and the effect on outputs. The independence of these circuits aids in isolating their influence across various model behaviours.
Mechanistic Interpretability via Path Expansion
The “Path Expansion Trick” is introduced to break down the transformer model into more straightforward, interpretable paths. By examining the OV and QK circuits, the analysis reveals how one-layer models predict “skip-trigram” sequences and how two-layer models develop more complex algorithms through the composition of attention heads.
The Role of Induction Heads
Two-layer attention-only transformers are found to utilize induction heads for implementing in-context learning algorithms. Induction heads can repeat previously seen sequences, both exactly and approximately, by attending to tokens that appeared before. This mechanism significantly enhances the model’s predictive power compared to simple copying in one-layer models.
Importance of Understanding Higher-Order Compositions
The research also investigates higher-order compositions like virtual attention heads, though their significance appears limited in miniature models. However, these compositions might be crucial in larger, more complex transformers and represent an essential focus for future research.
Future Directions and Practical Implications
While this initial work focuses on attention-only models, understanding MLP layers remains challenging. Addressing these layers’ complexities will require more holistic insights into transformers. The ultimate goal is to develop systematic interpretability tools that anticipate and mitigate safety issues in present and future AI models.
Resource
Read more in A Mathematical Framework for Transformer Circuits