Introduction
Large Language Models (LLMs) have revolutionized artificial intelligence and natural language processing. From GPT to BERT, these models demonstrate remarkable capabilities in understanding and generating human language. But what makes them work? In this post, we’ll explore the mathematical foundations and architectural principles that power these sophisticated systems.
The Foundation: Neural Networks
At their core, LLMs are built upon artificial neural networks. A neural network consists of layers of interconnected neurons, each performing a simple mathematical operation:
$$y = \sigma(Wx + b)$$
Where:
- $x$ is the input vector
- $W$ is the weight matrix
- $b$ is the bias vector
- $\sigma$ is the activation function
The activation function introduces non-linearity, allowing the network to learn complex patterns. Common choices include ReLU (Rectified Linear Unit): $\sigma(x) = \max(0, x)$ and the more traditional sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$.
From RNNs to Transformers
The Problem with RNNs
Early language models used Recurrent Neural Networks (RNNs), which process sequences step by step. However, RNNs suffer from two major problems:
- Vanishing Gradients: During backpropagation, gradients can become extremely small: $$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial h_t} \cdot \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial h_{t-2}} \cdots \frac{\partial h_1}{\partial W}$$
When the terms $\frac{\partial h_t}{\partial h_{t-1}} < 1$, multiplying many of them together causes the gradient to vanish.
- Sequential Processing: RNNs can’t parallelize across time steps, making them slow for long sequences.
The Transformer Revolution
The Transformer architecture, introduced in “Attention Is All You Need” (2017), revolutionized NLP by eliminating recurrence entirely and relying on attention mechanisms.
The Attention Mechanism
Self-Attention
The core innovation of Transformers is the self-attention mechanism. For each word, we compute attention scores with all other words:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where:
- $Q$ (Query) represents the current word’s representation
- $K$ (Key) represents all words’ representations for matching
- $V$ (Value) represents all words’ representations for output
- $d_k$ is the dimension of the keys
The softmax function converts these scores into probabilities:
$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$
Multi-Head Attention
Single attention isn’t enough for complex patterns. Multi-head attention runs multiple attention mechanisms in parallel:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, …, \text{head}_h)W^O$$
Where each head computes: $$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
This allows the model to focus on different aspects of the input simultaneously.
Positional Encoding
Since Transformers process all words simultaneously, they need a way to understand word order. Positional encoding adds position information to word embeddings:
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
Where $pos$ is the position and $i$ is the dimension index. This creates a unique encoding for each position that the model can learn to recognize.
The Transformer Architecture
A Transformer consists of an encoder and decoder, each containing multiple identical layers. Each encoder layer has two sub-layers:
- Multi-Head Self-Attention
- Feed-Forward Network
The feed-forward network applies to each position independently:
$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$
Training LLMs
Loss Function
LLMs are typically trained to minimize the cross-entropy loss:
$$\mathcal{L} = -\sum_{t=1}^{T} \sum_{i=1}^{|V|} y_{t,i} \log(\hat{y}_{t,i})$$
Where:
- $T$ is the sequence length
- $|V|$ is the vocabulary size
- $y_{t,i}$ is the true probability (1 for correct token, 0 otherwise)
- $\hat{y}_{t,i}$ is the predicted probability
Backpropagation
The model learns by computing gradients of the loss with respect to all parameters:
$$\frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \theta}$$
These gradients are then used to update parameters using optimization algorithms like Adam:
$$\theta_{t+1} = \theta_t - \eta \cdot \frac{m_t}{\sqrt{v_t} + \epsilon}$$
Where $m_t$ and $v_t$ are momentum estimates, $\eta$ is the learning rate, and $\epsilon$ prevents division by zero.
Scaling Laws
One fascinating aspect of LLMs is their predictable scaling behavior. Research has shown that:
$$\text{Loss}(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + \text{const}$$
Where:
- $N$ is the model size (number of parameters)
- $D$ is the dataset size
- $\alpha_N, \alpha_D$ are scaling exponents (typically around 0.07-0.35)
- $N_c, D_c$ are critical sizes
This means we can predict performance before training!
The Math Behind Tokenization
LLMs don’t see words directly; they see tokens. Byte-Pair Encoding (BPE) is a popular tokenization method that works by:
- Starting with individual characters
- Iteratively merging the most frequent adjacent pairs
- Building vocabulary until reaching desired size
The merge frequency is calculated as:
$$\text{freq}(xy) = \sum_{i} \mathbb{1}[\text{token}i = x \land \text{token}{i+1} = y]$$
Temperature and Sampling
When generating text, LLMs use the softmax output to sample the next token. Temperature controls randomness:
$$P(x_i) = \frac{\exp(x_i / T)}{\sum_j \exp(x_j / T)}$$
- $T < 1$: More conservative, focuses on high-probability tokens
- $T = 1$: Standard sampling
- $T > 1$: More diverse, lower-probability tokens get more chance
The Future: Understanding Emergent Abilities
Recent research shows that LLMs develop “emergent abilities” at certain scales—capabilities that weren’t explicitly trained but appear spontaneously. This might be related to the mathematical structure of high-dimensional spaces:
In high dimensions, the volume concentrates at the surface of a sphere. For a unit ball in $\mathbb{R}^n$, most volume is in the thin shell at radius approximately $\sqrt{1 - 1/n}$. This geometric property might explain why large models can generalize so well.
Conclusion
Large Language Models represent a beautiful fusion of mathematics, computer science, and linguistics. From the elegance of attention mechanisms to the predictability of scaling laws, these systems demonstrate how mathematical principles can create systems that appear to “understand” language.
As we continue to develop larger and more capable models, understanding their mathematical foundations becomes increasingly important—not just for building better models, but for ensuring their safety and alignment with human values.
The mathematics of LLMs isn’t just abstract theory; it’s the key to understanding how these remarkable systems work and how we can make them better.
Further Reading
- Attention Is All You Need - The original Transformer paper
- Scaling Laws for Neural Language Models - Understanding model scaling
- Language Models are Few-Shot Learners - GPT-3 paper
This post only scratches the surface of LLM mathematics. The field continues to evolve rapidly, with new architectures and training methods emerging regularly.
