Understanding Large Language Models: Mathematics and Architecture

Introduction

Large Language Models (LLMs) have revolutionized artificial intelligence and natural language processing. From GPT to BERT, these models demonstrate remarkable capabilities in understanding and generating human language. But what makes them work? In this post, we’ll explore the mathematical foundations and architectural principles that power these sophisticated systems.

The Foundation: Neural Networks

At their core, LLMs are built upon artificial neural networks. A neural network consists of layers of interconnected neurons, each performing a simple mathematical operation:

$$y = \sigma(Wx + b)$$

Where:

$x$ is the input vector
$W$ is the weight matrix
$b$ is the bias vector
$\sigma$ is the activation function

The activation function introduces non-linearity, allowing the network to learn complex patterns. Common choices include ReLU (Rectified Linear Unit): $\sigma(x) = \max(0, x)$ and the more traditional sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$.

From RNNs to Transformers

The Problem with RNNs

Early language models used Recurrent Neural Networks (RNNs), which process sequences step by step. However, RNNs suffer from two major problems:

Vanishing Gradients: During backpropagation, gradients can become extremely small: $$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial h_t} \cdot \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial h_{t-2}} \cdots \frac{\partial h_1}{\partial W}$$

When the terms $\frac{\partial h_t}{\partial h_{t-1}} < 1$, multiplying many of them together causes the gradient to vanish.

Sequential Processing: RNNs can’t parallelize across time steps, making them slow for long sequences.

The Transformer Revolution

The Transformer architecture, introduced in “Attention Is All You Need” (2017), revolutionized NLP by eliminating recurrence entirely and relying on attention mechanisms.

The Attention Mechanism

Self-Attention

The core innovation of Transformers is the self-attention mechanism. For each word, we compute attention scores with all other words:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:

$Q$ (Query) represents the current word’s representation
$K$ (Key) represents all words’ representations for matching
$V$ (Value) represents all words’ representations for output
$d_k$ is the dimension of the keys

The softmax function converts these scores into probabilities:

$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

Multi-Head Attention

Single attention isn’t enough for complex patterns. Multi-head attention runs multiple attention mechanisms in parallel:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, …, \text{head}_h)W^O$$

Where each head computes: $$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

This allows the model to focus on different aspects of the input simultaneously.

Positional Encoding

Since Transformers process all words simultaneously, they need a way to understand word order. Positional encoding adds position information to word embeddings:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Where $pos$ is the position and $i$ is the dimension index. This creates a unique encoding for each position that the model can learn to recognize.

The Transformer Architecture

A Transformer consists of an encoder and decoder, each containing multiple identical layers. Each encoder layer has two sub-layers:

Multi-Head Self-Attention
Feed-Forward Network

The feed-forward network applies to each position independently:

$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

Training LLMs

Loss Function

LLMs are typically trained to minimize the cross-entropy loss:

$$\mathcal{L} = -\sum_{t=1}^{T} \sum_{i=1}^{|V|} y_{t,i} \log(\hat{y}_{t,i})$$

Where:

$T$ is the sequence length
$|V|$ is the vocabulary size
$y_{t,i}$ is the true probability (1 for correct token, 0 otherwise)
$\hat{y}_{t,i}$ is the predicted probability

Backpropagation

The model learns by computing gradients of the loss with respect to all parameters:

$$\frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \theta}$$

These gradients are then used to update parameters using optimization algorithms like Adam:

$$\theta_{t+1} = \theta_t - \eta \cdot \frac{m_t}{\sqrt{v_t} + \epsilon}$$

Where $m_t$ and $v_t$ are momentum estimates, $\eta$ is the learning rate, and $\epsilon$ prevents division by zero.

Scaling Laws

One fascinating aspect of LLMs is their predictable scaling behavior. Research has shown that:

$$\text{Loss}(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + \text{const}$$

Where:

$N$ is the model size (number of parameters)
$D$ is the dataset size
$\alpha_N, \alpha_D$ are scaling exponents (typically around 0.07-0.35)
$N_c, D_c$ are critical sizes

This means we can predict performance before training!

The Math Behind Tokenization

LLMs don’t see words directly; they see tokens. Byte-Pair Encoding (BPE) is a popular tokenization method that works by:

Starting with individual characters
Iteratively merging the most frequent adjacent pairs
Building vocabulary until reaching desired size

The merge frequency is calculated as:

$$\text{freq}(xy) = \sum_{i} \mathbb{1}[\text{token}i = x \land \text{token}{i+1} = y]$$

Temperature and Sampling

When generating text, LLMs use the softmax output to sample the next token. Temperature controls randomness:

$$P(x_i) = \frac{\exp(x_i / T)}{\sum_j \exp(x_j / T)}$$

$T < 1$: More conservative, focuses on high-probability tokens
$T = 1$: Standard sampling
$T > 1$: More diverse, lower-probability tokens get more chance

The Future: Understanding Emergent Abilities

Recent research shows that LLMs develop “emergent abilities” at certain scales—capabilities that weren’t explicitly trained but appear spontaneously. This might be related to the mathematical structure of high-dimensional spaces:

In high dimensions, the volume concentrates at the surface of a sphere. For a unit ball in $\mathbb{R}^n$, most volume is in the thin shell at radius approximately $\sqrt{1 - 1/n}$. This geometric property might explain why large models can generalize so well.

Conclusion

Large Language Models represent a beautiful fusion of mathematics, computer science, and linguistics. From the elegance of attention mechanisms to the predictability of scaling laws, these systems demonstrate how mathematical principles can create systems that appear to “understand” language.

As we continue to develop larger and more capable models, understanding their mathematical foundations becomes increasingly important—not just for building better models, but for ensuring their safety and alignment with human values.

The mathematics of LLMs isn’t just abstract theory; it’s the key to understanding how these remarkable systems work and how we can make them better.

Understanding Large Language Models: Mathematics and Architecture

Introduction

The Foundation: Neural Networks

From RNNs to Transformers

The Problem with RNNs

The Transformer Revolution

The Attention Mechanism

Self-Attention

Multi-Head Attention

Positional Encoding

The Transformer Architecture

Training LLMs

Loss Function

Backpropagation

Scaling Laws

The Math Behind Tokenization

Temperature and Sampling

The Future: Understanding Emergent Abilities

Conclusion

Further Reading

What is cloud?

Understanding Large …

Introduction

Render Math With Mathjax

Mathjax