3. Introduction to Large Language Models#

(Pytorch nanoGPT code is from: karpathy/nanoGPT)

3.1. Intro#

This chapter serves as an introduction to the basic concepts in Large Language Model (LLM) architectures and training. In the first section we will introduce the basic blocks and mechanisms of the transformer architecture, which most LLMs are based on. Next, we will introduce the basic ideas involved in the training of the LLMs. In addition, this notebook is intended to provide a runnable code implementation of the nanoGPT model and a basic training pipeline.

3.2. Transformer Architecture#

Most modern LLMs are based on the decoder-only transformer model architecture. Essentially, the decoder-only transformer (we will refer to it simply as transformer) is a deep learning model built from multiple transformer layers as well as an embedding and a Language Model (LM) Head layer in the input and output, respectively.

The transformer model takes as input sequences of tokens and outputs a probability distribution over possible next tokens for each input position. Those probability distributions are then used to generate the output text of the model.

Next we will describe in detail its components and provide the corresponding codes.

import math
import torch
import torch.nn as nn
from torch.nn import functional as F
from dataclasses import dataclass
import inspect
import numpy as np

3.2.1. Tokenizer#

Neural Networks such as the transformer work with numbers, not text. Therefore the input texts need to be converted to sequences of numbers before the transformer can process them. Thus, the text is first converted into discrete components known as tokens, which constitute the model’s vocabulary \(V\). Each of the tokens corresponds to a unique number, the token ID.

The module that performs the conversion is called the tokenizer. It is technically considered a preprocessing module, separate from the transformer architecture. There are a number of different approaches to tokenization with one of the most popular being Byte Pair Encoding (BPE), which performs subword tokenization by iteratively merging the most frequent pairs of symbols in the text. An important parameter of the tokenizer is the vocabulary size \(V\), which is the total number of unique tokens it can produce and the model can recognize.

In short, the tokenizer, given as input raw text sequences, outputs a batch of token sequences and their corresponding IDs, which normally have a shape \([B, L]\). Here \(B\) is the batch size that determines the number of sequences (or samples) the transformer will process in parallel and \(L\) is the number of tokens per sequence.

The batch of token ID sequences will then be converted to a batch of embeddings vectors (one embedding per token) by the embedding layer of the transformer.

from transformers import AutoTokenizer

# Load the GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

print("Pad token/ID:", tokenizer.pad_token,  tokenizer.convert_tokens_to_ids(tokenizer.pad_token))
print("Vocabulary size |V|:", tokenizer.vocab_size)

# Example batch of texts
texts = [
    "The cat sat on the mat",
    "A quick brown fox jumps over the lazy dog!"
]

# Tokenize the batch with padding, L=12
encoded = tokenizer(texts, return_tensors="pt", max_length=10 , truncation=True,  padding="max_length" )

tokens = [tokenizer.convert_ids_to_tokens(ids) for ids in encoded["input_ids"]]
print("Tokens:", *tokens, sep="\n")
print("Input IDs:\n", encoded["input_ids"])
print("Attention Mask:\n", encoded["attention_mask"])
Pad token/ID: <|endoftext|> 50256
Vocabulary size |V|: 50257
Tokens:
['The', 'Ġcat', 'Ġsat', 'Ġon', 'Ġthe', 'Ġmat', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>']
['A', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '!']
Input IDs:
 tensor([[  464,  3797,  3332,   319,   262,  2603, 50256, 50256, 50256, 50256],
        [   32,  2068,  7586, 21831, 18045,   625,   262, 16931,  3290,     0]])
Attention Mask:
 tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

3.2.2. Embedding Layer#

The input tokens are discrete and therefore not directly optimizable by gradient-based methods. Thus they need to be converted to continuous, dense vectors that capture semantic meanings and can be learned and optimized by the transformer. This is done by the embedding layer.

The embedding layer is essentially a trainable look-up table. It takes as input the token IDs and outputs the corresponding rows - which are the embedding vectors for the specified token IDs. It has shape (\(|V|\), \(d_{\text{model}}\)), where \(d_{\text{model}}\) is the dimension of the embedding vectors. Bellow is an example of how it looks in code:

embedding.weight = [
    [e1_1, e1_2, ..., e1_d],   # embedding for token 1
    [e2_1, e2_2, ..., e2_d],   # embedding for token 2
    ...
    [en_1, en_2, ..., en_d],   # embedding for token n
]
# where n = |V| and d = d_model

Therefore, the embedding layer for an input tensor of shape \([B, L]\) it will output a tensor of shape \([B, L, d_{\text{model}}]\).

V = tokenizer.vocab_size #|V| = 50256
d_model = 3 # Small value for illustration purposes. E.g., nanoGPT uses d_model=768

embedding = nn.Embedding(num_embeddings=V, embedding_dim=d_model)

# Token IDs from previous example. 
input_ids = encoded["input_ids"]

print("Token IDs:\n", input_ids)

embedded_vectors = embedding(input_ids)

print("Embedding vectors:\n", embedded_vectors)
print("Embedding Layer output shape:", embedded_vectors.shape) # Output shape [B=2, L=10, d_model=3]
Token IDs:
 tensor([[  464,  3797,  3332,   319,   262,  2603, 50256, 50256, 50256, 50256],
        [   32,  2068,  7586, 21831, 18045,   625,   262, 16931,  3290,     0]])
Embedding vectors:
 tensor([[[ 1.4466e+00, -6.2028e-01,  1.5941e+00],
         [-6.0530e-01, -1.3651e-01,  1.8417e+00],
         [-5.5886e-01,  4.4475e-01,  1.1693e+00],
         [-1.0221e+00, -1.5275e+00,  1.0596e+00],
         [-1.4098e-01,  4.8020e-01,  1.0143e+00],
         [-1.1285e+00, -6.4855e-01,  2.0758e-01],
         [ 2.7207e-01, -8.5640e-01, -1.1811e+00],
         [ 2.7207e-01, -8.5640e-01, -1.1811e+00],
         [ 2.7207e-01, -8.5640e-01, -1.1811e+00],
         [ 2.7207e-01, -8.5640e-01, -1.1811e+00]],

        [[-1.0824e-04, -3.3857e-01,  8.3713e-02],
         [ 5.3167e-01,  2.1249e-01,  1.9939e+00],
         [-1.2466e+00, -7.0938e-01, -4.7685e-01],
         [-6.9967e-01, -1.2469e+00,  6.3768e-01],
         [-5.4820e-01, -4.6216e-01,  2.3139e-01],
         [-9.9105e-01,  9.9384e-01, -4.1378e-01],
         [-1.4098e-01,  4.8020e-01,  1.0143e+00],
         [ 1.0743e+00,  5.0753e-01,  8.4159e-02],
         [-1.1365e-01,  1.3078e-01, -4.7219e-01],
         [ 7.4202e-01, -6.7998e-01, -6.0465e-01]]],
       grad_fn=<EmbeddingBackward0>)
Embedding Layer output shape: torch.Size([2, 10, 3])

3.2.3. Self-attention#

Self-attention is a mechanism that transforms the representation of each token in a sequence by relating it to different tokens of the sequence. This new representation can then be used by the model to e.g., predict the next word of the sequence.

For example, lets say we have the sentence “The cat sat on the mat”. We can assume a simple word tokenizer and an embedding layer which results in one embedding vector for each word:

The

cat

sat

on

the

mat

\(e_1\)

\(e_2\)

\(e_3\)

\(e_4\)

\(e_5\)

\(e_6\)

If we wanted to create a model to predict the next word, we could directly just use an MLP+softmax with input one of the embedding vectors like the above and output a probability distribution over the vocabulary.

\[\begin{split} \begin{array}{c} \text{Input Tokens} \quad [B, L] \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Embedding Layer} \end{array} } \\ \downarrow \\ \text{Token Embeddings} \quad [B, L, d_{\text{model}}] \\ \downarrow \\ \boxed{ \begin{array}{c} \text{MLP (Feed-Forward Network)} \end{array} } \\ \downarrow \\ \text{MLP Output} \quad [B, L, |V|] \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Softmax (next token probabilities)} \end{array} } \\ \downarrow \\ \text{Output Probabilities} \quad [B, L, |V|] \end{array} \end{split}\]

However, currently each embedding \(e_i\) contains information only for the particular word it embeds, regardless of the rests of the sequence. So for example given \(e_5\) the model would just predict the most probable word after a “the”, which is certainly not “mat”. Instead, with a self-attention layer \(e_5\) will be transformed into \(e_5' = f(e_1, e_2, e_3, e_4, e_5)\), now containing the context.

Bellow we will describe how self-attention is computed.

3.2.4. Scaled Dot-Product Attention#

  • Assume batch size B=1

  • X is the embedding matrix (contains the embedding vectors of \(L\) tokens)

  • Assume single attention layer after the embedding layer

Input matrix:

\[\begin{split} X = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_L \end{bmatrix} \in \mathbb{R}^{L \times d_{\text{model}}} \end{split}\]

We apply learned projection matrices:

\[ W_Q \in \mathbb{R}^{d_{\text{model}} \times d_q}, \quad W_K \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_V \in \mathbb{R}^{d_{\text{model}} \times d_v} \]

To obtain the queries, keys, and values:

\[\begin{split} Q = X W_Q = \begin{bmatrix} q_1 \\ q_2 \\ \vdots \\ q_L \end{bmatrix} \in \mathbb{R}^{L \times d_q}, \quad K = X W_K = \begin{bmatrix} k_1 \\ k_2 \\ \vdots \\ k_L \end{bmatrix} \in \mathbb{R}^{L \times d_k}, \quad V = X W_V = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_L \end{bmatrix} \in \mathbb{R}^{L \times d_v} \end{split}\]

Attention logits matrix \(QK^T\):

\[\begin{split} QK^T \in \mathbb{R}^{L \times L} = \begin{bmatrix} q_1 \\ q_2 \\ \vdots \\ q_L \end{bmatrix} \cdot \begin{bmatrix} k_1^T & k_2^T & \cdots & k_L^T \end{bmatrix} = \begin{bmatrix} q_1 k_1^T & q_1 k_2^T & \cdots & q_1 k_L^T \\ q_2 k_1^T & q_2 k_2^T & \cdots & q_2 k_L^T \\ \vdots & \vdots & \ddots & \vdots \\ q_L k_1^T & q_L k_2^T & \cdots & q_L k_L^T \\ \end{bmatrix} \end{split}\]

Attention weight matrix \(A\):

\[\begin{split} A = [a_{ij}] \in \mathbb{R}^{L \times L} = \textrm{Softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) = \begin{bmatrix} \frac{\exp \left( \frac{q_1 k_1^T}{\sqrt{d_k}} \right)}{\sum\limits_{j=1}^{L} \exp \left( \frac{q_1 k_j^T}{\sqrt{d_k}} \right)} & \cdots & \frac{\exp \left( \frac{q_1 k_L^T}{\sqrt{d_k}} \right)}{\sum\limits_{j=1}^{L} \exp \left( \frac{q_1 k_j^T}{\sqrt{d_k}} \right)} \\ \vdots & \ddots & \vdots \\ \frac{\exp \left( \frac{q_L k_1^T}{\sqrt{d_k}} \right)}{\sum\limits_{j=1}^{L} \exp \left( \frac{q_L k_j^T}{\sqrt{d_k}} \right)} & \cdots & \frac{\exp \left( \frac{q_L k_L^T}{\sqrt{d_k}} \right)}{\sum\limits_{j=1}^{L} \exp \left( \frac{q_L k_j^T}{\sqrt{d_k}} \right)} \end{bmatrix} \end{split}\]

Note that \(a_{ij}\) is the attention weight of query \(q_i\) wrt key \(k_j\) and indicates the level of attention that token \(i\) pays on token \(j\).

Attention output \(Z\):

\[\begin{split} Z \in \mathbb{R}^{L \times d_v} = \textrm{Softmax} \left( \frac{Q K^T}{\sqrt{d_k}} \right) \cdot V = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1L} \\ a_{21} & a_{22} & \cdots & a_{2L} \\ \vdots & \vdots & \ddots & \vdots \\ a_{L1} & a_{L2} & \cdots & a_{LL} \end{bmatrix} \cdot \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_L \end{bmatrix} = \begin{bmatrix} \sum\limits_{j=1}^{L} a_{1j} v_j \\ \sum\limits_{j=1}^{L} a_{2j} v_j \\ \vdots \\ \sum\limits_{j=1}^{L} a_{Lj} v_j \end{bmatrix} \end{split}\]

\(Z\) is the output of the attention layer, which weights the value vectors \(v_i\) based on the computed attention weights. Each \(z_i\) combines information from different value vectors (corresponding to different tokens) according to the attention given to each key by each query.

# Numpy implementation of dot-product attention:

# Define embedding vectors for tokens (3 tokens/sq_len, 4-dimensional embeddings)
embedding_dim = 4
X = np.array([[0.1, 0.2, 0.3, 0.4],   # Embedding for token 1
              [0.5, 0.6, 0.7, 0.8],   # Embedding for token 2
              [0.9, 1.0, 1.1, 1.2]])  # Embedding for token 3

# Define weight matrices for the transformations (W_Q, W_K, W_V)
W_Q = np.random.rand(embedding_dim, embedding_dim)  # Query weight matrix
W_K = np.random.rand(embedding_dim, embedding_dim)  # Key weight matrix
W_V = np.random.rand(embedding_dim, embedding_dim)  # Value weight matrix

# Compute the Q, K, V matrices by applying the transformations to the embeddings
Q = np.dot(X, W_Q)  # Query matrix (Q)
K = np.dot(X, W_K)  # Key matrix (K)
V = np.dot(X, W_V)  # Value matrix (V)

# Define the scaling factor (sqrt of key dimension)
d_k = K.shape[1] 
scaling_factor = np.sqrt(d_k)

# Compute the attention logits (Q K^T / sqrt(d_k))
logits = np.dot(Q, K.T) / scaling_factor

# Apply softmax to the logits to get attention weights
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)

attention_weights = softmax(logits)

# Compute the output Z (weighted sum of values)
Z = np.dot(attention_weights, V)

print("Embeddings:\n", X)
print("\nQuery Matrix (Q):\n", Q)
print("\nKey Matrix (K):\n", K)
print("\nValue Matrix (V):\n", V)
print("\nAttention Logits (Q K^T / sqrt(d_k)):\n", logits)
print("\nAttention Weights (Softmax of Logits):\n", attention_weights)
print("\nOutput Z (Weighted Sum of Values):\n", Z)
Embeddings:
 [[0.1 0.2 0.3 0.4]
 [0.5 0.6 0.7 0.8]
 [0.9 1.  1.1 1.2]]

Query Matrix (Q):
 [[0.31209963 0.53531405 0.51485473 0.68002436]
 [0.73840816 1.31008629 1.29053135 1.8029026 ]
 [1.16471669 2.08485852 2.06620796 2.92578083]]

Key Matrix (K):
 [[0.52286941 0.64123121 0.38116102 0.53391471]
 [1.23957825 1.58818893 0.94290048 1.15365426]
 [1.95628708 2.53514666 1.50463995 1.77339382]]

Value Matrix (V):
 [[0.2872904  0.47770439 0.56456457 0.39405629]
 [0.898091   1.15947937 1.43623228 0.96008464]
 [1.50889159 1.84125435 2.3079     1.52611298]]

Attention Logits (Q K^T / sqrt(d_k)):
 [[0.53288249 1.25351077 1.97413905]
 [1.34032786 3.14637407 4.95242028]
 [2.14777322 5.03923737 7.93070151]]

Attention Weights (Softmax of Logits):
 [[0.13733006 0.28231275 0.58035719]
 [0.02266042 0.13791889 0.83942069]
 [0.00290927 0.05242418 0.94466655]]

Output Z (Weighted Sum of Values):
 [[1.16869224 1.46152419 1.82240474 1.21085055]
 [1.39696866 1.71632609 2.14817585 1.4223941 ]
 [1.4733169  1.80154592 2.2571317  1.49314595]]

3.2.5. Masked (or causal) self-attention#

In practice, transformers use a version of self-attention called masked or causal self-attention. In contrast to (bidirectional) self-attention that computes attentions scores for all tokens in the sequence, masked self-attention uses a causal mask that hides future tokens, so each token can attend to itself and earlier tokens.

Example of a casual mask:

\[\begin{split} \begin{array}{c|cccccc} & \text{The} & \text{cat} & \text{sat} & \text{on} & \text{the} & \text{mat} \\ \hline \text{The} & 1 & 0 & 0 & 0 & 0 & 0 \\ \text{cat} & 1 & 1 & 0 & 0 & 0 & 0 \\ \text{sat} & 1 & 1 & 1 & 0 & 0 & 0 \\ \text{on} & 1 & 1 & 1 & 1 & 0 & 0 \\ \text{the} & 1 & 1 & 1 & 1 & 1 & 0 \\ \text{mat}& 1 & 1 & 1 & 1 & 1 & 1 \end{array} \end{split}\]

Where 1 means the token can attend, 0 means it’s masked out. Note that for simplicity each word represents a token. So,

  • “The” can attend only to itself.

  • “cat” can attend only to “The” and “cat”.

  • “mat” can attend to everything in this sentence.

Usually this masking is done by setting the \(q_i k_j^T\) with \(i < j\) to \(-\infty\) in the \(QK^T\) attention logits matrix so that the application of the softmax will give zero attention weights to those positions.

# Numpy implementation of masked self-attention:

# Define embedding vectors for tokens (3 tokens/sq_len, 4-dimensional embeddings)
embedding_dim = 4
X = np.array([[0.1, 0.2, 0.3, 0.4],   # Embedding for token 1
              [0.5, 0.6, 0.7, 0.8],   # Embedding for token 2
              [0.9, 1.0, 1.1, 1.2]])  # Embedding for token 3

# Define weight matrices for the transformations (W_Q, W_K, W_V)
W_Q = np.random.rand(embedding_dim, embedding_dim)  # Query weight matrix
W_K = np.random.rand(embedding_dim, embedding_dim)  # Key weight matrix
W_V = np.random.rand(embedding_dim, embedding_dim)  # Value weight matrix

# Compute the Q, K, V matrices by applying the transformations to the embeddings
Q = np.dot(X, W_Q)  # Query matrix (Q)
K = np.dot(X, W_K)  # Key matrix (K)
V = np.dot(X, W_V)  # Value matrix (V)

# Define the scaling factor (sqrt of key dimension)
d_k = K.shape[1] 
scaling_factor = np.sqrt(d_k)

# Compute the attention logits (Q K^T / sqrt(d_k))
logits = np.dot(Q, K.T) / scaling_factor

# Create causal mask: shape (seq_len, seq_len)
seq_len = X.shape[0]
mask = np.tril(np.ones((seq_len, seq_len)))  # Lower triangular matrix including diagonal

# Apply mask: set logits where mask==0 to very large negative value (simulate -inf)
logits_masked = np.where(mask == 1, logits, -1e9)

# Apply softmax to the masked logits to get attention weights
def softmax(x):
    e_x = np.exp(x - np.max(x, axis=1, keepdims=True))  # for numerical stability
    return e_x / np.sum(e_x, axis=1, keepdims=True)

attention_weights = softmax(logits_masked)

# Compute the output Z (weighted sum of values)
Z = np.dot(attention_weights, V)

print("Embeddings:\n", X)
print("\nQuery Matrix (Q):\n", Q)
print("\nKey Matrix (K):\n", K)
print("\nValue Matrix (V):\n", V)
print("\nAttention Logits (Q K^T / sqrt(d_k)):\n", logits)
print("\nMask (1=keep, 0=mask):\n", mask)
print("\nMasked Attention Logits:\n", logits_masked)
print("\nAttention Weights (Softmax of Masked Logits):\n", attention_weights)
print("\nOutput Z (Weighted Sum of Values):\n", Z)
Embeddings:
 [[0.1 0.2 0.3 0.4]
 [0.5 0.6 0.7 0.8]
 [0.9 1.  1.1 1.2]]

Query Matrix (Q):
 [[0.70789558 0.57123414 0.48801403 0.53045068]
 [1.77844482 1.62403702 1.26478696 1.18007677]
 [2.84899407 2.6768399  2.04155989 1.82970286]]

Key Matrix (K):
 [[0.47457039 0.54832416 0.47768742 0.59151911]
 [1.27620835 1.52578969 1.06025913 1.40333355]
 [2.07784632 2.50325522 1.64283083 2.215148  ]]

Value Matrix (V):
 [[0.40830593 0.34716112 0.41442188 0.27214191]
 [0.91654217 0.97609074 1.13588343 0.68737608]
 [1.42477841 1.60502037 1.85734498 1.10261025]]

Attention Logits (Q K^T / sqrt(d_k)):
 [[ 0.59802882  1.51841299  2.43879716]
 [ 1.51835338  3.87232416  6.22629495]
 [ 2.43867795  6.22623534 10.01379273]]

Mask (1=keep, 0=mask):
 [[1. 0. 0.]
 [1. 1. 0.]
 [1. 1. 1.]]

Masked Attention Logits:
 [[ 5.98028817e-01 -1.00000000e+09 -1.00000000e+09]
 [ 1.51835338e+00  3.87232416e+00 -1.00000000e+09]
 [ 2.43867795e+00  6.22623534e+00  1.00137927e+01]]

Attention Weights (Softmax of Masked Logits):
 [[1.00000000e+00 0.00000000e+00 0.00000000e+00]
 [8.67506706e-02 9.13249329e-01 0.00000000e+00]
 [5.01446068e-04 2.21380572e-02 9.77360497e-01]]

Output Z (Weighted Sum of Values):
 [[0.40830593 0.34716112 0.41442188 0.27214191]
 [0.87245234 0.92153068 1.07329615 0.65135424]
 [1.41301734 1.59046634 1.84064967 1.09300134]]

3.2.6. Multi-Head Attention:#

\[\begin{split} \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O \quad \text{where} \quad \\ \text{head}_i = \text{Attention}(Q W_i^Q, \; K W_i^K, \; V W_i^V) \end{split}\]

where the projection matrices are: \( \quad W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}, \quad \text{and} \quad W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}} \)

TODO: Explain more ….

# Pytorch implementation of causal self-attention (multi-head)

class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        # regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        # flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            # causal mask to ensure that attention is only applied to the left in the input sequence
            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                        .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        # ! n_embd = d_model, c_attn acts as all three W_q, W_k, W_v at once and all have output dim n_embd.
        q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        if self.flash:
            # efficient attention using Flash Attention CUDA kernels
            y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
        else:
            # manual implementation of attention
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

3.2.7. LayerNorm#

Layer Normalization (LayerNorm) is used to stabilize and accelerate training by normalizing the input across features.

Given input vector \(x \in \mathbb{R}^d\) (e.g., the hidden state of one token), compute:

\[\mu = \frac{1}{d} \sum_{i=1}^{d} x_i, \ \ \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2\]

Then Layer Normalization is applied as:

\[ \textrm{LayerNorm}(x)_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma_i + \beta_i, \]

Where:

  • \(\gamma \in \mathbb{R}^d\): learnable scale

  • \(\beta \in \mathbb{R}^d\): learnable bias

Given an input \(X \in \mathbb{R}^{B \times L \times d_{\textrm{model}}}\), LayerNorm(\(X\)) will be applied independently for each token’s hidden state, \(x_{b,t} \in \mathbb{R}^{d_{\textrm{model}}}\), unlike BatchNorm which normalizes across a batch. Note that \(\gamma\) and \(\beta\) dimensions will remain \(d_{\textrm{model}}\) as they are shared across tokens for all sequences.

# Pytorch implementation of Layer Normalization

class LayerNorm(nn.Module):
    """ LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """

    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

3.2.8. MLP Layer#

TODO: Explain why we need an MLP layer, also expain residual connections, GELU

3.2.8.1. MLP diagram:#

\[\begin{split} \begin{array}{c} \text{Input } x \quad (B \times L \times d_{\text{model}}) \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Linear } (d_{\text{model}} \rightarrow 4 d_{\text{model}}) \\ (\text{c\_fc}) \end{array} } \\ \downarrow \\ \boxed{ \begin{array}{c} \text{GELU Activation} \end{array} } \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Linear } (4 d_{\text{model}} \rightarrow d_{\text{model}}) \\ (\text{c\_proj}) \end{array} } \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Dropout} \end{array} } \\ \downarrow \\ \text{Output } x \quad (B \times L \times d_{\text{model}}) \end{array} \end{split}\]

3.2.8.2. Block Layer diagram:#

\[\begin{split} \begin{array}{c} \text{Input } x \quad (B \times L \times d_{\text{model}}) \\ \downarrow \\ \boxed{ \begin{array}{c} \text{LayerNorm (ln\_1)} \end{array} } \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Causal Self-Attention} \\ \text{(attn)} \end{array} } \\ \downarrow \\ \text{Residual: } x = x + \text{Attention Output} \\ \downarrow \\ \boxed{ \begin{array}{c} \text{LayerNorm (ln\_2)} \end{array} } \\ \downarrow \\ \boxed{ \begin{array}{c} \text{MLP (diagram above)} \end{array} } \\ \downarrow \\ \text{Residual: } x = x + \text{MLP Output} \\ \downarrow \\ \text{Output of Block } (B \times L \times d_{\text{model}}) \end{array} \end{split}\]
# Pytorch implementation of MLP Layer

class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

3.2.9. Decoder-only transformer (nanoGPT)#

\[\begin{split} \begin{array}{c} \text{Input Tokens (indices)} \quad (B \times T) \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Token Embedding (wte)} \end{array} } \\ + \\ \boxed{ \begin{array}{c} \text{Learned Positional Embedding (wpe)} \end{array} } \\ \downarrow \\ \text{Token Embeddings + Positional Embeddings (with Dropout)} \quad (B \times T \times d_{\text{model}}) \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Block 1:} \\ x = x + \text{Causal Self-Attention}(\text{LayerNorm}(x)) \\ x = x + \text{MLP}(\text{LayerNorm}(x)) \end{array} } \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Block 2:} \\ x = x + \text{Causal Self-Attention}(\text{LayerNorm}(x)) \\ x = x + \text{MLP}(\text{LayerNorm}(x)) \end{array} } \\ \downarrow \\ \vdots \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Block N:} \\ x = x + \text{Causal Self-Attention}(\text{LayerNorm}(x)) \\ x = x + \text{MLP}(\text{LayerNorm}(x)) \end{array} } \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Final LayerNorm (ln\_f)} \end{array} } \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Linear Projection (lm\_head)} \\ \text{(weights tied with token embeddings)} \end{array} } \\ \downarrow \\ \text{Output Logits} \quad (B \times T \times |V|) \\ \downarrow \\ \boxed{ \begin{array}{c} \text{Softmax (next token probabilities)} \end{array} } \\ \downarrow \\ \text{Output Probabilities} \quad (B \times T \times |V|) \end{array} \end{split}\]
# Pytorch implemenation of nanoGPT

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster


class GPT(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, bias=config.bias),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer.wte.weight = self.lm_head.weight # https://paperswithcode.com/method/weight-tying

        # init all weights
        self.apply(self._init_weights)
        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

        # report number of parameters
        print("number of parameters: %.2fM" % (self.get_num_params()/1e6,))

    def get_num_params(self, non_embedding=True):
        """
        Return the number of parameters in the model.
        For non-embedding count (default), the position embeddings get subtracted.
        The token embeddings would too, except due to the parameter sharing these
        params are actually used as weights in the final layer, so we include them.
        """
        n_params = sum(p.numel() for p in self.parameters())
        if non_embedding:
            n_params -= self.transformer.wpe.weight.numel()
        return n_params

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-100)
        else:
            # inference-time mini-optimization: only forward the lm_head on the very last position
            logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
            loss = None

        return logits, loss
 
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

TODO: explain positional embeddings

TODO: explain training and generation

3.3. LLM Training#

from loguru import logger
import os 
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from datasets import load_dataset

from transformers import get_constant_schedule_with_warmup
from transformers import AutoTokenizer

from torch.utils.data import IterableDataset, get_worker_info
# Environment parameters:
device = f"cuda:0"
workers = 4
# Data parameters:
batch_size = 64
# Tokenization parameters:
max_length = 256  # sequence length L
# Optimizer parameters
lr = 1e-3
weight_decay = 0.0 
# LR scheduler parameters
warmup_steps = 1000
# Training parameters:
num_training_steps = 100 # 10000
total_batch_size = 256
grad_clipping = 0.0 
print_freq = 10
data = load_dataset(
            "allenai/c4", "en", split="train", streaming=True
        )
val_data = load_dataset(
            "allenai/c4", "en", split="validation", streaming=True
        ) 

class PreprocessedIterableDataset(IterableDataset):
    def __init__(self, data, tokenizer, batch_size, max_length):
        super().__init__()
        self.data = data
        self.tokenizer = tokenizer
        self.batch_size = batch_size
        self.max_length = max_length

    def __iter__(self):
        iter_data = iter(self.data)

        batch = []
        for example in iter_data:
            tokenized_example = self.tokenizer(
                example["text"],
                max_length=self.max_length,
                truncation=True,
                padding="max_length",
                return_tensors="pt",
            )
            batch.append(tokenized_example)

            if len(batch) == self.batch_size:
                yield self._format_batch(batch)
                batch = []

        if batch:
            yield self._format_batch(batch)

    def _format_batch(self, batch):
        input_ids = torch.stack([item["input_ids"].squeeze(0) for item in batch])
        return input_ids    
    
# GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

dataset = PreprocessedIterableDataset(
                data, tokenizer, batch_size=batch_size, max_length=max_length
            )

dataloader = torch.utils.data.DataLoader(
    dataset, batch_size=None, num_workers=workers,
) 
# model parameters:
n_layer = 12
n_head = 12
n_embd = 768
dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+
bias = False
block_size = 1024
vocab_size = tokenizer.vocab_size
# ??? " vocab_size = 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency"

model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=vocab_size, dropout=dropout)

gptconf = GPTConfig(**model_args)
model = GPT(gptconf).to(device)

n_total_params = sum(p.numel() for p in model.parameters())
trainable_params = [p for p in model.parameters() if p.requires_grad]
number of parameters: 123.55M
optimizer = torch.optim.Adam(
            trainable_params, lr=lr, weight_decay=weight_decay
        )

scheduler = get_constant_schedule_with_warmup(
            optimizer,
            num_warmup_steps=warmup_steps,
            last_epoch=-1,            
        )
global_step = 0
update_step = 0
tokens_seen = 0
tokens_seen_before = 0
world_size = 1

pad_idx = tokenizer.pad_token_id

gradient_accumulation = None

if  total_batch_size is not None:
    if  gradient_accumulation is None:
        assert (
            total_batch_size % world_size == 0
        ), "total_batch_size must be divisible by world_size"
        gradient_accumulation = total_batch_size // (
            batch_size * world_size
        )
        assert (
            gradient_accumulation > 0
        ), "gradient_accumulation must be greater than 0"

assert (
    gradient_accumulation * batch_size * world_size
    == total_batch_size
), "gradient_accumulation * batch_size * world_size must be equal to total_batch_size"


# ##############################
# START of training loop
# ##############################

for batch_idx, batch in enumerate(dataloader):
    global_step += 1

    if update_step > num_training_steps:
        logger.info(
            f"Reached max number of update steps (f{num_training_steps}). Stopping training."
        )
        break

    input_ids = batch.to(device)
    labels = input_ids.clone()
    labels[labels == pad_idx] = -100
    labels = labels.to(device)
    tokens_seen += (input_ids != pad_idx).sum().item() * world_size

    logits, loss = model(input_ids, targets=labels)
    
    scaled_loss = loss / gradient_accumulation
    scaled_loss.backward()

    if global_step % gradient_accumulation != 0:
        continue

    if update_step % print_freq == 0:
        print(f"Update step: {update_step}/{num_training_steps}")

    #######
    if grad_clipping != 0.0:
        torch.nn.utils.clip_grad_norm_(trainable_params, grad_clipping)

    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()
        
    update_step += 1
 
 
# ##############################
# END of training loop
# ##############################
logger.info("Training finished") 
Update step: 0/100
Update step: 10/100
Update step: 20/100
Update step: 30/100
Update step: 40/100
Update step: 50/100
Update step: 60/100
Update step: 70/100
Update step: 80/100
Update step: 90/100
2025-09-10 03:23:23.901 | INFO     | __main__:<cell line: 34>:38 - Reached max number of update steps (f100). Stopping training.
2025-09-10 03:23:24.054 | INFO     | __main__:<cell line: 74>:74 - Training finished
Update step: 100/100