3. Introduction to Large Language Models#
(Pytorch nanoGPT code is from: karpathy/nanoGPT)
3.1. Intro#
This chapter serves as an introduction to the basic concepts in Large Language Model (LLM) architectures and training. In the first section we will introduce the basic blocks and mechanisms of the transformer architecture, which most LLMs are based on. Next, we will introduce the basic ideas involved in the training of the LLMs. In addition, this notebook is intended to provide a runnable code implementation of the nanoGPT model and a basic training pipeline.
3.2. Transformer Architecture#
Most modern LLMs are based on the decoder-only transformer model architecture. Essentially, the decoder-only transformer (we will refer to it simply as transformer) is a deep learning model built from multiple transformer layers as well as an embedding and a Language Model (LM) Head layer in the input and output, respectively.
The transformer model takes as input sequences of tokens and outputs a probability distribution over possible next tokens for each input position. Those probability distributions are then used to generate the output text of the model.
Next we will describe in detail its components and provide the corresponding codes.
import math
import torch
import torch.nn as nn
from torch.nn import functional as F
from dataclasses import dataclass
import inspect
import numpy as np
3.2.1. Tokenizer#
Neural Networks such as the transformer work with numbers, not text. Therefore the input texts need to be converted to sequences of numbers before the transformer can process them. Thus, the text is first converted into discrete components known as tokens, which constitute the model’s vocabulary \(V\). Each of the tokens corresponds to a unique number, the token ID.
The module that performs the conversion is called the tokenizer. It is technically considered a preprocessing module, separate from the transformer architecture. There are a number of different approaches to tokenization with one of the most popular being Byte Pair Encoding (BPE), which performs subword tokenization by iteratively merging the most frequent pairs of symbols in the text. An important parameter of the tokenizer is the vocabulary size \(V\), which is the total number of unique tokens it can produce and the model can recognize.
In short, the tokenizer, given as input raw text sequences, outputs a batch of token sequences and their corresponding IDs, which normally have a shape \([B, L]\). Here \(B\) is the batch size that determines the number of sequences (or samples) the transformer will process in parallel and \(L\) is the number of tokens per sequence.
The batch of token ID sequences will then be converted to a batch of embeddings vectors (one embedding per token) by the embedding layer of the transformer.
from transformers import AutoTokenizer
# Load the GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
print("Pad token/ID:", tokenizer.pad_token, tokenizer.convert_tokens_to_ids(tokenizer.pad_token))
print("Vocabulary size |V|:", tokenizer.vocab_size)
# Example batch of texts
texts = [
"The cat sat on the mat",
"A quick brown fox jumps over the lazy dog!"
]
# Tokenize the batch with padding, L=12
encoded = tokenizer(texts, return_tensors="pt", max_length=10 , truncation=True, padding="max_length" )
tokens = [tokenizer.convert_ids_to_tokens(ids) for ids in encoded["input_ids"]]
print("Tokens:", *tokens, sep="\n")
print("Input IDs:\n", encoded["input_ids"])
print("Attention Mask:\n", encoded["attention_mask"])
Pad token/ID: <|endoftext|> 50256
Vocabulary size |V|: 50257
Tokens:
['The', 'Ġcat', 'Ġsat', 'Ġon', 'Ġthe', 'Ġmat', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>']
['A', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '!']
Input IDs:
tensor([[ 464, 3797, 3332, 319, 262, 2603, 50256, 50256, 50256, 50256],
[ 32, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 0]])
Attention Mask:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
3.2.2. Embedding Layer#
The input tokens are discrete and therefore not directly optimizable by gradient-based methods. Thus they need to be converted to continuous, dense vectors that capture semantic meanings and can be learned and optimized by the transformer. This is done by the embedding layer.
The embedding layer is essentially a trainable look-up table. It takes as input the token IDs and outputs the corresponding rows - which are the embedding vectors for the specified token IDs. It has shape (\(|V|\), \(d_{\text{model}}\)), where \(d_{\text{model}}\) is the dimension of the embedding vectors. Bellow is an example of how it looks in code:
embedding.weight = [
[e1_1, e1_2, ..., e1_d], # embedding for token 1
[e2_1, e2_2, ..., e2_d], # embedding for token 2
...
[en_1, en_2, ..., en_d], # embedding for token n
]
# where n = |V| and d = d_model
Therefore, the embedding layer for an input tensor of shape \([B, L]\) it will output a tensor of shape \([B, L, d_{\text{model}}]\).
V = tokenizer.vocab_size #|V| = 50256
d_model = 3 # Small value for illustration purposes. E.g., nanoGPT uses d_model=768
embedding = nn.Embedding(num_embeddings=V, embedding_dim=d_model)
# Token IDs from previous example.
input_ids = encoded["input_ids"]
print("Token IDs:\n", input_ids)
embedded_vectors = embedding(input_ids)
print("Embedding vectors:\n", embedded_vectors)
print("Embedding Layer output shape:", embedded_vectors.shape) # Output shape [B=2, L=10, d_model=3]
Token IDs:
tensor([[ 464, 3797, 3332, 319, 262, 2603, 50256, 50256, 50256, 50256],
[ 32, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 0]])
Embedding vectors:
tensor([[[ 1.4466e+00, -6.2028e-01, 1.5941e+00],
[-6.0530e-01, -1.3651e-01, 1.8417e+00],
[-5.5886e-01, 4.4475e-01, 1.1693e+00],
[-1.0221e+00, -1.5275e+00, 1.0596e+00],
[-1.4098e-01, 4.8020e-01, 1.0143e+00],
[-1.1285e+00, -6.4855e-01, 2.0758e-01],
[ 2.7207e-01, -8.5640e-01, -1.1811e+00],
[ 2.7207e-01, -8.5640e-01, -1.1811e+00],
[ 2.7207e-01, -8.5640e-01, -1.1811e+00],
[ 2.7207e-01, -8.5640e-01, -1.1811e+00]],
[[-1.0824e-04, -3.3857e-01, 8.3713e-02],
[ 5.3167e-01, 2.1249e-01, 1.9939e+00],
[-1.2466e+00, -7.0938e-01, -4.7685e-01],
[-6.9967e-01, -1.2469e+00, 6.3768e-01],
[-5.4820e-01, -4.6216e-01, 2.3139e-01],
[-9.9105e-01, 9.9384e-01, -4.1378e-01],
[-1.4098e-01, 4.8020e-01, 1.0143e+00],
[ 1.0743e+00, 5.0753e-01, 8.4159e-02],
[-1.1365e-01, 1.3078e-01, -4.7219e-01],
[ 7.4202e-01, -6.7998e-01, -6.0465e-01]]],
grad_fn=<EmbeddingBackward0>)
Embedding Layer output shape: torch.Size([2, 10, 3])
3.2.3. Self-attention#
Self-attention is a mechanism that transforms the representation of each token in a sequence by relating it to different tokens of the sequence. This new representation can then be used by the model to e.g., predict the next word of the sequence.
For example, lets say we have the sentence “The cat sat on the mat”. We can assume a simple word tokenizer and an embedding layer which results in one embedding vector for each word:
The |
cat |
sat |
on |
the |
mat |
---|---|---|---|---|---|
\(e_1\) |
\(e_2\) |
\(e_3\) |
\(e_4\) |
\(e_5\) |
\(e_6\) |
If we wanted to create a model to predict the next word, we could directly just use an MLP+softmax with input one of the embedding vectors like the above and output a probability distribution over the vocabulary.
However, currently each embedding \(e_i\) contains information only for the particular word it embeds, regardless of the rests of the sequence. So for example given \(e_5\) the model would just predict the most probable word after a “the”, which is certainly not “mat”. Instead, with a self-attention layer \(e_5\) will be transformed into \(e_5' = f(e_1, e_2, e_3, e_4, e_5)\), now containing the context.
Bellow we will describe how self-attention is computed.
3.2.4. Scaled Dot-Product Attention#
Assume batch size B=1
X is the embedding matrix (contains the embedding vectors of \(L\) tokens)
Assume single attention layer after the embedding layer
Input matrix:
We apply learned projection matrices:
To obtain the queries, keys, and values:
Attention logits matrix \(QK^T\):
Attention weight matrix \(A\):
Note that \(a_{ij}\) is the attention weight of query \(q_i\) wrt key \(k_j\) and indicates the level of attention that token \(i\) pays on token \(j\).
Attention output \(Z\):
\(Z\) is the output of the attention layer, which weights the value vectors \(v_i\) based on the computed attention weights. Each \(z_i\) combines information from different value vectors (corresponding to different tokens) according to the attention given to each key by each query.
# Numpy implementation of dot-product attention:
# Define embedding vectors for tokens (3 tokens/sq_len, 4-dimensional embeddings)
embedding_dim = 4
X = np.array([[0.1, 0.2, 0.3, 0.4], # Embedding for token 1
[0.5, 0.6, 0.7, 0.8], # Embedding for token 2
[0.9, 1.0, 1.1, 1.2]]) # Embedding for token 3
# Define weight matrices for the transformations (W_Q, W_K, W_V)
W_Q = np.random.rand(embedding_dim, embedding_dim) # Query weight matrix
W_K = np.random.rand(embedding_dim, embedding_dim) # Key weight matrix
W_V = np.random.rand(embedding_dim, embedding_dim) # Value weight matrix
# Compute the Q, K, V matrices by applying the transformations to the embeddings
Q = np.dot(X, W_Q) # Query matrix (Q)
K = np.dot(X, W_K) # Key matrix (K)
V = np.dot(X, W_V) # Value matrix (V)
# Define the scaling factor (sqrt of key dimension)
d_k = K.shape[1]
scaling_factor = np.sqrt(d_k)
# Compute the attention logits (Q K^T / sqrt(d_k))
logits = np.dot(Q, K.T) / scaling_factor
# Apply softmax to the logits to get attention weights
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)
attention_weights = softmax(logits)
# Compute the output Z (weighted sum of values)
Z = np.dot(attention_weights, V)
print("Embeddings:\n", X)
print("\nQuery Matrix (Q):\n", Q)
print("\nKey Matrix (K):\n", K)
print("\nValue Matrix (V):\n", V)
print("\nAttention Logits (Q K^T / sqrt(d_k)):\n", logits)
print("\nAttention Weights (Softmax of Logits):\n", attention_weights)
print("\nOutput Z (Weighted Sum of Values):\n", Z)
Embeddings:
[[0.1 0.2 0.3 0.4]
[0.5 0.6 0.7 0.8]
[0.9 1. 1.1 1.2]]
Query Matrix (Q):
[[0.31209963 0.53531405 0.51485473 0.68002436]
[0.73840816 1.31008629 1.29053135 1.8029026 ]
[1.16471669 2.08485852 2.06620796 2.92578083]]
Key Matrix (K):
[[0.52286941 0.64123121 0.38116102 0.53391471]
[1.23957825 1.58818893 0.94290048 1.15365426]
[1.95628708 2.53514666 1.50463995 1.77339382]]
Value Matrix (V):
[[0.2872904 0.47770439 0.56456457 0.39405629]
[0.898091 1.15947937 1.43623228 0.96008464]
[1.50889159 1.84125435 2.3079 1.52611298]]
Attention Logits (Q K^T / sqrt(d_k)):
[[0.53288249 1.25351077 1.97413905]
[1.34032786 3.14637407 4.95242028]
[2.14777322 5.03923737 7.93070151]]
Attention Weights (Softmax of Logits):
[[0.13733006 0.28231275 0.58035719]
[0.02266042 0.13791889 0.83942069]
[0.00290927 0.05242418 0.94466655]]
Output Z (Weighted Sum of Values):
[[1.16869224 1.46152419 1.82240474 1.21085055]
[1.39696866 1.71632609 2.14817585 1.4223941 ]
[1.4733169 1.80154592 2.2571317 1.49314595]]
3.2.5. Masked (or causal) self-attention#
In practice, transformers use a version of self-attention called masked or causal self-attention. In contrast to (bidirectional) self-attention that computes attentions scores for all tokens in the sequence, masked self-attention uses a causal mask that hides future tokens, so each token can attend to itself and earlier tokens.
Example of a casual mask:
Where 1 means the token can attend, 0 means it’s masked out. Note that for simplicity each word represents a token. So,
“The” can attend only to itself.
“cat” can attend only to “The” and “cat”.
“mat” can attend to everything in this sentence.
Usually this masking is done by setting the \(q_i k_j^T\) with \(i < j\) to \(-\infty\) in the \(QK^T\) attention logits matrix so that the application of the softmax will give zero attention weights to those positions.
# Numpy implementation of masked self-attention:
# Define embedding vectors for tokens (3 tokens/sq_len, 4-dimensional embeddings)
embedding_dim = 4
X = np.array([[0.1, 0.2, 0.3, 0.4], # Embedding for token 1
[0.5, 0.6, 0.7, 0.8], # Embedding for token 2
[0.9, 1.0, 1.1, 1.2]]) # Embedding for token 3
# Define weight matrices for the transformations (W_Q, W_K, W_V)
W_Q = np.random.rand(embedding_dim, embedding_dim) # Query weight matrix
W_K = np.random.rand(embedding_dim, embedding_dim) # Key weight matrix
W_V = np.random.rand(embedding_dim, embedding_dim) # Value weight matrix
# Compute the Q, K, V matrices by applying the transformations to the embeddings
Q = np.dot(X, W_Q) # Query matrix (Q)
K = np.dot(X, W_K) # Key matrix (K)
V = np.dot(X, W_V) # Value matrix (V)
# Define the scaling factor (sqrt of key dimension)
d_k = K.shape[1]
scaling_factor = np.sqrt(d_k)
# Compute the attention logits (Q K^T / sqrt(d_k))
logits = np.dot(Q, K.T) / scaling_factor
# Create causal mask: shape (seq_len, seq_len)
seq_len = X.shape[0]
mask = np.tril(np.ones((seq_len, seq_len))) # Lower triangular matrix including diagonal
# Apply mask: set logits where mask==0 to very large negative value (simulate -inf)
logits_masked = np.where(mask == 1, logits, -1e9)
# Apply softmax to the masked logits to get attention weights
def softmax(x):
e_x = np.exp(x - np.max(x, axis=1, keepdims=True)) # for numerical stability
return e_x / np.sum(e_x, axis=1, keepdims=True)
attention_weights = softmax(logits_masked)
# Compute the output Z (weighted sum of values)
Z = np.dot(attention_weights, V)
print("Embeddings:\n", X)
print("\nQuery Matrix (Q):\n", Q)
print("\nKey Matrix (K):\n", K)
print("\nValue Matrix (V):\n", V)
print("\nAttention Logits (Q K^T / sqrt(d_k)):\n", logits)
print("\nMask (1=keep, 0=mask):\n", mask)
print("\nMasked Attention Logits:\n", logits_masked)
print("\nAttention Weights (Softmax of Masked Logits):\n", attention_weights)
print("\nOutput Z (Weighted Sum of Values):\n", Z)
Embeddings:
[[0.1 0.2 0.3 0.4]
[0.5 0.6 0.7 0.8]
[0.9 1. 1.1 1.2]]
Query Matrix (Q):
[[0.70789558 0.57123414 0.48801403 0.53045068]
[1.77844482 1.62403702 1.26478696 1.18007677]
[2.84899407 2.6768399 2.04155989 1.82970286]]
Key Matrix (K):
[[0.47457039 0.54832416 0.47768742 0.59151911]
[1.27620835 1.52578969 1.06025913 1.40333355]
[2.07784632 2.50325522 1.64283083 2.215148 ]]
Value Matrix (V):
[[0.40830593 0.34716112 0.41442188 0.27214191]
[0.91654217 0.97609074 1.13588343 0.68737608]
[1.42477841 1.60502037 1.85734498 1.10261025]]
Attention Logits (Q K^T / sqrt(d_k)):
[[ 0.59802882 1.51841299 2.43879716]
[ 1.51835338 3.87232416 6.22629495]
[ 2.43867795 6.22623534 10.01379273]]
Mask (1=keep, 0=mask):
[[1. 0. 0.]
[1. 1. 0.]
[1. 1. 1.]]
Masked Attention Logits:
[[ 5.98028817e-01 -1.00000000e+09 -1.00000000e+09]
[ 1.51835338e+00 3.87232416e+00 -1.00000000e+09]
[ 2.43867795e+00 6.22623534e+00 1.00137927e+01]]
Attention Weights (Softmax of Masked Logits):
[[1.00000000e+00 0.00000000e+00 0.00000000e+00]
[8.67506706e-02 9.13249329e-01 0.00000000e+00]
[5.01446068e-04 2.21380572e-02 9.77360497e-01]]
Output Z (Weighted Sum of Values):
[[0.40830593 0.34716112 0.41442188 0.27214191]
[0.87245234 0.92153068 1.07329615 0.65135424]
[1.41301734 1.59046634 1.84064967 1.09300134]]
3.2.6. Multi-Head Attention:#
where the projection matrices are: \( \quad W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}, \quad \text{and} \quad W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}} \)
TODO: Explain more ….
# Pytorch implementation of causal self-attention (multi-head)
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# key, query, value projections for all heads, but in a batch
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
# output projection
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
# regularization
self.attn_dropout = nn.Dropout(config.dropout)
self.resid_dropout = nn.Dropout(config.dropout)
self.n_head = config.n_head
self.n_embd = config.n_embd
self.dropout = config.dropout
# flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
if not self.flash:
print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
# causal mask to ensure that attention is only applied to the left in the input sequence
self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size))
def forward(self, x):
B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
# calculate query, key, values for all heads in batch and move head forward to be the batch dim
# ! n_embd = d_model, c_attn acts as all three W_q, W_k, W_v at once and all have output dim n_embd.
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
# causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
if self.flash:
# efficient attention using Flash Attention CUDA kernels
y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
else:
# manual implementation of attention
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
att = self.attn_dropout(att)
y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
# output projection
y = self.resid_dropout(self.c_proj(y))
return y
3.2.7. LayerNorm#
Layer Normalization (LayerNorm) is used to stabilize and accelerate training by normalizing the input across features.
Given input vector \(x \in \mathbb{R}^d\) (e.g., the hidden state of one token), compute:
Then Layer Normalization is applied as:
Where:
\(\gamma \in \mathbb{R}^d\): learnable scale
\(\beta \in \mathbb{R}^d\): learnable bias
Given an input \(X \in \mathbb{R}^{B \times L \times d_{\textrm{model}}}\), LayerNorm(\(X\)) will be applied independently for each token’s hidden state, \(x_{b,t} \in \mathbb{R}^{d_{\textrm{model}}}\), unlike BatchNorm which normalizes across a batch. Note that \(\gamma\) and \(\beta\) dimensions will remain \(d_{\textrm{model}}\) as they are shared across tokens for all sequences.
# Pytorch implementation of Layer Normalization
class LayerNorm(nn.Module):
""" LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """
def __init__(self, ndim, bias):
super().__init__()
self.weight = nn.Parameter(torch.ones(ndim))
self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
def forward(self, input):
return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)
3.2.8. MLP Layer#
TODO: Explain why we need an MLP layer, also expain residual connections, GELU
3.2.8.1. MLP diagram:#
3.2.8.2. Block Layer diagram:#
# Pytorch implementation of MLP Layer
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
self.gelu = nn.GELU()
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
self.dropout = nn.Dropout(config.dropout)
def forward(self, x):
x = self.c_fc(x)
x = self.gelu(x)
x = self.c_proj(x)
x = self.dropout(x)
return x
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
self.attn = CausalSelfAttention(config)
self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
self.mlp = MLP(config)
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
3.2.9. Decoder-only transformer (nanoGPT)#
# Pytorch implemenation of nanoGPT
@dataclass
class GPTConfig:
block_size: int = 1024
vocab_size: int = 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
n_layer: int = 12
n_head: int = 12
n_embd: int = 768
dropout: float = 0.0
bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
assert config.vocab_size is not None
assert config.block_size is not None
self.config = config
self.transformer = nn.ModuleDict(dict(
wte = nn.Embedding(config.vocab_size, config.n_embd),
wpe = nn.Embedding(config.block_size, config.n_embd),
drop = nn.Dropout(config.dropout),
h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
ln_f = LayerNorm(config.n_embd, bias=config.bias),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
self.transformer.wte.weight = self.lm_head.weight # https://paperswithcode.com/method/weight-tying
# init all weights
self.apply(self._init_weights)
# apply special scaled init to the residual projections, per GPT-2 paper
for pn, p in self.named_parameters():
if pn.endswith('c_proj.weight'):
torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))
# report number of parameters
print("number of parameters: %.2fM" % (self.get_num_params()/1e6,))
def get_num_params(self, non_embedding=True):
"""
Return the number of parameters in the model.
For non-embedding count (default), the position embeddings get subtracted.
The token embeddings would too, except due to the parameter sharing these
params are actually used as weights in the final layer, so we include them.
"""
n_params = sum(p.numel() for p in self.parameters())
if non_embedding:
n_params -= self.transformer.wpe.weight.numel()
return n_params
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
device = idx.device
b, t = idx.size()
assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)
# forward the GPT model itself
tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
x = self.transformer.drop(tok_emb + pos_emb)
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
if targets is not None:
# if we are given some desired targets also calculate the loss
logits = self.lm_head(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-100)
else:
# inference-time mini-optimization: only forward the lm_head on the very last position
logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
loss = None
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
"""
Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
the sequence max_new_tokens times, feeding the predictions back into the model each time.
Most likely you'll want to make sure to be in model.eval() mode of operation for this.
"""
for _ in range(max_new_tokens):
# if the sequence context is growing too long we must crop it at block_size
idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
# forward the model to get the logits for the index in the sequence
logits, _ = self(idx_cond)
# pluck the logits at the final step and scale by desired temperature
logits = logits[:, -1, :] / temperature
# optionally crop the logits to only the top k options
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = -float('Inf')
# apply softmax to convert logits to (normalized) probabilities
probs = F.softmax(logits, dim=-1)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1)
# append sampled index to the running sequence and continue
idx = torch.cat((idx, idx_next), dim=1)
return idx
TODO: explain positional embeddings
TODO: explain training and generation
3.3. LLM Training#
from loguru import logger
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from datasets import load_dataset
from transformers import get_constant_schedule_with_warmup
from transformers import AutoTokenizer
from torch.utils.data import IterableDataset, get_worker_info
# Environment parameters:
device = f"cuda:0"
workers = 4
# Data parameters:
batch_size = 64
# Tokenization parameters:
max_length = 256 # sequence length L
# Optimizer parameters
lr = 1e-3
weight_decay = 0.0
# LR scheduler parameters
warmup_steps = 1000
# Training parameters:
num_training_steps = 100 # 10000
total_batch_size = 256
grad_clipping = 0.0
print_freq = 10
data = load_dataset(
"allenai/c4", "en", split="train", streaming=True
)
val_data = load_dataset(
"allenai/c4", "en", split="validation", streaming=True
)
class PreprocessedIterableDataset(IterableDataset):
def __init__(self, data, tokenizer, batch_size, max_length):
super().__init__()
self.data = data
self.tokenizer = tokenizer
self.batch_size = batch_size
self.max_length = max_length
def __iter__(self):
iter_data = iter(self.data)
batch = []
for example in iter_data:
tokenized_example = self.tokenizer(
example["text"],
max_length=self.max_length,
truncation=True,
padding="max_length",
return_tensors="pt",
)
batch.append(tokenized_example)
if len(batch) == self.batch_size:
yield self._format_batch(batch)
batch = []
if batch:
yield self._format_batch(batch)
def _format_batch(self, batch):
input_ids = torch.stack([item["input_ids"].squeeze(0) for item in batch])
return input_ids
# GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
dataset = PreprocessedIterableDataset(
data, tokenizer, batch_size=batch_size, max_length=max_length
)
dataloader = torch.utils.data.DataLoader(
dataset, batch_size=None, num_workers=workers,
)
# model parameters:
n_layer = 12
n_head = 12
n_embd = 768
dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+
bias = False
block_size = 1024
vocab_size = tokenizer.vocab_size
# ??? " vocab_size = 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency"
model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
bias=bias, vocab_size=vocab_size, dropout=dropout)
gptconf = GPTConfig(**model_args)
model = GPT(gptconf).to(device)
n_total_params = sum(p.numel() for p in model.parameters())
trainable_params = [p for p in model.parameters() if p.requires_grad]
number of parameters: 123.55M
optimizer = torch.optim.Adam(
trainable_params, lr=lr, weight_decay=weight_decay
)
scheduler = get_constant_schedule_with_warmup(
optimizer,
num_warmup_steps=warmup_steps,
last_epoch=-1,
)
global_step = 0
update_step = 0
tokens_seen = 0
tokens_seen_before = 0
world_size = 1
pad_idx = tokenizer.pad_token_id
gradient_accumulation = None
if total_batch_size is not None:
if gradient_accumulation is None:
assert (
total_batch_size % world_size == 0
), "total_batch_size must be divisible by world_size"
gradient_accumulation = total_batch_size // (
batch_size * world_size
)
assert (
gradient_accumulation > 0
), "gradient_accumulation must be greater than 0"
assert (
gradient_accumulation * batch_size * world_size
== total_batch_size
), "gradient_accumulation * batch_size * world_size must be equal to total_batch_size"
# ##############################
# START of training loop
# ##############################
for batch_idx, batch in enumerate(dataloader):
global_step += 1
if update_step > num_training_steps:
logger.info(
f"Reached max number of update steps (f{num_training_steps}). Stopping training."
)
break
input_ids = batch.to(device)
labels = input_ids.clone()
labels[labels == pad_idx] = -100
labels = labels.to(device)
tokens_seen += (input_ids != pad_idx).sum().item() * world_size
logits, loss = model(input_ids, targets=labels)
scaled_loss = loss / gradient_accumulation
scaled_loss.backward()
if global_step % gradient_accumulation != 0:
continue
if update_step % print_freq == 0:
print(f"Update step: {update_step}/{num_training_steps}")
#######
if grad_clipping != 0.0:
torch.nn.utils.clip_grad_norm_(trainable_params, grad_clipping)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
update_step += 1
# ##############################
# END of training loop
# ##############################
logger.info("Training finished")
Update step: 0/100
Update step: 10/100
Update step: 20/100
Update step: 30/100
Update step: 40/100
Update step: 50/100
Update step: 60/100
Update step: 70/100
Update step: 80/100
Update step: 90/100
2025-09-10 03:23:23.901 | INFO | __main__:<cell line: 34>:38 - Reached max number of update steps (f100). Stopping training.
2025-09-10 03:23:24.054 | INFO | __main__:<cell line: 74>:74 - Training finished
Update step: 100/100