Introduction to Optimization

1.1. Introduction to Optimization#

1.1.1. Why Optimization?#

Optimization appears everywhere in almost all Science and Engineering disciplines such as Electrical/Mechanical/Aerospace/Systems/Mechanical… Engineering, Computer/Physical/Data… Science. In particular, in machine learning/ artificial intelligence, which are the main application area of this course, optimization theory and algorithms are used in many different problems:

Training a linear regression model: minimizing squared error
Training neural networks: minimizing a cross-entropy loss
Aligning LLMs: optimizing reward functions under preference constraints
Adversarial robustness: maximizing model loss with respect to small input perturbations

These problems differ in their structure, but they all boil down to choosing parameters to minimize a loss (or maximize a reward), possibly under various constraints.

We start the lecture with a couple of concrete examples of problem (1.1).

1.1.2. Examples#

Example 1.1 (Example 1: Least Squares Regression)

One of the most fundamental optimization problems in machine learning is linear regression, where the goal is to fit a linear model to observed data.

Problem Setup. We are given a dataset \({(\va_i, b_i)}_{i=1}^n\) where:

\(\va_i \in \RR^d\) is the feature vector for the \(i\)-th data point;
\(b_i \in \RR\) is the corresponding label or response;

We aim to find a weight vector \(\vw \in \RR^d\) that minimizes the average squared prediction error:

\[\min_{\vw \in \RR^d} \quad \frac{1}{2n} \sum_{i=1}^n (\vw^\top \va_i - b_i)^2.\]

This is an unconstrained, smooth, convex optimization problem. That is, it is an “easy” problem where both theory and algorithms are well-understood.

Vectorized Formulation. Let us define the matrix and vector notation:

\(A \in \RR^{n \times d}\): the data matrix whose \(i\)-th row is \(\va_i^\top\);
\(\vb \in \RR^n\): the label vector, whose \(i\)-th entry is \(b_i\);

Then the objective becomes:

\[\min_{\vw \in \RR^d} \quad \frac{1}{2n} \|A\vw - \vb\|^2.\]

This compact form makes derivation and implementation more efficient.

Closed-form Solution. To find the optimal \(\vw^*\), we take the gradient of the objective and set it to zero (we will shortly see why we should do this in Section Unconstrained Optimization):

\[\nabla_{\vw} \left( \frac{1}{2n} \|A\vw - \vb\|^2 \right) = \frac{1}{n} A^\top (A\vw - \vb) = 0.\]

Solving for \(\vw\) gives the normal equations:

\[A^\top A \vw = A^\top \vb.\]

Assuming \(A^\top A\) is invertible, the solution is:

\[\vw^* = (A^\top A)^{-1} A^\top \vb.\]

This closed-form solution is efficient when \(d\) is small and \(A^\top A\) is well-conditioned. In large-scale or ill-conditioned settings, iterative methods like gradient descent or conjugate gradient are preferred (to be discussed in Section Gradient Methods).

Augmented Form (with Intercept). If we wish to fit an intercept term \(w_0\) (bias), we can augment the input as follows:

Define \(\tilde{\va}_i := [\va_i^\top, 1]^\top \in \RR^{d+1}\)

Let \(\tilde{\vw} := [\vw^\top, w_0]^\top \in \RR^{d+1}\)

Then the problem becomes:

\[\min_{\tilde{\vw} \in \RR^{d+1}} \quad \frac{1}{2n} \sum_{i=1}^n (\tilde{\vw}^\top \tilde{\va}_i - b_i)^2.\]

Let \(\tilde{A} \in \RR^{n \times (d+1)}\) be the matrix with rows \(\tilde{\va}_i^\top\), and the closed-form solution becomes:

\[\tilde{\vw}^* = (\tilde{A}^\top \tilde{A})^{-1} \tilde{A}^\top \vb.\]

Some remarks about the above examples are ready:

Least squares regression is a convex, quadratic problem, and its solution is globally optimal (the notion of convexity will be introduced in Section Unconstrained Optimization).
It provides a clean entry point to gradient-based optimization and regularization techniques (e.g., ridge regression).
It forms the foundation for more advanced models like logistic regression, support vector machines, and neural networks.

Next, we consider a classical formulation of the classification problem, where the

Example 1.2 (Support Vector Machine (SVM))

Consider a binary classification task, where each data point is labeled with \(b_i \in \{-1, +1\}\) and described by a feature vector \(\va_i \in \RR^d\).

Our goal is to learn a linear classifier of the form:

\[f(\va) = \sign(\vw^\top \va + w_0)\]

where:

\(\vw\) is the weight vector that linearly combines the feature vector, and defines the orientation of the decision boundary,
\(w_0\) is the intercept (bias or offset term),
\(\sign(\cdot)\) is the sign function that maps outputs to \(\pm 1\).

Hard-Margin SVM Formulation. Suppose the data is linearly separable, that is, there exists a hyperplane that separates the two classes without error. If there are many hyperplanes that can separate the data, which one should we prefer? We prefer the one that maximizes the “margin”, defined as the distance between the decision boundary and the closest data points.

This leads to the following hard-margin SVM problem:

(1.2)#\[\begin{split}\begin{aligned} \min_{\vw \in \RR^d,\, w_0 \in \RR} \quad & \frac{1}{2} \|\vw\|^2 \\ \text{s.t.} \quad & b_i (\vw^\top \va_i + w_0) \ge 1, \quad \forall i = 1,\ldots,n \end{aligned}\end{split}\]

The objective \(\frac{1}{2} \|\vw\|^2\) minimizes the norm of the weight vector, which is inversely related to the margin width. The constraints ensure that all data points are classified correctly with a margin of at least 1. Below, we explain the intuition behind setting the constraint this way.

Geometric Intuition. The inequality constraint \(b_i(\vw^\top \va_i + w_0) \ge 1\) enforces:

\[\begin{split}\begin{aligned} b_i & = +1 \Rightarrow \vw^\top \va_i + w_0 \ge 1 \\ b_i & = -1 \Rightarrow \vw^\top \va_i + w_0 \le -1. \end{aligned}\end{split}\]

Hence, the points lie on the margin boundaries satisfies:

\[\vw^\top \va + w_0 = \pm 1.\]

The distance from a point to the decision boundary \(\vw^\top \va + w_0 = 0\) is (how to calculate the distance between two parallel lines?):

\[\frac{| \vw^\top \va + w_0 |}{\| \vw \|}.\]

The margin is therefore: \(2 / \|\vw\|\), so minimizing \(\|\vw\|^2\) corresponds to maximizing the margin.

Why the “1” in the Constraint? In the hard-margin SVM formulation, the constraint for each data point is:

\[b_i(\vw^\top \va_i + w_0) \ge 1\]

A natural question arises: Why the constant “1”? What happens if we use another constant?

The key point is that the value “1” is arbitrary — it is a scaling convention, not a fixed physical threshold. In fact, we can multiply both sides of the inequality by any positive constant \(\alpha > 0\), and the set of feasible solutions remains the same if we scale \(\vw\) and \(w_0\) accordingly.

That is we can consider the following set of constraints instead:

\[b_i(\alpha \vw^\top \va_i + \alpha x_0) \ge \alpha.\]

So, if we allow \(\vw\) and \(w_0\) to scale freely, the numerical value on the right-hand side can be absorbed into their magnitudes.

Then why do we fix it to 1? We fix the right-hand side to 1 for mathematical convenience and normalization:

It removes the ambiguity caused by arbitrary scaling of \(\vw\).
It ensures that the margin width becomes well-defined as \(2/\|\vw\|\).

Some remarks about the above example is now ready. First, about the optimization structure of the problem. This is a quadratic program (QP): a convex quadratic objective with linear inequality constraints. It has a unique global minimum when the data is linearly separable. Solving it efficiently often involves Lagrangian duality — the dual SVM problem has attractive computational and theoretical properties. We will discuss the issue about duality later in the class.

Second, real-world data is often not linearly separable, and we must extend the formulation (1.2) to allow misclassifications using slack variables, which leads to the soft-margin SVM. We will talk about some of its variants in subsequent lectures.

Example 1.3 (Training Neural Networks)

One of the most important applications of optimization in modern machine learning is the training of neural networks — especially deep neural networks used in large-scale models like transformers and LLMs.

At the core of this task is the optimization problem:

\[\begin{aligned} \min_{\vw \in \mathcal{W}} \quad f(\vw) \end{aligned}\]

where in the above equation, we have defined:

\(\vw\) is the collection of model parameters (weights, biases, etc.) across all layers of the network.
\(f(\vw)\) is the objective function, typically defined as the sum of prediction losses over a dataset (e.g., cross-entropy or mean squared error).
\(\mathcal{W}\) is the feasible set — it may encode constraints such as: parameter bounds, sparsity structure, architectural constraints (e.g., weight sharing, normalization).

A Typical Supervised Learning Setup. Let us consider a simple form of neural network optimization problem, Let \({(\va_i, b_i)}_{i=1}^n\) be a training dataset:

\(\va_i \in \RR^d\): input feature vector (e.g., tokens, pixels)
\(b_i\): target label (e.g., a class, a next token, a regression value)

Let \(g(\vw; \va_i)\) denote the neural network output for input \(\va_i\) with parameters \(\vw\). The typical loss is:

\[f(\vw) = \sum_{i=1}^n \ell(g(\vw; \va_i), b_i)\]

where \(\ell(\cdot, \cdot)\) is the loss function measuring the error between prediction and label (e.g., cross-entropy for classification, squared loss for regression).

At this point, it is not clear what the \(g\) function looks like. This is ok, but the key point we need to note here is that, as compared with the previous two examples, here the optimization variable \(\vw\) no longer linearly combines the feature \(\va_i\)’s. Therefore, the learning problem becomes nonlinear. In addition, the loss function is a highly complicated function w.r.t. the optimization variable \(\vw\).

A 2-Layer Neural Network. Let’s consider a simple fully connected network with one hidden layer:

\[\begin{split}\begin{aligned} \vz &= \sigma(W_1 \va + \vv_1) \\ \hat{y} &= W_2 \vz + v_2 \end{aligned}\end{split}\]

where we have defined:

\(W_1, W_2\): weight matrices (part of the optimization variables to be optimized);
\(\vv_1, \vv_2\): bias vectors (part of the optimization variables to be optimized);
\(\sigma(\cdot)\): element-wise nonlinearity (e.g., ReLU);
\(\vw = \{W_1, \vv_1, W_2, \vv_2\}\): optimization variables.

The training objective becomes:

\[\min_{W_1, \vv_1, W_2, \vv_2} \quad \sum_{i=1}^n \ell\left( W_2 \sigma(W_1 \va_i + \vv_1) + \vv_2,\, b_i \right) \]

Next, we provide an example about LLM training.

Example 1.4 (LLM training)

When training an LLM (e.g. GPT or LLaMA decoder-only transformer models), the objective is typically causal next-token prediction using the cross-entropy (CE) loss. More specifically, Given an input sequence of tokens:

\[ x = [x_1, x_2, \dots, x_L] \]

The model is trained to predict the probability distribution over the next token \(x_{t+1}\), conditioned on the tokens before it (hence causal):

\[P_{\theta}( \cdot \mid x_{\leq t}) = \text{softmax}(f_\theta(x_{\leq t})) \in \mathbb{R}^{|V|}, \textrm{ for } t=1,...,L-1 \]

where:

\(x_{\leq t} = [x_1, x_2, \dots, x_t]\) are the previous tokens,
\(f_\theta(x_{\leq t})\) represents the output logits from the transformer (having trainable weights \(\theta\)),
\({\rm softmax}()\) converts the logits (a vector of raw scores) into a probability distribution over the vocabulary of size \(|V|\).

More on the Softmax Operator Given a vector of logits \(\mathbf{z} = [z_1, z_2, \dots, z_{|V|}] \in \mathbb{R}^{|V|} \), where \(|V| \) is the vocabulary size, the softmax function is defined component-wise as:

\[ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{|V|} e^{z_j}}, \quad \text{for } i = 1, \dots, |V|. \]

The output is a probability vector \(\mathbf{p} = \text{softmax}(\mathbf{z}) \in \mathbb{R}^{|V|} \) such that:

\( p_i > 0 \) for all \( i \)
\( \sum_{i=1}^{|V|} p_i = 1 \)

Thus, \(\mathbf{p} \in \Delta^{|V|-1} \), the probability simplex.

Suppose:

\[ \mathbf{z} = [2.0,\ 1.0,\ 0.1] \]

Then we compute:

\[\begin{split}\begin{aligned} e^{2.0} &\approx 7.39, \\ e^{1.0} &\approx 2.72, \\ e^{0.1} &\approx 1.105, \\ \text{sum} &= 7.39 + 2.72 + 1.105 \approx 11.215 \end{aligned}\end{split}\]

So the softmax operator computes:

\[\text{softmax}(\mathbf{z}) = \left[ \frac{7.39}{11.215},\ \frac{2.72}{11.215},\ \frac{1.105}{11.215} \right] \approx [0.659,\ 0.243,\ 0.098]\]

Training Objective. So far, we know that the scalar probability of the correct next token is given by:

\[P_{\theta}( x_{t+1} \mid x_{\leq t}).\]

During training, the loss is computed using cross-entropy between the predicted distribution and the ground truth distribution, i.e. a one-hot vector for the correct token. Note that the typical cross-entropy definition is:

\[ H(q, p) = - \sum_{x \in \textrm{ classes}} q(x) \log p(x) \]

where \(q(\cdot)\) is the true probability distribution and \(p(\cdot)\) the predicted probability distribution. In our case we have:

\(x \in \{0,1,2, \dots, |V|-1\}\) the token indices,
\(p(i) = P_\theta(x = i \mid x_{\leq t})\) is the predicted probability of token \(i\).
\(q(i)\) is the true probability of token \(i\):

\[\begin{split} q(i) = \begin{cases} 1 & \text{if } i = x_{t+1} \\ 0 & \text{otherwise} \end{cases} \end{split}\]

Therefore we can write the loss as:

\[ L_t(\theta) = - \sum_{i=0}^{|V|-1} q(i) \log p(i) = - \log p_{x_{t+1}} = -\log P_\theta(x_{t+1} \mid x_{\leq t}) \]

This loss then is optimized by adjusting the model’s weights using gradient-based optimizers, such as Adam.

Optimization Challenges. Neural network training raises several fundamental questions in optimization:

Existence: Does a minimizer \(\vw^*\) exist?
Computation: How do we find a good solution efficiently?
Convergence: Will gradient-based methods converge to a global or local minimum?
Robustness: How sensitive is the solution to noise, overfitting, and adversarial perturbations?

These, and many more related questions, are particularly challenging in large-scale settings such as LLM pretraining, where we may have trillions of tokens and billions of parameters. We will explore many topics related to neural network optimization in the subsequent Chapter LLM Training as Large-scale Optimization.

Example 1.5 (Adversarial Attacks)

Modern machine learning models — including deep neural networks — can be surprisingly fragile. Small, imperceptible perturbations to the input can cause a model to misclassify with high confidence. This phenomenon has motivated the study of adversarial attacks and the corresponding field of robust optimization.

Adversarial Attack Formulation. Suppose we have a trained model with parameters \(\vw\), and a loss function \(\ell(\vw; \va, b)\) defined over input \(\va\) and label \(b\).

An adversarial example is a perturbed version of the input:

\[\va_{\text{adv}} = \va + \vdelta\]

where \(\vdelta\) is crafted intentionally to increase the model’s loss while remaining small in magnitude. The attack is formulated as an optimization problem:

\[\begin{split}\begin{aligned} \max_{\vdelta} \quad & \ell(\vw; \va + \vdelta, y) \\ \text{s.t.} \quad & \|\vdelta\| \le \epsilon \end{aligned}\end{split}\]

where in the above we have defined:

\(\ell\) is the model’s loss function (e.g., quadratic loss, cross-entropy)
\(\vdelta\) is the adversarial perturbation
\(\epsilon\) is a bound on the perturbation (perturbation cannot be too large).

This problem seeks a perturbation within a ball of radius \(\epsilon\) that maximizes model error; it represents a worst-case perturbation.

A Robust Optimization Viewpoint. From a robust learning perspective, we are interested in training models that perform well under worst-case perturbations. This motivates a minimax formulation:

\[\min_{\vw} \; \max_{\|\vdelta\| \le \epsilon} \; \ell(\vw; \va + \vdelta, b)\]

This nested optimization problem describes adversarial training, where the inner maximization finds harmful perturbations, and the outer minimization adapts the model to defend against them.

Interpretation. The adversarial attack is a local search in input space for points that lie near the input data (often times images) \(\va\) but cause the model to fail (by maximizing the loss). The constraint \(\|\vdelta\| \le \epsilon\) ensures imperceptibility — the perturbed input looks similar to the original one. The magnitude \(\epsilon\) controls the strength of the attack.

1.1.3. Elements of an Optimization Problem#

Every optimization problem is characterized by:

The type of variables: continuous, discrete, mixed
The objective function: smooth/nonsmooth, convex/nonconvex, deterministic/stochastic
The constraints: equalities, inequalities, box, norm, combinatorial
The scale: number of variables and data points (relevant to LLMs!)
The desired accuracy and computation budget

Understanding these factors helps guide both the modeling and the algorithm selection process.

In the rest of this course, we will explore:

Core theoretical tools: optimality conditions, convexity, duality
Algorithms: gradient descent, Newton methods, proximal methods, ADMM
Modern extensions: stochastic optimization, large-scale methods, distributed training
Applications to LLM training: pretraining, fine-tuning, RLHF, unlearning, and personalization

Optimization is the common thread that binds these concepts; mastering it is essential for building scalable, robust, and aligned AI systems.

Introduction to Optimization

Contents

1.1. Introduction to Optimization#

1.1.1. Why Optimization?#

1.1.2. Examples#

1.1.3. Elements of an Optimization Problem#