LLM Training as Large-scale Optimization

4. LLM Training as Large-scale Optimization#

In this chapter we will discuss the training of LLMs as a large-scale optimization problem. [TODO: discuss how LLM pretraining is modeled to a stochastic opt problem]

We have already discussed stochastic optimization problem:

(4.1)#\[\min_{\vw}\ f(\vw) := \EE_{\xi\sim\mathcal{D}}[\ell(\vw,\xi)]\]

where \(\mathcal{D}\) is the data distribution. Note that for neural network training, we usually collect a set of training data \(\{\xi_i\}_{i=1,2,...,n}\) and optimize for the training loss

(4.2)#\[\min_{\vw}\ f(\vw) := \frac{1}{n}\sum_{i=1}^{n}\ell(\vw, \xi_i)\]

which is also know as the finite sum setting of the stochastic optimization.