4. LLM Training as Large-scale Optimization#
In this chapter we will discuss the training of LLMs as a large-scale optimization problem. [TODO: discuss how LLM pretraining is modeled to a stochastic opt problem]
We have already discussed stochastic optimization problem:
(4.1)#\[\min_{\vw}\ f(\vw) := \EE_{\xi\sim\mathcal{D}}[\ell(\vw,\xi)]\]
where \(\mathcal{D}\) is the data distribution. Note that for neural network training, we usually collect a set of training data \(\{\xi_i\}_{i=1,2,...,n}\) and optimize for the training loss
(4.2)#\[\min_{\vw}\ f(\vw) := \frac{1}{n}\sum_{i=1}^{n}\ell(\vw, \xi_i)\]
which is also know as the finite sum setting of the stochastic optimization.