References

5. References#

[AGK+20]

Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018, 2020.

[BN24]

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: an anthology. arXiv preprint arXiv:2409.20325, 2024.

[Ber97]

Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.

[BV04]

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[B+15]

Sébastien Bubeck and others. Convex optimization: algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.

[CLSH19]

Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. In International Conference on Learning Representations. 2019. URL: https://openreview.net/forum?id=H1x-x309tm.

[DHS11]

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 2011.

[DefossezBBU22]

Alexandre Défossez, Leon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad. Transactions on Machine Learning Research, 2022. URL: https://openreview.net/forum?id=ZPQhzTSWA7.

[GKS18]

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: preconditioned stochastic tensor optimization. In International Conference on Machine Learning, 1842–1850. PMLR, 2018.

[HSS12]

Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.

[JJB+24]

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: an optimizer for hidden layers in neural networks. 2024. URL: https://kellerjordan.github.io/posts/muon/.

[KB15]

Diederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. ICLR, 2015.

[Lan20]

Guanghui Lan. First-order and stochastic optimization methods for machine learning. Volume 1. Springer, 2020.

[LSY+25]

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, and others. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025.

[LH19]

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations. 2019. URL: https://openreview.net/forum?id=Bkg6RiCqY7.

[Nes18]

Yurii Nesterov. Lectures on convex optimization. Volume 137. Springer, 2018.

[RKK18]

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations. 2018. URL: https://openreview.net/forum?id=ryQu7f-RZ.

[SLI+23]

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale. arXiv preprint arXiv:2309.06497, 2023.

[ZCL22]

Yushun Zhang, Congliang Chen, and Zhi-Quan Luo. Does adam converge and when? In ICLR Blog Track. 2022. https://iclr-blog-track.github.io/2022/03/25/does-adam/. URL: https://iclr-blog-track.github.io/2022/03/25/does-adam/.

[ZCS+22]

Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam can converge without any modification on update rules. Advances in neural information processing systems, 35:28386–28399, 2022.