G-2025-44
Zeroth-order Kronecker optimization for pretraining language models
, , et
référence BibTeXTraining language models (LMs) under tight GPU memory budgets rules out standard back-propagation and motivates zeroth-order (ZO) optimization. While ZO methods have proven effective for fine-tuning, their potential during the more memory-intensive pretraining stage has received little attention. We first revisit the singular-value spectra of layer gradients during pretraining and show that the gradient information is spread across many directions; low-rank ZO methods therefore potentially discard some informative components. Building on this insight, we introduce KronZO, a Kronecker-structured ZO optimizer that (i) explores a full-rank search subspace with state-of-the-art storage compression and (ii) employs a criterion-driven directional update that selectively keeps only informative steps. When pretraining GPT-2 Small from scratch on OpenWebText, KronZO achieves a markedly lower training loss than all previous ZO baselines while consuming less GPU memory. Although still trailing first-order methods in final loss, KronZO substantially narrows the gap at a fraction of their memory footprint, extending ZO optimization to larger models and longer runs and paving the way for memory-efficient pretraining on commodity hardware.
Paru en juillet 2025 , 24 pages
Ce cahier a été révisé en septembre 2025
Axe de recherche
Applications de recherche
Document
G2544R.pdf (960 Ko)