Retour

G-2025-44

Zeroth-order Kronecker optimization for pretraining language models

, , et

référence BibTeX

Training language models (LMs) under tight GPU memory budgets rules out standard back-propagation and motivates zeroth-order (ZO) optimization. While ZO methods have proven effective for fine-tuning, their potential during the more memory-intensive pretraining stage has received little attention. We first revisit the singular-value spectra of layer gradients during pretraining and show that the gradient information is spread across many directions; low-rank ZO methods therefore potentially discard some informative components. Building on this insight, we introduce KronZO, a Kronecker-structured ZO optimizer that (i) explores a full-rank search subspace with state-of-the-art storage compression and (ii) employs a criterion-driven directional update that selectively keeps only informative steps. When pretraining GPT-2 Small from scratch on OpenWebText, KronZO achieves a markedly lower training loss than all previous ZO baselines while consuming less GPU memory. Although still trailing first-order methods in final loss, KronZO substantially narrows the gap at a fraction of their memory footprint, extending ZO optimization to larger models and longer runs and paving the way for memory-efficient pretraining on commodity hardware.

, 24 pages

Ce cahier a été révisé en septembre 2025

Axe de recherche

Applications de recherche

Document

G2544R.pdf (960 Ko)