Zeroth-order Kronecker optimization for pretraining language models

Allaire, Nathan; Le Digabel, Sébastien; Orban, Dominique; Partovi Nia, Vahid

Training language models (LMs) under tight GPU memory budgets rules out standard back-propagation and motivates zeroth-order (ZO) optimization. While ZO methods have proven effective for fine-tuning, their potential during the more memory-intensive pretraining stage has received little attention. We first revisit the singular-value spectra of layer gradients during pretraining and show that the gradient information is spread across many directions; low-rank ZO methods therefore potentially discard some informative components. Building on this insight, we introduce KronZO, a Kronecker-structured ZO optimizer that (i) explores a full-rank search subspace with state-of-the-art storage compression and (ii) employs a criterion-driven directional update that selectively keeps only informative steps. When pretraining GPT-2 Small from scratch on OpenWebText, KronZO achieves a markedly lower training loss than all previous ZO baselines while consuming less GPU memory. Although still trailing first-order methods in final loss, KronZO substantially narrows the gap at a fraction of their memory footprint, extending ZO optimization to larger models and longer runs and paving the way for memory-efficient pretraining on commodity hardware.

Published July 2025 , 24 pages

This cahier was revised in September 2025

Research Axis

Axis 2: Decision support made in complex systems

Research applications

Publication

Feb 2026

Zeroth-order Kronecker optimization for pretraining language models

Nathan Allaire, Sébastien Le Digabel, Dominique Orban, and Vahid Partovi Nia

SN Computer Science, 7, Paper no: 162, 2026 BibTeX reference

Document

G2544R.pdf (1000 KB)

GERAD

G-2025-44

Zeroth-order Kronecker optimization for pretraining language models

Nathan Allaire, Sébastien Le Digabel, Dominique Orban, and Vahid Partovi Nia

Research Axis

Research applications

Publication

Document