Parameter-Free Adaptation: Model Complexity vs. Distance-Traveled Step Sizes

Version

Download 0

File Size 367.76 KB

File Count 1

Download

or download free

Manuscript Title

Parameter-Free Adaptation: Model Complexity vs. Distance-Traveled Step Sizes

Tajinder Singh

Assistant Professor

Department of Mathematics

Government college, Hoshiarpur

Email: tajindersi786@gmail.com

Abstract

The selection of an appropriate step size remains one of the most computationally expensive aspects of training modern neural networks. Grid search over the learning rate, often coupled with warmup schedules and decay parameters, consumes resources that scale poorly with model size. For contemporary architectures exceeding parameters, the cost of a single hyperparameter sweep can rival the cost of the final training run itself[1]. Practitioners typically resolve this through heuristic transfer from smaller models, a procedure that introduces a measurable generalization gap and offers no formal guarantee that the chosen rate is near-optimal for the loss landscape at hand[2,3]. This review synthesizes two parallel lines of work that aim to eliminate the static learning rate as a tunable quantity. The first, which we group under the heading of distance-traveled step size mechanisms (exemplified by D-Adaptation and the Adam++ family), exploits the observed displacement of iterates from the initialization to construct an adaptive estimate of the optimal step. The second, captured by Adaptive Model Complexity (AMC) schemes, ties the effective step size to scale-dependent quantities such as the trace of the empirical Fisher or the spectral norm of the parameter matrices[7,8]. Both families, despite originating from different theoretical principles (online learning regret bounds versus statistical learning theory), suggest that the optimal step size at iteration can be computed dynamically from quantities the optimizer already maintains. We formalize this correspondence, compare theoretical regret bounds whose constants typically scale with where denotes the Euclidean iterate diameter, and assess the empirical evidence from recent large-scale studies. Our analysis suggests that the practical generalization gap between tuned and parameter-free methods has narrowed substantially, with remaining discrepancies attributable to schedule-free dynamics rather than to step-size selection per se[11,12]. We close by identifying open theoretical questions, particularly the interaction between distance-traveled estimators and the non-convex curvature of transformer loss landscapes.

Keywords: parameter-free optimization, adaptive learning rate, D-Adaptation, distance-traveled step size, Adam++, Adaptive Model Complexity, stochastic gradient methods.

[changelog]