Jamie Simon $^{*}$
UC Berkeley and Imbue
Daniel Kunin
UC Berkeley
Alexander Atanasov
Harvard University
Enric Boix-Adserà
University of Pennsylvania
Blake Bordelon
Harvard University
Jeremy Cohen
Flatiron Institute
Nikhil Ghosh
Flatiron Institute
Florentin Guth
NYU & Flatiron Institute
Arthur Jacot
New York University
Mason Kamb
Stanford University
Dhruva Karkada
UC Berkeley
Eric J. Michaud
Astera Institute
Berkan Ottlik
University of Pennsylvania
Joseph Turnbull
UC Berkeley
$^*$Correspondence to [email protected].
In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (1) solvable idealized settings that provide intuition for learning dynamics in realistic systems, (2) tractable limits that reveal insights into fundamental learning phenomena, (3) simple mathematical laws that capture important macroscopic observables, (4) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and, (5) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We assert that learning mechanics should be a mathematical theory, grounded in first-principles calculations that closely predict empirics, reliant on well-tested approximations and assumptions, aiming for broad impact across the machine learning stack once it reaches maturity. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic and mutually supportive relationship between learning mechanics and the developing discipline of mechanistic interpretability. Where mechanistic interpretability aims to be the biology of deep learning, learning mechanics should aspire to be its physics, mirroring the complementary relationship between biology and physics in the natural sciences. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at learningmechanics.pub.
Executive Summary: Deep learning powers transformative AI systems, from language models to image generators, yet it remains a black box: we train these models through trial and error without a clear scientific explanation of why they work so well. This gap hinders reliable engineering, raises safety risks as models grow more powerful, and leaves fundamental questions about intelligence unanswered. With AI scaling rapidly and influencing policy, economy, and society, the need for a principled understanding has never been more urgent.
This document argues that a scientific theory of deep learning is emerging and will soon provide such understanding. It aims to characterize key aspects of neural networks, including how they learn during training, what internal representations they form, and how they perform on tasks.
The authors synthesize ongoing research without conducting new experiments. They draw on diverse studies, including simplified models like linear networks and infinite-width limits, empirical observations from large-scale training runs, and analyses of hyperparameters and data patterns. These efforts span math, physics, and computer science, focusing on credible patterns from real AI systems over the past decade, such as those trained on billions of data points.
The core findings highlight five converging lines of evidence for an emerging theory, dubbed "learning mechanics"—a physics-like framework for the "motion" of models through parameter space during training. First, simplified settings like deep linear networks reveal exact learning paths, showing networks prioritize simple patterns early, much like greedy algorithms in nature. Second, limits like infinite network width simplify dynamics into predictable regimes: "lazy" ones mimic fixed kernels for basic tasks, while "rich" ones enable adaptive feature learning, explaining why wider models improve. Third, simple laws govern big-picture outcomes, such as test error dropping predictably as a power law with more data or compute (e.g., doubling compute halves error in many cases), and training stabilizing at a "edge of stability" where loss curvature balances step sizes. Fourth, hyperparameters like learning rates can be decoupled and scaled predictably, allowing settings from small tests to transfer to massive models, cutting trial-and-error costs. Fifth, universal patterns appear across tasks and architectures, like similar internal representations in vision and language models, suggesting shared data structures drive convergence.
These results imply deep learning exploits universal principles in data and training, not just brute force. Practically, a mature theory could slash development costs by guiding model design, reducing hyperparameter tuning by 50-80% in some cases, and accelerating timelines for reliable AI. For safety, it would clarify how models form behaviors, aiding oversight and reducing risks like unintended biases. Unlike classical learning theory, which focused on guarantees for simple models, this mechanics approach embraces complexity through empirical science, resolving why overparameterized networks generalize better than expected. It complements mechanistic interpretability, which dissects trained models like biology studies organisms, by explaining dynamics as the underlying "physics."
Leaders should invest in this research by funding interdisciplinary teams and collaborations between theorists and practitioners. Prioritize developing nonlinear toy models and data-structure theories to bridge simplified insights to real AI. Explore symbiotic ties with interpretability for AI safety pilots. Explore scaling predictions to inform resource allocation, such as optimal compute-data ratios. Without such theory, AI progress risks stagnation in alchemy; with it, decisions on investment and regulation become evidence-based.
The theory is nascent, strongest in idealized or large-scale regimes but weaker on nonlinear, small-data cases or diverse architectures. Uncertainties remain in predicting exact scaling exponents or feature emergence from data alone. Confidence is high in the trends—evidenced by consistent predictions across studies—but full maturity requires more empirical validation on frontier models. Proceed cautiously on safety claims until broader tests confirm universality.
Section Summary: Deep learning powers incredible feats in artificial intelligence but remains a mysterious black box, with no clear scientific explanation for why neural networks learn so effectively from data. While early machine learning theories focused on what simple models could achieve and how they generalize, deep learning's complexity and scale have outpaced those frameworks, shifting the field toward a more empirical, scientific study of training dynamics and behaviors. This paper argues that a unified theory is emerging—like a mechanics of learning—supported by solvable models, scaling limits, empirical patterns, simplified optimizations, and universal trends across applications.
Deep learning is famously a black-box learning method, the most powerful, most inscrutable, and now most technologically important member of the machine learning pantheon. Properly trained, neural networks learn to perform a wide array of tasks with superhuman performance, but we have no unified scientific framework that explains why or how. Motivated by both scientific curiosity and the promise of practical engineering benefit, the effort to put rigorous mathematical and scientific backing behind this applied discipline has spanned decades. Despite some progress, however, our understanding remains primitive: neural networks are still trained using methods discovered largely through trial and error rather than first principles, and theory plays little role in the day-to-day practice of deep learning. The challenge has only compounded as practice has advanced, and in the era of large language models and diffusion models, the mysteries are arguably deeper than they were one or two decades ago. Will we ever understand?
**This paper makes the case that, yes, there will be a scientific theory of deep learning; that we can see pieces of this theory starting to emerge; and that this theory will take the form of a *mechanics* of the learning process.**
The questions driving deep learning theory have changed over time, and to understand where the field is going, it is useful to first look back at how we got here. Deep learning theory is as old as machine learning itself, with roots in the McCulloch–Pitts neuron and the perceptron in the middle of the last century. The earliest theoretical questions in machine learning were about expressivity: what functions can simple models represent, and how can they be learned from data? As learning came to be understood as a statistical problem, and simple learning systems found practical success, the theoretical focus shifted to ask: when does learning from finite samples generalize? This gave rise to classical learning theory, including statistical and computational/PAC learning theory. Paired with classical optimization theory, these frameworks gave clean end-to-end guarantees of the optimization and generalization of simple learning systems. In parallel, a classical tradition of the statistical physics of machine learning developed satisfying theories of the average-case behavior of simple models.
While these classical theories built a strong foundation for understanding learning, the rise of deep learning through multilayer networks, backpropagation, and increasing scale in both data and compute exposed limitations in their explanatory power. Neural networks are complex, nonconvex, and overparameterized (in contrast with the simple, convex, parsimonious models for which classical learning theory excels), and they optimize and generalize better than these classical approaches can guarantee or explain. Furthermore, it became clear that neural networks were not merely fitting data or achieving low training error, they were learning structured internal representations and displaying striking regularities across tasks and scales. The classical questions of performance and efficiency remained important, but answering them would first require understanding a new host of phenomena shaped both by the dynamics of neural networks through training and the structure of the data they are trained on.
This marked a transition in which deep learning theory changed in character from a largely mathematical study of what is possible to a truly scientific effort to describe, explain, and ultimately predict the behavior of complex empirical systems. New scientific endeavors often start with an empirical tension in which nature presents something interesting we cannot predict or explain with existing tools, and although neural networks are artificial computational systems, this same scientific tension is present here. We should thus approach this task as scientists, embracing empirics, seeking unifying principles, and identifying recurring motifs. We should also expect the path forward to look more like the development of a scientific field than the development of a mathematical one.
The purpose of this paper is to convince the reader that this scientific tension is gradually giving way to a scientific theory which resolves it. In Section 2, we pull together major strands of ongoing research and identify five lines of evidence that such a theory is emerging:
These lines of research broadly share several overarching characteristics: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics of learning; and they emphasize accurate average-case predictions over rigorous worst-case bounds. In this sense, the emerging scientific theory of neural networks appears to have much in common with theories in physics such as classical mechanics, continuum mechanics, statistical mechanics, and quantum mechanics. We argue that this emerging theory is best understood as a mechanics of learning.
Mechanics is the branch of physics studying how forces acting on objects determine their movement through space and time. Neural network learning can be thought of in this way: much as an object moves continuously through physical space, learning involves a model moving through parameter space via discrete updates. In the physical sciences, forces come from interactions between components of a system. Similarly, the process of deep learning is shaped by interactions between the parameters, dataset, task, and learning rule. In physics, these forces are mediated by fields; in deep learning, they are mediated by gradients. In physics, systems settle into equilibria at local minima of a potential determined by internal interactions and external constraints; analogously, neural networks converge to local minima of a loss landscape shaped by their architecture and training data. While the systems under study are very different, since the key problems of both are essentially about movement and interaction, we might expect some features of the resulting sciences to be shared.
These analogies are not just speculation: we can see these similarities reflected in the lines of research listed above. All branches of mechanics (and especially classical mechanics) develop a library of analytically solvable settings to gain intuition; so too does learning mechanics. All branches of mechanics use limits as simplifying tools; so too does learning mechanics. Continuum and statistical mechanics, the branches which most directly deal with large numbers of interacting components, describe zoomed-out summary statistics rather than the motion of every particle; this has also proven a useful approach in dealing with the complexity of deep learning. Every physical system has one or more system parameters (characteristic scales, coupling constants, etc.) affecting its behavior, and some techniques for treating these are essentially the same as those used to study hyperparameters in deep learning. Finally, physics is full of cases in which the same phenomena show up in very different settings, and similarly we see universal behavior emerging across deep learning systems.
All considered, the emerging science shares deep similarities with established branches of mechanics. By analogy to classical, continuum, statistical, and quantum mechanics, we suggest the intended theory be called learning mechanics.
Seven desiderata for learning mechanics.
We should be clear at the outset what we want from a mechanics of learning. Assessing how mature branches of mechanics were motivated, developed, and succeeded, we can see what sort of goals to aim for. Here are seven desiderata for this research program:
A mechanics of learning with these virtues — one that is fundamental, mathematical, predictive, comprehensive, intuitive, useful, and humble — would be transformative, paradigm-setting. We expect such a theory would resolve important open questions that have long remained out of reach, as we discuss in Section 5.
Building learning mechanics will not be easy. It will require sustained effort, both intellectual and institutional. It is therefore worth being clear about why such a project matters. The reasons to seek a mechanics of learning fall into three broad categories: scientific, practical, and safety-related.
The scientific reasons concern what such a theory could teach us about intelligence and the natural world. The striking engineering success of large neural networks suggests that they exploit deep principles of learning and representation that we do not yet understand. This has historical precedent: technology has often preceded scientific theory, as was the case with steam engines' role in motivating thermodynamics, which went on to explain much more than engine efficiency. A similar story played out in flight: the development of airplanes through trial and error and inspiration from the natural world helped motivate aerodynamic theory, which in turn enabled both better aircraft design and a deeper understanding of how birds themselves fly. In our case, the principles that govern learning in artificial neural networks may also shed light on our own biological intelligence, with potentially important implications for neuroscience and cognitive science.
The practical reasons concern the design and development of real-world AI systems. A mature theory of deep learning could guide model design, optimization, scaling, and deployment, replacing trial and error with more reliable principles. Theory has already begun to play this role in a limited but growing number of cases, including empirical scaling laws (Section 2.3), mathematical prescriptions for hyperparameter scaling (Section 2.4), and theoretically-motivated optimizers and methods for data attribution (Section 4). A deeper, more complete theory will give more such guidance and make it sharper and more predictive.
The safety reasons concern our ability to describe, characterize, and govern increasingly powerful AI systems. Some form of regulation will likely be necessary, but it is difficult to regulate a technology that we cannot clearly describe. A theory that identifies the relevant variables, mechanisms, and organizing principles of large models could help provide the clarity needed for reliability, oversight, and control. One avenue by which fundamental theory might aid in AI safety is by supporting mechanistic interpretability, a point to which we return in Section 3.
This paper is structured as follows. In Section 2, we present our five lines of evidence that a scientific theory of deep learning is beginning to emerge. We motivate each line of evidence with an intuitive explanation and highlight examples of research successes that illustrate the underlying principle. In Section 3, we discuss the relationship between learning mechanics and other perspectives on the science of deep learning, including a possible symbiotic relationship between learning mechanics and mechanistic interpretability. In Section 4, we review and address common arguments that fundamental theory will not be possible. In Section 5, we give a portrait of ten important open directions in learning mechanics, from predicting scaling laws to eliminating hyperparameters, where we expect to see major progress in the coming years. Finally, in Section 6, we offer some advice for young researchers looking to get involved in this scientific project and extend a hand with some introductory resources.
We write this paper for a broad audience. We hope the veteran scientist of deep learning will find something valuable in our synthesis of useful approaches and results, and feel galvanized by our depiction of an emerging science. We hope to convince the deep learning practitioner that theory is on a path to fulfilling its longstanding promise of practical utility and to encourage them to experiment with their systems with an eye for science. We hope to convince the AI safety or mechanistic interpretability researcher that white-box theory is difficult yet possible — that a first-principles study of dynamics can help put solid foundations beneath their important work, and that our communities should work together (see Section 3 for our vision of symbiosis). Lastly, we hope to make it easier for young students and newcomers to the field to get involved. This is an exciting and important area of work, and while it requires some mathematical maturity to get started, it is our belief that the barrier to entry could be much lower. Various deep intuitions about this science have been percolating inside the theory community for a while, and this paper is an attempt to state them clearly. We hope to make it easier for folks with the requisite background to quickly get up to speed and contribute.
Section Summary: Deep learning offers hope for developing a scientific "mechanics of learning" because its core elements—like the neural network structure, training data, performance goals, and update rules—are fully visible and easy to measure, unlike many complex systems where the inner workings are hidden. This transparency allows researchers to track every detail of the training process, run repeatable experiments, and hunt for patterns, even though the high-dimensional interactions create challenges. Evidence comes from simplified models that can be solved exactly, much like basic physics problems, and from broader parallels to concepts in classical and statistical mechanics, suggesting a unifying theory is within reach.
A great cause for optimism that a mechanics of learning is possible is the fact that the essential ingredients of deep learning are both explicit and measurable. A deep learning system is characterized by the following components:
$ \begin{aligned} \text{Architecture:} \quad & \text{a neural network } f({\bm{x}}; {\bm{\theta}}) \text{ specified as a composition of simple linear and nonlinear transformations.} \ \text{Data:} \quad & \text{a dataset } \mathcal{D} = {({\bm{x}}i, {\bm{y}}i)}{i=1}^n \text{ consisting of samples from an unknown data-generating distribution }\& ({\bm{x}}, {\bm{y}}) \sim \mathcal{P}{\mathrm{data}}. \ \text{Task:} \quad & \text{an objective } \mathcal{L}({\bm{\theta}}) \text{ measuring the performance of the network } f({\bm{x}}; {\bm{\theta}}) \text{ on the dataset } \mathcal{D}. \ \text{Learning rule:} \quad & \text{a gradient-based update equation, e.g. } {\bm{\theta}}^{(t+1)} = {\bm{\theta}}^{(t)} - \eta \nabla \mathcal{L}({\bm{\theta}}^{(t)}) \text{, together with a parameter} \ & \text{initialization, e.g. } {\bm{\theta}}^{(0)}i \sim \mathcal{N}(0, \alpha^2{\mathrm{init}}) \text{ and optimization hyperparameters, e.g. learning rate }\eta\text{.} \end{aligned} $
Nothing about the learning process is hidden. Unlike many complex systems where the equations governing dynamics must be inferred from observations, deep learning directly exposes its "equations of motion." Moreover, these dynamics are extraordinarily measurable: every weight, activation, gradient, and loss value can be recorded, along with arbitrary statistics derived from them. As a result, deep learning experiments are unusually easy to design, replicate, and interrogate, making it more straightforward to discover empirical regularities and rigorously test theoretical predictions. Few fast-moving scientific domains offer comparable transparency in their governing equations or comparable freedom in what can be measured.
What, then, stands in the way of a scientific theory of deep learning? The central challenge is not opacity, but complexity. While we have direct access to the architecture, data, task, and learning rule, the interaction of these components gives rise to learning dynamics that are nonlinear, coupled, and high-dimensional. These dynamics depend in subtle ways on the choice of hyperparameters. And even though we can inspect every training sample, data distributions are complex and have defied simple characterization.
Nevertheless, we argue that this complexity conceals underlying regularities, and that deep learning will indeed admit a scientific theory. In what follows, we present five broad observations that serve as evidence for an emerging mechanics of learning. Each of these admits direct analogies to tools and ideas in other disciplines of mechanics. These are summarized in Table 1.
: Table 1: Useful tools and ideas in the emerging science of deep learning closely resemble important tools and ideas from physics, particularly classical mechanics, continuum mechanics, statistical mechanics, and quantum mechanics. Extrapolating, this suggests that there will be a mechanics of learning which offers a unifying first-principles theory of the training process, hidden representations, final weights, and test-time performance of neural networks.
| Section | Approach | Examples in deep learning | **Examples from physics ** |
|---|---|---|---|
| solvable settings | deep linear networks, | ||
| kernel regression, | |||
| multi-index models | harmonic oscillator, | ||
| hydrogen atom, | |||
| Ising model | |||
| simplifying limits | lazy vs. rich learning, | ||
| width, depth $\rightarrow \infty$, | |||
| small initialization | thermodynamic limit $(n, V \rightarrow \infty)$, | ||
| classical limit $(\hbar \rightarrow 0)$, | |||
| hydrodynamic limit $({\bm{k}}, \omega \rightarrow 0)$ | |||
| simple empirical laws | neural scaling laws | ||
| edge of stability | |||
| neural feature ansatz | the laws of Kepler, Snell, Boyle, Hooke, Newton, Faraday, Ohm, Poiseuille, Planck, Hubble, etc. | ||
| study of system | |||
| parameters | step size as sharpness | ||
| regularization, | |||
| $\mu$ P and width-scaling | scaling analysis, | ||
| nondimensionalization, | |||
| chaotic vs. ordered regimes | |||
| universal phenomena | common inductive biases and representations across models | critical phenomena, | |
| renormalization group flow |
A reliable way to build scientific understanding in complex systems is to study pared-down yet representative settings in which quantitative calculations are possible. For example, physics uses representative solvable settings like the harmonic oscillator and the hydrogen atom as sources of intuition for much broader classes of system. Deep learning appears to be particularly amenable to this approach: scientists have identified a rich landscape of minimal models where the learning dynamics simplify and many quantities of interest become solvable. These analytically tractable cornerstones are useful because they reveal phenomena and mechanisms to look for when we turn to realistic deep learning. [^3]
[^3]: A complementary view is that any eventual complete theory of deep learning must encompass these simplified settings. Their solutions may provide conceptual scaffolding, serving as nucleation sites from which a more general theory crystallizes.
One particularly fruitful simplification is linearization. Here we discuss two distinct instantiations of this idea: linearization in the data, where $f({\bm{x}}; {\bm{\theta}})$ becomes linear in ${\bm{x}}$, and linearization in the parameters, where $f({\bm{x}}; {\bm{\theta}})$ becomes linear in ${\bm{\theta}}$.
![**Figure 1:** **Linearization yields exact solutions that match experiments.** (a) Canonical work by [1] showed that, under a task-aligned initialization ${\bm{\theta}}^{(0)}$ and whitened inputs ${\bm{x}} \sim \mathcal{N}(0, \mathbf{I})$, the gradient flow learning dynamics of deep linear networks decouple into independent solvable Bernoulli ODEs. This leads to sequential learning of singular modes, with larger-singular-value modes emerging first. Panel (a) reproduces Fig. 3 of [1]. (b) Linearizing a nonlinear network by truncating nonlinear terms in its Taylor expansion around initialization reduces least-squares training to kernel ridge regression with the neural tangent kernel (NTK). This analysis connects the network's architecture to its inductive bias through the NTK eigenstructure, enabling accurate predictions for the test performance of these networks. Panel (b) is based on Fig. 2 of [2].](https://ittowtnkqtyixxjxrhou.supabase.co/storage/v1/object/public/public-images/jkqvvdxd/complex_fig_f1a5f2e57246.png)
Linearization in the data.
A deep linear network is obtained by removing all nonlinearities from a neural network's architecture, yielding a model that is linear in its inputs ${\bm{x}}$ but remains highly nonlinear in its parameters ${\bm{\theta}}$:
$ f({\bm{x}}; {\bm{\theta}}) = \mathbf{W}L \mathbf{W}{L-1} \cdots \mathbf{W}1 {\bm{x}}, \qquad \text{where } {\bm{\theta}} := {\mathbf{W}\ell}{\ell = 1}^L\text{, each }\mathbf{W}\ell \text{ is a linear transformation, and } L \ge 2. $
Deep linear networks have a long history of study because, despite their simplicity, they retain many hallmark behaviors of deep learning ([3]). These include saddle-point-dominated loss landscapes ([4]), dynamics with sharp phase transitions and separation of timescales ([5, 6]), edge-of-stability oscillations with gradient descent ([7]), and strong initialization-dependent inductive biases ([8, 9]). Analysis of these networks is typically carried out with the gradient flow learning rule — the continuous-time limit of gradient descent — under simplifying assumptions on the data distribution and with carefully chosen initializations ([10, 1, 11, 12]). In these regimes, the learning dynamics can often be solved exactly or reduced to low-dimensional dynamical systems.
Across many such analyses, a consistent lesson emerges: learning exhibits a greedy low-rank bias, acquiring some components of the task before others. Canonical work by [1] first showed how deep linear networks learn singular vectors of the input–output correlation sequentially during training, with learning prioritized toward modes associated with the largest singular values, as shown in Figure 1. This bias has been hypothesized to benefit generalization by separating the signal from the noise ([13]), and closely mirrors behavior observed in nonlinear networks, where simpler functions are often learned before more complex ones ([14, 15]). Moreover, a range of factors — including small initializations ([16, 17, 18, 19]), increased depth ([20, 21, 22]), stronger mini-batch noise ([23, 24]), and explicit $\ell_2$ regularization ([25, 26]) — have all been shown to further strengthen this greedy learning bias.
Linearization in the parameters.
A linearized network is obtained by truncating the nonlinear terms in a network's Taylor expansion around its initial parameters. This yields a model that is linear in its parameters ${\bm{\theta}}$ but remains highly nonlinear in the data ${\bm{x}}$:
$ f_\text{lin}({\bm{x}}; {\bm{\theta}}) = f({\bm{x}}; {\bm{\theta}}0) + \nabla{{\bm{\theta}}} f({\bm{x}}; {\bm{\theta}}_0)^\top ({\bm{\theta}}- {\bm{\theta}}0), \qquad \text{where } \nabla{{\bm{\theta}}} f(\cdot; {\bm{\theta}}_0) \text{ is the gradient at initialization.} $
This is not some contrived construction: in fact, there are settings in which a model is well-approximated throughout training by its linearization, i.e., $\forall t, \ f({\bm{x}}; {\bm{\theta}}t)\approx f\text{lin}({\bm{x}}; {\bm{\theta}}_t)$. For example, any neural network architecture can be driven into the linearized regime by taking suitable limits ([27, 28, 29, 30]), as discussed in Section 2.2. Additionally, recent evidence suggests that language model fine-tuning occurs in a near-linearized regime ([31, 32]).
Since a linearized network is linear in its parameters, its learning dynamics are identical to those of linear regression, with one key difference: while the dynamics of linear regression are driven by the Gram kernel, $K_\text{Gram}({\bm{x}}, {\bm{x}}')= {\bm{x}}^\top {\bm{x}}'$, linearized networks are described by the neural tangent kernel (NTK), $K_\text{NTK}({\bm{x}}, {\bm{x}}')\coloneqq \nabla_{{\bm{\theta}}} f({\bm{x}}; {\bm{\theta}}0)^\top \nabla{{\bm{\theta}}} f({\bm{x}}'; {\bm{\theta}}_0)$. When the task is least squares regression and training uses small-step gradient descent, the dynamics are analytically tractable and the final predictor is given by kernel ridge regression with the NTK ([27]).
This setting yields insight into a variety of deep learning phenomena. For example, since the details of the network architecture influence the mathematical structure of the NTK through the fixed feature map $\nabla_{{\bm{\theta}}} f(\cdot; {\bm{\theta}}_0)$, one learns how the linearized model's inductive bias follows from its architecture ([33, 34]). Furthermore, one may accurately predict the model's expected generalization error on arbitrary targets $f^\star$ by accounting for the structure of the input data ([35, 36, 37, 38, 39, 2]), as shown in Figure 1. Applying this framework to realistic data distributions uncovers the origin of the typical models' tendency to learn simple and generalizing functions ([40, 41]). Linearized models also capture relevant phenomena such as double descent ([42, 43]) and scaling laws ([44, 45, 46, 47]).
However, despite these theoretical merits, linearized networks are unrealistic in a few critical ways. Most notably, they do not capture the strong feature-learning capabilities that generic neural networks exhibit, often leading to overly pessimistic predictions for sample complexity ([48, 49]). Moreover, by reducing training to a tractable linear problem, these models sidestep the intrinsically nonconvex optimization phenomena of deep learning. To describe these and other aspects of deep learning, one must look beyond linearization.
Beyond linearization.
An important frontier for theory lies in developing analytically tractable toy models that remain genuinely nonlinear in both the data and the parameters (see Section 5). In these settings, the influence of the data distribution becomes more complex, making it difficult to obtain a unified and general framework. However, a growing body of work is progressing in this direction by isolating specific nonlinear mechanisms and making them solvable under assumptions on the data.
One line of work studies Gaussian inputs and structured targets (e.g., single- and multi-index models). Fully nonlinear neural networks provably outperform kernel methods using fewer samples because they exploit the structure in the target function to learn relevant features ([50, 51, 52, 53, 54]). Complementarily, methods from statistical physics enable computing exact asymptotics for Bayes-optimal inference and learning dynamics in these models ([55, 56, 57]). A related setting is two-layer neural networks with quadratic activation functions, where recent results have characterized the exact asymptotics, training dynamics, and scaling laws ([58, 59, 60, 61]). Several other lines of research isolate distinct nonlinear phenomena: the convergence of homogeneous networks trained on logistic losses to max-margin solutions ([62, 63]), the reduction of training dynamics to low-dimensional summary statistics in teacher-student models ([64, 65, 66, 67, 68]), memorization in associative memory models ([69]), learned algorithmic structure in modular arithmetic tasks ([70, 71, 72]), nonlinear solvable models of attention ([73, 74]), and improved scaling laws from nonlinear feature learning ([75]).
Taken together, these approaches illustrate both the promise and the limitations of current nonlinear toy models: each captures a slice of fully nonlinear learning dynamics, yet no unified framework has emerged. We view this space as an open and rapidly evolving area, and return to these challenges in our discussion of open problems in Section 5.
Modern deep learning systems are enormous: they regularly involve hundreds of interacting architectural components comprised of hundreds of billions of parameters and trained on trillions of tokens. With so many interacting degrees of freedom, constructing detailed microscopic theories that track individual parameters in practical systems seems all but hopeless.
Fortunately, complex systems often simplify when approximated as effectively infinite in size, revealing simple mathematical structure that remains informative even for the original finite system. This strategy is well established in statistical and chemical physics: for example, the ideal gas law, $PV = nRT$, is derived in a limit of infinite number of particles (often termed the thermodynamic limit) yet accurately describes real parcels of gas of finite volume. Limits are a central mathematical tool for managing the complexity of deep learning, and their recurring success in doing so provides strong evidence for an emerging theory.
Here we discuss the limit of infinite width in detail. We conclude by mentioning other limits and offering some unifying ideas.
The infinite width limit and the lazy/rich dichotomy.
The dynamics of a deep neural network often simplify when one takes the number of neurons in each hidden layer to infinity. Such a limit generally leads to so-called mean-field behavior in which we only need to describe the evolution of the neuron population as a whole (as e.g. a probability distribution) and we can ignore what each individual neuron is doing. However, achieving this limit requires shrinking the initialization scale as width increases to prevent activations in deeper layers from diverging. The key subtlety in taking the infinite width limit is that the rate at which we suppress these initial weights strongly influences the resulting training dynamics, leading to one of two qualitatively distinct limiting behaviors.
The lazy, kernel, or linearized regime. The first forays into the land of infinite width studied only a network's statistics at initialization, not its training dynamics ([76, 77]). These works found that, in order for the inputs to hidden neurons to neither vanish nor explode as width increases, the parameter size at initialization has to decay as $\text{[width]}^{-1/2}$. This is not a surprise: it is just the well-known LeCun initialization rule ([78]), which can be easily derived from the central limit theorem. Later works that tried naively training the parameters of these infinite-width networks found the surprising fact that the network's weights and hidden representations change only negligibly, yet these small changes accumulate to produce substantial changes in the output function. As a result, the training dynamics are linear in the parameters in the sense discussed in Section 2.1, and the evolution of the target function may be expressed entirely in terms of the NTK ([27, 28]). While a network in this limit is wonderfully analytically tractable, the fact that its hidden representations do evolve only negligibly means that it fails to exhibit feature learning. While the definition of feature learning is much debated (see Section 5), all agree that at minimum it requires the network's hidden activations on a given data sample to change from their values at initialization, which does not happen in this limit. This suggests the NTK infinite-width limit is not the right one to study. Networks in this linearized regime were later termed "lazy" by [29].
![**Figure 2:** **Large and small network output multipliers are sufficient to induce lazy and rich training dynamics.** We train a shallow student network $\hat{f}({\bm{x}}) = \frac{\alpha}{n} \sum_{i=1}^n a_i \text{ReLU}({\bm{w}}_i^\top {\bm{x}})$ with width $n = 200$ to match a teacher network $f^*({\bm{x}}) = \sum_{i=1}^3 a^*_i \text{ReLU}(({\bm{w}}^*_i)^\top {\bm{x}})$ on two-dimensional input data. We plot the training trajectories of the student weights $w_i$ (color denotes $\mathrm{sgn}(a_i)$) against the teacher feature directions. **Left:** with $\alpha = 0.1$, the dynamics are *rich:* the student weights grow significantly and cluster in angle around the teacher feature directions. **Right:** with $\alpha = 30$, the dynamics are *lazy:* the student weights move negligibly during training, even though the loss drops. Experiment reproduced from [29].](https://ittowtnkqtyixxjxrhou.supabase.co/storage/v1/object/public/public-images/jkqvvdxd/lazy_rich_plots.png)
The rich, active, or feature-learning regime. In answer to this, several authors identified an alternative infinite width limit in which training does induce feature learning. The key insight was essentially to downscale the final-layer weights by a factor of $\text{[width]}^{-1}$, rather than the earlier $\text{[width]}^{-1/2}$, thereby forcing the network weights to change more to compensate. [^4] While this makes the function trivial at initialization (at infinite width it is uniformly zero), it can still grow nontrivially during training, changing by an order-one amount upon each gradient step.
[^4]: The lazy vs. rich dichotomy is conceptually similar to elastic vs. plastic deformation in materials. A material will deform linearly in response to a small force, and its internal atomic structure will not change. In response to a larger force, it will deform nonlinearly, and its internal structure changes.
This downscale-the-network-output idea first appeared in the shallow "mean-field networks" of [79], [80], and [81]. [82] and [83] found that this idea also works for networks of arbitrary depth, bundling the resulting hyperparameter scaling factors together into the celebrated "Maximal Update Parameterization" discussed in Section 2.4. It is now widely accepted that infinite-width neural networks can learn features.
Wide networks in this "rich" regime display a huge range of interesting behaviors that their lazy counterparts do not. The most significant is certainly that the hidden features of these networks change over time, adapting to the structure in the input data, altering the internal geometry of hidden representations over the course of training ([84]). Subpopulations of neurons specialize, learning to attend to different features latent in the data ([56, 65, 61]). For instance, in tasks where the optimal predictions involve low-dimensional subspaces of high dimensional data, the distribution over first layer weights evolves to amplify weights in the subspace of interest ([85, 50, 86, 87, 60, 58, 88]). When the scale of the initialization is made even smaller, they often show the greedy low-rank bias discussed in Section 2.1, acquiring some components of the task before others ([89, 6, 90]). [^5]
[^5]: There is also a well-developed line of work studying the signatures of feature learning in large-width networks from a Bayesian perspective. Naively, infinite-width networks have simple Bayesian statistics given by Gaussian processes ([91]), which is analogous to the "lazy" limit of conventionally-trained networks. This view treats this Gaussian process limit as a solvable reference point ([92, 93]) and then reintroduces finite width, using mean-field and variational techniques to characterize aspects of feature adaptation to data ([92, 94, 95, 96, 97]). One may also induce feature learning by rescaling the total likelihood (see e.g. [98]), which is analogous to the final-layer downscaling which gives the rich limit in conventional training.
The lazy–rich dichotomy, and its dependence on initialization scale, emerged as a central finding of infinite-width analyses. Subsequent work has shown that analogous behavior appears even at finite width: scaling down the network output promotes feature learning, pushing models toward the rich regime, whereas increasing the output scale tends to linearize training dynamics and induce lazy behavior ([29]). This sensitivity to initialization scale connects to a broader literature on inductive bias, where seemingly small changes to the learning setup can steer training toward fundamentally different solution classes ([99, 8]). Figure 2 illustrates how the same finite network, trained with different output scalings, can exhibit either lazy or rich learning dynamics.
The infinite depth limit and other hyperparameter limits.
As with infinite width, one can arrive at a stable infinite depth limit of a deep residual network by downscaling the contribution of each layer so the residual stream does not blow up. Here, too, there are different limiting behaviors depending on the size of this downscaling factor: suppressing each layer by a factor of $\text{[depth]}^{-1}$ results in limiting dynamics in which the residual stream changes smoothly over depth ([100, 101, 102]) (reminiscent of Neural ODEs ([103])) while suppressing each layer by a factor of $\text{[depth]}^{-1/2}$ results in limiting dynamics in which the residual stream diffuses as if driven by a stochastic differential equation ([104, 105]). Networks in these two limits converge to qualitatively different solutions in realistic architectures such as transformers ([106]). It is not yet clear which is the more important limit to study.
Some deep learning architectures admit size limits other than those of large width or large depth. Instead of increasing size or total number of distinct feedforward layers, one can also analyze the infinite limits of recurrent architectures using similar mean-field ideas ([107, 108]). State-of-the-art transformer models include more expressive constituent blocks such as multi-head self-attention layers and mixture-of-expert multi-layer perceptrons. These layers have multiple scaling directions including head count, head size, and context length for attention ([109, 100]) and expert count, expert size, and sparsity for mixture-of-expert models ([110, 111]). Clarifying the interplay of different infinite limits in these models is important to making contact with modern practice and to disentangle various hyperparameters related to initialization and optimization (see Section 2.4).
Lastly, most optimization hyperparameters have an associated limit. As the batch size approaches infinity, we reach population gradient descent. As we take learning rate to zero, we recover gradient flow. If we add an infinitesimal weight decay and take training time to infinity, we first optimize the loss to convergence, then perform parameter norm minimization conditioned on the final value of the loss. We discuss how to understand the corrections induced by having finite values for some of these hyperparameters in Section 2.4.
Joint scaling limits.
Sometimes scaling limits in multiple variables ($\nu_1, \nu_2$) play nicely, in the sense that $\displaystyle \lim_{\nu_1 \to \infty}, \lim_{\nu_2 \to \infty}$ gives the same result as $\displaystyle \lim_{\nu_2 \to \infty}, \lim_{\nu_1 \to \infty}$. For example, the infinite width and depth limits in residual networks usually commute in this way, so long as one takes a sensible parameterization ([112]). However, in many theoretical machine learning settings, different scaling dimensions do not commute, and the limiting behavior could depend on a limiting ratio $\nu_2 / \nu_1$. Such joint/proportional scaling limits are common in random matrix theory: for example, consider the SVD of a random matrix with $P$ rows and $N$ columns with $N, P \to \infty$ with $P/N$ held constant. In machine learning theory, neural networks trained with random data can often be described by a joint scaling limit where both the dataset size and parameter count are taken to infinity, but one or more of the ratios $\frac{\text{[data]}}{\text{[input dim]}}$, $\frac{\text{[data]}}{\text{[width]}}$, or $\frac{\text{[data]}}{\text{[parameters]}}$ is a finite value ([113, 64, 114, 115, 116, 117, 118]). This joint scaling is likely necessary in the study of compute-optimal neural scaling laws where the training horizon (i.e. dataset size) is scaled linearly with the total parameters ([119]) and to theoretically characterize hyperparameter transfer phenomena ([120]). These joint (data & model size) limits are potentially important as infinite parameter limits at fixed dataset size are capable of perfect interpolation and do not capture scaling law behaviors across model sizes (see Section 2.3). Other well-studied joint scaling quantities include the ratio $\frac{\text{[width]}}{\text{[depth]}}$ in non-residual networks ([121, 122, 123, 124]), the ratio $\frac{\text{[learning rate]}}{\text{[output multiplier]}}$ in the rich regime ([90]), and the "SGD noise temperature" $\frac{\text{[learning rate]}}{\text{[batch size]}}$ ([125, 126]).
The Discretization Hypothesis.
Overall, the widespread use of limits to manage the complexity of deep learning reflects a recurring theme across scientific disciplines: appropriate asymptotic perspectives often render otherwise intractable systems analytically tractable. Many theorists share a heuristic belief that most practical neural networks can be understood as noisy, finite approximations to models of infinite size. [^6] By analogy, one numerically solves a partial differential equation by discretizing over space and time, and the finer the discretization, the smaller the numerical error from the desired continuum process. This is very possibly also true of deep neural networks, with width and depth taking the place of space and time. Other finite hyperparameters, such as the learning rate, batch size, and dataset size, might also be understood in this way.
[^6]: Works studying finite-size corrections to infinite limits include ([121, 127, 128, 129, 130, 131]).
We might call this belief the Discretization Hypothesis. While it has yet to be made precise or proven (see Section 5), this hypothesis has implicitly underpinned much important work, and little in the analytical study of large models makes sense without it.
The Discretization Hypothesis amounts to the statement that finite-size corrections from limits typically worsen performance while saving costs in data, time, memory, and compute. Showing that these finite-size effects deliver a general benefit that cannot be achieved any other way would falsify this hypothesis.
Deep learning is highly measurable: it is easy to track a vast array of quantities before, during, and after training. While any quantity can be measured, the most lawful are typically aggregate, macroscopic statistics over many weights and samples. For instance, the train and test losses are aggregates over many samples. These quantities are occasionally described by simple empirical laws relating one to another. Such laws have already played an important role in shaping both our understanding and practice of deep learning.
This pattern has ample precedent in the quantitative sciences. Many important physical and chemical laws were first discovered as empirical regularities and only later understood in terms of deeper principles, including laws due to Kepler, Snell, Boyle, Hooke, Faraday, Ohm, Poiseuille, and Planck. Given how often scientific fields have developed in this way, it seems likely that deep learning will continue to yield empirical laws as its science matures. Here, we highlight a handful of examples and conclude with takeaways for theorists.
![**Figure 3:** **The loss of large neural networks decays according to predictable *neural scaling laws*.** These neural scaling laws take the form of power laws (linear on log-log plots) in compute, dataset size, and parameter count. Reproduced from ([132]).](https://ittowtnkqtyixxjxrhou.supabase.co/storage/v1/object/public/public-images/jkqvvdxd/scaling_laws_original.png)
Neural scaling laws.
The single most important measurement of any machine learning system is its test loss. Given the complexity of large deep learning systems, one might expect the test loss to be a complex, unknowable function of the system's hyperparameters. This is not so: studies of neural scaling laws ([132, 133]) demonstrate that, within an architectural family, the final loss follows a predictable power law function governed by only three scalar variables: compute, the amount of data, and the network's size. These power laws are shown in Figure 3.
Why does test loss decay as a power law in these variables, and what determines the scaling law exponent? We still do not know! While scaling laws are often attributed to structure in the data, with candidate explanations in terms of the dimensionality of the data manifold ([134, 135]), feature superposition ([136]), and power laws latent in task structure ([137, 138, 139, 61, 60]), they may also depend on details of the architecture and optimizer ([140]). At present, no framework can robustly predict the observed exponents a priori from dataset and architectural properties across realistic settings (see Section 5), though recent progress has begun to move in this direction ([141]). The fact that test loss is so predictable strongly suggests that a simple underlying explanation remains to be found.
Weight dynamics at the edge of stability.
Because every model is the result of a training process, we would like to understand the dynamics and trajectory of a model's weights during training. While there are simple cases where these dynamics are exactly solvable (see Section 2.1), this is usually well out of reach. The loss landscape dictates the network's dynamics, but a direct visualization of the loss, as is done in [142], suggests an immensely complicated landscape that is unlikely to have lawful regularities.
Nonetheless, some robust patterns in the coarse, aggregate properties of weight trajectories have been found. One of these is the sharpness of the network loss surface, defined as the largest eigenvalue of the Hessian with respect to the parameters. When a typical network is trained using full-batch gradient descent with learning rate $\eta$, the sharpness undergoes two distinct phases: a gradual increase (termed progressive sharpening) followed by a plateau near $2 / \eta$ ([143]; see Figure 4), called the edge of stability.
![**Figure 4:** **Gradient descent occurs near the edge of stability.** Three architectures are trained with full-batch gradient descent on CIFAR-10 with varying learning rate $\eta$. Plots show the train loss (top row) and Hessian sharpness (bottom row). For each step size $\eta$, observe that the sharpness rises to $2/\eta$ (dashed horizontal lines) and hovers at or just above this value. Reproduced from [143].](https://ittowtnkqtyixxjxrhou.supabase.co/storage/v1/object/public/public-images/jkqvvdxd/eos_original.png)
Having identified these regularities, we can begin to understand them. Progressive sharpening provably occurs in deep linear networks ([7, 144]), yet a quantitative explanation suitable to realistic nonlinear networks remains to be found (see Section 5). More is understood about why the sharpness stabilizes at $2 / \eta$. Particularly, $2 / \eta$ is the maximum stable sharpness achieved in convex optimization–any sharpness larger than $2/\eta$ would cause parameter oscillations of increasing magnitude. In more general cases, [145] showed how coarse properties of the third-order loss curvature can cause the (second-order) sharpness to stabilize at $2 / \eta$. Follow-up work reveals that loss dynamics at the edge of stability can be decomposed as smooth, time-averaged, gradient flow dynamics plus oscillations in unstable directions ([146]). These works make quantitative predictions about the parameter trajectory which closely match experiment.
Coarse properties of hidden representations and weights.
There are a handful of other cases in which coarse properties of neural networks' hidden representations and weights are known to obey simple equations. We will briefly mention three of these.
Neural collapse. Consider a neural network classifier trained to choose among $C$ classes. [147] found that, at the end of training, the final-hidden-layer representations of samples from each class tend to cluster tightly around their class mean. Furthermore, the $C$ class mean vectors form a regular simplex. Later theoretical work has explained this geometric arrangement as the natural energy-minimizing configuration when (a) the loss used is cross-entropy and (b) a small amount of weight decay is applied ([148]).[^7]
[^7]: This parallels how gradient descent on separable logistic regression converges in direction to the max-margin separator ([149]).
The neural feature ansatz. At the other end of the network, there are some robust regularities known about the first-layer weights. [150] show that, after training, the Gram matrix of the the first-layer weights ${\bm{W}}_1^\top {\bm{W}}_1$ aligns with the average gradient outer product:
$ {\bm{W}}1^\top {\bm{W}}1 \propto \mathbb{E}{{\bm{x}} \sim \mathcal{P}\text{data}} ! \left[\nabla_{{\bm{x}}} f({\bm{x}}; {\bm{\theta}}) \nabla_{{\bm{x}}} f({\bm{x}}; {\bm{\theta}})^\top \right], $
where $\nabla_{{\bm{x}}} f({\bm{x}}; {\bm{\theta}})$ denotes the Jacobian of the network with respect to ${\bm{x}}$. While this rule is heuristic and inexact, it often makes strikingly accurate predictions for quantities like the top eigenvectors of ${\bm{W}}_1^\top {\bm{W}}_1$. Similar heuristics hold at deeper layers. At time of writing, there are only partial theoretical explanations for this phenomenon; see [151, 152].
Gradient flow conservation laws. A striking regularity identified in linear networks is that the difference between the covariance and Gram matrices of consecutive layers ${\bm{W}}\ell {\bm{W}}\ell^\top - {\bm{W}}{\ell + 1}^\top {\bm{W}}{\ell + 1}$ is conserved under gradient flow ([1, 153, 154]). What initially appeared to be a curiosity of linear networks was later shown to follow from continuous symmetries of the parameterization — an instance of the Noether principle — and thus could be used to identify similar conserved quantities in nonlinear networks ([155, 156, 157, 158]). For instance, the rescaling symmetries in networks with homogeneous nonlinearities (e.g., ReLU), the scale symmetries preceding normalization layers (e.g., batch normalization), the translation symmetries in the logits preceding a softmax, and the rotation symmetries between key and query matrices in attention all lead to symmetry-specific statistics of the parameters that are conserved under gradient flow and weakly broken by SGD in predictable ways.
Takeaways for theorists.
Theory can be built "bottom-up, " starting from first-principles math as in Section 2.1 and Section 2.2, or "top-down, " starting from empirical observations and attempting to explain them. In this section we have highlighted a few notable examples of top-down theories. We expect more to come. The measurability of deep learning makes observation and empiricism a particularly fruitful approach, since experimentation can be iterated on quickly, while revealing mathematically simple relations and structure in trained models. Of course, some caution is necessary: most macroscopic statistics don't obey a simple and general mathematical law — or at least don't seem to until plotted against the right quantity — and so the challenge is to find those that do. We encourage theorists of deep learning to proactively use experiments to look for lawful regularities in neural networks.
Training a deep learning system involves many numerical knobs, termed "hyperparameters." These include optimization hyperparameters such as the learning rate, batch size, momentum, and initialization variance, as well as architecture hyperparameters such as width, and depth. The large number of hyperparameters in deep learning presents a challenge not only for practitioners, who must tune them carefully in order to achieve optimal performance, but also for researchers, who must grapple with many confounding factors when trying to interpret the outcome of scientific experiments. It is only in the last few years that the theory community has come to realize that hyperparameters can be disentangled and understood, and that the resulting mathematics is often both useful for practitioners and clarifying for theorists.
This study of hyperparameters bears similarities to the study of the constant parameters governing the behavior of a physical dynamical system. For example, in a fluid flowing through a pipe, a dimensionless number called the Reynolds number computed from the pipe diameter and the fluid's speed, density, and viscosity determines whether flow is laminar or turbulent. While solving for the trajectory of the turbulent fluid is extremely difficult, it is nonetheless very helpful to be able to quickly predict whether flow will be turbulent at all — and how things change if you scale up the pipe diameter or increase the fluid flow. Similarly, while solving the optimization dynamics of a neural network is very difficult, it is often very helpful to quickly obtain a coarse picture of how things change if you change one or more hyperparameters. In this section we highlight two lines of work in which hyperparameters have been found to admit explanatory theory.
Understanding optimization hyperparameters.
Stochastic gradient descent has two hyperparameters: learning rate and batch size. The algorithm's dynamics are often invariant under a simultaneous rescaling of both. That is, if one doubles both the learning rate and batch size, and halves the number of optimizer steps (or equivalently, keeps fixed the number of training examples processed), then the trajectory stays nearly the same. This so-called linear scaling rule ([159]) is useful for transferring a learning rate that was tuned for one batch size to a different one. A line of theoretical work has clarified this rule of thumb by interpreting SGD as a discretization of an underlying stochastic differential equation (SDE), a perspective that predicts the linear scaling rule ([125, 126, 160, 161, 162]). [163] extended this line of work from SGD to adaptive optimizers, for which they argued that the learning rate should scale with the square root of the batch size.
This invariance perspective explains how to adjust hyperparameters across batch sizes, but not how to choose the batch size itself. That choice involves an inherent tradeoff between two resources: serial time (the number of sequential training steps) and overall compute (the total amount of computation, often closely tied to cost) ([164, 165, 166, 167]). For a practitioner who cares only about serial time and not at all about cost, the optimal batch size is the full dataset. Conversely, for a practitioner who cares only about cost and not at all about serial time, the optimal batch size is 1. In reality, no practitioner falls exactly in either bucket; a practitioner might care more about one resource than the other, but is generally willing to accept some slack in return for a better deal on the second resource. A frequently discussed concept is that of the critical batch size, a batch size which trades off between these two concerns. [166] proposed a simple model of this tradeoff under which the Pareto frontier between serial time and compute takes the form of a hyperbola.
Optimization hyperparameters in deep learning affect not just the speed and cost of training but also the trajectory that training follows. This in turn affects various properties of the learned network, including generalization performance ([168, 169]) and compressibility ([170, 171]). A fruitful line of work has sought to explain these effects through the hypothesis that many implicit effects of optimizer hyperparameters can be understood as implicit regularization of loss function curvature.[^8] Empirical studies initially observed that first-order optimizers regularize the curvature (i.e. Hessian) of the loss function, with larger learning rates and smaller batch sizes yielding stronger regularization strengths ([168, 126, 172, 143]). Meanwhile, theoretical works in simplified settings showed that this effect can be explained by Taylor-expanding the objective to third order, as such a calculation reveals that oscillating or fluctuating dynamics automatically induce curvature regularization ([173, 174, 175, 176, 177]). Building on this body of work, [146] recently showed that for several optimizers in the full-batch setting, the whole training trajectory on realistic neural nets is well-modeled by a curvature-penalized gradient flow, where the role of the hyperparameters is to modulate both the form and strength of the curvature penalty. As a result, we now have a mathematical understanding of the learning rate in full-batch gradient descent, and are mostly free to instead study the simpler dynamics of gradient flow plus a loss curvature penalty.[^9] Other analyses have developed analogous characterizations for stochastic dynamics in more specialized settings ([23, 24]). Fully extending this characterization to stochastic and adaptive optimizers would give us a common language for reasoning about the implicit effects of optimization hyperparameters on the training trajectory. It then remains to understand how these modifications to the training trajectory influence properties of the learned network (see Section 5).
[^8]: An additional, but apparently weaker, effect is captured by an implicit regularization of the gradient norm ([178, 179]).
[^9]: This perspective is reminiscent of the Itô's correction in stochastic calculus: after a nonlinear transformation, noise can contribute an additional deterministic drift. Likewise, stochastic or oscillatory optimization dynamics may be described by an effective flow on a modified loss.
![**Figure 5:** **The theory of network parameterization permits learning rate transfer across widths.** Transformers of varying widths trained on WikiText-2 under standard parameterization (left) and $\mu$ P (right). Under standard parameterization, the optimal learning rate decreases as model width increases. Under $\mu$ P, by contrast, the optimal learning rate remains nearly constant across widths, making it possible to predict the learning rate for wide networks from experiments on narrower, cheaper models. Reproduced from [180].](https://ittowtnkqtyixxjxrhou.supabase.co/storage/v1/object/public/public-images/jkqvvdxd/mup_img_flattened.png)
Disentangling architecture hyperparameters from optimization hyperparameters.
There has been a highly successful line of work aimed at disentangling architecture hyperparameters such as width, depth, and output multiplier (see the lazy/rich dichotomy in Section 2.2), from optimization hyperparameters such as the learning rate and initialization variance. The Tensor Programs framework ([83, 181]) makes this separation explicit, writing hyperparameters such as the learning rate in the form $\eta = \eta_0 \cdot [\mathrm{width}]^{c}$, separating a scale-independent coefficient $\eta_0$ from a width dependent factor with exponent $c$. This line of work then asks: how can we set these exponents such that we retain interesting training behavior at infinite width? A remarkable insight from this analysis is that all non-trivial and non-explosive scalings give one of two limiting behaviors, analogous to the rich/lazy dichotomy in Section 2.2: in the Neural Tangent Parameterization (NTP), features are frozen during training, and in the Maximal Update Parameterization ($\mu$ P), features evolve. Since feature learning is essential for most tasks, this analysis tells us that $\mu$ P is the scaling to use, resolving how hyperparameters should scale with model width. This understanding enables hyperparameter transfer: we can tune hyperparameters on small proxy models and then transfer them to large, production-size models, where they remain near-optimal when both models are sufficiently wide (([180]); Figure 5).
At the same time, the theory underpinning this result is asymptotic and does not fully account for its empirical effectiveness. In practice, models are trained at widths far smaller than the dataset size, and the usefulness of transfer depends on how quickly optimal hyperparameters stabilize with width. [182] and [183] and [184] take steps toward closing this gap, providing evidence that a small set of spectral statistics stabilizes rapidly across widths under $\mu$ P and approximately governs the optimal hyperparameters. This scaling-centric approach to hyperparameters was later extended to depth scaling ([105, 104, 106]), and leveraging this approach with other scaling dimensions remains an important future direction (see Section 5).
![**Figure 6:** **Universality across architectures and data modalities.** (a): Different diffusion model architectures (from top to bottom: DDPM, a consistency model—both based on UNet—and U-ViT) converge to the same learned distribution and produce identical images when given the same input seed. Adapted from [185] (b): As language models performance (horizontal axis) increases, their internal representations become increasingly similar to that of vision models, and more so for larger models (from yellow to purple lines). Adapted from [186].](https://ittowtnkqtyixxjxrhou.supabase.co/storage/v1/object/public/public-images/jkqvvdxd/complex_fig_7dba8ebb88af.png)
Deep learning is not a single recipe followed exactly every time: different systems use very different architectures, datasets, training algorithms, and objectives, with ingredients combined in creative ways. This versatility has enabled successes on many tasks and modalities including vision, language, speech, time series, protein sequences, and games, but the resulting model diversity makes it less clear how to approach the development of scientific theory. Do these diverse settings share deep commonalities we might hope to capture scientifically?
Here, we review a growing body of evidence that there are indeed universal phenomena at play in these diverse settings. This is good news for theory: when many different complex systems exhibit the same universal behavior, it suggests that a simple underlying explanation may exist. We highlight this universality through three different viewpoints: (1) different architectures reach comparably good performance on many tasks; (2) different datasets share similar statistical properties; and (3) the learned representations and weights across different architectures and datasets are surprisingly alike. This roughly echoes examples of universality in which disparate physical systems share deep commonalities or display similar behavior at large scales. [^10] We end by highlighting a few theoretical successes in modeling universal phenomena.
[^10]: Universal behavior across physical systems can often be understood with the renormalization group, a technique which formalizes the idea that, as one examines a system from a more and more zoomed-out perspective, most details "wash out" and only a handful of aggregate effects remain important. We note that another apt analogy for universality in deep learning, this one from biology, is convergent evolution: species that "solve similar problems" tend to "find similar solutions" after many generations.
Universal inductive biases.
Performance on a given task is often robust to variations in architectures, training algorithms, and objectives, in the sense that many alternate choices still lead to models that can solve the task. A well-known example is the choice between convolutional networks and transformers in computer vision tasks, which after much debate have been shown to obtain similar performance when matching compute, data size, and training recipes ([187, 188]). In diffusion models, this similarity has been further shown to hold at the level of input-output mappings, with transformers and UNets generating near-identical images when fed with the same noise samples ([185]), as shown in Figure 6. These results strongly indicate that different architectures share similar inductive biases despite their apparent differences. As a partial explanation, recent work has shown that assuming inductive biases towards locality and adaptivity to geometric structures leads to accurate quantitative predictions about the behavior of diffusion generative models ([189, 190, 191]).
Universal structure in data.
The no-free-lunch theorem states that generalization on completely arbitrary data with a common learning strategy is not possible ([192]). Therefore, deep learning must rely on particular features of the data present across all datasets and modalities on which it succeeds. For instance, many classes of images and audio signals share power-law spectral properties, sparsity patterns, and multiscale structures, and can be analyzed with general-purpose wavelet bases ([193, 194]) A similar phenomenon in text data is the ubiquity of Zipf's law (word frequencies obey a power-law distribution) that holds over many natural and artificial languages ([195, 196]). Hierarchical, compositional structure is also routinely used to model both images and text, which can sometimes be related through a common model ([197, 198, 199]). These shared statistical properties are a partial explanation for the ability of a single learning algorithm (say, a transformer trained with SGD) to tackle seemingly unrelated datasets, leaving only the finer-grained differences between them to be learned.
Universality in representations.
Going deeper in the internals of the network, it has been observed that representations learned by different networks can be similar across random initializations, widths, and architecture ([200, 201, 202, 186, 203]). It has been shown that networks trained to solve different tasks learn similar representations across training datasets (ImageNet and Places-365, [204]), objectives (supervised or self-supervised, ([202])), and modalities (vision or language, ([186])). Furthermore, this similarity grows as model size and performance increase, hinting that neural activations converge towards a universal ("Platonic") representation ([202, 186]), as shown in Figure 6. In simplified settings such as random feature representations, this convergence is a consequence of the law of large numbers applied to the feature kernels ([205, 206]); in deep linear networks, it can be proven to arise from the implicit regularization of SGD ([207]); in more diverse settings, recent evidence suggests that the universality of representations may ultimately trace its origins to universal structure in data ([186, 208]). Recent advances in identifiability theory ([209]) also have shown that representational convergence happens at the global optimum of unsupervised ([210]), self-supervised ([211]) and supervised ([212]) objective functions under a suitable data generating process ([213]). Several works have also shown empirically that this similarity can extend to the level of individual neurons ([214, 215, 216]). In some cases, similar representations have been found in both artificial neural networks and biological neural networks ([193, 217, 218]), though the extent of this correspondence remains controversial ([219]). While a global trend towards similarity is emerging, it should be noted that the range of settings in which this convergence is observed, and its extent, are not fully known (see Section 5). In particular, recent work has shown that this apparent convergence to universal representations depends crucially on the chosen comparison metric across similarities ([220])). A growing literature is devoted to understanding which representation similarity metrics one should choose in different circumstances ([221, 222]) and highlighting the cases where they can be unified ([223, 224]).
If the mechanisms learned by large models are indeed universal, this is very encouraging for theory: behavior shared across many systems should depend primarily on the features common to all such systems, and thus admit a description simpler than any particular model in isolation. Moreover, if the internal structure of trained neural networks primarily reflects the structure of data, then in studying neural networks we may ultimately be studying the structure of data and its generating processes (see Section 5). In particular, since language data comes directly from humans, understanding its structure may teach us something new and fundamental about ourselves.
Section Summary: Different approaches to understanding deep learning, such as statistical methods that balance data complexity and efficiency, information theory viewing learning as data compression, physics-inspired models for average behaviors, and neuroscience ideas borrowing from brain functions, all offer valuable insights into how these systems work. The authors see these perspectives as complementary, each either directly aiming for or benefiting from a deeper "mechanics of learning" that examines the training process, architectures, and data interactions. Ultimately, they argue that exploring these connections will require precise studies of what happens during training to explain why deep learning succeeds so well.
There are several ongoing approaches to developing explanatory scientific theory of deep learning, each adopting a different perspective and using different sets of tools. We believe that these perspectives are essentially all complementary: all either directly seek a mechanics of learning or would symbiotically benefit from one.
The statistical perspective.
The rich tradition of classical learning theory remains influential today.[^11] [225] offer a lucid summary of its central framing: any statistical prediction method must balance expressivity (to represent the richness of real data), complexity control (to make the most of finite training data), and computational efficiency (to yield practical algorithms). It is apparent that deep learning is sufficiently expressive, but it is not clear how a good function is selected from this enormous function class, nor why simple gradient methods suffice to train such complex beasts. The modern statistical viewpoint suggests two answers: deep learning has an implicit inductive bias towards simple, well-generalizing functions ([226]), and despite their nonconvexity, the very high dimensionality (overparameterization) of neural networks makes optimization easy.
[^11]: The Simons Institute for the Theory of Computing has provided an important substrate for developing this statistical perspective of deep learning through collaborations and seminars.
These questions are good ones, and we believe these answers are basically correct. The challenge is now to make them precise in the case of neural networks. It is clear that doing so will require taking a close look at the nature of the training process. Only once we have done so will we be able to back out how this implicit bias arises and why gradient methods suffice for optimization. We do not believe these answers will be generic statements, but instead critically rely on important properties of deep learning and of natural data. The statistical perspective thus leads naturally to a serious scientific study of the mechanics of training.
The information-theoretic perspective.
A closely-related approach seeks to explain deep learning in terms of information-theoretic ideas. In this view, learning is a process of extracting information from datasets, and a learning system works when it extracts information useful for prediction while discarding irrelevant information. This perspective hopes to understand learning as compression of the dataset into either the model's parameters or its hidden representations, with good generalization resulting when this compression is successful ([227, 228]).
We find this perspective insightful, and it seems likely to us that a picture of this nature will hold. As with the statistical perspective, a major remaining question is how to make this view concrete and actionable: how do the architecture and training process of deep learning interact to actually implement this compression, and what factors make it more or less successful? Doing this, too, will require taking a close look at the nature of the training process, the architecture, the data, and their interactions. The information-theoretic perspective thus also leads naturally to a serious scientific study of the mechanics of training.
Physics of deep learning.
This community descends from the older physics of machine learning lineage ([229, 230, 231]) and essentially seeks satisfying average-case theories of neural network learning ([232, 233, 234, 235]).[^12] The close relationship between physics and machine learning was recognized by the 2024 Nobel Prize in Physics. This approach is in line with (and has largely shaped) the perspective presented in this paper, and the project of this community is arguably the development of a mechanics of learning. The challenge then is to clarify important problems and coordinate effort for efficient progress.
[^12]: In the modern day, this community is mediated in part by recurring events at the Kavli Institute for Theoretical Physics, the Aspen Center for Theoretical Physics, and the Les Houches School of Physics, and organizations such as the NSF AI Institute for Artificial Intelligence and Fundamental Interactions and the Simons collaboration on the physics of learning and neural computation.
Perspectives from neuroscience.
Several approaches to developing a science of the brain suggest approaches to developing a science of deep learning. One approach starts from hypotheses about neural systems — for example, that their computation amounts to some form of approximate probabilistic inference — and seeks to make deductions and predictions from this hypothesis ([236, 237]). Some of these predictions seem to hold suspiciously well in deep learning: see, for example, the case of edge-selective cells in the visual cortex ([193]) and edge-selective receptive fields in convolutional networks (e.g. [238]). Another approach termed systems neuroscience seeks to directly decompose subsets of the brain into interpretable circuits and reverse-engineer the structure of their learned representations ([239, 240, 241]). This approach resembles mechanistic interpretability, which has adopted some of its methods and intuitions.
We expect and encourage this dialogue to continue, and it seems plausible that some of these high-level hypotheses about the brain — e.g., that the brain admits at least a partial decomposition into interpretable circuits and that local circuits implicitly solve inference tasks — will turn out to be true of deep learning. The reasons these facts are true, if indeed they are, is surely bound up in the dynamical way learning actually happens. A study of the mechanics of learning is thus important to the continued exploration of these ideas.
Developmental interpretability/singular learning theory.
This approach, which grew out of the mechanistic interpretability community, seeks first-principles predictive theories of neural network learning based on the singular learning theory framework of [242], emphasizing a Bayesian perspective and aiming to understand training as a process of sequential phase transitions mediated by the geometry of the loss landscape ([243]). We see this community as seeking the same goal we suggest here — a fundamental mechanics of learning, and a rigorous foundation for interpretability — but with a toolkit that differs from the other listed perspectives. There is potential for fruitful cross-pollination and tool-sharing between these different approaches.
Science of deep learning.
It has long been appreciated by practitioners that machine learning is largely a practice of trial and error and that it may be possible and beneficial to systematize it ([244, 245, 246, 247]). Indeed, much of the rapid empirical progress of the last decade resulted from systematic organization around agreed-upon benchmark tasks ([248]). Nonetheless, the training and application of large models remains more alchemy than science. We believe that a fundamental mechanics of the learning process is the foundation on which this science will finally be built.
We discuss mechanistic interpretability specially because there is a unique opportunity for cooperation. Mechanistic interpretability aims to understand trained neural networks by identifying the internal mechanisms — features, circuits, and learned algorithms — that give rise to their behavior. At its core, this approach is guided by the belief that neural networks admit a human-understandable, mechanistic description that can be uncovered through careful empirical reverse engineering.[^13] This approach has already borne fruit: many visually striking or interpretable mechanisms have been discovered in large models to date ([249, 250, 251, 252, 253]). [^14]
[^13]: The mechanistic interpretability community does not yet share a formal definition of what constitutes a "mechanistic description, " though see [254] for a recently proposed causal framing. Informally, many researchers proceed under a set of working assumptions: (1) that neural networks encode the state of internal computational variables in their activations, often referred to as "features"; (2) that successive layers transform and combine these features in structured "circuits"; and (3) that, taken together, these circuits implement algorithms that admit some level of human-understandable description.
[^14]: Mechanistic interpretability is deeply associated with Anthropic and the AI safety and Effective Altruism communities, though it is increasingly pursued in academic labs. We also note that mechanistic interpretability has recently split into an ambitious camp that hopes to develop full, explanatory scientific theory and a pragmatic camp which is mostly interested in targeted interventions for particular cases. See also [255] for a discussion of the origin of the term "mechanistic interpretability" and the dynamics of its community.
This is a complementary perspective to our own and presents a wonderful opportunity for symbiosis. At time of writing, mechanistic interpretability remains largely a qualitative science, more reliant on human-judged empirics than on compact mathematical principles or simple governing laws. This is quite natural: semantically-meaningful functions resist mathematical characterization. [^15] On the other hand, a mechanics of learning would be quantitative by definition, but by the same token will be too low-level to answer important questions of semantic meaning on its own. These approaches study the same system — i.e., deep learning — at different levels of abstraction, and so of course they can (and should) work together for mutual gain. Calls for rigorous foundations for interpretability have been steadily growing ([256, 257, 258]), and this is one thing learning mechanics can and should seek to help provide. In turn, mechanistic interpretability offers learning mechanics a rich and growing tableau of empirical phenomena ripe for the development of explanatory mathematical theory.
[^15]: For example, try writing down a function that can classify dogs vs. cats from image pixel values. The difficulty of expressing such functions in mathematics is why we invented deep learning in the first place!
Learning mechanics $\rightarrow$ mechanistic interpretability.
We emphasize two complementary avenues through which learning mechanics can support mechanistic interpretability: formalizing core assumptions and explaining how mechanisms develop through training.
Formalizing core assumptions. Learning mechanics can make explicit, formalize, and, where necessary, challenge the core and often implicit assumptions that guide interpretability research. These include:
These core assumptions underpin the identification, isolation, and analysis of the internal mechanisms of trained neural networks in mechanistic interpretability research. A mathematical theory of learning offers a way to clarify the regimes in which these assumptions hold, the conditions under which they fail, and the sense in which they can be derived from training dynamics and data statistics (see Section 5).
Explaining how mechanisms develop through training. Mechanistic interpretability has generally prioritized describing what mechanisms trained neural networks have learned, and there remains a rich opportunity for work which aims to explain how and why such mechanisms form in the first place. There is already substantial interest within parts of the interpretability community in this dynamical/theoretical perspective, including work on the formation of induction heads ([276, 277]), grokking and progress measures ([278]), sudden phase transitions in circuit formation ([279, 280, 281, 282]), and the research program of developmental interpretability ([243, 283]), discussed earlier. Our goal is not to replace these efforts but to encourage deeper engagement between mechanistic interpretability and the broader landscape of mathematically grounded ideas and tools in learning mechanics. Echoing [284], we hope that learning mechanics can play a role analogous to evolution in biology: just as "nothing in biology makes sense except in the light of evolution, " the internal mechanisms of trained networks may be most naturally understood in the light of the processes that give rise to them.
Learning mechanics $\leftarrow$ mechanistic interpretability.
Conversely, learning mechanics has been deeply influenced by the empirical discoveries of mechanistic interpretability, which often identify concrete phenomena that invite first-principles explanation. Mechanistic interpretability places the structure of data at the center of its analyses, revealing settings in which the relationship between input structure and learned mechanisms is especially clear ([278, 285]). By contrast, much of classical deep learning theory has relied on highly simplified data models, leaving a gap between theoretical predictions and behaviors observed in practice. In this way, mechanistic interpretability helps bridge this gap by providing learning mechanics with concrete, well-defined targets for theoretical modeling.
Several such observations have already proven influential in stimulating work in learning mechanics, including the emergence of induction heads for in-context learning ([286, 287, 288]), the role of Fourier features in algebraic tasks ([70, 72, 289]), and the geometry of features arising from the structure of correlations in the data ([251, 290, 208]). Just as the development of physics was often driven by empirical discoveries in adjacent fields, we expect progress in learning mechanics to be driven by theorists who take seriously empirical phenomena, including those uncovered by the mechanistic interpretability community, and seek to explain them.
Section Summary: This section tackles common doubts about creating a mathematical theory for deep learning, such as why decades of effort haven't succeeded yet, why understanding simple models might not apply to complex ones like large language models, and arguments that we should focus on data or high-level behaviors instead. In response, the authors argue that recent breakthroughs and scaling have provided fresh empirical insights, drawn in experts from various fields, and allowed for useful "local theories" on specific aspects like optimization and scaling laws, even if a full theory takes time. They emphasize that studying deep learning at multiple levels—from basic mechanics to model psychology—will be essential, and theory remains valuable for practical impacts and AI safety before systems understand themselves.
We have made a case that an ambitious mathematical theory of deep learning is possible and that developing this theory is a worthwhile endeavor. This is far from a universal view, and so we now address common counterarguments that a theory of deep learning is either not possible or not a goal worthy of our effort.
Competent researchers have been trying to develop a theory of deep learning for decades, and we don't have one. Surely if there was a theory, we would have already found it.
It is true that machine learning theory is a field with a long history, and certain avenues for developing theory have been thoroughly explored. Why should now be different?
There are several reasons for optimism. First, the practical success of deep learning is comparatively recent, and we have a wealth of new empirical systems to study and mine for explainable phenomena. Some of these phenomena, like the apparent convergence to universal representations discussed in Section 2.5, were only revealed by the last few years of model scaling. These developments have turned the search for a theory of deep learning from a mathematics into an empirical science (and one with no lack of interesting things to measure). We now have much better means to ask questions and check our answers in a tight feedback loop.
Second, the field is much bigger: empirical successes have attracted researchers from physics, mathematics, neuroscience, and other adjacent fields, and so we have more and more diverse minds on the case. Third, it is worth noting that the development of major sciences has usually taken at least several decades, so we should not be too discouraged that we do not yet have all the answers.
The objects currently understood from theory are very primitive compared to e.g. LLMs. Surely first-principles understanding of large models is too heavy a lift.
Indeed, we expect that building up to LLMs will be a heavy lift and take considerable time. The near-term hope is instead that some understanding of the basic building blocks of deep learning will prove useful even without a constructive theory that explains the whole model. We can see this happening already in isolated pockets, including empirical scaling laws (Section 2.3), mathematical prescriptions for hyperparameter scaling (Section 2.4), neural-tangent-kernel-based methods for data attribution ([291]), and theoretically motivated optimizers ([292, 293]). These "local theories" of small pieces of the deep learning stack are useful for hyperparameter scaling in large models, even though they are in no way comprehensive theories of the model! One might hope for similarly useful "local theories" that treat subjects like training instabilities, dataset selection and attribution, or the effect of normalization layers.
It is also important to stress that the identification of the right basic objects in a field of science often makes it possible to ask applied questions in a more sensible way. Consider, for example, how the understanding that all matter is made of atoms underlies virtually all other basic science, and how knowledge of electromagnetism permits optical and radiological tools in countless applied disciplines. As discussed in Section 3.1, we hope that learning mechanics can offer tools that adjacent fields such as mechanistic interpretability can apply to better carry out their work. In this way, rigorous work on primitive objects can aid the applied science of large models even without a rigorous theory that builds all the way up.
What matters is a model's high-level behavior. Microscopic theories are too zoomed in to see this.
Models' high-level behavior is indeed important. How does this fit in with the lower-level sciences of deep learning? We argue that deep learning may be studied at the level of physics, biology, or psychology, with this last including the study of the model's capabilities, personality ([294]), and goals. It seems likely that study at all levels will be necessary. Learning mechanics (the physics of deep learning) is the farthest from model psychology, with mechanistic interpretability (the biology) lying in the middle and connecting the two. [^16]
[^16]: We note that these three levels of study of deep learning are roughly analogous to Marr's levels of analysis of a computational system: the physical implementation of the computation, how the computation is performed algorithmically, and what is being computed ([295]).
We don't need a theory of deep learning, we need a theory of data.
We think we need both: we need a theory of the structure in data and a theory of how a parameterized model learns it. We touch on the necessity of developing a useful theory of data in Section 2.5 and Section 5. These are both part of the project of developing a mechanics of learning.
AI will understand itself before we do. Why try to build theory?
This is a present concern for human intellectual endeavors across the board. Our response here has three parts. First, theory is already useful, and will continue to be more impactful as it develops, so this scientific work is likely to make a near-term impact. Second, it seems unlikely that AI working in isolation will suddenly and separately "solve deep learning theory." It seems more likely that breakthrough progress in a transitory period will come from human scientists using or working with AI, and expert humans will remain in the loop. Third, if one's goal is AI safety, some human oversight of AI systems will be necessary (unless one trusts the AIs to fully police themselves), and having a human-parseable theory of deep learning gives us a foot in the door.
Section Summary: This section outlines key unsolved challenges in understanding the inner workings of deep learning, aiming to guide researchers toward breakthroughs in the coming decade. It explores questions like developing simple models that capture both nonlinear parameter changes and data processing, figuring out how networks exploit patterns in real-world data, and clarifying if they naturally favor simpler solutions during training. Other areas include formally defining learned features, viewing finite networks as approximations of continuous systems, removing unnecessary tweaks in models, and predicting performance trends based on data and compute scale, all to build a stronger theoretical foundation.
It is important for any field, at any stage of development, to have a sense of its important open questions and goals. In this section, we present a curated list of open directions which we expect can be solved by a theory of the mechanics of learning in the next decade. These directions are loosely ordered by their connection to the lines of evidence introduced in Section 2. We hope this helps sharpen a shared research agenda. For a longer catalog and a forum for community discussion, see learningmechanics.pub/openquestions.
emoji crystal-ball What are simple, solvable models of genuinely deep, nonlinear learning?
As discussed in Section 2.1, deep linear networks and kernel methods are the two main workhorse solvable models of learning mechanics.
The first captures nonlinear dynamics of the parameters, and the second learns nonlinear functions of the data.
While a few special cases of solvable models with both forms of nonlinearity are known, no unified framework has emerged.
Can we get the best of both worlds while maintaining some level of generality? Is there a class of solvable model that captures both deep, nonlinear dynamics and nonlinear function learning? Can such models illuminate new things about feature learning, the role of depth, optimization phenomena (e.g. progressive sharpening), and architectural innovations (e.g. normalization layers, residual streams, self-attention, and gated nonlinearities)? Can it be usefully applied to modern learning paradigms like self-supervised learning, reinforcement learning, and denoising diffusion?
emoji elephant What would a theory capable of capturing natural data look like?
Deep neural networks find and exploit structure in natural data.
This means that the structure of the data must somehow enter into our theories.
What is this structure, and how do we find it?
Despite the complexity of data, in many cases models appear to derive their learning signal from a small set of sufficient statistics.
What are these minimal data statistics, and how do they enter into a predictive theory of what the model learns? Are these statistics different for different models and at different stages of training? Can we describe the relevant structure in a dataset in terms of a model with free parameters found via an empirical fit?
emoji abacus Does deep learning implicitly minimize some notion of functional complexity?
Deep networks trained by conventional optimizers are widely believed to have some sort of bias towards learning simple functions.
This idea has surfaced many times under different names (e.g. implicit regularization, maximum margin bias, simplicity bias, and spectral bias), but has only been characterized precisely in highly specific settings, and a general picture has not been found.
Do deep neural networks broadly seek to minimize some precise notion of complexity among functions with low loss?
If so, what is the appropriate notion of complexity — Kolmogorov, circuit, weight norm, or something else? In what settings or limits is this minimization exact, and when is it only approximate? Do the sparse features and circuitry studied by mechanistic interpretability naturally emerge as the solution to this minimization problem?
emoji microscope How do we formally define the features learned by neural networks?
Mechanistic interpretability seeks to identify and disentangle the features, circuits, and mechanisms learned by neural networks.
Can these concepts be given precise mathematical definitions grounded in first principles? What formal structures naturally emerge from such a definition? Can we use these notions to evaluate and formalize central assumptions of mechanistic interpretability, including linear representability, locality, sparsity,andcompositionality, as discussed in Section 3? How do these ideas connect with the less semantically-meaningful — but more precise — rich vs. lazy picture of feature learning discussed in Section 2.2?
emoji infinity Are finite neural networks properly understood as approximations to infinite limits?
In Section 2.2, we articulated the Discretization Hypothesis, which states that finite neural networks are simply discretized approximations to infinite networks, analogous to how a spatiotemporal discretization is used to numerically approximate the solution to a differential equation. For network width, the limiting continuous object is the measure of neuron activity in hidden layers, while finite depth in a residual network can be viewed as a discretization of a neural SDE or ODE.
Small step sizes can render stochastic optimization algorithms approximately equivalent to some kind of flow. In this view, increasing model size (and decreasing learning rate while commensurately increasing step count) serve essentially to improve model performance by decreasing discretization error, at the price of additional computation.
Is this the right way to understand width, depth, learning rate, and other finite hyperparameters in deep learning? What does the limiting continuum system look like?
emoji broom Can we understand and eliminate all hyperparameters?
In Section 2.2 and Section 2.4, we outlined a research program in which hyperparameters are systematically analyzed, disentangled, and in some cases removed by taking appropriate limits.
How far can this program go? Can we reach zero hyperparameters, or are some hyperparameters irreducible? If we eliminate all hyperparameters, what remains?
emoji triangular-ruler Can we predict scaling law exponents a priori?
As discussed in Section 2.3, large models exhibit robust power-law scaling of loss with respect to model size, data, and compute.
The observed exponents are nontrivial: they do not appear to be simple fractions which might result from elementary dimensionality arguments.
It is widely believed that these values are driven largely by structure latent in the dataset, but may also depend on details of the architecture and optimizer.
While many explanations for scaling laws have been proposed, a decisive test of any such theory is its ability to predict these exponents quantitatively from first principles.
At present, no framework can robustly do so across realistic settings.
Can we develop a theory of scaling laws that both explains why power laws arise and predicts their exponents a priori? What measurements of the dataset, architecture, and optimization are required to do so?
emoji roller-coaster How does loss curvature interplay with architecture, features, and generalization?
As discussed in Section 2.3 and Section 2.4, a significant feature of deep learning optimization is that the optimizer implicitly regularizes the curvature (i.e. Hessian) along its trajectory, by steering towards regions of the loss landscape with lower curvature. While progress has been made on formalizing this effect using curvature-penalized gradient flows, it remains unclear how these curvature dynamics relate to other concerns in deep learning theory. Why does the curvature tend to rise in the absence of any such implicit regularization, and can this "progressive sharpening" be attributed to certain properties of the architecture or data distribution? How does the implicit curvature regularization affect the features that are learned? Why does it sometimes lead to improved generalization?
emoji racing-car What makes for a good optimizer in deep learning?
It remains fundamentally unclear why some deep learning optimizers work better than others.
Why do adaptive methods, such as Adam and Muon, consistently outperform simpler alternatives like SGD when training large language models? How does adaptive preconditioning in these optimizers interact with a network's architecture and loss landscape to lead to faster, more stable training? Can we identify fundamental principles that explain the success of modern optimizers, predict when they will fail, and guide the design of new ones?
emoji dancing-women In what sense do large models trained differently learn similar representations?
In Section 2.5, we discussed evidence that large models trained from different random seeds — and sometimes even with different widths, architectures, data, or objectives — tend to learn similar internal representations.
A precise version of this claim would be very powerful: understanding how representation learning is universal would give us confidence that theory developed for one model and setting transfers to many others.
The central difficulty here is methodological: how do we assess "similarity"? There is no single way to compare high-dimensional representations — metrics based on kernel alignment, nearest-neighbors, model stitching, and more compare different aspects of representation geometry. Which ones are stable across training regimes? What is the appropriate metric that quantifies this similarity? What is the largest range of experimental settings under which convergence is observed — what are the representation universality classes?
Section Summary: Getting involved in developing learning mechanics, a branch of deep learning theory, is open to newcomers without needing a specific academic background—just undergraduate math, some familiarity with deep learning, and a willingness to learn, with diverse perspectives from fields like physics or neuroscience adding value. The section offers practical guidance through six key principles: run frequent simple experiments to test ideas, prioritize clear insights over complex techniques, focus on deepening understanding rather than chasing top performance benchmarks, collaborate with others for mentorship and feedback, explore various problems early on before specializing, and build skills in fundamental tools from related disciplines like statistics or optimization. These tenets aim to foster long-term impact and community integration, with encouragement to watch talks and seek online resources for a smoother start.
It is always difficult to start doing research in a new field. Consequently, we would like to make it as easy as possible for newcomers to get started. In this section, we extend a hand with some encouragement and advice.
There is no specific academic background required to do useful work in this field. Well-regarded researchers in deep learning theory come from backgrounds in physics, mathematics, computer science, neuroscience, statistics, and more. Moreover, knowing another field well is useful, and established ideas from other fields can be applied to deep learning in some form or another, as the diversity of perspectives on deep learning theory attests (Section 3). Good things grow from cross-pollination. A firm grasp of undergraduate mathematics, a familiarity with deep learning, and a desire to learn are the only definite prerequisites.
If you want to join this field, you are more than welcome. While there is no single correct way to craft theory, there are plenty of pitfalls that many of us encountered when starting out. To help avoid some of them, we have compiled a shortlist of guiding principles for doing research in this field. These tenets are not intended to maximize your number of citations in the short term, and following them may involve some swimming against the current of academia. Instead, they are intended to maximize your impact in the long term and your ability to integrate and contribute to the community.
[^1]: For a good example of this, see [180].
[^2]: Most authors of this paper did this, as you can see from our respective research records!
We will put as much useful introductory material as we can on learningmechanics.pub, and we encourage discussion in the comments there. We also encourage taking a crack at the open directions in Section 5. Work hard, have fun, and best of luck — we hope to see a great deal more fundamental science of deep learning in the next few years!
Section Summary: The authors express thanks to a wide range of experts who provided feedback on their paper, including researchers in machine learning theory, AI safety and mechanistic interpretability, as well as practitioners in deep learning, neuroscientists, and physicists. This group encompasses overlapping communities from various fields, ensuring diverse perspectives. They specifically name individuals such as Alberto Bietti, Alex Infanger, and many others up to Zohar Ringel for their contributions.
We are grateful for feedback on this paper from many people from several overlapping groups: researchers working on the theory of machine learning, researchers working on AI safety and mechanistic interpretability, practitioners and applied deep learning scientists, neuroscientists, and physicists. This includes Alberto Bietti, Alex Infanger, Alex Williams, Amil Dravid, Anthony Thomas, Avrajit Ghosh, Bin Yu, Bruno Loureiro, Chandan Singh, Clémentine Dominé, David Berman, David Klindt, Denny Wu, Ev Gunter, Honam Wong, Itay Lavie, Jacob Yates, Jacob Zavatone-Veth, Jeff Gore, Jesse Hoogland, Jiechao Feng, Jingfeng Wu, Kaden Tro, Lauren Greenspan, Lenka Zdeborová, Lily Stelling, Lukas Bongartz, Nina Miolane, Noa Rubin, Peter Bartlett, Raymond Fan, Samyak Jain, Soufiane Hayou, Sultan Daniels, Wanyu Lei, Yasaman Bahri, and Zohar Ringel.
Section Summary: This section is a bibliography listing over 35 academic papers and conference proceedings on the mathematics and behavior of neural networks, especially deep linear models used in machine learning. The references explore topics like how these networks learn from data through processes such as gradient descent, implicit biases that guide training toward simpler solutions, and concepts like kernel methods that explain generalization and optimization in wide or overparameterized networks. Many entries focus on exact mathematical solutions and dynamical phenomena, drawing from sources like arXiv preprints and major conferences from 1989 to 2025.
[1] Saxe et al. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proceedings of the International Conference on Learning Represenatations 2014.
[2] Simon et al. (2023). The eigenlearning framework: A conservation law perspective on kernel ridge regression and wide neural networks. Transactions on Machine Learning Research.
[3] Nam et al. (2025). Position: Solve layerwise linear models first to understand neural dynamical phenomena (neural collapse, emergence, lazy/rich regime, and grokking). arXiv preprint arXiv:2502.21009.
[4] Baldi, Pierre and Hornik, Kurt (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural networks. 2(1). pp. 53–58.
[5] Gissin et al. (2019). The implicit bias of depth: How incremental learning drives generalization. arXiv preprint arXiv:1909.12051.
[6] Atanasov et al. (2021). Neural Networks as Kernel Learners: The Silent Alignment Effect. In International Conference on Learning Representations.
[7] Even et al. (2023). (S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability. Advances in Neural Information Processing Systems. 36. pp. 29406–29448.
[8] Woodworth et al. (2020). Kernel and rich regimes in overparametrized models. In Conference on Learning Theory. pp. 3635–3673.
[9] Kunin et al. (2024). Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning. Advances in Neural Information Processing Systems. 37. pp. 81157–81203.
[10] Fukumizu, Kenji (1998). Effect of batch learning in multilayer neural networks. Gen. 1(04). pp. 1E–03.
[11] Tarmoun et al. (2021). Understanding the dynamics of gradient flow in overparameterized linear models. In International Conference on Machine Learning. pp. 10153–10161.
[12] Dominé et al. (2025). From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks. In The Thirteenth International Conference on Learning Representations.
[13] Lampinen, Andrew K and Ganguli, Surya (2018). An analytic theory of generalization dynamics and transfer learning in deep linear networks. arXiv preprint arXiv:1809.10374.
[14] Kalimeris et al. (2019). Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems. 32.
[15] Simon et al. (2023). On the stepwise nature of self-supervised learning. In International Conference on Machine Learning. pp. 31852–31876.
[16] Gidel et al. (2019). Implicit regularization of discrete gradient dynamics in linear neural networks. Advances in Neural Information Processing Systems. 32.
[17] Li et al. (2021). Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning. In International Conference on Learning Representations.
[18] Jacot et al. (2021). Saddle-to-saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity. arXiv preprint arXiv:2106.15933.
[19] Pesme, Scott and Flammarion, Nicolas (2023). Saddle-to-saddle dynamics in diagonal linear networks. Advances in Neural Information Processing Systems. 36. pp. 7475–7505.
[20] Gunasekar et al. (2018). Implicit bias of gradient descent on linear convolutional networks. Advances in neural information processing systems. 31.
[21] Arora et al. (2018). On the optimization of deep networks: Implicit acceleration by overparameterization. In International conference on machine learning. pp. 244–253.
[22] Arora et al. (2019). Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems. 32.
[23] Pesme et al. (2021). Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems. 34. pp. 29218–29230.
[24] Chen et al. (2024). Stochastic collapse: How gradient noise attracts sgd dynamics towards simpler subnetworks. Advances in Neural Information Processing Systems. 36.
[25] Ziyin et al. (2022). Exact solutions of a deep linear network. Advances in Neural Information Processing Systems. 35. pp. 24446–24458.
[26] Wang, Zihan and Jacot, Arthur (2024). Implicit bias of SGD in $ L_2 $-regularized linear DNNs: One-way jumps from high to low rank. In The Twelfth International Conference on Learning Representations.
[27] Jacot et al. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems. 31.
[28] Lee et al. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems. 32.
[29] Chizat et al. (2019). On lazy training in differentiable programming. Advances in neural information processing systems. 32.
[30] Liu et al. (2020). On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems. 33. pp. 15954–15964.
[31] Malladi et al. (2023). A kernel-based view of language model fine-tuning. In International Conference on Machine Learning. pp. 23610–23641.
[32] Yi Ren and Danica J. Sutherland (2025). Learning Dynamics of LLM Finetuning. In The Thirteenth International Conference on Learning Representations.
[33] Arora et al. (2019). On exact computation with an infinitely wide neural net. Advances in neural information processing systems. 32.
[34] Geifman et al. (2020). On the similarity between the laplace and neural tangent kernels. Advances in Neural Information Processing Systems. 33. pp. 1451–1461.
[35] Jacot et al. (2020). Kernel alignment risk estimator: Risk prediction from training data. Advances in neural information processing systems. 33. pp. 15568–15578.
[36] Canatar et al. (2021). Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications. 12(1). pp. 2914.
[37] Loureiro et al. (2021). Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems. 34. pp. 18137–18151.
[38] Hastie et al. (2022). Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics. 50(2). pp. 949–986.
[39] Wei et al. (2022). More than a toy: Random matrix models predict how real-world neural representations generalize. In International conference on machine learning. pp. 23549–23588.
[40] Basri et al. (2020). Frequency bias in neural networks for input of non-uniform density. In International conference on machine learning. pp. 685–694.
[41] Karkada et al. (2025). Predicting kernel regression learning curves from only raw data statistics. arXiv preprint arXiv:2510.14878.
[42] Belkin et al. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences. 116(32). pp. 15849–15854.
[43] Advani et al. (2020). High-dimensional dynamics of generalization error in neural networks. Neural Networks. 132. pp. 428–446.
[44] Caponnetto, Andrea and de Vito, Ernesto (2007). Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics. 7(3). pp. 331–368.
[45] Pillaud-Vivien et al. (2018). Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Advances in Neural Information Processing Systems. 31.
[46] Cui et al. (2023). Error scaling laws for kernel classification under source and capacity conditions. Machine Learning: Science and Technology. 4(3). pp. 035033.
[47] Atanasov et al. (2024). Scaling and renormalization in high-dimensional regression. arXiv preprint arXiv:2405.00592.
[48] Ghorbani et al. (2020). When do neural networks outperform kernel methods?. Advances in Neural Information Processing Systems. 33. pp. 14820–14830.
[49] Vyas et al. (2022). Limitations of the ntk for understanding generalization in deep learning. arXiv preprint arXiv:2206.10012.
[50] Abbe et al. (2022). The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory. pp. 4782–4887.
[51] Damian et al. (2022). Neural networks can learn representations with gradient descent. In Conference on Learning Theory. pp. 5413–5452.
[52] Bietti et al. (2022). Learning single-index models with shallow neural networks. Advances in neural information processing systems. 35. pp. 9768–9783.
[53] Ba et al. (2022). High-dimensional asymptotics of feature learning: How one gradient step improves the representation. Advances in Neural Information Processing Systems. 35. pp. 37932–37946.
[54] Dandi et al. (2023). How two-layer neural networks learn, one (giant) step at a time. arXiv preprint arXiv:2305.18270.
[55] Barbier et al. (2019). Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences. 116(12). pp. 5451–5460.
[56] Aubin et al. (2018). The committee machine: Computational to statistical gaps in learning a two-layers neural network. Advances in Neural Information Processing Systems. 31.
[57] Mignacco et al. (2020). Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. Advances in Neural Information Processing Systems. 33. pp. 9540–9550.
[58] Erba et al. (2025). The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks. arXiv preprint arXiv:2505.17958.
[59] Ben Arous et al. (2025). Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws. arXiv preprint arXiv:2508.03688.
[60] Defilippis et al. (2025). Scaling laws and spectra of shallow neural networks in the feature learning regime. arXiv preprint arXiv:2509.24882.
[61] Yunwei Ren et al. (2025). Emergence and scaling laws in SGD learning of shallow neural networks. arXiv:2504.19983.
[62] Soudry et al. (2018). The implicit bias of gradient descent on separable data. Journal of Machine Learning Research. 19(70). pp. 1–57.
[63] Lyu, Kaifeng and Li, Jian (2020). Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. In International Conference on Learning Representations.
[64] Saad, David and Solla, Sara A (1995). Exact solution for on-line learning in multilayer neural networks. Physical Review Letters. 74(21). pp. 4337–4340.
[65] Goldt et al. (2019). Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. Advances in neural information processing systems. 32.
[66] Ben Arous et al. (2022). High-dimensional limit theorems for sgd: Effective dynamics and critical scaling. Advances in neural information processing systems. 35. pp. 25349–25362.
[67] Veiga et al. (2022). Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks. Advances in Neural Information Processing Systems. 35. pp. 23244–23255.
[68] Zavatone-Veth et al. (2025). Summary statistics of learning link changing neural representations to behavior. Frontiers in Neural Circuits. 19. pp. 1618351.
[69] Nichani et al. (2025). Understanding Factual Recall in Transformers via Associative Memories. In The Thirteenth International Conference on Learning Representations.
[70] Morwani et al. (2023). Feature emergence via margin maximization: case studies in algebraic tasks. arXiv preprint arXiv:2311.07568.
[71] Gromov, Andrey (2023). Grokking modular arithmetic. arXiv preprint arXiv:2301.02679.
[72] Kunin et al. (2025). Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks. arXiv preprint arXiv:2506.06489.
[73] Zhang et al. (2025). Training dynamics of in-context learning in linear attention. arXiv preprint arXiv:2501.16265.
[74] Boncoraglio et al. (2025). Single-Head Attention in High Dimensions: A Theory of Generalization, Weights Spectra, and Scaling Laws. In Workshop on Scientific Methods for Understanding Deep Learning.
[75] Bordelon et al. (2025). How Feature Learning Can Improve Neural Scaling Laws. In The Thirteenth International Conference on Learning Representations.
[76] Neal, Radford M (1996). Priors for infinite networks.
[77] Poole et al. (2016). Exponential expressivity in deep neural networks through transient chaos.
[78] Y. Lecun et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE. 86(11). pp. 2278-2324. doi:10.1109/5.726791.
[79] Mei et al. (2019). Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on learning theory. pp. 2388–2464.
[80] Rotskoff, Grant and Vanden-Eijnden, Eric (2018). Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. Advances in neural information processing systems. 31.
[81] Chizat, Lenaic and Bach, Francis (2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems. 31.
[82] Geiger et al. (2020). Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment. 2020(11). pp. 113301.
[83] Yang, Greg and Hu, Edward J (2021). Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning. pp. 11727–11737.
[84] Bordelon, Blake and Pehlevan, Cengiz (2022). Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems. 35. pp. 32240–32256.
[85] Mei et al. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences. 115(33). pp. E7665–E7671.
[86] Moniri et al. (2023). A theory of non-linear feature learning with one gradient step in two-layer neural networks. arXiv preprint arXiv:2310.07891.
[87] Cui et al. (2024). Asymptotics of feature learning in two-layer networks after one gradient-step. arXiv preprint arXiv:2402.04980.
[88] Montanari, Andrea and Wang, Zihao (2026). Phase transitions for feature learning in neural networks. arXiv preprint arXiv:2602.01434.
[89] Saxe, Andrew Michael (2015). Deep linear neural networks: A theory of learning in the brain and mind. Stanford University.
[90] Atanasov et al. (2025). The Optimization Landscape of SGD Across the Feature Learning Strength. In The Thirteenth International Conference on Learning Representations.
[91] Lee et al. (2017). Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165.
[92] Cohen et al. (2021). Learning curves for overparametrized deep neural networks: A field theory perspective. Physical Review Research. 3(2). pp. 023034.
[93] Lavie et al. (2024). Towards understanding inductive bias in transformers: A view from infinity. arXiv preprint arXiv:2402.05173.
[94] Seroussi et al. (2023). Separation of scales and a thermodynamic description of feature learning in some cnns. Nature Communications. 14(1). pp. 908.
[95] Rubin et al. (2023). Grokking as a first order phase transition in two layer networks. arXiv preprint arXiv:2310.03789.
[96] Rubin et al. (2025). From kernels to features: A multi-scale adaptive theory of feature learning. arXiv preprint arXiv:2502.03210.
[97] Rubin et al. (2025). Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity. arXiv preprint arXiv:2512.04165.
[98] Yang et al. (2023). A theory of representation learning gives a deep generalisation of kernel methods. In International Conference on Machine Learning. pp. 39380–39415.
[99] Maennel et al. (2018). Gradient descent quantizes relu network features. arXiv preprint arXiv:1803.08367.
[100] Bordelon et al. (2024). Infinite limits of multi-head transformer dynamics. Advances in Neural Information Processing Systems. 37. pp. 35824–35878.
[101] Chizat, Léna"ıc (2025). The hidden width of deep ResNets: Tight error bounds and phase diagrams. arXiv preprint arXiv:2509.10167.
[102] Chaintron et al. (2026). Resnets of all shapes and sizes: Convergence of training dynamics in the large-scale limit. arXiv preprint arXiv:2603.18168.
[103] Chen et al. (2018). Neural ordinary differential equations. Advances in neural information processing systems. 31.
[104] Bordelon et al. (2023). Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit. arXiv preprint arXiv:2309.16620.
[105] Yang et al. (2023). Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244.
[106] Dey et al. (2025). Don't be lazy: CompleteP enables compute-efficient deep transformers. arXiv preprint arXiv:2505.01618.
[107] Clark et al. (2026). Structure, disorder, and dynamics in task-trained recurrent neural circuits. bioRxiv. pp. 2026–03.
[108] Bauer et al. (2026). A unified theory of feature learning in RNNs and DNNs. arXiv preprint arXiv:2602.15593.
[109] Hron et al. (2020). Infinite attention: NNGP and NTK for deep attention networks. In International Conference on Machine Learning. pp. 4376–4386.
[110] Małaśnicki et al. (2025). $\mu$-Parametrization for Mixture of Experts. arXiv preprint arXiv:2508.09752. arXiv:2508.09752.
[111] Jiang et al. (2026). Hyperparameter Transfer with Mixture-of-Expert Layers. arXiv preprint arXiv:2601.20205.
[112] Hayou, Soufiane and Yang, Greg (2023). Width and depth limits commute in residual networks. In International Conference on Machine Learning. pp. 12700–12723.
[113] Seung et al. (1992). Statistical mechanics of learning from examples. Physical review A. 45(8). pp. 6056.
[114] Zdeborová, Lenka and Krzakala, Florent (2016). Statistical physics of inference: Thresholds and algorithms. Advances in Physics. 65(5). pp. 453–552.
[115] Li, Qianyi and Sompolinsky, Haim (2021). Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Physical Review X. 11(3). pp. 031059.
[116] Maillard et al. (2024). Bayes-optimal learning of an extensive-width neural network from quadratically many samples. Advances in Neural Information Processing Systems. 37. pp. 82085–82132.
[117] Martin et al. (2024). On the impact of overparameterization on the training of a shallow neural network in high dimensions. In International Conference on Artificial Intelligence and Statistics. pp. 3655–3663.
[118] Barbier et al. (2025). Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation. arXiv preprint arXiv:2510.24616.
[119] Hoffmann et al. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
[120] Bordelon, Blake and Pehlevan, Cengiz (2025). Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer. arXiv preprint arXiv:2502.02531.
[121] Hanin, Boris and Nica, Mihai (2019). Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989.
[122] Li et al. (2022). The neural covariance SDE: Shaped infinite depth-and-width networks at initialization. Advances in Neural Information Processing Systems. 35. pp. 10795–10808.
[123] Noci et al. (2023). The shaped transformer: Attention models in the infinite depth-and-width limit. Advances in Neural Information Processing Systems. 36. pp. 54250–54281.
[124] Hanin, Boris and Jiang, Tianze (2025). Global Universality of Singular Values in Products of Many Large Random Matrices. arXiv preprint arXiv:2503.07872.
[125] Mandt et al. (2017). Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research. 18(134). pp. 1–35.
[126] Jastrzebski et al. (2017). Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623.
[127] Roberts et al. (2022). The principles of deep learning theory. Cambridge University Press Cambridge, MA, USA.
[128] Zavatone-Veth et al. (2021). Asymptotics of representation learning in finite Bayesian neural networks. Advances in neural information processing systems. 34. pp. 24765–24777.
[129] Segadlo et al. (2022). Unified field theoretical approach to deep and recurrent neuronal networks. Journal of Statistical Mechanics: Theory and Experiment. 2022(10). pp. 103401.
[130] Bordelon, Blake and Pehlevan, Cengiz (2023). Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. Advances in Neural Information Processing Systems. 36. pp. 9707–9750.
[131] Glasgow et al. (2025). Propagation of chaos in one-hidden-layer neural networks beyond logarithmic time. arXiv preprint arXiv:2504.13110.
[132] Jared Kaplan et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
[133] Joel Hestness et al. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv:1712.00409.
[134] Sharma, Utkarsh and Kaplan, Jared (2022). Scaling laws from the data manifold dimension. Journal of Machine Learning Research. 23(9). pp. 1–34.
[135] Bahri et al. (2024). Explaining neural scaling laws. Proceedings of the National Academy of Sciences. 121(27). doi:10.1073/pnas.2311878121.
[136] Yizhou Liu et al. (2025). Superposition Yields Robust Neural Scaling. arXiv:2505.10465.
[137] Cui et al. (2021). Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems. 34. pp. 10131–10143.
[138] Bordelon et al. (2024). A dynamical model of neural scaling laws. arXiv preprint arXiv:2402.01092.
[139] Michaud et al. (2023). The quantization model of neural scaling. Advances in Neural Information Processing Systems. 36. pp. 28699–28722.
[140] Barkeshli et al. (2026). On the origin of neural scaling laws: from random graphs to natural language. arXiv preprint arXiv:2601.10684.
[141] Cagnetta et al. (2026). Deriving Neural Scaling Laws from the statistics of natural language. arXiv preprint arXiv:2602.07488.
[142] Li et al. (2018). Visualizing the Loss Landscape of Neural Nets. In Advances in Neural Information Processing Systems. pp. .
[143] Cohen et al. (2021). Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065.
[144] Yoo et al. (2025). Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty, Depth, Stochasticity, and More. In International Conference on Machine Learning. pp. 72574–72617.
[145] Damian et al. (2022). Self-stabilization: The implicit bias of gradient descent at the edge of stability. arXiv preprint arXiv:2209.15594.
[146] Jeremy M. Cohen et al. (2025). Understanding Optimization in Deep Learning with Central Flows. arXiv:2410.24206.
[147] Vardan Papyan et al. (2020). Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences. 117(40). pp. 24652-24663. doi:10.1073/pnas.2015509117.
[148] Zhu et al. (2021). A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems. 34. pp. 29820–29834.
[149] Soudry et al. (2018). The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research. 19(1). pp. 2822–2878.
[150] Adityanarayanan Radhakrishnan et al. (2024). Mechanism for feature learning in neural networks and backpropagation-free machine learning models. Science. 383(6690). pp. 1461-1467. doi:10.1126/science.adi5639.
[151] Ziyin et al. (2024). Formation of representations in neural networks. arXiv preprint arXiv:2410.03006.
[152] Enric Boix-Adsera et al. (2025). The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations. arXiv:2507.05644.
[153] Du et al. (2018). Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. Advances in neural information processing systems. 31.
[154] Sanjeev Arora et al. (2019). A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks. arXiv:1810.02281.
[155] Kunin et al. (2021). Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics. In International Conference on Learning Representations.
[156] Tanaka, Hidenori and Kunin, Daniel (2021). Noether’s learning dynamics: Role of symmetry breaking in neural networks. Advances in Neural Information Processing Systems. 34. pp. 25646–25660.
[157] Sibylle Marcotte et al. (2024). Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows. arXiv:2307.00144.
[158] Sibylle Marcotte et al. (2024). Keep the Momentum: Conservation Laws beyond Euclidean Gradient Flows. arXiv:2405.12888.
[159] Goyal et al. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
[160] Chaudhari, Pratik and Soatto, Stefano (2018). Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA). pp. 1–10.
[161] Li et al. (2019). Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. Journal of Machine Learning Research. 20(40). pp. 1–47.
[162] Zhiyuan Li et al. (2021). On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs). arXiv:2102.12470.
[163] Malladi et al. (2022). On the SDEs and Scaling Rules for Adaptive Gradient Algorithms. In Advances in Neural Information Processing Systems. pp. 7697–7711. https://proceedings.neurips.cc/paper_files/paper/2022/file/32ac710102f0620d0f28d5d05a44fe08-Paper-Conference.pdf.
[164] Ma et al. (2018). The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In International Conference on Machine Learning. pp. 3325–3334.
[165] Jain et al. (2018). Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. Journal of machine learning research. 18(223). pp. 1–42.
[166] McCandlish et al. (2018). An empirical model of large-batch training. arXiv preprint arXiv:1812.06162.
[167] Shallue et al. (2019). Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research. 20(112). pp. 1–49.
[168] Keskar et al. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836.
[169] John Schulman and Thinking Machines Lab (2025). LoRA Without Regret. Thinking Machines Lab: Connectionism. doi:10.64434/tml.20250929.
[170] Catalan-Tatjer et al. (2025). Training Dynamics Impact Post-Training Quantization Robustness. arXiv preprint arXiv:2510.06213.
[171] Barsbey et al. (2025). Large learning rates simultaneously achieve robustness to spurious correlations and compressibility. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2055–2066.
[172] Jastrzebski et al. (2020). The break-even point on optimization trajectories of deep neural networks. arXiv preprint arXiv:2002.09572.
[173] Blanc et al. (2020). Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory. pp. 483–513.
[174] Li et al. (2021). What Happens after SGD Reaches Zero Loss?–A Mathematical Framework. arXiv preprint arXiv:2110.06914.
[175] Damian et al. (2021). Label noise sgd provably prefers flat global minimizers. Advances in Neural Information Processing Systems. 34. pp. 27449–27461.
[176] Wen et al. (2022). How does sharpness-aware minimization minimize sharpness?. arXiv preprint arXiv:2211.05729.
[177] Li et al. (2025). Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold. arXiv preprint arXiv:2511.02773.
[178] Barrett, David GT and Dherin, Benoit (2020). Implicit gradient regularization. arXiv preprint arXiv:2009.11162.
[179] Smith et al. (2021). On the origin of implicit regularization in stochastic gradient descent. arXiv preprint arXiv:2101.12176.
[180] Yang et al. (2022). Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466.
[181] Yang, Greg and Littwin, Etai (2023). Tensor programs ivb: Adaptive optimization in the infinite-width limit. arXiv preprint arXiv:2308.01814.
[182] Noci et al. (2024). Super consistency of neural network landscapes and learning rate transfer. Advances in Neural Information Processing Systems. 37. pp. 102696–102743.
[183] Ghosh et al. (2025). Understanding the Mechanisms of Fast Hyperparameter Transfer. arXiv preprint arXiv:2512.22768.
[184] Hayou, Soufiane (2025). A Proof of Learning Rate Transfer under $mu$ P. arXiv preprint arXiv:2511.01734.
[185] Zhang et al. (2024). The emergence of reproducibility and consistency in diffusion models. In Forty-first International Conference on Machine Learning.
[186] Huh et al. (2024). Position: The platonic representation hypothesis. In Forty-first International Conference on Machine Learning.
[187] Liu et al. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986.
[188] Smith et al. (2023). Convnets match vision transformers at scale. arXiv preprint arXiv:2310.16764.
[189] Kadkhodaie et al. (2024). Generalization in diffusion models arises from geometry-adaptive harmonic representations. In The Twelfth International Conference on Learning Representations.
[190] Kamb, Mason and Ganguli, Surya (2025). An analytic theory of creativity in convolutional diffusion models. In Forty-second International Conference on Machine Learning.
[191] Niedoba et al. (2025). Towards a Mechanistic Explanation of Diffusion Model Generalization. In Forty-second International Conference on Machine Learning.
[192] Wolpert, David H (1996). The lack of a priori distinctions between learning algorithms. Neural computation. 8(7). pp. 1341–1390.
[193] Olshausen, Bruno A and Field, David J (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature. 381(6583). pp. 607–609.
[194] Mallat, Stéphane (1999). A wavelet tour of signal processing. Elsevier.
[195] Li, Wentian (2002). Zipf's Law everywhere.. Glottometrics. 5(2002). pp. 14–21.
[196] Piantadosi, Steven T (2014). Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic bulletin & review. 21(5). pp. 1112–1130.
[197] Cagnetta et al. (2024). How deep neural networks learn compositional data: The random hierarchy model. Physical Review X. 14(3). pp. 031001.
[198] Sclocchi et al. (2025). A phase transition in diffusion models reveals the hierarchical nature of data. Proceedings of the National Academy of Sciences. 122(1). pp. e2408799121.
[199] Cagnetta et al. (2025). Learning curves theory for hierarchically compositional data with power-law distributed features. arXiv preprint arXiv:2505.07067.
[200] Raghu et al. (2017). Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems. 30.
[201] Kornblith et al. (2019). Similarity of neural network representations revisited. In International conference on machine learning. pp. 3519–3529.
[202] Bansal et al. (2021). Revisiting model stitching to compare neural representations. Advances in neural information processing systems. 34. pp. 225–236.
[203] Moschella et al. (2022). Relative representations enable zero-shot latent space communication. arXiv preprint arXiv:2209.15430.
[204] Lenc, Karel and Vedaldi, Andrea (2015). Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 991–999.
[205] Rahimi, Ali and Recht, Benjamin (2007). Random features for large-scale kernel machines. Advances in neural information processing systems. 20.
[206] Guth et al. (2024). A rainbow in deep network black boxes. Journal of Machine Learning Research. 25(350). pp. 1–59.
[207] Ziyin, Liu and Chuang, Isaac (2025). Proof of a perfect platonic representation hypothesis. arXiv preprint arXiv:2507.01098.
[208] Dhruva Karkada et al. (2026). Symmetry in Language Statistics Shapes the Geometry of Model Representations. arXiv preprint arXiv:2602.15029.
[209] Hyvärinen et al. (2024). Identifiability of latent-variable and structural-equation models: from linear to nonlinear. Annals of the Institute of Statistical Mathematics. 76(1). pp. 1–33.
[210] Klindt et al. (2020). Towards nonlinear disentanglement in natural data with temporal sparse coding. arXiv preprint arXiv:2007.10930.
[211] Zimmermann et al. (2021). Contrastive learning inverts the data generating process. In International conference on machine learning. pp. 12979–12990.
[212] Reizinger et al. (2024). Cross-entropy is all you need to invert the data generating process. arXiv preprint arXiv:2410.21869.
[213] Reizinger et al. (2025). Position: An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research. arXiv preprint arXiv:2504.13101.
[214] Li et al. (2015). Convergent Learning: Do different neural networks learn the same representations?. In Feature Extraction: Modern Questions and Challenges. pp. 196–212.
[215] Dravid et al. (2023). Rosetta neurons: Mining the common units in a model zoo. In Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1934–1943.
[216] Khosla et al. (2024). Privileged representational axes in biological and artificial neural networks. bioRxiv. pp. 2024–06.
[217] Yamins et al. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences. 111(23). pp. 8619–8624.
[218] McIntosh et al. (2016). Deep learning models of the retinal response to natural scenes. Advances in neural information processing systems. 29.
[219] Bowers et al. (2023). Deep problems with neural network models of human vision. Behavioral and Brain Sciences. 46. pp. e385.
[220] Gröger et al. (2026). Revisiting the Platonic Representation Hypothesis: An Aristotelian View. arXiv preprint arXiv:2602.14486.
[221] Sucholutsky et al. (2023). Getting aligned on representational alignment. arXiv preprint arXiv:2310.13018.
[222] Klabunde et al. (2025). Similarity of neural network models: A survey of functional and representational measures. ACM Computing Surveys. 57(9). pp. 1–52.
[223] Harvey et al. (2024). Duality of bures and shape distances with implications for comparing neural representations. In Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models. pp. 11–26.
[224] Williams, Alex H (2024). Equivalence between representational similarity analysis, centered kernel alignment, and canonical correlations analysis. In Proceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models. pp. 10–23.
[225] Bartlett et al. (2021). Deep learning: a statistical viewpoint. Acta numerica. 30. pp. 87–201.
[226] Wilson, Andrew Gordon (2025). Position: Deep Learning is Not So Mysterious or Different. In Forty-second International Conference on Machine Learning Position Paper Track.
[227] Shwartz-Ziv, Ravid and Tishby, Naftali (2017). Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810.
[228] Xu, Aolin and Raginsky, Maxim (2017). Information-theoretic analysis of generalization capability of learning algorithms. Advances in neural information processing systems. 30.
[229] Hopfield, John J (1982). Neural networks and physical systems with emergent collective computational abilities.. Proceedings of the national academy of sciences. 79(8). pp. 2554–2558.
[230] Amit et al. (1985). Spin-glass models of neural networks. Physical Review A. 32(2). pp. 1007.
[231] Gardner, Elizabeth (1988). The space of interactions in neural network models. Journal of physics A: Mathematical and general. 21(1). pp. 257–270.
[232] Zdeborová, Lenka (2020). Understanding deep learning is also a job for physicists. Nature Physics. 16(6). pp. 602–604.
[233] Bahri et al. (2020). Statistical mechanics of deep learning. Annual review of condensed matter physics. 11(1). pp. 501–528.
[234] Michaud, Eric J (2024). A Physics of Systems that Learn. https://ericjmichaud.com/physics-of-learning.pdf.
[235] Ringel et al. (2025). Applications of statistical field theory in deep learning. arXiv preprint arXiv:2502.18553.
[236] Dayan et al. (1995). The helmholtz machine. Neural computation. 7(5). pp. 889–904.
[237] Friston, Karl (2010). The free-energy principle: a unified brain theory?. Nature reviews neuroscience. 11(2). pp. 127–138.
[238] Zeiler, Matthew D. and Fergus, Rob (2014). Visualizing and Understanding Convolutional Networks. In European Conference on Computer Vision. pp. 818–833.
[239] Chung, SueYeon and Abbott, Larry F (2021). Neural population geometry: An approach for understanding biological and artificial neural networks. Current opinion in neurobiology. 70. pp. 137–144.
[240] Bernardi et al. (2020). The geometry of abstraction in the hippocampus and prefrontal cortex. Cell. 183(4). pp. 954–967.
[241] Kriegeskorte et al. (2008). Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience. 2. pp. 249.
[242] Watanabe, Sumio (2009). Algebraic geometry and statistical learning theory. Cambridge university press.
[243] Hoogland et al. (2023). Towards Developmental Interpretability. LessWrong. Accessed: 2026-02-03. https://www.lesswrong.com/posts/TjaeCWvLZtEDAS5Ex/towards-developmental-interpretability.
[244] Langley, Pat (1988). Machine learning as an experimental science. Machine Learning. 3(1). pp. 5–8.
[245] Gal, Yarin (2015). The Science of Deep Learning. https://www.cs.ox.ac.uk/people/yarin.gal/website/blog_5058.html.
[246] Ali Rahimi (2017). Let’s take machine learning from alchemy to electricity: Test‑of‑Time Award presentation.
[247] Baraniuk et al. (2020). The science of deep learning. Proceedings of the National Academy of Sciences. 117(48). pp. 30029–30032.
[248] Donoho, David (2024). Data science at the singularity. Harvard Data Science Review. 6(1).
[249] Olah et al. (2020). Zoom in: An introduction to circuits. Distill. 5(3). pp. e00024–001.
[250] Templeton et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
[251] Engels et al. (2024). Not all language model features are one-dimensionally linear. arXiv preprint arXiv:2405.14860.
[252] Gurnee et al. (2025). When Models Manipulate Manifolds: The Geometry of a Counting Task. Transformer Circuits Thread. https://transformer-circuits.pub/2025/linebreaks/index.html.
[253] Lindsey et al. (2025). On the Biology of a Large Language Model. Transformer Circuits Thread. https://transformer-circuits.pub/2025/attribution-graphs/biology.html.
[254] Geiger et al. (2025). Causal abstraction: A theoretical foundation for mechanistic interpretability. Journal of Machine Learning Research. 26(83). pp. 1–64.
[255] Saphra, Naomi and Wiegreffe, Sarah (2024). Mechanistic?. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. pp. 480–498.
[256] Sharkey et al. (2025). Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496.
[257] Joshi et al. (2026). Causality is Key for Interpretability Claims to Generalise. arXiv preprint arXiv:2602.16698.
[258] Greenspan et al. (2026). Towards Worst-Case Guarantees with Scale-Aware Interpretability. arXiv preprint arXiv:2602.05184.
[259] Mikolov et al. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies. pp. 746–751.
[260] Park et al. (2023). The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658.
[261] Nanda et al. (2023). Emergent linear representations in world models of self-supervised sequence models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. pp. 16–30.
[262] Marks, Samuel and Tegmark, Max (2023). The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824.
[263] Jiang et al. (2024). On the origins of linear representations in large language models. arXiv preprint arXiv:2403.03867.
[264] Csordás et al. (2024). Recurrent neural networks learn to store and generate sequences using non-linear representations. arXiv preprint arXiv:2408.10920.
[265] Meng et al. (2022). Locating and editing factual associations in gpt. Advances in neural information processing systems. 35. pp. 17359–17372.
[266] Wang et al. (2022). Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593.
[267] Conmy et al. (2023). Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems. 36. pp. 16318–16352.
[268] Aryaman Arora et al. (2025). Language Model Circuits Are Sparse in the Neuron Basis. Blog post: https://transluce.org/neuron-circuits. arXiv:2601.22594.
[269] Cunningham et al. (2023). Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600.
[270] Bricken et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
[271] Thorpe, Simon (1989). Local vs. distributed coding. Intellectica. 8(2). pp. 3–40.
[272] Smolensky, Paul (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence. 46(1-2). pp. 159–216.
[273] Lepori et al. (2023). Break it down: Evidence for structural compositionality in neural networks. Advances in Neural Information Processing Systems. 36. pp. 42623–42660.
[274] Schug et al. (2023). Discovering modular solutions that generalize compositionally. arXiv preprint arXiv:2312.15001.
[275] Ramesh et al. (2023). Compositional capabilities of autoregressive transformers: A study on synthetic, interpretable tasks. arXiv preprint arXiv:2311.12997.
[276] Elhage et al. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
[277] Olsson et al. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.
[278] Nanda et al. (2023). Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217.
[279] Nelson Elhage et al. (2022). Toy Models of Superposition. arXiv preprint arXiv:2209.10652.
[280] Chen et al. (2023). Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLMs. arXiv preprint arXiv:2309.07311.
[281] Gopalani et al. (2024). Abrupt learning in transformers: A case study on matrix completion. Advances in Neural Information Processing Systems. 37. pp. 55053–55085.
[282] Park et al. (2024). Emergence of hidden capabilities: Exploring learning dynamics in concept space. Advances in Neural Information Processing Systems. 37. pp. 84698–84729.
[283] Hoogland et al. (2025). Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research.
[284] Saphra, Naomi (2022). Interpretability Creationism. Blog post. Accessed: 2026-02-03. https://nsaphra.net/post/creationism/.
[285] Shai et al. (2024). Transformers represent belief state geometry in their residual stream. Advances in Neural Information Processing Systems. 37. pp. 75012–75034.
[286] Bietti et al. (2023). Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems. 36. pp. 1560–1588.
[287] Reddy, Gautam (2023). The mechanistic basis of data dependence and abrupt learning in an in-context classification task. arXiv preprint arXiv:2312.03002.
[288] Nichani et al. (2024). How transformers learn causal structure with gradient descent. In Proceedings of the 41st International Conference on Machine Learning. pp. 38018–38070.
[289] Marchetti et al. (2026). Sequential Group Composition: A Window into the Mechanics of Deep Learning. arXiv preprint arXiv:2602.03655. arXiv:2602.03655.
[290] Prieto et al. (2025). Correlations in the data lead to semantically rich feature geometry under superposition. In Mechanistic Interpretability Workshop at NeurIPS 2025.
[291] Park et al. (2023). Trak: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186.
[292] Vineet Gupta et al. (2018). Shampoo: Preconditioned Stochastic Tensor Optimization. doi:10.48550/arXiv.1802.09568. arXiv:1802.09568.
[293] Jordan et al. (2024). Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/.
[294] Betley et al. (2026). Training large language models on narrow tasks can lead to broad misalignment. Nature. 649(8097). pp. 584–589.
[295] Marr, David (2010). Vision: A computational investigation into the human representation and processing of visual information. MIT press.