Theory 3 papers

Theory Digest — Mar 23, 2026

Today’s Digest at a Glance

Today’s theory papers explore neural network training through continuous dynamical systems, introduce a novel projection-based algorithm for polynomial learning, and extend sparse regression methods to stochastic systems.

Mean-Field Theory for Neural Networks

Mean-field theory provides a mathematical framework for analyzing neural networks in the infinite-width limit by treating network parameters as particles in a probability distribution. The key insight is that as the number of neurons approaches infinity, the empirical distribution of parameters converges to a deterministic measure that evolves according to a partial differential equation. This approach transforms the discrete optimization problem of training a finite neural network into a continuous optimal control problem over probability measures.

The connection works through the Wasserstein gradient flow, where parameter updates correspond to moving probability mass along steepest descent directions in the space of probability measures equipped with the Wasserstein-2 metric. Mathematically, if $\mu_t$ represents the parameter distribution at time $t$, then training follows:

\[\frac{\partial \mu_t}{\partial t} = \nabla \cdot (\mu_t \nabla \frac{\delta \mathcal{L}}{\delta \mu})\]

where $\mathcal{L}$ is the loss functional and $\frac{\delta \mathcal{L}}{\delta \mu}$ is its functional derivative.

The power of this formulation lies in connecting neural network training to McKean-Vlasov optimal control theory, where the optimal policy for each particle (neuron) depends on the entire population distribution. This enables rigorous analysis of phenomena like implicit regularization that are difficult to study in the finite-width setting. Intuitively, mean-field theory reveals that neural network training naturally regularizes by preferring parameter distributions with lower “transportation cost” in the Wasserstein sense.

Spherical Harmonics in Neural Network Theory

Spherical harmonics provide the natural basis for analyzing functions on the unit sphere, playing a crucial role in understanding neural network expressivity for polynomial targets. These are the eigenfunctions of the Laplace-Beltrami operator on the sphere, forming a complete orthonormal system analogous to Fourier modes on the circle. For degree-$k$ spherical harmonics $Y_k^m(x)$ where $x \in \mathbb{S}^{d-1}$, we have:

\[-\Delta_{\mathbb{S}^{d-1}} Y_k^m = k(k+d-2) Y_k^m\]

In neural network theory, spherical harmonics become essential because they decompose the Neural Tangent Kernel (NTK) into frequency components. For ReLU networks on the sphere, the NTK can be written as:

\[K(x,x') = \sum_{k=0}^{\infty} \lambda_k P_k(\langle x,x' \rangle)\]

where $P_k$ are Gegenbauer polynomials and $\lambda_k$ are the eigenvalues corresponding to degree-$k$ spherical harmonics.

The key insight is that learning degree-$k_0$ polynomials requires the network to primarily use spherical harmonics up to degree $k_0$. However, standard gradient descent learns all frequency components simultaneously, leading to suboptimal rates. The Gradient Descent with Projection (GDP) algorithm addresses this by explicitly projecting updates onto the subspace spanned by harmonics of degree at most $k_0$, achieving the minimax optimal rate $O(d^{k_0}/n)$. This projection effectively focuses the network’s capacity on the relevant frequency components while avoiding interference from higher-order harmonics.

Reading Guide

The first paper provides theoretical foundations for understanding neural network training as a continuous dynamical system, while the second paper leverages this continuous perspective through spherical harmonic analysis to design an optimal algorithm for polynomial learning. The third paper extends regression techniques to stochastic systems, complementing the deterministic optimization focus of the neural network papers by addressing noisy dynamics identification.


Implicit Regularization of Large Neural Networks via Mean-Field Formulation

Authors: Beatrice Acciaio, Jakob Heiss, Gudmund Pammer, Qinxin Yan · Institution: ETH Zurich · Category: math.OC

Connects mean-field neural network training to McKean-Vlasov optimal control, showing early stopping implicitly regularizes via a new Finsler metric on probability measures.

Tags: mean-field theory neural network theory implicit regularization optimal transport McKean-Vlasov control gradient flows early stopping Wasserstein space

arXiv · PDF

Problem Formulation
  1. Motivation (2–3 sentences): Understanding implicit regularization in overparametrized neural networks is central to machine learning theory, as these networks generalize well despite fitting training data perfectly. Early stopping during training acts as an implicit regularizer, but the mechanism lacks rigorous mathematical explanation.

  2. Mathematical setup: Consider a one-hidden-layer neural network in the mean-field limit where parameters $\theta = (w,a,b) \in \Theta := \mathbb{R} \times \mathbb{R}^d \times \mathbb{R}$ follow a probability measure $\mu \in \mathcal{P}_2(\Theta)$. The network output is

    \[f_\mu(x) = \int_\Theta w \sigma(a \cdot x + b) \mu(dw,da,db)\]

    The empirical loss is

    \[L(\mu) = \frac{1}{M} \sum_{k=1}^M \tilde{L}(f_\mu(x_k) - y_k)\]

    Training evolves $\mu_t$ via gradient flow on $(\mathcal{P}_2(\Theta), W_2)$. Assumptions:

    1. $\sigma, \sigma’$ bounded and $\sigma$ Lipschitz with constant $S_0$
    2. Training data bounded: $\max_k |x_k| < \infty$
    3. Loss $\tilde{L}$ convex, $C^1$ with Lipschitz derivative
    4. $\min \tilde{L} = 0$
  3. Toy example: For $d=1$, single data point $(x_1,y_1) = (1,0)$, quadratic loss $\tilde{L}(r) = r^2/2$, and $\sigma(z) = \tanh(z)$, we get $L(\mu) = \frac{1}{2}(\int w \tanh(a + b) \mu(dw,da,db))^2$. The gradient flow must balance loss decrease against parameter drift.

  4. Formal objective: Characterize early stopping at time $T$ as the solution to

    \[\min_{\nu \in \mathcal{T}_T(\mu)} \{d_G^2(\mu,\nu) + e^{-T} L(\nu)\}\]

    where $d_G$ is the Finsler metric induced by the upper gradient $G(\mu) = \sqrt{2L(\mu)}$.

Method

The method establishes a bridge between Wasserstein gradient flows and McKean-Vlasov optimal control.

  1. For diffusive regime ($\epsilon > 0$), consider controlled SDE:

    \[dX_t = \alpha_t dt + \epsilon dW_t^{\alpha}\]

    with cost functional

    \[J_\epsilon(\mu,\alpha) = \mathbb{E}^{\mathcal{P}^\alpha} \left[\int_0^\infty e^{-t} \left(\frac{1}{2}\|\alpha_t\|^2 + \ell_\epsilon(\mu_t^\alpha)\right) dt\right]\]
  2. For deterministic regime ($\epsilon = 0$), the admissible pairs $(\mu,\alpha)$ satisfy continuity equation:

    \[\partial_t \mu_t + \nabla \cdot (\mu_t \alpha_t) = 0\]

    with cost

    \[J_0(\mu,\alpha) = \int_0^\infty e^{-t} \left(\frac{1}{2}\int \|\alpha_t(x)\|^2 \mu_t(dx) + \ell_0(\mu_t)\right) dt\]
  3. The value function $v_\epsilon(\mu) = \inf J_\epsilon(\mu,\alpha)$ satisfies HJB equation:

    \[w_\epsilon(\mu) = -\frac{1}{2}\int |\nabla \partial_\mu w_\epsilon(\mu)(x)|^2 \mu(dx) + \frac{\epsilon^2}{2}\int \Delta \partial_\mu w_\epsilon(\mu)(x) \mu(dx) + \ell_\epsilon(\mu)\]
  4. For any $V \in C^2(\mathcal{P}_2) \cap C_{quad}$, choose running cost:

    \[\ell_\epsilon(\mu) = V(\mu) + \frac{1}{2}\int \|\nabla_x \partial_\mu V(\mu)\|^2 d\mu - \frac{\epsilon^2}{2}\int \Delta_x \partial_\mu V(\mu) d\mu\]

    Then $v_\epsilon \equiv V$.

    Applied to toy example: With $L(\mu) = \frac{1}{2}(\int w \tanh(a+b) \mu(dw,da,db))^2$, the optimal control selects the trajectory minimizing kinetic energy while decreasing loss, yielding implicit $\ell^2$-regularization of parameters.

Novelty & Lineage

Step 1 — Prior work:

  • Mei et al. 2018, Chizat & Bach 2018: established mean-field limit of neural networks as gradient flow on probability measures
  • Ambrosio et al. 2008: developed theory of curves of maximal slope in metric spaces for gradient flows
  • Carmona & Delarue 2018: comprehensive McKean-Vlasov control theory

Step 2 — Delta: This paper connects gradient flows to McKean-Vlasov control via HJB equations, providing a variational characterization of early stopping. The key novelty is the endpoint formulation from dynamic programming principle and the identification of implicit regularization with a new Finsler metric $d_G$ on probability measures.

Step 3 — Theory-specific assessment:

  • Main theorem is somewhat predictable: connecting gradient flows to control problems is natural via calculus of variations
  • Proof technique assembles known results (viscosity solutions, dynamic programming) but the specific construction of running cost $\ell_\epsilon$ in Theorem 2.18 is novel
  • Bounds are non-asymptotic but no lower bounds are established to assess tightness

The theoretical contribution is solid but expected. The connection between early stopping and implicit regularization has been studied extensively, and this work provides one more perspective through optimal control theory. The framework is mathematically rigorous but the practical insights are limited.

Verdict: INCREMENTAL — provides a new but predictable connection between well-studied concepts using standard techniques.

Proof Techniques
  1. Viscosity solution theory on metric spaces: The HJB equation (2.18) lacks classical smooth solutions due to non-compactness of $\mathcal{P}_2(\Theta)$. The authors use viscosity solutions with comparison principles from Gangbo & Święch 2015 and Soner & Yan 2024 to prove uniqueness in the class $C_{quad}$.

  2. Dynamic Programming Principle: Key inequality from Djete et al. 2022:

    \[v_\epsilon(\mu) = \inf_{(\mu,\alpha)} \left\{\int_0^T e^{-t}\left[\frac{1}{2}\mathbb{E}\|\alpha_t\|^2 + \ell_\epsilon(\mu_t)\right]dt + e^{-T}v_\epsilon(\mu_T)\right\}\]
  3. Calibration Property technique: For gradient flows to be optimal, the energy-dissipation equality must hold:

    \[-\frac{d}{dt}v(\mu_t) = \frac{1}{2}|\partial v|(\mu_t)^2 + \frac{1}{2}|\mu'|_t^2\]

    This forces $\alpha_t^* = -\zeta_t$ where $\zeta_t \in \partial^\circ v(\mu_t)$ realizes the metric slope.

  4. $\Gamma$-convergence argument: As $\epsilon \to 0$, the diffusive functional $\nu \mapsto d_{T,\epsilon}^2(\mu,\nu) + e^{-T}V(\nu)$ $\Gamma$-converges to the deterministic limit. This uses compactness of sublevel sets and lower semicontinuity.

  5. Finsler metric construction: The key technical insight is defining

    \[d_G(\mu,\nu) = \inf_{\mu \in A_{1,0}(\mu,\nu)} \int_0^1 |\mu'|(t) G(\mu_t) dt\]

    where $G(\mu) = \sqrt{2L(\mu)}$ acts as a “speed limit” encoding the loss structure. This provides the regularization mechanism.

Experiments & Validation

Purely theoretical with brief numerical illustration in Section 5 (not detailed in provided text). The paper states numerical experiments demonstrate the implicit regularization behavior but provides no specific datasets, baselines, or quantitative results.

Empirical validation would require:

  1. comparing finite-width networks with mean-field predictions
  2. verifying that early-stopped solutions minimize the proposed Finsler distance, and
  3. demonstrating improved generalization consistent with the theoretical regularization bounds.
Limitations & Open Problems
  1. Bounded activation assumption - TECHNICAL: Required for proof techniques but ReLU networks are more practically relevant. Authors suggest clipped ReLU approximation.

  2. Mean-field limit validity - RESTRICTIVE: Theory applies only to infinite-width networks, but practical networks are finite. Gap between theory and practice unclear.

  3. Convex loss assumption - RESTRICTIVE: Real neural network training involves highly non-convex losses where gradient flows may not converge to global minima.

  4. Uniqueness of gradient flows not guaranteed - TECHNICAL: When displacement convexity fails, multiple gradient flow solutions exist, complicating the control formulation.

  5. Growth conditions (Assumption 4.12) - NATURAL: Standard in mean-field theory but may fail for some loss functions.

    Open problems:

  6. Extend to multi-layer networks beyond the two-layer case studied here.
  7. Establish finite-sample convergence rates relating finite-width networks to the mean-field limit with explicit dependence on network width.

Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate

Authors: Yingzhen Yang, Ping Li · Institution: Arizona State University · Category: stat.ML

First neural network training algorithm achieving nearly minimax optimal rate $\Theta(d^{k_0}/n)$ for learning degree-$k_0$ spherical polynomials via novel Gradient Descent with Projection.

Tags: neural tangent kernel overparameterized neural networks minimax optimal rates spherical harmonics polynomial regression gradient descent projection methods RKHS theory

arXiv · PDF

Problem Formulation

Motivation: Learning low-degree polynomials with neural networks is fundamental to understanding their generalization capabilities, particularly due to the spectral bias phenomenon where networks preferentially learn functions aligned with top eigenspaces of the Neural Tangent Kernel. Achieving minimax optimal rates for such problems has remained an open challenge.

Mathematical setup: Consider data uniformly distributed on unit sphere $X = S^{d-1} \subset \mathbb{R}^d$ with measure $\mu$. Training data is ${(\vec{x}_i, y_i)}_{i=1}^n$ where $y_i = f^*(\vec{x}_i) + w_i$ with $w_i$ i.i.d. sub-Gaussian noise (mean 0, variance proxy $\sigma_0^2$). Two-layer neural network with augmented feature:

\[f(W,x) = \frac{1}{\sqrt{m}}\sum_{r=1}^m a_r \sigma(\vec{w}_r^\top x) + \frac{1}{\sqrt{m}}\vec{w}_{m+1}^\top F(W^{(0)}, x)\]

where $\sigma(\cdot) = \max{\cdot, 0}$ is ReLU, $F(W^{(0)}, x) \in \mathbb{R}^m$ is augmented feature computed at initialization. Associated kernel:

\[K(u,v) = K^{(0)}(u,v) + K^{(1)}(u,v)\]

where:

\[K^{(0)}(u,v) = \frac{\pi - \arccos(u^\top v)}{2\pi}, \quad K^{(1)}(u,v) = u^\top v K^{(0)}(u,v)\]

Target function is degree-$k_0$ spherical polynomial:

\[f^*(x) = \sum_{\ell=0}^{k_0} \sum_{j=1}^{N(d,\ell)} a_{\ell j} Y_{\ell j}(x)\]

where ${Y_{\ell j}}$ are spherical harmonics. Function class constraint:

\[F^* = \left\{\sum_{\ell=0}^{k_0} \sum_{j=1}^{N(d,\ell)} a_{\ell j} Y_{\ell j} : \sum_{\ell=0}^{k_0} \sum_{j=1}^{N(d,\ell)} a_{\ell j}^2/\mu_\ell \leq \gamma_0^2\right\}\]

Assumptions:

  1. Features $\vec{x}_i$ drawn i.i.d. from uniform distribution on $S^{d-1}$
  2. Noise $w_i$ is sub-Gaussian with variance proxy $\sigma_0^2$
  3. Target function $f^* \in F^*$ with bounded RKHS norm $\gamma_0$
  4. No two training features coincide
  5. Degree $k_0 = \Theta(1) \geq 1$ is constant

    Toy example: When $d=2$ and $k_0=1$, target is $f^*(x) = a_0 Y_0(x) + a_{1,1} Y_{1,1}(x) + a_{1,2} Y_{1,2}(x)$ where $Y_0$ is constant and ${Y_{1,1}, Y_{1,2}}$ span linear functions on $S^1$. The relevant subspace has dimension $r_0 = m_{k_0} = N(2,0) + N(2,1) = 1 + 2 = 3$.

    Formal objective: Minimize regression risk:

    \[\mathbb{E}_P[(f_t - f^*)^2]\]

    where $f_t$ is network after $t$ steps of Gradient Descent with Projection (GDP).

Method

Gradient Descent with Projection (GDP): Novel training algorithm that projects gradient updates onto $r_0$-dimensional subspace corresponding to spherical harmonics up to degree $k_0$.

GDP Update Rules: At step $t+1$:

\[\text{vec}(W_S(t+1)) - \text{vec}(W_S(t)) = -\frac{\eta}{n}Z_S(t)P^{(r_0)}(\hat{y}(t) - y)\] \[\vec{w}_{m+1}(t+1) - \vec{w}_{m+1}(t) = -\frac{\eta}{n\sqrt{m}}F(W^{(0)}, S)^\top P^{(r)}(\hat{y}(t) - y)\]

where $P^{(r_0)} = U\Sigma^{(r_0)}U^\top$ is projection matrix onto top $r_0$ eigenspace of empirical NTK matrix $K_n$.

Key Components:

  1. Projection matrix $P^{(r_0)}$ ensures learning occurs in $r_0$-dimensional subspace
  2. Augmented feature $F(W^{(0)}, x)$ makes all NTK eigenvalues positive
  3. Symmetric initialization: $\hat{y}(0) = 0$

    Steps:

  4. Initialize weights symmetrically: $\vec{w}_{2r’-1}(0) = \vec{w}_{2r’}(0) \sim \mathcal{N}(0, \kappa^2 I_d)$, $a_{2r’-1} = -a_{2r’}$
  5. For $t = 1, \ldots, T$: Apply GDP updates with projection $P^{(r_0)}$
  6. Set $r_0 = m_{k_0} = \Theta(d^{k_0})$, learning rate $\eta = \Theta(1)$, steps $T \asymp n/d^{k_0}$

    Adaptive Degree Selection: When $k_0$ unknown, Algorithm 2 tests degrees $\ell = L, \ldots, 0$ and selects based on training loss criterion:

    \[\frac{E_\ell}{\mu_{\ell+1}} \geq \frac{\beta_0^2}{4} \text{ and } \frac{E_{\ell+1}}{\mu_{\ell+2}} \leq \frac{\beta_0^2}{8}\]

    Application to toy example: For $d=2, k_0=1$: Network learns 3-dimensional subspace spanned by ${Y_0, Y_{1,1}, Y_{1,2}}$. GDP projects gradients onto this subspace, avoiding contamination from higher-degree harmonics that vanilla GD would attempt to fit.

Novelty & Lineage

Prior work:

  1. Nichani et al. (2022): Achieved sample complexity $\Theta(d^{k_0} \max{\epsilon^{-2}, \log d})$ for learning dense polynomials, implying convergence rate $\Theta(\sqrt{d^{k_0}/n})$ - far from optimal.
  2. Ghorbani et al. (2021): Showed NTK alone can learn polynomials under restrictive conditions but without concrete rates or minimax optimality.
  3. Damian et al. (2022): Two-stage method requiring assumption that target depends on $r \ll d$ directions, achieving rate $\tilde{\Theta}(\sqrt{d^{k_0+1}/n})$.

    Delta: This paper introduces GDP algorithm achieving nearly minimax optimal rate $\log(4/\delta) \cdot \Theta(d^{k_0}/n)$ vs. known lower bound $\Theta(d^{k_0}/n)$, plus adaptive degree selection.

    Theory-specific assessment:

    • Main theorem: Highly surprising - first algorithm to achieve nearly optimal rate for learning general degree-$k_0$ spherical polynomials with neural networks. Prior work achieved only $\Theta(\sqrt{d^{k_0}/n})$ rates.
    • Proof technique: Novel projection-based training fundamentally different from standard NTK analysis. Key insight: project onto $r_0$-dimensional eigenspace rather than full RKHS. Non-trivial operator theory to bound information loss from projection.
    • Bound tightness: Nearly matches minimax lower bound $\Theta(d^{k_0}/n)$ up to $\log(1/\delta)$ factor. Lower bound is known and established.

    Verdict: SIGNIFICANT — First nearly minimax optimal algorithmic result for learning spherical polynomials with neural networks, substantially improving upon prior $\sqrt{d^{k_0}/n}$ rates to achieve $d^{k_0}/n$ rate.

Proof Techniques

Main Proof Strategy: Three-stage approach fundamentally different from standard NTK analysis.

Stage 1 - Neural Network Decomposition (Theorem 4.2): Key insight: GDP ensures network function decomposes as $f_t = h_t + e_t$ where $h_t \in H_K(B_h) \cap H_{S,r_0}$ (lives in $r_0$-dimensional RKHS subspace) and $|e_t|_\infty \leq w$ (small uniform error).

Critical uniform convergence bound:

\[\sup_{u,v \in X} |K(u,v) - \hat{h}(W^{(0)}, u, v)| \leq C_1(m/2, d, 1/n) \lesssim \sqrt{\frac{d \log m}{m}}\]

Stage 2 - Regression Risk via Local Rademacher Complexity (Theorem 4.3): For localized function class $F(B_h, w, S, r_0) = {f : f = h + e, h \in H_K(B_h) \cap H_{S,r_0}, |e|_\infty \leq w}$:

\[\mathbb{E}_P[(f_t - f^*)^2] - 2\mathbb{E}_{P_n}[(f_t - f^*)^2] \lesssim \sqrt{\frac{\log(2/\delta) \cdot d^{k_0}}{n}} + w\]

Stage 3 - Sharp Training Loss Bound (Theorem 4.4): Novel operator-theoretic approach. Key technical lemma bounds projection of target onto higher eigenspaces:

\[\|P_{U^{(-r_0)}}(f^*(S))\|_2^2 \leq n\gamma_0^2 \log(2/\delta) \cdot \Theta\left(\frac{d^{k_0}}{n}\right)\]

This uses bounded Hilbert-Schmidt norm of difference between projection operators:

\[\mathbb{E}_{P_n}[(f_t - f^*)^2] \leq \Theta\left(\frac{\gamma_0^2}{\eta t}\right) + \gamma_0^2 \log(2/\delta) \cdot \Theta\left(\frac{d^{k_0}}{n}\right)\]

Novel Operator Theory Result (Theorem 4.5): For orthonormal basis extension ${\Phi^{(k)}}_{k \geq 0}$ of RKHS:

\[\sum_{q=r_0}^\infty \langle f^*, \Phi^{(q)} \rangle_{H_K}^2 \leq \frac{32\gamma_0^2 \log(2/\delta)}{(\mu_{k_0} - \mu_{k_0+1})^2 n}\]

Key insight: Information loss from projecting to $r_0$-dimensional subspace is controlled and small enough to maintain near-optimality.

Experiments & Validation

Purely theoretical paper with no experiments.

Empirical validation would require:

  1. Synthetic data on unit spheres with known degree-$k_0$ polynomial targets
  2. Comparison of GDP vs. vanilla GD on regression risk convergence
  3. Verification of adaptive degree selection accuracy
  4. Scaling studies with dimension $d$ and degree $k_0$
  5. Comparison with kernel methods achieving minimax rates
Limitations & Open Problems

Limitations:

  1. RESTRICTIVE: Requires massive overparameterization $m \gtrsim (n/d^{k_0})^{25/2} d^{5/2}$ - extremely large network widths needed
  2. RESTRICTIVE: Limited to two-layer networks with specific augmented feature architecture
  3. TECHNICAL: Uniform distribution on sphere assumption - real data rarely satisfies this
  4. TECHNICAL: Sub-Gaussian noise assumption with known variance proxy
  5. NATURAL: Constant degree assumption $k_0 = \Theta(1)$ - standard in polynomial learning literature
  6. RESTRICTIVE: For adaptive algorithm, requires minimum signal strength condition $\min a_{\ell j} /\sqrt{\mu_\ell} \geq \beta_0$
  7. TECHNICAL: Analysis specific to ReLU activation function

    Open Problems:

  8. Extension to deep networks: Can similar rates be achieved with deeper architectures without exponential width requirements?
  9. Realistic data distributions: Develop theory for non-uniform distributions on manifolds while maintaining near-optimal rates

Sparse Weak-Form Discovery of Stochastic Generators

Authors: Eshwar R A, Gajanan V. Honnavar · Institution: PES University · Category: stat.ME

Extends weak-form sparse regression to stochastic differential equations by using spatial Gaussian test functions that avoid endogeneity bias, enabling joint identification of drift and diffusion from trajectory data.

Tags: stochastic differential equations sparse regression system identification weak-form methods SINDy infinitesimal generator data-driven discovery LASSO

arXiv · PDF

Problem Formulation
  1. Motivation: Most real-world dynamical systems are intrinsically stochastic (thermal fluctuations, market noise, demographic randomness). Data-driven discovery methods must identify both deterministic drift and stochastic diffusion components to construct reduced-order models that capture variability, not just average behavior.

  2. Mathematical setup: Let ${X_{t_0}, X_{t_1}, \ldots, X_{t_N}}$ be discrete observations of the scalar Itô diffusion:

    \[dX_t = b(X_t) dt + \sigma(X_t) dW_t\]

    where $b: \mathbb{R}^d \to \mathbb{R}^d$ is drift, $\sigma: \mathbb{R}^d \to \mathbb{R}^{d \times m}$ is diffusion matrix, $W_t \in \mathbb{R}^m$ is standard Wiener process. Define diffusion tensor $a(x) = \sigma(x)\sigma(x)^T$ and infinitesimal generator:

    \[\mathcal{L}f(x) = b(x) \cdot \nabla f(x) + \frac{1}{2} a(x) : \nabla^2 f(x)\]

    Assumptions:

    1. True functions are sparse in polynomial library: $b(x) = \Theta(x) c^*, a(x) = \Theta(x) d^*$
    2. Library $\Theta(x) = [f_1(x), \ldots, f_K(x)]$ spans true drift/diffusion
    3. Diffusion is ergodic with unique invariant measure $\mu$
  3. Toy example: For Ornstein-Uhlenbeck process with $d=1$, $b(x) = -\theta x$ and $a(x) = \sigma_0^2$ constant. When $\Theta(x) = [1, x, x^2, x^3, x^4]$, true coefficients are $c^* = [0, -\theta, 0, 0, 0]$ and $d^* = [\sigma_0^2, 0, 0, 0, 0]$. The core difficulty is that temporal test functions create endogeneity bias since future states depend on current Brownian innovations.

  4. Formal objective: Recover sparse coefficient vectors $c^*$ and $d^*$ such that:

    \[\hat{b}(x) = \sum_{k \in S_b} \hat{c}_k f_k(x), \quad \hat{a}(x) = \sum_{k \in S_a} \hat{d}_k f_k(x)\]
Method

The method uses spatial Gaussian test functions to create unbiased regression systems.

  1. Define spatial Gaussian kernels at $M$ centers $x_1, \ldots, x_M$:

    \[K_j(x) = \exp\left(-\frac{(x-x_j)^2}{2h^2}\right)\]
  2. From Euler-Maruyama discretization $\Delta X_n = b(X_{t_n}) \Delta t + \sigma(X_{t_n}) \xi_n \sqrt{\Delta t}$, multiply by $K_j(X_{t_n})$ and sum over steps to build drift system:

    \[B_j := \sum_n K_j(X_{t_n}) \Delta X_n \approx \sum_k c_k A_{jk}\]

    where design matrix is:

    \[A_{jk} = \sum_{n=0}^{N-1} K_j(X_{t_n}) f_k(X_{t_n}) \Delta t\]
  3. For diffusion identification via quadratic variation:

    \[Q_j := \sum_n K_j(X_{t_n}) (\Delta X_n)^2 \approx \sum_k d_k A_{jk}\]
  4. Apply finite-time-step bias correction by first estimating drift $\hat{b}$, then:

    \[Q_j^{\text{corr}} = Q_j - \sum_n K_j(X_n) \hat{b}(X_n)^2 \Delta t^2\]
  5. Solve both systems $B \approx Ac$ and $Q^{\text{corr}} \approx Ad$ via LASSO with grouped cross-validation, followed by OLS debiasing and STLSQ refinement.

    Toy example application: For OU process with centers on $[-2.5, 2.5]$ and $h=0.22$, the method produces design matrix where only the $x$ column contributes to drift identification and only the constant column contributes to diffusion identification, recovering $\hat{c}_x \approx -1.0$ and $\hat{d}_1 \approx \sigma_0^2$.

Novelty & Lineage

Step 1 — Prior work:

  • Weak SINDy (Messenger & Bortz 2021): Uses temporal test functions with integration-by-parts for deterministic ODEs, avoiding derivative estimation but limited to deterministic systems
  • Stochastic SINDy (Boninsegna & Clementi 2018): Extends SINDy to SDEs via Kramers-Moyal moments but uses individual-step increments vulnerable to noise amplification
  • Classical SINDy (Brunton et al. 2016): Sparse regression for deterministic dynamics using derivative estimates

Step 2 — Delta: This paper combines weak-form projection (from Weak SINDy) with stochastic identification (from Stochastic SINDy) by switching from temporal to spatial test functions. Key insight: temporal test functions create endogeneity bias in stochastic setting because $\phi_j(t)$ weights observations by time index, but future states depend on past Brownian innovations. Spatial kernels $K_j(X_{t_n})$ avoid this since they’re $\mathcal{F}_{t_n}$-measurable and independent of innovation $\xi_n$.

Step 3 — Theory-specific assessment:

  • Main theorem (unbiasedness): $E[K_j(X_{t_n}) \sigma(X_{t_n}) \xi_n \mathcal{F}_{t_n}] = 0$ is straightforward from independence properties, not surprising
  • Proof technique: Standard stochastic calculus and ergodic theorem applications, routine assembly of known results
  • Bounds: No explicit finite-sample rates provided; convergence relies on ergodic theorem without quantitative rates
  • The endogeneity bias identification is the main conceptual contribution but follows directly from basic martingale properties

Verdict: INCREMENTAL — solid combination of existing techniques with clear practical value but no fundamental theoretical breakthrough.

Proof Techniques

The main proof strategy relies on standard ergodic theory and martingale properties:

  1. Unbiasedness proof uses independence structure. For Brownian innovation $\xi_n = (W_{t_{n+1}} - W_{t_n})/\sqrt{\Delta t}$:

    \[E[K_j(X_{t_n}) \sigma(X_{t_n}) \xi_n | \mathcal{F}_{t_n}] = K_j(X_{t_n}) \sigma(X_{t_n}) E[\xi_n | \mathcal{F}_{t_n}] = 0\]

    since $\xi_n \sim N(0,1)$ independent of $\mathcal{F}_{t_n}$.

  2. Consistency uses ergodic theorem for diffusions. Under ergodicity with invariant measure $\mu$:

    \[\frac{1}{N} A_{jk} = \frac{1}{N} \sum_n K_j(X_{t_n}) f_k(X_{t_n}) \Delta t \to \int K_j(x) f_k(x) \mu(dx)\] \[\frac{1}{N} B_j = \frac{1}{N} \sum_n K_j(X_{t_n}) \Delta X_n \to \int K_j(x) b(x) \mu(dx)\]
  3. Noise robustness analysis: For additive observation noise $\tilde{X}_{t_n} = X_{t_n} + \eta_n$, the noise contribution to $\tilde{B}_j$ has variance:

    \[\text{Var}\left[\sum_n K_j(X_{t_n})(\eta_{n+1} - \eta_n)\right] = O(\sigma_\eta^2 \Delta t)\]

    which vanishes as $\Delta t \to 0$, contrasting with Kramers-Moyal variance $O(\sigma_\eta^2/\Delta t^2)$.

  4. Finite-time-step bias correction: Squaring Euler-Maruyama step gives:

    \[(\Delta X_n)^2 = a(X_{t_n}) \Delta t + 2b(X_{t_n}) \sigma(X_{t_n}) \xi_n \Delta t^{3/2} + b(X_{t_n})^2 \Delta t^2\]

    The drift-squared term creates $O(\Delta t^2)$ systematic bias requiring correction.

Experiments & Validation

Three benchmark SDEs tested:

  1. Ornstein-Uhlenbeck: $dX_t = -\theta X_t dt + \sigma_0 dW_t$ with $\theta=1.0, \sigma_0=0.7$ - Results: Drift error 2.0%, diffusion error 0.0%

  2. Double-well Langevin: $dX_t = (X_t - X_t^3) dt + \sigma_0 dW_t$ with $\sigma_0=0.5$
    - Results: Drift errors 2.7% (linear) and 2.9% (cubic), diffusion error 0.0%

  3. Multiplicative diffusion: $dX_t = -2X_t dt + \frac{1}{2}\sqrt{1+X_t^2} dW_t$ - Results: Drift error 3.9%, diffusion errors 0.3% and 0.4% (after bias correction)

    Experimental setup: $R=120$ trajectories, $N=50,000$ steps per trajectory, $\Delta t=0.002$, $T=100$. Polynomial library $\Theta(x) = [1,x,x^2,x^3,x^4]$. Spatial kernels: $M=50$ centers, bandwidth $h=0.22$ (OU, double-well) or $h=0.27$ (multiplicative).

    Validation metrics: Coefficient errors, stationary density total variation distances (<0.01 all cases), autocorrelation function matching. All active terms correctly identified with no false positives.

Limitations & Open Problems

Limitations:

  1. Library specification requirement - RESTRICTIVE: User must provide feature library $\Theta(x)$ spanning true drift/diffusion functions; misspecification not correctable post-hoc

  2. Hyperparameter tuning - TECHNICAL: Bandwidth $h$, number of centers $M$, and STLSQ threshold require tuning; currently uses heuristics rather than principled selection

  3. Scalar limitation - TECHNICAL: Current formulation handles scalar SDEs; multi-dimensional extension straightforward but diffusion parameters scale as $d(d+1)/2$

  4. Ergodicity assumption - NATURAL: Requires ergodic diffusion with unique invariant measure, standard assumption in stochastic analysis

  5. Polynomial library restriction - RESTRICTIVE: Method limited to polynomial generators; cannot identify exponential, trigonometric, or other non-polynomial terms

  6. No finite-sample rates - TECHNICAL: Convergence analysis relies on ergodic theorem without explicit finite-sample convergence rates

    Open problems:

  7. Develop principled cross-validation schemes for bandwidth and center selection rather than heuristic overlap ratios
  8. Extend to high-dimensional systems with structured (low-rank, sparse) diffusion tensors to overcome $O(d^2)$ parameter scaling