Apr 17, 2026 Theory 3 papers

Theory Digest — Apr 17, 2026

Today’s Digest at a Glance

Preliminary

Today’s digest features three theoretical advances: a Fokker-Planck transformation for nonlinear stochastic control, asymptotic theory for graphical SLOPE, and stochastic modified equations for infinite-dimensional SGD.

Fokker-Planck Transformation for Optimal Control

The Fokker-Planck transformation addresses a fundamental limitation in stochastic optimal control: nonlinear problems with distributional constraints are typically intractable because the Hamilton-Jacobi-Bellman equation couples with the evolution of the probability density. Classical approaches either ignore distributional effects or resort to computationally expensive particle methods.

The core mathematical insight exploits the duality between the controlled stochastic differential equation $dX_t = b(X_t, u_t)dt + \sigma dW_t$ and its associated Fokker-Planck equation $\partial_t \rho = -\nabla \cdot (b(x,u)\rho) + \frac{1}{2}\sigma^2 \Delta \rho$. For state-independent diffusion, this can be rewritten as a continuity equation $\partial_t \rho + \nabla \cdot (\rho v) = 0$ where $v(x,t) = b(x,u(x,t))$ is a deterministic velocity field. The transformation then converts the stochastic control problem into an equivalent deterministic optimal control problem over velocity fields, preserving the dynamic programming structure.

Intuition: Instead of controlling random particles, we control the deterministic flow of probability mass itself.

SLOPE Penalty for Graphical Models

The SLOPE (Sorted L-One Penalized Estimation) penalty extends the LASSO by applying decreasing penalty weights $\lambda_1 > \lambda_2 > \cdots$ to the sorted absolute values of parameters rather than uniform penalization. In precision matrix estimation for graphical models, this addresses LASSO’s limitation of treating all edges equally regardless of their strength.

For the precision matrix $\Theta$, the graphical SLOPE estimator minimizes the penalized negative log-likelihood:

\[\hat{\Theta} = \arg\min_{\Theta \succ 0} \left\{ \text{tr}(S\Theta) - \log\det(\Theta) + \sum_{j=1}^p \lambda_j |\Theta|_{(j)} \right\}\]

where $

\Theta

_{(j)}$ denotes the $j$-th largest off-diagonal entry in absolute value. The decreasing penalty sequence enables automatic selection of a “natural” sparsity level by more heavily penalizing the largest coefficients.

Intuition: SLOPE acts like an adaptive threshold that automatically determines how many edges to include based on their relative magnitudes.

Cylindrical Brownian Motion

Cylindrical Brownian motion extends standard Brownian motion to infinite-dimensional Hilbert spaces, addressing the fact that infinite-dimensional white noise cannot be defined as a classical stochastic process. In SGD analysis, this enables rigorous treatment of gradient noise in function spaces arising in inverse problems and PDE-constrained optimization.

Mathematically, cylindrical Brownian motion $W_t$ on a Hilbert space $H$ is defined through its action on a complete orthonormal basis $(e_k)$: $W_t = \sum_{k=1}^\infty \beta_k(t) e_k$ where $\beta_k(t)$ are independent scalar Brownian motions. For SGD in infinite dimensions, the diffusion operator takes the form $\sigma(\phi) = \sqrt{\frac{1}{n}\sum_{i=1}^n

\nabla f_i(\phi)

^2_{L^2_\Gamma}}$ where $L^2_\Gamma$ weights different frequency components.

Intuition: Cylindrical Brownian motion provides independent random perturbations along each coordinate direction of an infinite-dimensional function space.

Reading Guide

The first paper’s Fokker-Planck transformation provides a novel deterministic reformulation technique that could potentially be combined with the infinite-dimensional SGD analysis from the third paper for mean-field control problems. The second paper’s SLOPE methodology represents a different approach to structured sparsity that complements the distributional constraint handling in the first paper. All three papers advance theoretical understanding of high-dimensional optimization under different types of constraints.

Nonlinear Stochastic Optimal Control and Optimal Stopping using the Fokker-Planck Transformation

Authors: Akan Selim, Siddhartha Ganguly, Ali Pakniyat, Panagiotis Tsiotras · Institution: Georgia Institute of Technology, University of Alabama · Category: math.OC

Develops a novel Fokker-Planck transformation that converts nonlinear stochastic optimal control problems with stopping and distributional constraints into equivalent deterministic formulations preserving the same dynamic programming structure.

Tags: stochastic optimal control optimal stopping Fokker-Planck equations distributional constraints mean-field theory Hamilton-Jacobi-Bellman equations Skorokhod embedding optimal transport

arXiv · PDF

Problem Formulation

Motivation (2–3 sentences): Stochastic optimal control with stopping arises naturally when controlling the full distribution of states rather than single trajectories, as in robotic swarms, aerospace guidance under uncertainty, and finance applications. The coupling between control decisions and optimal stopping timing makes distribution-level analysis substantially more challenging than fixed-horizon control. Classical trajectory-based approaches are insufficient for problems requiring distributional constraints and endogenous stopping mechanisms.
Mathematical setup: Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a complete probability space and consider the controlled stochastic system:
\[d x_t = f(t, x_t, u_t) dt + \sigma_t d w_t\]
where $x_t \in \mathbb{R}^n$ is the state, $u_t \in \mathbb{R}^m$ is the control, $w_t$ is $n$-dimensional Brownian motion, and $\sigma_t \in \mathbb{R}^{n \times n}$ is a deterministic diffusion matrix. The drift function $f: \mathbb{R}_{\geq 0} \times \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^n$ is deterministic.

Assumptions:
1. The drift $f$ is continuously differentiable with bounded derivatives and satisfies Lipschitz conditions
2. The diffusion matrix $\sigma_t$ is bounded, continuous, and uniformly elliptic: $\sigma_t \sigma_t^T \succ 0$
3. The cost functions $L: \mathbb{R}_{\geq 0} \times \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}$ (running cost) and $\Phi: \mathbb{R}_{\geq 0} \times \mathbb{R}^n \to \mathbb{R}$ (terminal cost) are continuously differentiable with bounded derivatives
For stopping boundary $g \in \mathcal{G}$, define the stopping time:
\[\tau_g := \inf\{t \geq 0 : g(t, x_t) \geq 0\}\]
The continuation region is $\mathcal{D}_g := {(t,x) \in \mathbb{R}_{\geq 0} \times \mathbb{R}^n : g(t,x) < 0}$.
Toy example: Consider $n=2$, $m=1$ with dynamics $dx_t = u_t dt + dw_t$ where $u_t \in [-1,1]$. Take $L(t,x,u) = \frac{1}{2}u^2 + c|x|^2$ and $\Phi(t,x) = |x|^2$. With stopping boundary $g(t,x) = |x|^2 - R^2(t)$ for some function $R(t)$, particles stop when exiting a time-varying ball. The core difficulty is that optimal control and stopping decisions must account for the full probability distribution, not just individual trajectories.
Formal objective: The main problem is to minimize:
\[\min_{g \in \mathcal{G}, u \in \mathcal{U}} \mathbb{E}^{\mathcal{F}_0}_{x \sim \mu_0}\left[\int_0^{\tau_g} L(t, x_t, u_t(x_t)) dt + \Phi(\tau_g, x_{\tau_g})\right]\]

Method

The method consists of a Fokker-Planck transformation that converts the stochastic control problem into an equivalent deterministic formulation:

Fokker-Planck Transformation: For state-independent diffusion, rewrite the controlled Fokker-Planck equation as a continuity equation. Given density $\rho(t,x)$ evolving according to:
\[\frac{\partial}{\partial t}\rho(t,x) = -\nabla_x \cdot [f(t,x,u_t(x))\rho(t,x)] + \frac{1}{2}\nabla_x \cdot [\sigma_t \sigma_t^T \nabla_x \rho(t,x)]\]
Define the score-corrected velocity field:
\[\tilde{f}(t,x,u) := f(t,x,u) - \frac{\sigma_t \sigma_t^T}{2} \nabla_x \log \rho(t,x)\]
Deterministic Reformulation: The Fokker-Planck equation becomes the Liouville equation:
\[\frac{\partial}{\partial t}\rho(t,x) = -\nabla_x \cdot [\tilde{f}(t,x,u_t(x))\rho(t,x)]\]
Characteristic Dynamics: This yields deterministic characteristic dynamics:
\[\frac{dx_t}{dt} = \tilde{f}(t,x_t,u_t(x_t))\]
whose flow reproduces the marginal law of the original stochastic system.
Distributional Value Function: Define the distributional value function:
\[\bar{V}_g(t,\mu) := \min_{u \in \mathcal{U}} \mathbb{E}^{\mathcal{F}_t}_{x \sim \mu}\left[\int_t^{\tau_g} L(s,x_s,u_s(x_s))ds + \Phi(\tau_g,x_{\tau_g})\right]\]
Variational Formulation: Reformulate as an optimal control problem with state-dependent terminal-time assignment $\tau_m \in \mathcal{T}_{ad}$ and terminal distributional constraints.

Application to toy example: For the 2D example, the score function $\nabla_x \log \rho(t,x)$ modifies the drift, creating a deterministic flow that maintains the same marginal distributions as the original Brownian motion with control. The stopping boundary becomes a time-assignment function $\tau_m(x_0)$ specifying when each initial condition should stop.

Novelty & Lineage

Prior work:

Classical Skorokhod Embedding Problem (Skorokhod 1961, Root 1969, Rost 1971): Addressed optimal stopping for single processes to achieve target distributions, but without control or distributional constraints.
Schrödinger Bridge Problems (recent works by Gelbrich et al. 2024, Chen et al. 2025): Handle distribution steering with fixed terminal times, but lack optimal stopping components.
Mean-field stochastic control (Pakniyat 2021, 2022): Considered distributional constraints with fixed horizons, but not optimal stopping.

Delta: This paper introduces three key advances:
- New transformation: The Fokker-Planck transformation converting stochastic control+stopping to deterministic density evolution is novel
- Unified framework: First to combine nonlinear stochastic control, optimal stopping, AND distributional constraints in a coherent theory
- Equivalence result: Proof that the deterministic reformulation preserves the same dynamic programming structure as the original stochastic problem
Theory-specific assessment:
- Main theorem surprising or predictable? The equivalence between stochastic HJB and deterministic distributional HJB (Theorem 4.1) is non-obvious. The decomposition $\bar{V}_g(t,\mu) = \mathbb{E}_{x \sim \mu}[V_g(t,x)]$ and preservation of the same second-order differential operator is genuinely surprising.
- Proof technique routine or novel? The proof technique combining Stein identities, Reynolds transport theorems in measure spaces, and functional derivatives is genuinely novel. The use of score-corrected velocity fields to maintain stochastic-deterministic equivalence requires new technical machinery.
- Bound tightness: This is a foundational theoretical paper establishing equivalence rather than providing bounds. No relevant lower bounds exist for comparison.
Verdict: SIGNIFICANT — The Fokker-Planck transformation approach to unify stochastic control, optimal stopping, and distributional constraints represents a clear non-obvious advance that most researchers in stochastic control and optimal transport should read.

Proof Techniques

The proof strategy employs several sophisticated techniques:

Distributional Dynamic Programming: Start with the decomposition ansatz:
\[\bar{V}_g(t,\mu) = \mathbb{E}_{x \sim \mu}[V_g(t,x)]\]
Use push-forward measure evolution under controlled dynamics to derive Reynolds transport-like formula:
\[\frac{d}{dt}\bar{V}_g(t,\mu) = \frac{\partial}{\partial t}\bar{V}_g(t,\mu) + \int_{S_\mu} \frac{\delta \bar{V}_g}{\delta \mu}(t,\mu)(t,x) \frac{\partial}{\partial t}\rho^a(t,x) dx\]
Stein-Type Identities: Key integration by parts formula for the diffusion term:
\[\int_{S_\mu} \frac{\delta \bar{V}_g}{\delta \mu}(t,\mu)(t,x) \nabla_x \cdot [\sigma_t \sigma_t^T \nabla_x \rho^a(t,x)] dx = \int_{S_\mu} \frac{1}{2}\text{trace}\left[\sigma_t \sigma_t^T \frac{\partial^2}{\partial x^2}\frac{\delta \bar{V}_g}{\delta \mu}(t,\mu)(t,x)\right] \rho^a(t,x) dx\]
Flux Analysis at Stopping Boundary: The probability mass conservation equation:
\[\frac{d}{dt}\Phi_g(t,\mu) = \int_{\partial S_{\mu^a}} \Phi(t,x)[J_{out}(t,x,u_t(x)) - \rho^a(t,x)v_{out}(t,x)] \cdot \hat{n}(t,x) dS(x)\]
where $J_{out}(t,x,u) = f(t,x,u)\rho^a(t,x) - \nabla_x \cdot [\frac{\sigma_t \sigma_t^T}{2}\rho^a(t,x)]$ is the outward flux density.
Functional Derivative Equivalence: The critical insight that:
\[\frac{\delta \bar{V}_g}{\delta \mu}(t,\mu)(t,x) = V_g(t,x)\]
This connects the distributional value function to the pointwise value function through functional differentiation.
Score Function Integration: The transformed drift becomes:
\[\tilde{f}(t,x,u) = f(t,x,u) - \frac{\sigma_t \sigma_t^T}{2}\nabla_x \log \rho(t,x)\]
where the score term $\nabla_x \log \rho(t,x)$ encodes the stochastic diffusion effect in the deterministic formulation.
Rockafellar-Wets Interchange: To exchange minimization and expectation:
\[\min_{u \in \Pi} \mathbb{E}_{x \sim \mu}[h(x,u(x))] = \mathbb{E}_{x \sim \mu}[\min_{u \in \mathbb{R}^m} h(x,u)]\]
This requires the integrand to be a normal integrand (measurable in $x$, continuous in $u$).

Experiments & Validation

Purely theoretical. The paper develops the mathematical framework without numerical experiments.

Empirical validation would require:

Computational algorithms to solve the coupled Fokker-Planck PDE and characteristic ODE system
Comparison studies showing equivalence between stochastic and deterministic formulations on test problems
Applications to concrete problems in robotics swarms, aerospace guidance, or quantitative finance
Convergence analysis of discretization schemes for the score-corrected velocity field

The authors note this work establishes theoretical foundations “with an eye toward the imminent development of tractable algorithmic architectures.”

Limitations & Open Problems

Limitations:

State-independent diffusion assumption - TECHNICAL: The Fokker-Planck transformation requires $\sigma_t$ independent of state and control. Extension to state-dependent diffusion would require different techniques and is noted as technically challenging.
Regularity assumptions on stopping boundary - RESTRICTIVE: Assumption 4.5 requires the stopping surface to be a $C^1$ hypersurface, which significantly narrows applicability. The authors acknowledge this is “heavily problem dependent” and don’t characterize when it holds.
Classical solution existence for Dirichlet problem - TECHNICAL: Assumption 4.7 requires classical solutions $\rho \in C^{1+\alpha/2,2+\alpha}_{loc}$ to uniformly parabolic Dirichlet problems, which may not hold for irregular boundaries.
Score function singularities - TECHNICAL: The score term $\nabla_x \log \rho(t,x)$ becomes unbounded near the Dirichlet boundary where $\rho(t,x) = 0$, requiring careful treatment in boundary layers.
No computational algorithms - NATURAL: This is a theoretical paper laying foundations, but practical implementation remains completely open.

Open Problems:
Algorithmic development: Design computationally tractable methods to solve the coupled nonlocal PDE-ODE system arising from the Fokker-Planck transformation.
Regularity theory for free boundaries: Develop sufficient conditions under which optimal stopping boundaries satisfy the required regularity assumptions for general cost functionals.

Asymptotic Theory for Graphical SLOPE: Precision Estimation and Pattern Convergence

Authors: Ivan Hejný, Giovanni Bonaccolto, Philipp Kremer, Sandra Paterlini et al. (6 authors) · Institution: Lund University, University of Enna, La Francaise Systematic Asset Management, University of Trento, University of Wroclaw · Category: math.ST

Develops asymptotic theory for SLOPE-penalized precision matrix estimation, establishing convergence of both parameter estimates and clustering patterns with robust variants for heavy-tailed data

Tags: precision matrix estimation graphical models SLOPE penalty pattern recovery heavy-tailed distributions robust statistics clustering M-estimators

arXiv · PDF

Problem Formulation

Motivation: This paper addresses precision matrix estimation for Gaussian graphical models, where the goal is not only to identify sparsity but also to recover clusters of edges with equal or similar strength. In financial networks, gene regulatory networks, and social networks, edges often naturally group into clusters with comparable magnitudes, requiring methods that can simultaneously estimate sparsity patterns and clustering structure.

Mathematical setup: Let $X^{(1)}, \ldots, X^{(n)}$ be i.i.d. observations of $X = (X_1, \ldots, X_p)^T \in \mathbb{R}^p$ with $E[X] = 0$ and full-rank covariance. The SLOPE estimator solves:

\[\hat{\Theta}_n \in \arg\min_{\Theta \in \text{Sym}^+(p)} \left\{ \frac{1}{n} \sum_{i=1}^n \ell(X^{(i)}, \Theta) + n^{-1/2} \text{Pen}(\Theta) \right\}\]

The SLOPE penalty is:

\[\text{Pen}(\Theta) = \sum_{j=1}^{p(p-1)/2} \lambda_j |\text{vec}^+(\Theta)|_{(j)}\]

where $\lambda_1 > \lambda_2 > \cdots > \lambda_{p(p-1)/2} > 0$ and $

\text{vec}^+(\Theta)

_{(1)} \geq \cdots \geq

\text{vec}^+(\Theta)

_{(p(p-1)/2)}$ are ordered absolute values of strictly lower-triangular entries.

Assumptions:

$\Theta \mapsto \ell(x, \Theta)$ is twice continuously differentiable in a neighborhood of $\Theta_0$
Bounded second derivatives: $|\nabla^2_\Theta \ell(X, \Theta)| \leq M(X)$ with $E[M(X)^2] < \infty$
Population Hessian $C := \nabla^2 G(\Theta_0)$ is positive definite, where $G(\Theta) = E[\ell(X, \Theta)]$
$E[\nabla_\Theta \ell(X, \Theta_0)] = 0$ and finite score covariance $C_\triangle$

Toy example: When $p = 2$ with $\Theta_0 = \begin{pmatrix} 2 & 1 \ 1 & 2 \end{pmatrix}$, there is one off-diagonal entry $\Theta_{12}$. The SLOPE pattern reduces to identifying whether this entry is zero or nonzero, similar to Lasso. The clustering aspect becomes meaningful for $p \geq 3$ where multiple off-diagonal entries can be tied to equal magnitudes.

Formal objective: Characterize the limiting distribution:
\[\sqrt{n}(\hat{\Theta}_n - \Theta_0) \overset{d}{\to} \hat{U}\]
where $\hat{U}$ solves a deterministic optimization problem involving the directional derivative of the SLOPE penalty.

Method

Core Method: The Graphical SLOPE estimator minimizes the penalized negative log-likelihood with ordered $\ell_1$ penalties on off-diagonal precision matrix entries. The key innovation is using decreasing penalty weights $\lambda_1 > \lambda_2 > \cdots$ applied to ordered absolute values, which encourages both sparsity and clustering of edge strengths.

Algorithm Steps:

Order off-diagonal entries: compute $

\text{vec}^+(\Theta)

_{(1)} \geq \cdots \geq

\text{vec}^+(\Theta)

_{(p(p-1)/2)}$

Apply penalty sequence: $\text{Pen}(\Theta) = \sum_{j=1}^{p(p-1)/2} \lambda_j

\text{vec}^+(\Theta)

_{(j)}$

Minimize objective: solve the convex optimization problem in equation (1)

For heavy-tailed data, use t-distribution loss instead of Gaussian loss (TSLOPE):

\[\ell(x, \Theta) = -\frac{1}{2}\log\det(\Theta) + \frac{\nu + p}{2}\log(\nu + x^T\Theta x)\]

Pattern Recovery: The method identifies the SLOPE pattern $\text{patt}(\theta^+)_i = \text{rank}(

\theta^+

)_i \text{sign}(\theta^+)_i$, which captures both sparsity (zero entries) and clustering (entries with equal absolute values).

Toy Example Application: For $p=2$ with one off-diagonal entry, SLOPE reduces to standard thresholding. For $p=3$ with entries ${\Theta_{12}, \Theta_{13}, \Theta_{23}}$, if the true values are ${0.5, 0.5, 0}$, SLOPE can recover the pattern where two entries are equal and nonzero while one is zero—something standard Lasso cannot guarantee.

Novelty & Lineage

Step 1 — Prior work:

“The graphical lasso: New insights and alternatives” (Yuan & Lin 2007): introduced $\ell_1$-penalized precision matrix estimation with sparsity but no clustering capability
“Sorted L-One Penalized Estimation” (Bogdan et al. 2015): developed SLOPE for regression with theoretical guarantees but not for matrix estimation
“Pattern identification and local convergence for SLOPE” (Larsson et al. 2021): established pattern convergence theory for SLOPE in regression settings

Step 2 — Delta: This paper extends SLOPE theory to precision matrix estimation, providing:

asymptotic distribution theory for graphical SLOPE in fixed-$p$ regime
explicit characterization under elliptical distributions with variance inflation quantification
robust t-distribution variant (TSLOPE), and
pattern convergence results for matrix-valued clustering.

Step 3 — Theory-specific assessment:
- Predictability: The main convergence result (Theorem 3.1) follows standard M-estimator asymptotics but requires non-trivial adaptation to matrix settings and directional derivatives of ordered penalties
- Proof technique: Largely assembles known techniques from polyhedral regularizer theory, but the explicit computation of Hessian and score covariance for elliptical models requires some technical work
- Bound tightness: No matching lower bounds provided; the results are distributional rather than rate-optimal
Verdict: INCREMENTAL — Solid extension of existing SLOPE theory to precision matrices with useful practical insights, but follows predictable technical patterns from the polyhedral regularizer literature.

Proof Techniques

Main Strategy: The proof relies on the general asymptotic theory for M-estimators with polyhedral penalties from Larsson et al. (2021). The key steps are:

M-estimator framework: Apply Theorem 3.1 by verifying regularity conditions and computing population quantities for the graphical SLOPE problem
Directional derivative computation: For SLOPE penalty, the directional derivative is:
\[\text{Pen}'(\Theta_0; U) = \sum_{j=1}^{p(p-1)/2} \lambda_j \sum_{i \in I_j(\Theta_0)} \text{sign}(\Theta_{0,ij}) U_{ij}\]
where $I_j(\Theta_0)$ indexes entries with $j$-th largest absolute value
Hessian calculation for elliptical models: Under $X = \Sigma^{1/2} uR$ with radius $R$, the key identity is:
\[C = \frac{1}{2}(\Sigma \otimes \Sigma), \quad C_\triangle = C + \left(\frac{E[R^4]}{p(p+2)} - 1\right)(C + \frac{1}{4}\text{vec}(\Sigma)\text{vec}(\Sigma)^T)\]
T-distribution analysis: For TSLOPE with constraint $E[R^2/(ν + R^2)] = p/(ν + p)$, compute:
\[C = \frac{1}{2}(\Sigma \otimes \Sigma) - \frac{E[\xi_R]}{p(p+2)}(\Sigma \otimes \Sigma + \frac{1}{2}\text{vec}(\Sigma)\text{vec}(\Sigma)^T)\]
where $\xi_R = (\nu + p)R^4/(ν + R^2)^2$
Pattern convergence: Follows from continuity of the pattern map and convergence of the optimization problem solution.

Experiments & Validation

Simulation Studies:

Setup: Multivariate Gaussian and t-distributed data with block-diagonal precision matrices ($p = 20$, block structure $I_2 \otimes \Sigma_0$ with compound symmetry)
Methods: GSLOPE, TSLOPE, Graphical Lasso, T-Lasso with BH penalty sequences
Key Results: TSLOPE substantially outperforms GSLOPE under heavy tails (t with ν=5), with improvements diminishing as ν increases. SLOPE methods achieve better clustering error vs total error trade-offs than Lasso methods
Hidden factor simulations: 300 assets, 6 factors, 3 groups with t(ν=4) factors. TSLOPE achieves lowest median Frobenius error

Empirical Application:

Data: 25 Fama-French portfolios sorted by size/book-to-market, daily returns 2000-2025 (6,498 observations)
Results: TSLOPE identifies economically meaningful clusters aligned with book-to-market characteristics, showing clear block structure in estimated precision matrices

Asymptotic Validation: Simulations confirm convergence of $\sqrt{n} \cdot \text{RMSE}(\hat{\Theta}_n)$ to theoretical limits as sample size increases.

Limitations & Open Problems

Limitations:

TECHNICAL: Fixed-$p$ asymptotic regime only; high-dimensional theory ($p \to \infty$) not developed, though simulations suggest similar behavior persists
RESTRICTIVE: Elliptical distribution assumption for heavy-tail analysis; more general heavy-tailed models not covered
TECHNICAL: TSLOPE requires constraint $E[R^2/(ν + R^2)] = p/(ν + p)$ which may not hold for general elliptical models
NATURAL: Positive definite constraint on precision matrix restricts to well-conditioned problems
TECHNICAL: Pattern convergence proven but no finite-sample pattern recovery guarantees provided

Open Problems:
Develop high-dimensional ($p/n \to c > 0$) theory for graphical SLOPE with minimax rates and pattern recovery thresholds
Extend robust analysis beyond elliptical models to more general heavy-tailed distributions and establish adaptive procedures that automatically detect tail behavior

Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces

Authors: Sandra Cerrai, Qin Li, Anjali Nair, Jaeyoung Yoon · Institution: University of Maryland · Category: math.OC

Establishes that SGD in infinite-dimensional Hilbert spaces can be approximated by a cylindrical noise-driven SDE with second-order weak accuracy, extending finite-dimensional stochastic modified equation theory.

Tags: stochastic gradient descent infinite dimensional optimization stochastic differential equations weak error analysis inverse problems cylindrical brownian motion hilbert spaces stochastic modified equations

arXiv · PDF

Problem Formulation

Motivation: Inverse problems in scientific computing often require optimization over infinite-dimensional Hilbert spaces $H$ where parameters are functions rather than finite vectors. Stochastic Gradient Descent (SGD) is commonly used in such settings, but its continuous-time behavior in infinite dimensions requires rigorous analysis.
Mathematical setup: Consider the optimization problem
\[\min_{\phi \in H} F(\phi) := \mathbb{E} F_\gamma(\phi)\]
where ${F_\gamma : \gamma \in \Gamma}$ is a family of functionals from $H$ to $\mathbb{R}$, and $\gamma$ is a $\Gamma$-valued random variable. The SGD iteration is
\[\phi_{n+1} = \phi_n - \eta \nabla F_{\gamma_n}(\phi_n)\]
where ${\gamma_n}_{n \geq 0}$ are i.i.d. random variables with the same distribution as $\gamma$. The intrinsic noise is
\[V(\phi; \gamma) := \nabla F(\phi) - \nabla F_\gamma(\phi)\]
with covariance operator $Q(\phi) : H \to H$ defined by
\[\langle Q(\phi)u, v \rangle = \text{Cov}(\langle V(\phi; \gamma), u \rangle, \langle V(\phi; \gamma), v \rangle)\]
Assumptions:
1. $F$ is Fréchet differentiable on $H$
2. The initial iterate $\phi_0$ belongs to $H$
3. For every $\phi \in H$, $\mathbb{E}_\gamma |\nabla F_\gamma(\phi)|^2 < \infty$
4. There exist random variables $C_L(\gamma) \in L^2(\Gamma)$ and $C_G(\gamma) \in L^1(\Gamma)$ such that $|\nabla F_\gamma(\phi) - \nabla F_\gamma(\psi)| \leq C_L(\gamma)|\phi - \psi|$ and $\langle \nabla F_\gamma(\phi), \phi \rangle \leq C_G(\gamma)(1 + |\phi|^2)$
Toy example: When $H = \mathbb{R}^2$ and $F_\gamma(\phi) = \frac{1}{2}\langle \phi - \gamma, \Lambda(\phi - \gamma) \rangle$ with $\Lambda = I_2$ and $\gamma \sim \mathcal{N}(0, \sigma^2 I_2)$, the noise term becomes $V(\phi; \gamma) = -\gamma$, which is independent of $\phi$, and the covariance operator is $Q(\phi) = \sigma^2 I_2$.
Formal objective: Establish that the weak error between SGD iterates $\phi_n$ and discretized SME $\tilde{\phi}_n$ satisfies
\[\max_{k=0,\ldots,\lfloor T/\eta \rfloor} |\mathbb{E}[g(\phi_k)] - \mathbb{E}[g(\tilde{\phi}_k)]| = O(\eta^2)\]

Method

The authors derive a stochastic modified equation (SME) for SGD by matching first and second moments:

\[d\phi_t = -\nabla F(\phi_t) dt + \sqrt{\eta} \sigma(\phi_t) dW_t\]

where $W_t$ is a cylindrical Brownian motion on $U := L^2_\Gamma$ and the diffusion operator is constructed as:

\[\sigma(\phi)u := \mathbb{E}_\gamma[V(\phi; \gamma)u(\gamma)]\]

This construction ensures $\sigma(\phi)\sigma(\phi)^* = Q(\phi)$.

The discrete SME via Euler-Maruyama is:

\[\tilde{\phi}_{n+1} = \tilde{\phi}_n - \eta \nabla F(\tilde{\phi}_n) + \sqrt{\eta} \sigma(\tilde{\phi}_n) \Delta W_n\]

The weak error analysis proceeds by telescoping:

\[\mathbb{E}[g(\tilde{\phi}_k)] - \mathbb{E}[g(\phi_k)] = \sum_{l=1}^{k-1} G_l + G_k\]

where each $G_l$ represents the propagation error through smooth test functionals.

Applied to toy example: When $F_\gamma(\phi) = \frac{1}{2}\langle \phi - \gamma, \phi - \gamma \rangle$ with $\gamma \sim \mathcal{N}(0, \sigma^2 I_2)$, we have $V(\phi; \gamma) = -\gamma$, so $\sigma$ is constant with $\sigma\sigma^* = \sigma^2 I_2$. The SME becomes:

\[d\phi_t = -\phi_t dt + \sqrt{\eta} \sigma dW_t\]

which has an explicit solution allowing direct verification of the $O(\eta^2)$ weak error bound.

Novelty & Lineage

Step 1 — Prior work: The closest prior papers are:

Li et al. (2017) “Stochastic modified equations and adaptive stochastic gradient algorithms” - established SME theory for SGD in finite-dimensional Euclidean spaces
Feng et al. (2018) “Uniform-in-time weak error analysis for stochastic gradient methods” - provided weak error bounds for finite-dimensional SGD
Various works on infinite-dimensional SPDEs (Da Prato & Zabczyk, Prévôt & Röckner) - established well-posedness theory for stochastic evolution equations

Step 2 — Delta: This paper extends the SME framework from finite-dimensional Euclidean spaces to infinite-dimensional Hilbert spaces. Key technical contributions:
Construction of diffusion operator $\sigma: L^2_\Gamma \to H$ that ensures well-posedness under cylindrical noise
Weak error analysis requiring control of higher-order derivatives along discrete trajectories
Second-order weak approximation result (surprising since standard EM is only first-order)

Step 3 — Theory-specific assessment:
- The main theorem is somewhat predictable as an extension of finite-dimensional results, but the technical execution is non-trivial
- The proof requires genuinely new techniques: the cylindrical noise structure and the preservation of regularity classes under composition are novel challenges not present in finite dimensions
- No lower bounds are provided or known for this specific weak error problem
- The second-order rate is somewhat surprising since Euler-Maruyama schemes typically achieve first-order weak convergence
Verdict: INCREMENTAL — This is a solid technical extension of known finite-dimensional SME theory to infinite dimensions, but the core ideas and methodology follow established patterns.

Proof Techniques

The proof strategy consists of three main components:

Well-posedness of SME: Uses standard theory (Theorem 2.8) by verifying local Lipschitz and linear growth conditions for both drift $\nabla F$ and diffusion $\sigma$. The key inequality for the diffusion operator is:
\[\|\sigma(\phi) - \sigma(\psi)\|_{L^2(U,H)}^2 = \mathbb{E}\|V(\phi;\gamma) - V(\psi;\gamma)\|^2 \leq \mathbb{E}C_L(\gamma)^2 \|\phi - \psi\|^2\]
One-step error estimate (Lemma 4.2): Uses Taylor expansion to third order:
\[\mathbb{E}[g(\phi_1) - g(\tilde{\phi}_1)] = \mathbb{E}\left[\eta^3\left(D^3g(\hat{\phi}) (\sigma(\phi)(\Delta W))^3 - D^3g(\hat{\phi}) (V(\phi))^3\right)\right]\]
First and second-order terms cancel due to matching moments. This yields:
\[|\mathbb{E}[g(\phi_1) - g(\tilde{\phi}_1)]| \leq \eta^3 \|D^3g\|_\infty \mathbb{E}[\|\sigma(\phi)(\Delta W)\|^3 + \|V(\phi)\|^3]\]
Propagation of regularity (Lemma 4.3): Shows that the operator $A(g)(\phi) := \mathbb{E}[g(\tilde{\phi}_1(\phi,0))]$ maps the test function class $\mathcal{G}$ to itself. Key estimates involve controlling:
\[\|D^i A(g)\|_\infty \leq C_i(b,\sigma,\eta) \text{ combinations of } \|D^j g\|_\infty\]
where the constants $C_i$ grow controlled by the step size $\eta$.
Telescoping argument: Decomposes the global error as:
\[\mathbb{E}[g(\tilde{\phi}_k)] - \mathbb{E}[g(\phi_k)] = \sum_{l=1}^{k-1} G_l + G_k\]
Each $G_l$ is bounded using the one-step estimate applied to test functions that have been composed $k-l$ times, requiring the regularity propagation bounds.

Experiments & Validation

Two main experiments validate the theoretical $O(\eta^2)$ convergence rate:

Quadratic example with homogeneous noise: - Objective: $F(\phi) = \langle \phi, \Lambda \phi \rangle$ with $\Lambda_{ij} = (0.8)^{|i-j|}$ - Noise: Bernoulli-type with explicit covariance structure - Test function: $g(\phi) = |\phi|^4$ - Results: Clear $O(\eta^2)$ slope across truncation dimensions $D \in {5,10,20,30}$ - Monte Carlo: $N = 5 \times 10^6$ samples to minimize sampling error
Nonlinear inverse problem with state-dependent noise: - Objective: Optical ptychography problem $F_x(\phi) = \frac{1}{2}|A_x(\phi) - A_x(\bar{\phi})|^2$ - Sensing operator: Gaussian kernel $A_x(\phi) = \int K_\varepsilon(x,y)\phi(y)dy$ - Target function: Smooth function with three localized peaks - Results: Consistent $O(\eta^2)$ convergence across truncation levels $K \in {2,3,4,6,8}$

Key technical details: Finite-dimensional truncation with sinusoidal basis, careful Monte Carlo error control, and verification that convergence rate holds before saturation by sampling error.

Limitations & Open Problems

Limitations:

Bounded derivative assumption - TECHNICAL: The assumption that $b$ and $\sigma$ have uniformly bounded derivatives up to third order is needed for the discrete telescoping argument but could likely be relaxed to polynomial growth using continuous-time techniques.
Finite second moments - NATURAL: The assumption $\mathbb{E}_\gamma |\nabla F_\gamma(\phi)|^2 < \infty$ is standard in stochastic optimization and widely satisfied in practice.
Smooth test functionals - TECHNICAL: The analysis requires test functions $g$ with bounded derivatives up to third order, limiting the class of observables that can be analyzed.
Cylindrical noise structure - TECHNICAL: The specific construction $U = L^2_\Gamma$ and diffusion operator $\sigma: U \to H$ is tailored to match SGD moments but may not capture all possible noise structures in infinite-dimensional optimization.
Weak convergence only - RESTRICTIVE: The results only establish weak convergence through smooth functionals, not strong (pathwise) convergence of the processes.

Open problems:
Extend to polynomial growth conditions: Develop continuous-time analysis tools (e.g., Kolmogorov backward equation) to handle unbounded coefficients with polynomial growth.
Strong convergence rates: Establish pathwise convergence results between SGD and SME trajectories in infinite-dimensional settings.