Theory 3 papers

Theory Digest — Apr 6, 2026

Today’s Digest at a Glance

Today’s digest covers actor-critic methods for linear MDPs, spectral Wasserstein distances for matrix optimization, and minimum norm interpolation in uniformly convex spaces.

Linear Markov Decision Processes (Linear MDPs)

Linear MDPs provide a structured reinforcement learning setting where the transition dynamics and rewards have linear structure in a known feature representation. Specifically, the transition kernel satisfies $P_h(s’ s,a) = \langle \phi_h(s,a), \mu_h(s’) \rangle$ and rewards satisfy $r_h(s,a) = \langle \phi_h(s,a), \theta_h \rangle$ for known feature maps $\phi_h$ and unknown parameters $\mu_h, \theta_h$. This structure enables sample-efficient learning because the value functions also become linear in the features, allowing function approximation with polynomial sample complexity.

The key insight is that this linear structure propagates through the Bellman equations: if we define $Q_h^\pi(s,a) = \langle \phi_h(s,a), w_h^\pi \rangle$ for some weight vector $w_h^\pi$, then the Bellman consistency condition becomes a system of linear equations in the weight vectors. This allows algorithms to maintain explicit representations of value functions and policies using only $d$-dimensional parameter vectors (where $d$ is the feature dimension) rather than tabular representations over the full state-action space.

Traditionally, algorithms for linear MDPs either use confidence sets around the unknown parameters (requiring expensive optimization at each step) or maintain tabular representations that don’t fully exploit the linear structure. The linear structure enables both computational efficiency and strong theoretical guarantees.

2-Uniform Convexity in Normed Spaces

2-uniform convexity is a geometric property of normed spaces that strengthens the usual notion of strict convexity. A normed space $(X, |\cdot|)$ is 2-uniformly convex if there exists a constant $c > 0$ such that for all $x, y \in X$ with $|x| \leq 1, |y| \leq 1$:

\[\left\|\frac{x+y}{2}\right\|^2 \leq \frac{1}{2}(\|x\|^2 + \|y\|^2) - \frac{c}{2}\|x-y\|^2\]

This condition ensures that the unit ball is “uniformly round” rather than having flat sides or sharp corners. The parameter 2 indicates that the modulus of convexity grows quadratically. Important examples include Hilbert spaces (with $c = 1/4$) and $L^p$ spaces for $p \in [2, \infty)$.

The power of 2-uniform convexity lies in providing quantitative control over projections and approximations. In minimum norm interpolation problems, this property allows sharp analysis of the bias-variance decomposition because it bounds how much the minimum norm solution can deviate from the optimal predictor. The quadratic growth ensures that small perturbations in the constraints (due to noise) lead to small perturbations in the minimum norm solution, enabling precise finite-sample analysis.

Reading Guide: The first paper develops computationally efficient actor-critic methods that exploit linear MDP structure through explicit policy parameterization. The second paper introduces matrix-aware optimal transport distances that connect to modern optimizers. The third paper extends interpolation theory beyond Hilbert spaces using geometric properties of uniformly convex norms.


Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes

Authors: Max Qiushi Lin, Reza Asad, Kevin Tan, Haque Ishfaq et al. (6 authors) · Institution: Simon Fraser University, University of Pennsylvania, Mila McGill University, Google DeepMind, University of Alberta · Category: cs.LG

Introduces explicit log-linear policy parameterization for actor-critic in linear MDPs, achieving same sample complexity as prior work ($\tilde{O}(\epsilon^{-4})$ on-policy, $\tilde{O}(\epsilon^{-2})$ off-policy) but with significantly reduced computational cost per episode.

Tags: actor-critic linear-mdp policy-gradient exploration sample-complexity parametric-policies langevin-monte-carlo natural-policy-gradient

arXiv · PDF

Problem Formulation
  1. Motivation: Actor-critic methods for reinforcement learning are successful in practice but suffer from a theory-practice gap. Existing theoretical work either makes strong assumptions that sidestep exploration or analyzes impractical methods with implicit policies that are computationally expensive to sample from.

  2. Mathematical setup: Consider a finite-horizon linear MDP $M = (S, A, P, r, H)$ where $S$ is the state space, $A$ is finite with $ A $ actions, $H \in \mathbb{Z}_+$ is the horizon. The transition kernel satisfies:
    \[Ph(s' | s, a) = \langle \phi(s, a), \psi_h(s') \rangle\]

    The reward function satisfies:

    \[r_h(s, a) = \langle \phi(s, a), \upsilon_h \rangle\]

    where $\phi: S \times A \to \mathbb{R}^{d_c}$ is the feature map with $|\phi(s,a)|_2 \leq 1$. The policy is log-linear:

    \[\pi_h(a | s, \theta) = \frac{\exp(z_h(s, a | \theta_h))}{\sum_{a'} \exp(z_h(s, a' | \theta_h))}\]
    where $z_h(s, a \theta_h) = \langle \varphi(s, a), \theta_h \rangle$.

    Key assumptions:

    1. Linear MDP structure (Definition 2.1)
    2. Bias assumption: $\min_\theta \sup_{t,h} \tilde{\ell}_h^t(\theta) \leq \epsilon_{\text{bias}}$ (bounded approximation error)
    3. Optimization error: Actor optimization error bounded by $\epsilon_{\text{opt}}$
  3. Toy example: When $d_c = 2$, $ A = 2$, and $H = 2$, the feature map might be $\phi(s,a) = [s; a] \in \mathbb{R}^2$. The Q-function becomes linear: $Q_h^\pi(s,a) = \langle \phi(s,a), w_h \rangle$ for some $w_h \in \mathbb{R}^2$.
  4. Formal objective: Learn an $\epsilon$-optimal policy minimizing the optimality gap:

    \[\text{OG}(T) := \mathbb{E}[V_1^*(s_1) - V_1^{\pi_T}(s_1)] \leq \tilde{O}(\epsilon)\]
Method

The algorithm alternates between actor and critic updates:

Actor (Projected NPG): At episode $t$, solve the logit-matching regression:

\[\tilde{\ell}_h^t(\theta) = \frac{1}{2} \sum_{(s,a) \in D_{\text{exp}}} \rho_{\text{exp}}(s,a) \left[\langle \varphi(s,a), \theta - \tilde{\theta}_h^t \rangle - \eta \hat{Q}_h^t(s,a)\right]^2\]

Update via gradient descent:

\[\theta_h^{t,k} \leftarrow \theta_h^{t,k-1} - \alpha_a^t \nabla_\theta \tilde{\ell}_h^t(\theta_h^{t,k-1})\]

Critic (LMC): For each step $h$, define ridge regression loss:

\[L_h^t(w) = \frac{1}{2} \sum_{i=1}^{|D^t|} \left[r_h(s_h^i, a_h^i) + \hat{V}_{h+1}^t(s_{h+1}^i) - \langle \phi(s_h^i, a_h^i), w \rangle\right]^2 + \frac{\lambda}{2}\|w\|^2\]

Update via Langevin Monte Carlo:

\[w^{t,m,j}_h \leftarrow w^{t,m,j-1}_h - \alpha_c^{h,t} \nabla_w L_h^t(w^{t,m,j-1}_h) + \sqrt{\alpha_c^{h,t}/\zeta} \nu^{t,m,j}_h\]

where $\nu^{t,m,j}_h \sim N(0,I)$. Take maximum over $M$ samples:

\[\hat{Q}_h^t(s,a) = \min\left\{\max_{m \in [M]} \langle \phi(s,a), w_h^{t,m,J_t} \rangle, H\right\}_+\]

Toy example application: For $d_c = d_a = 2$, the actor solves a 2D regression problem over the coreset $D_{\text{exp}}$, while the critic maintains 2 parallel LMC chains for each horizon step.

Novelty & Lineage

Prior work:

  1. Liu et al. (2023): Achieved $\tilde{O}(\epsilon^{-4})$ on-policy sample complexity using implicit NPG policies and UCB bonuses, but with $O(d_c^2 H A \epsilon^{-2})$ computational cost per episode.
  2. Sherman et al. (2023): Achieved $\tilde{O}(\epsilon^{-2})$ off-policy sample complexity with implicit policies and complicated algorithmic modifications.

    Delta: This paper introduces explicit log-linear policy parameterization for actor-critic in linear MDPs, replacing implicit NPG policies. Key contributions:

  3. projected NPG that maintains explicit parameterization while controlling approximation error
  4. LMC-based critic replacing UCB bonuses for more practical exploration.

    Theory-specific assessment:

    • Main theorems are not surprising given prior work on implicit policies. The sample complexity bounds exactly match Liu et al. (2023) and Sherman et al. (2023).
    • Proof technique is mostly routine: standard regret decomposition plus projection error control. The LMC analysis adapts known concentration inequalities.
    • Bounds are tight relative to existing work but no fundamental lower bounds are established.
    • The main novelty is computational: reducing per-episode cost from $O(d_c^2 H A \epsilon^{-2})$ to $O(d_a H A )$ while maintaining same statistical rates.

    Verdict: INCREMENTAL — This is a solid computational improvement that makes existing algorithms more practical, but the statistical guarantees are identical to prior work and the proof techniques are standard applications of known results.

Proof Techniques
  1. Regret decomposition: Split optimality gap into actor and critic errors:

    \[\mathbb{E}[V_1^* - V_1^{\pi_T}] = \frac{1}{T} \sum_{t=1}^T \sum_{h=1}^H \mathbb{E}_{\pi^*}[\langle \pi_h^*(s_h) - \pi_h^t(s_h), \hat{Q}_h^t(s_h, \cdot) \rangle] + \frac{1}{T} \sum_{t=1}^T \sum_{h=1}^H [\mathbb{E}_{\pi^*}[\iota_h^t] - \mathbb{E}_{\pi^t}[\iota_h^t]]\]
  2. Projected NPG regret bound: Uses generalized mirror descent analysis with projection error $\epsilon_t$:

    \[\sum_{t=1}^T \langle u - p^t, g^t \rangle \leq \frac{\log|A| + \sum_{t=1}^T \epsilon_t}{\eta} + \frac{\eta H^2 T}{2}\]
  3. Projection error control: Key inequality bounding KL divergence error:

    \[\epsilon_h^t(s) \leq 2(\phi_G + 1)\sqrt{\epsilon_{\text{bias}}} + 2\sqrt{\epsilon_{\text{opt}}}\]

    where $\phi_G = \sup_{(s,a)} |\varphi(s,a)|_{G^{-1}}$ with $G = \sum_{(s,a) \in D_{\text{exp}}} \rho_{\text{exp}}(s,a) \varphi(s,a) \varphi(s,a)^\top$.

  4. LMC optimism guarantee: Shows model prediction error satisfies:

    \[-\Gamma_{\text{LMC}} \cdot \|\phi(s,a)\|_{(\Lambda_h^t)^{-1}} \leq \iota_h^t(s,a) \leq 0\]

    using self-normalized concentration (on-policy) and uniform concentration via covering numbers (off-policy).

  5. Elliptic potential argument: For off-policy setting:

    \[\sum_{t=1}^T \sum_{h=1}^H \mathbb{E}_{\pi^t}[\|\phi(s,a)\|_{(\Lambda_h^t)^{-1}}] \leq \tilde{O}(\sqrt{d_c^3 H^4 T})\]
  6. Covering number bound: Critical for off-policy analysis:

    \[\log N_\Delta(V) \leq d_c \log\left(1 + \frac{4W + 4H\sqrt{2Z}}{\Delta}\right) + d_a \log\left(1 + \frac{4H\sqrt{2Z}}{\Delta}\right)\]
Experiments & Validation

Linear MDP experiments: Random MDP and Deep Sea environments with dimensions $d_c = 10, 50$ and horizon $H = 10, 50$. Compares LMC-NPG-EXP (proposed), LMC-NPG-IMP (implicit baseline), and LMC (value-based). Results show comparable performance to memory-intensive baseline.

Atari experiments: Extended algorithm to deep RL using neural networks. Tested on Atari games comparing to PPO (on-policy) and DQN (off-policy) baselines via Stable Baselines3. Claims similar or better performance but limited experimental details provided.

Key numbers: In Random MDP with $d_c = 50, H = 50$, proposed algorithm achieves similar final performance (~0.8 value) to implicit baseline while using $O(d_a H A )$ vs $O(d_c^2 H A \epsilon^{-2})$ computation per episode.

Limitations: Experiments primarily in small-scale linear MDPs. Atari results lack detailed comparison methodology and statistical significance testing. No computational time comparisons provided despite this being a key claimed advantage.

Limitations & Open Problems

Limitations:

  1. TECHNICAL: Requires access to coreset $D_{\text{exp}}$ and distribution $\rho_{\text{exp}}$ via experimental design, adding offline preprocessing cost.
  2. TECHNICAL: Assumptions 4.1 and 4.2 on bias and optimization error are not verifiable in practice without knowing the optimal solution.
  3. RESTRICTIVE: Log-linear policy class may be insufficient for complex environments where neural network policies are needed.
  4. TECHNICAL: LMC requires careful tuning of inverse temperature $\zeta$, learning rates $\alpha_c$, and number of samples $M$.
  5. NATURAL: Linear MDP assumption standard in theoretical RL but restrictive for real applications.

    Open problems:

  6. Eliminate dependence on experimental design: Develop algorithms that don’t require offline construction of $D_{\text{exp}}$ and $\rho_{\text{exp}}$, making the approach fully online.
  7. Extension to general function approximation: Prove similar sample complexity guarantees for neural network policies and critics beyond the log-linear case, while maintaining computational advantages.

Muon Dynamics as a Spectral Wasserstein Flow

Authors: Gabriel Peyré · Institution: ENS, PSL Université · Category: math.OC

Introduces Spectral Wasserstein distances parameterized by matrix norms, proving static-dynamic equivalence for monotone cases and connecting to matrix-aware optimizers like Muon.

Tags: optimal transport matrix optimization spectral normalization mean-field theory gradient flows Muon optimizer Wasserstein distances neural network training

arXiv · PDF

Problem Formulation
  1. Motivation: Deep learning optimizers often use spectral normalizations (like Muon) that respect matrix structure of parameters better than coordinatewise Euclidean methods. Understanding these methods through optimal transport theory could provide theoretical foundations and unify various spectral normalization schemes.

  2. Mathematical setup: Let $\mathcal{P}_2(\mathbb{R}^d)$ denote probability measures on $\mathbb{R}^d$ with finite second moments. Consider a norm $\gamma$ on the cone $S_+^d$ of positive semidefinite $d \times d$ matrices. The Spectral Wasserstein distance between $\mu, \nu \in \mathcal{P}_2(\mathbb{R}^d)$ is defined as:

    \[W_\gamma(\mu, \nu)^2 := \inf_{\pi \in \Pi(\mu,\nu)} \gamma\left(\int_{\mathbb{R}^d \times \mathbb{R}^d} (y-x)(y-x)^T d\pi(x,y)\right)\]

    where $\Pi(\mu,\nu)$ denotes couplings between $\mu$ and $\nu$. Key assumptions:

    1. $\gamma$ is a norm on $S_+^d$
    2. For metric properties, $\gamma$ is monotone: $0 \preceq S \preceq T \Rightarrow \gamma(S) \leq \gamma(T)$
    3. There exists a convex compact representing set $K_\gamma \subset S^d$ such that $\gamma(S) = \max_{Q \in K_\gamma} \text{tr}(QS)$
  3. Toy example: When $d=2$ and $\gamma(S) = \text{tr}(S)$ (trace norm), this reduces to classical quadratic Wasserstein distance $W_2$. When $\gamma(S) = |S|_{op}$ (operator norm), optimal transport must balance the largest eigenvalue of displacement covariance, creating collective interaction between particles.

  4. Formal objective: Characterize the distance $W_\gamma$, prove equivalence between static Kantorovich and dynamic Benamou-Brenier formulations, and identify the corresponding gradient flows:

    \[\partial_t \mu_t + \text{div}(\mu_t J_{\mu_t}(g_{\mu_t})) = 0\]
Method

The method develops Spectral Wasserstein theory through three formulations:

Static Kantorovich formulation: For measures $\mu, \nu$, solve

\[W_\gamma(\mu, \nu)^2 = \inf_{\pi \in \Pi(\mu,\nu)} \gamma\left(\int (y-x)(y-x)^T d\pi(x,y)\right)\]

Max-min representation: Using duality,

\[W_\gamma(\mu, \nu)^2 = \max_{Q \in K_\gamma} W_Q(\mu, \nu)^2\]

where $W_Q$ is quadratic transport with cost matrix $Q$.

Dynamic Benamou-Brenier formulation: For monotone $\gamma$,

\[W_\gamma^{BB}(\mu_0, \mu_1)^2 = \inf_{(\mu_t, v_t)} \int_0^1 \gamma\left(\int v_t(x)v_t(x)^T d\mu_t(x)\right) dt\]

subject to continuity equation $\partial_t \mu_t + \text{div}(\mu_t v_t) = 0$.

Gradient flow construction: The duality map $J_\mu(g)$ minimizes

\[v \mapsto \int g(x) \cdot v(x) d\mu(x) + \frac{1}{2}\gamma\left(\int v(x)v(x)^T d\mu(x)\right)\]

For force covariance $S_\mu(g) = \int g(x)g(x)^T d\mu(x)$ and active matrix $Q^* \in \arg\max_{Q \in K_\gamma} \text{tr}(QS_\mu(g))$:

\[J_\mu(g)(x) = -(Q^*)^{-1}g(x)\]

Application to toy example: For $d=2$, trace norm gives $J_\mu(g) = -g$ (standard gradient flow). Operator norm gives Muon-type normalization with collective matrix structure.

Novelty & Lineage

Step 1 — Prior work:

  • Chizat and Bach (2018): Mean-field analysis of neural networks using classical Wasserstein geometry
  • Ambrosio et al. (2019): Wasserstein gradient flows for neural network training
  • Paty and Cuturi (2019): Subspace robust Wasserstein distances via projection optimization

Step 2 — Delta: This paper introduces Spectral Wasserstein distances parameterized by matrix norms on $S_+^d$, proving static-dynamic equivalence for monotone norms, and connecting to matrix-aware optimization methods like Muon. The key insight is interpreting spectral normalizations as transport with matrix-structured costs rather than coordinatewise penalties.

Step 3 — Theory-specific assessment:

  • The main theorem (static-dynamic equivalence) extends classical Benamou-Brenier theory to matrix-norm geometries, which is non-trivial due to indefinite test matrices in general norms
  • Proof technique uses monotonicity crucially to restrict to PSD representing sets, enabling Jensen’s inequality in the dynamic-to-static direction
  • Bounds are explicit: $\sqrt{c_\gamma} W_2 \leq W_\gamma \leq \sqrt{C_\gamma} W_2$ with tight constants for Schatten norms
  • Gaussian case reduces to finite-dimensional covariance optimization, extending Bures distance

The theoretical development is solid but incremental - it systematically extends known optimal transport theory to matrix-norm settings without fundamentally new proof techniques.

Verdict: INCREMENTAL — Well-executed extension of Wasserstein theory to matrix geometries, but follows predictable paths from prior optimal transport literature.

Proof Techniques

The main proofs use several key techniques:

Static-Dynamic equivalence (Theorem 3.3):

  1. Static ≥ Dynamic: For any dynamic plan $(\mu_t, v_t)$, use superposition principle to construct measure $\eta$ on paths with $\mu_t = (e_t)_#\eta$ and $\dot{\gamma}_t = v_t(\gamma_t)$. Key inequality via Jensen for PSD matrices:

    \[\int_0^1 \dot{\gamma}_t^T Q \dot{\gamma}_t dt \geq (\gamma_1 - \gamma_0)^T Q (\gamma_1 - \gamma_0)\]

    This requires $Q \succeq 0$, which follows from monotonicity via Proposition 2.2.

  2. Dynamic ≥ Static: Any coupling $\pi$ induces displacement interpolation $\mu_t = ((1-t)x + ty)_#\pi$ with constant velocity $y-x$, giving

    \[\int_0^1 \gamma\left(\int v_t v_t^T d\mu_t\right) dt = \gamma\left(\int (y-x)(y-x)^T d\pi\right)\]

    Gaussian reduction (Theorem 2.12): Uses the fact that Gaussian couplings are characterized by cross-covariance matrix $K$ with block PSD constraint:

    \[\begin{pmatrix} \Sigma_0 & K \\ K^T & \Sigma_1 \end{pmatrix} \succeq 0\]

    Displacement covariance becomes $(m_1-m_0)(m_1-m_0)^T + \Sigma_0 + \Sigma_1 - K - K^T$.

    Comparison bounds (Proposition 2.8): Direct application of homogeneity and compactness:

    \[c_\gamma = \inf_{\text{tr}(S)=1, S \succeq 0} \gamma(S), \quad C_\gamma = \sup_{\text{tr}(S)=1, S \succeq 0} \gamma(S)\]

    Geodesic convexity (Theorem 5.2): For linear functionals $F_h(\mu) = \int h d\mu$, geodesic convexity reduces to pointwise convexity of $h$ since Spectral Wasserstein geodesics remain displacement interpolations.

Experiments & Validation

The paper presents two numerical studies:

  1. Static spectral couplings: Compares optimal transport plans for trace norm ($p=1$), Frobenius norm ($p=2$), and operator norm ($p=\infty$) between 2D Gaussian mixtures. Shows how different norms create different coupling structures - operator norm creates more “collective” transport respecting global covariance structure.

  2. MMD gradient flows: Studies gradient flows of Maximum Mean Discrepancy (MMD) functionals under the three Spectral Wasserstein geometries. Demonstrates convergence behavior and shows how operator-norm geometry leads to different particle dynamics compared to classical Wasserstein.

    The experiments are primarily illustrative rather than comprehensive validation. A more thorough empirical validation would require:

    • Large-scale comparison with actual Muon optimization on deep networks
    • Analysis of computational complexity vs. classical methods
    • Convergence rate comparisons across different spectral geometries
    • Applications to realistic neural network training scenarios

    Code is provided at https://github.com/gpeyre/spectral-wasserstein for reproducibility.

Limitations & Open Problems

Limitations:

  1. Monotonicity requirement: Metric properties only hold for monotone norms - TECHNICAL (needed for PSD reduction in proofs, but excludes some natural matrix norms)

  2. Computational complexity: No analysis of algorithmic complexity for computing Spectral Wasserstein distances - RESTRICTIVE (practical scalability unclear)

  3. Idealized Muon connection: Analysis considers deterministic continuous-time limit without momentum or stochasticity - TECHNICAL (simplification needed for theory)

  4. Gaussian preservation conditions: Active matrix must remain invertible for preserved Gaussian flows - TECHNICAL (limits applicability of explicit analysis)

  5. Finite-sample convergence: No rates for convergence of empirical measures to population level - NATURAL (standard in mean-field theory)

    Open problems:

  6. Algorithmic development: Design efficient algorithms for computing Spectral Wasserstein distances, especially for high-dimensional problems with computational complexity analysis.

  7. Non-monotone extensions: Develop metric theory for non-monotone matrix norms, possibly through regularization or alternative geometric constructions that avoid the PSD restriction.


Minimum Norm Interpolation via The Local Theory of Banach Spaces: The Role of $2$-Uniform Convexity

Authors: Gil Kur, Pierre Bizeul · Institution: ETH Zürich, Weizmann Institute · Category: math.FA

Extends minimum-norm interpolator theory to 2-uniformly convex norms under sub-Gaussian covariates, achieving sharp bounds that match prior Gaussian-specific results.

Tags: minimum-norm-interpolation uniform-convexity benign-overfitting high-dimensional-geometry sub-Gaussian-analysis overparameterized-models banach-space-theory

arXiv · PDF

Problem Formulation

Motivation: The minimum-norm interpolator (MNI) framework aims to understand generalization in overparameterized models like neural networks. Traditional theory requires norms induced by inner products, but many practical settings involve more general norms. Understanding MNIs under weaker assumptions is crucial for theoretical foundations of deep learning.

Mathematical setup: Consider a Banach space $(B(X), |\cdot|)$ with unit ball $F := {f \in B(X) : |f| \leq 1}$. The MNI is defined as:

\[\hat{f}_n(x, y) := \arg\min_{f \in B(X): f|_x = y} \|f\|\]

The regression model is:

\[Y = f^*(X) + \xi\]

where $X \sim P$, $\xi \sim N(0, 1)$, and $f^* \in F$.

The norm is 2-uniformly convex (UC(2)) with constant $t \geq 0$ if:

\[\left\|\frac{f + g}{2}\right\|^2 + t\left\|\frac{f - g}{2}\right\|^2 \leq \frac{\|f\|^2 + \|g\|^2}{2}\]

Assumptions:

  1. $|f^*|_P \asymp |f^*| \asymp 1$ (signal has order-one norm)
  2. $|\hat{f}_n(X, \xi)| \geq C$ with probability $\geq 1 - n^{-2}$ (inductive bias)
  3. $M_g(F_n) \asymp M_n(F)$ and $M_g^*(F_n) \asymp M_n^*(F)$ with high probability (regularity)
  4. Lower isometry remainder satisfies $I_L(n, F, P) \leq C_4 G(F)^2$

    Toy example: When $d = 2$ and we use $\ell_p$ norm with $p \in (1, 2]$, the UC(2) constant is $t = p - 1$. For sparse signal $w^* = (1, 0)$ and isotropic Gaussian covariates, the MNI exhibits bias-variance tradeoffs controlled by geometric properties of the $\ell_p$ ball.

    Formal objective: Bound the mean squared error:

    \[\mathbb{E}_D\|\hat{f}_n - f^*\|_{L^2(P)}^2 = T_1 + T_2\]

    where $T_1$ is structural error and $T_2$ is noise error.

Method

The approach decomposes the MSE as $T_1 + T_2$ where:

\[T_1 = \mathbb{E}\|P_{f^*}(\mathbb{E}_\xi[\hat{f}_n|X]) - f^*\|_P^2 + \mathbb{E}\|P_{(f^*)^\perp}(\mathbb{E}_\xi[\hat{f}_n|X])\|_P^2\] \[T_2 = \mathbb{E}_X \text{Var}_\xi[\hat{f}_n|X]\]

Key steps:

  1. Use 2-uniform convexity to control bias via quantitative Anderson inequality:

    \[\mathbb{E}\|\xi + x\|^2 \geq \mathbb{E}\|\xi\|^2 + t\|x\|^2\]
  2. Apply concentration properties through convex concentration property (CCP) for sub-Gaussian covariates.

  3. Control structural error $T_1$ using the ratio $R_{MM^*}(F) = M_n(F)M_n^*(F)/n$ where $M_n(F)$ is the Gaussian mean width.

  4. For noise error $T_2$, use K-convexity to bound expected conditional variance via localization radius $r^*$.

    Algorithm for linear case: In $\mathbb{R}^d$ with $\ell_p$ norm, solve:

    \[\hat{w}_p = \arg\min_{Xw = Y} \|w\|_p\]

    Applied to toy example: For $\ell_2$ case with $p = 2$, $t = 1$, and $d = 2$:

    • Structural error: $T_1 \lesssim G_n(F)^2/n \approx 1/n$
    • Noise error: $T_2 \lesssim G_n(F)^2 \approx 1/n$
    • Total MSE: $O(1/n)$ achieving parametric rate
Novelty & Lineage

Prior work:

  1. Bartlett et al. (2020): Proved benign overfitting for $\ell_2$-MNI under rapidly decaying eigenvalues, exploiting closed-form solutions from inner product structure.
  2. Koehler et al. (2021), Donhauser et al. (2022): Established sharp rates for $\ell_p$-MNI with $p \in [1,2]$ under Gaussian covariates using Convex Gaussian Minimax Theorem (CGMT).
  3. Wang et al. (2022): Extended CGMT approach but remained limited to Gaussian covariates.

    Delta: This work extends MNI analysis to:

    • Sub-Gaussian covariates (beyond Gaussians)
    • 2-uniform convex norms (beyond inner product norms)
    • General Banach spaces (beyond finite-dimensional linear models)

    Theory-specific assessment:

    • Main theorem: The structural error bound $T_1 \lesssim R_{MM^*} G_n(F)/t$ under UC(2) is somewhat predictable as it mirrors classical ERM bounds, but extending beyond Gaussian covariates required non-trivial geometric analysis.
    • Proof technique: The combination of UC(2) geometry with CCP for sub-Gaussian covariates is genuinely new. Prior CGMT-based proofs don’t extend to non-Gaussian settings.
    • Tightness: For $\ell_p$-MNI with $p \in (1 + C/\log d, 2]$, bounds match prior sharp rates, suggesting tightness. No lower bounds proven for general UC(2) case.

    Verdict: INCREMENTAL — Solid extension of existing MNI theory to broader geometric and distributional settings, but the core phenomenon and proof strategies build incrementally on established foundations.

Proof Techniques

Main proof strategy:

  1. Quantitative Anderson inequality: For UC(2) norms with constant $t > 0$:

    \[\mathbb{E}\|\xi + x\|^2 \geq \mathbb{E}\|\xi\|^2 + t\|x\|^2\]

    This provides sensitivity to signal vs. pure noise interpolation.

  2. Convex concentration property (CCP): For sub-Gaussian $X$ and convex 1-Lipschitz $F$:

    \[\Pr(|F(X) - \mathbb{E}[F(X)]| \geq t) \leq 2\exp(-t^2/L^2)\]
  3. Key structural error bound: Using UC(2) and symmetrization:

    \[T_1 \lesssim \frac{R_{MM^*}(F) \cdot G_n(F)}{t}\]

    where $R_{MM^*}(F) = M_n(F)M_n^*(F)/n$ controls the geometry.

  4. Localization via position theory: In isotropic/John’s position for UC(2) norms:

    \[R_{\tilde{M}M^*}(F) = \tilde{O}(1)\]

    achieved through Klartag-Milman results and Bourgain’s slicing conjecture resolution.

  5. K-convexity for noise error: The expected conditional variance satisfies:

    \[T_2 \gtrsim n \cdot \Psi_n\left(e_1, C \cdot \frac{K_{\|\cdot\|} \cdot M_n(F)}{t\sqrt{n}}\right)^2\]

    where $K_{|\cdot|}$ is the K-convexity constant.

  6. Technical lemma for sub-Gaussian projections: With probability $\geq 1 - \exp(-cn)$:

    \[1 \lesssim b^{-1}(PK) \leq M_s^*(PK) \lesssim \log(d)^3\]

    using Dudley’s integral and finite volume ratio results.

Experiments & Validation

Purely theoretical.

Empirical validation would require:

  1. Comparing $\ell_p$-MNI performance on synthetic data with sub-Gaussian vs. Gaussian covariates
  2. Testing bounds on real datasets with sparse ground truth signals
  3. Validating geometric assumptions (UC(2), position requirements) on practical norms
  4. Measuring localization radius $r^*$ empirically to verify theoretical predictions
Limitations & Open Problems

Limitations:

  1. 2-uniform convexity assumption is TECHNICAL - authors conjecture that cotype-2 suffices but UC(2) needed for current proof techniques with sub-Gaussian covariates.

  2. Position requirements (isotropic/John’s) are TECHNICAL - needed for sharp constants but likely artifacts of geometric analysis tools.

  3. Sparsity assumption (1-sparse signal) in linear case is RESTRICTIVE - limits applicability to structured signals.

  4. Dimension constraints $n \log(d)^C \leq d \leq n^{q/2} \log(d)^{-C}$ are RESTRICTIVE - narrow regime for optimal behavior.

  5. Sub-Gaussian covariates with CCP property is NATURAL - standard assumption but excludes some heavy-tailed distributions.

    Open problems:

  6. Quantitative Anderson inequality: For cotype-2 norms, does there exist a position such that $\mathbb{E}|\xi + x|^2 \geq \mathbb{E}|\xi|^2 + c(t)|x|^2$?
  7. Optimal position: For UC(2) norms under Gaussian covariates, what position minimizes $T_2 \lesssim C(1/t) \cdot (n/d)$?