Mar 17, 2026 Theory 15 papers

Theory Digest — Mar 17, 2026

SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

Authors: Viktor Stein, Wuchen Li, Gabriele Steidl · Institution: (high-relevance, institution unverified) · Category: cs.LG

Introduces accelerated attention blocks by deriving transformer architectures from inertial Nesterov-type dynamics on probability density manifolds, achieving improved convergence with same oracle complexity.

Tags: transformer-theory optimal-transport nesterov-acceleration hamiltonian-dynamics attention-mechanisms gradient-flows density-manifolds symplectic-integration

arXiv · PDF

Problem Formulation

Motivation: Standard transformer attention blocks correspond to gradient flows in probability density spaces, but lack acceleration mechanisms that could improve convergence rates. Transformer training efficiency remains limited by the O(1/k) convergence rate of standard gradient descent, while accelerated methods like Nesterov achieve O(1/k²) rates in convex optimization.
Mathematical setup: Let $P(\Omega)$ be the space of smooth probability densities on $\Omega = \mathbb{R}^d$. For a density $\rho \in P(\Omega)$, define the tangent space $T_\rho P(\Omega) = {\sigma \in C^\infty(\Omega) : \int_\Omega \sigma dx = 0}$. Given query, key, value matrices $Q, K \in \mathbb{R}^{m \times d}$, $V \in \mathbb{R}^{d \times d}$ with symmetric $A := K^T Q$, consider the attention kernel:
\[\kappa(x_i, x_j) = \frac{\exp(x_j^T A x_i)}{\sum_{\ell=1}^N \exp(x_\ell^T A x_i)}\]
The transformer PDE for softmax self-attention is:
\[\partial_t \rho_t = -\nabla \cdot \left( \rho_t(x) \int V y \frac{\exp(y^T A x)}{\int \exp(z^T A x) \rho_t(z) dz} \rho_t(y) dy \right)\]
Assumptions:
1. $A$ is symmetric and positive definite
2. $V$ is symmetric
3. $B := VA^{-1}$ is symmetric and positive definite
4. Densities have finite second moments
Toy example: When $d=2$, $N=2$, $A = I_2$, $V = I_2$, and initial tokens $X_1(0) = [1,0]^T$, $X_2(0) = [0,1]^T$, the softmax attention becomes $\kappa(X_1, X_2) = \exp(0)/(\exp(1) + \exp(0)) \approx 0.27$, so $X_1$ updates as $X_1 + h \cdot 0.27 \cdot X_2$.
Formal objective: Design accelerated attention blocks that approximate the solution to:
\[\min_{\rho \in P(\Omega)} F(\rho) := -\frac{1}{2} \iint \exp(y^T A x) \rho(x) \rho(y) dx dy\]

Method

The method derives accelerated attention blocks from inertial Nesterov-type dynamics on density manifolds.

Start with the Hamiltonian functional:
\[H(\rho_t, \Phi_t) := \frac{1}{2} \int \Phi_t G_{\rho_t}^{-1}[\Phi_t] dx + F(\rho_t)\]
The accelerated gradient flow satisfies the damped Hamiltonian system:
\[\partial_t \begin{pmatrix} \rho_t \\ \Phi_t \end{pmatrix} + \begin{pmatrix} 0 \\ \alpha_t \Phi_t \end{pmatrix} - \begin{pmatrix} 0 & I \\ -I & 0 \end{pmatrix} \begin{pmatrix} \frac{\delta}{\delta \rho_t} H(\rho_t, \Phi_t) \\ \frac{\delta}{\delta \Phi_t} H(\rho_t, \Phi_t) \end{pmatrix} = 0\]
For softmax attention, this becomes:
\[\partial_t \rho_t + \nabla \cdot \left( \rho_t \frac{VA^{-1} \nabla \Phi_t}{\int \exp(z^T A x) \rho_t(z) dz} \right) = 0\] \[\partial_t \Phi_t + \alpha_t \Phi_t + \frac{1}{2} \frac{\|\nabla \Phi_t\|_B^2}{\int \exp(z^T A x) \rho_t(z) dz} + \text{correction terms} = 0\]
Discretize using empirical measure $\rho_t \approx \frac{1}{N} \sum_{j=1}^N \delta_{X_j(t)}$ and set $Y_i(t) := \nabla \Phi_t(X_i(t))$
Apply conformally symplectic Euler discretization:
\[X^{(k+1)} = X^{(k)} + h_k \frac{N}{diag(M^{(k)} \mathbf{1})} Y^{(k)} B^T\] \[Y^{(k+1)} = \sigma_k Y^{(k)} + h_k \frac{N}{2} \left( \{R^{(k)}, M^{(k)}\} + 2M^{(k)} \right) X^{(k)} A\]
where $M_{ij}^{(k)} = \exp(X_i^{(k)T} A X_j^{(k)})$.

Applied to toy example: With $X_1^{(0)} = [1,0]^T$, $X_2^{(0)} = [0,1]^T$, $Y_1^{(0)} = Y_2^{(0)} = 0$, the first momentum update gives $Y_1^{(1)} = h \cdot N/2 \cdot M_{11}^{(0)} \cdot X_1^{(0)} = h \cdot 1 \cdot [1,0]^T$, then position update $X_1^{(1)} = [1,0]^T + h \cdot Y_1^{(1)} = [1+h^2, 0]^T$.

Novelty & Lineage

This work extends the connection between transformers and optimal transport (Geshkovski et al. 2025, Castin et al. 2025) by incorporating Nesterov acceleration on density manifolds (Wang & Li 2022, Taghvaei & Mehta 2019).

Key novelties:

First application of accelerated gradient flows on density manifolds to transformer architectures
Introduction of momentum variables for tokens, creating “Hamiltonian momentum attention blocks”
Proof that linear attention preserves elliptically contoured distributions under acceleration
Non-separable Hamiltonian formulation for softmax attention case
Oracle-preserving time discretizations that maintain computational efficiency

Prior work: Sinkformer (Sander et al. 2022) connected transformers to Sinkhorn algorithm; Geshkovski et al. (2024, 2025) established transformer-OT connections; concurrent Yuriiformer (Zimin et al. 2026) applies classical Nesterov in Euclidean space rather than density manifolds.

The mathematical framework connecting accelerated density flows to transformer architectures is novel, though builds incrementally on established OT-transformer connections. SIGNIFICANT

Proof Techniques

The proofs use three main strategies:

Variational calculus on density manifolds: For softmax attention, show the transformer PDE is a gradient flow by computing the first variation:
\[\frac{\delta F_{SM}(\rho)}{\delta \rho} = -\int \exp(y^T A x) \rho(y) dy\]
and verifying:
\[\partial_t \rho_t = -(G^{SM}_{\rho_t})^{-1} \left[ \frac{\delta F_{SM}(\rho_t)}{\delta \rho_t} \right]\]
Preservation of elliptical distributions: For linear attention with initial density $\rho_0 = E(m, \Sigma, g)$, prove by ansatz that:
\[\rho_t = E(m_t, \Sigma_t, g), \quad \Phi_t(x) = \frac{1}{2} x^T P_t x\]
Substituting into the PDE yields the matrix ODEs:
\[\dot{\Sigma}_t = \kappa_g (\Sigma_t A \Sigma_t P_t + P_t \Sigma_t A \Sigma_t)\] \[\dot{P}_t = -\alpha_t P_t - \kappa_g (A \Sigma_t P_t^2 + P_t^2 \Sigma_t A) + V\]
Particle discretization via empirical measures: Replace $\rho_t$ with $\frac{1}{N} \sum_{j=1}^N \delta_{X_j(t)}$ in the PDE. The key technical insight is that $Y_i(t) := \nabla \Phi_t(X_i(t))$ transforms the PDE into the finite-dimensional system:
\[\dot{X}_i = \frac{1}{N} \sum_j Y_j X_j^T A X_i, \quad \dot{Y}_i = -\alpha_t Y_i - \frac{1}{N} \sum_j \langle Y_i, Y_j \rangle A X_j + V X_i\]
The non-separable Hamiltonian structure requires careful treatment of the mixed $X$-$Y$ dependence in the kinetic energy term.

Experiments & Validation

Experiments on TinyStories dataset (2000+ short stories) comparing:

Baseline transformer
Yuriiformer (Lie-Trotter Nesterov)
SympFormer variants: Plain Euler, Presymplectic Euler, Exponential Euler, Adams-Bashforth-2

Key results:

SympFormer Presymp ETD-AB2: 1.8386 validation loss (best)
Baseline: 2.4687 validation loss
Yuriiformer: 2.4041 validation loss
Wall-clock time: 2409s (baseline) vs 5798s (best SympFormer)

Additional experiments on linear attention variants showing similar improvements. All models use same oracle budget (attention computations per step). SympFormer achieves lower cross-entropy loss but at higher computational cost per iteration.

Architecture details: nanoGPT-style decoder-only, learnable damping schedules $\alpha(t) = r/t + m$, separate step sizes for position/momentum updates.

Limitations & Open Problems

Limitations:

RESTRICTIVE: Theory requires compact manifolds but experiments use $\Omega = \mathbb{R}^d$ - rigorous treatment needs layer normalization as sphere projection
TECHNICAL: Symmetric matrix assumptions ($A$, $V$, $B$ positive definite) needed for Hamiltonian structure but may be removable
RESTRICTIVE: Empirical measure approximation $\rho_t \approx \frac{1}{N} \sum \delta_{X_i(t)}$ lacks convergence guarantees as $N \to \infty$
NATURAL: Multi-head attention averaging approximation (theory for single-head, practice averages across heads)
RESTRICTIVE: Higher wall-clock time per iteration limits practical scalability
TECHNICAL: Log-linear damping schedule $\alpha(t) = r/t + m$ is heuristic, optimal schedules unknown

Open problems:
Establish rigorous convergence rates for the particle approximation and prove acceleration benefits carry over from continuous to discrete setting
Extend framework to non-symmetric attention matrices and derive accelerated dynamics for other transformer components (layer normalization, MLPs) within the density manifold formalism

Robust and Computationally Efficient Linear Contextual Bandits under Adversarial Corruption and Heavy-Tailed Noise

Authors: Naoto Tani, Futoshi Futami · Institution: (high-relevance, institution unverified) · Category: cs.LG

First computationally efficient algorithm for linear contextual bandits robust to both adversarial corruption and heavy-tailed noise with optimal $O(1)$ per-round updates.

Tags: contextual bandits adversarial robustness heavy-tailed noise online mirror descent computational efficiency regret bounds Huber loss

arXiv · PDF

Problem Formulation

Motivation: Linear contextual bandits are fundamental for sequential decision-making with side information (recommendation systems, online advertising). However, real-world applications face adversarial corruption (e.g., spam reviews, click fraud) and heavy-tailed noise (e.g., financial signals), often simultaneously.
Mathematical setup: At each round $t \in [T]$, the learner observes decision set $X_t \subset \mathbb{R}^d$ and selects action $X_t \in X_t$. The environment generates stochastic reward:
\[r'_t = \langle X_t, \theta^* \rangle + \eta_t\]
where $\theta^* \in \mathbb{R}^d$ is unknown parameter. An adversary adds corruption $c_t$, yielding observed reward:
\[r_t = \langle X_t, \theta^* \rangle + \eta_t + c_t\]
Assumptions:
1. Heavy-tailed noise: For some $\epsilon \in (0,1]$, $E[\eta_t \lvert X_{1:t}, \eta_{1:t-1}, c_{1:t-1}] = 0$ and $E[\lvert \eta_t \rvert^{1+\epsilon} \lvert X_{1:t}, \eta_{1:t-1}, c_{1:t-1}] \leq \nu_t^{1+\epsilon}$
2. Bounded contexts and parameter: $\lvert x \rvert_2 \leq L$ for all $x \in X_t$ and $\theta^* \in \Theta := {\theta : \lvert \theta \rvert_2 \leq S}$
3. Corruption budget: $C = \sum_{t=1}^T \lvert c_t \rvert$
Toy example: When $d=2$, $X_t = {e_1, e_2}$ (coordinate vectors), $\theta^* = [1,1]^T$, corruption $c_t = 1$ on first $T/2$ rounds, and $\eta_t \sim \text{Pareto}(\alpha=1.5)$ (giving $\epsilon = 0.5$), the learner must distinguish signal from heavy-tailed noise plus adversarial bias.
Formal objective: Minimize regret
\[\text{Reg}(T) = \sum_{t=1}^T \max_{x \in X_t} \langle x, \theta^* \rangle - \sum_{t=1}^T \langle X_t, \theta^* \rangle\]

Method

The algorithm (CR-Hvt-UCB) uses online mirror descent with adaptive Huber loss scaling to handle both corruptions and heavy tails.

Key components:

Huber loss on normalized prediction error $z_t(\theta) = (r_t - \theta^T X_t)/\sigma_t$:
\[\ell_t(\theta) = \begin{cases}\] \[\frac{1}{2}z_t(\theta)^2 & \text{if } |z_t(\theta)| \leq \tau_t \\\] \[\tau_t|z_t(\theta)| - \frac{\tau_t^2}{2} & \text{if } |z_t(\theta)| > \tau_t\] \[\end{cases}\]
Adaptive scale parameter:
\[\sigma_t = \max\left(\nu_t, \sigma_{\min}, \sqrt{\frac{2\beta_{t-1}}{\tau_0\sqrt{\alpha}}} t^{\frac{1-\epsilon}{2(1+\epsilon)}} \|X_t\|_{V_{t-1}^{-1}}, \sqrt{C\kappa^{-1/4}}\|X_t\|_{V_{t-1}^{-1}}^{1/2}\right)\]
Threshold parameter:
\[\tau_t = \tau_0 \sqrt{\frac{1+w_t^2}{w_t}} t^{\frac{1-\epsilon}{2(1+\epsilon)}}\]
OMD update with quadratic regularizer $\psi_t(\theta) = \frac{1}{2}\lvert \theta \rvert_{V_t}^2$:
\[\tilde{\theta}_{t+1} = \hat{\theta}_t - V_t^{-1} \nabla \ell_t(\hat{\theta}_t)\] \[\hat{\theta}_{t+1} = \arg\min_{\theta \in \Theta} \|\theta - \tilde{\theta}_{t+1}\|_{V_t}\]
where $V_t = V_{t-1} + \alpha^{-1} X_t X_t^T / \sigma_t^2$.
UCB arm selection:
\[X_t = \arg\max_{x \in X_t} \{\langle x, \hat{\theta}_t \rangle + \beta_{t-1} \|x\|_{V_{t-1}^{-1}}\}\]
Toy example application: With $d=2$, coordinate arms, the corruption-dependent term in $\sigma_t$ inflates the normalization when $\lvert c_t \rvert$ is large, reducing the gradient magnitude and limiting corruption’s influence on parameter updates.

Novelty & Lineage

This extends prior work on linear contextual bandits in two key directions:

Statistical robustness: Yu et al. [2025] studied simultaneous corruption and heavy-tailed noise but required finite variance ($\epsilon = 1$ only). This work relaxes to bounded $(1+\epsilon)$-moments for any $\epsilon \in (0,1]$, generalizing the noise assumption.
Computational efficiency: Yu et al. [2025] required solving convex optimization at each round ($O(t \log T)$ cost). This work achieves $O(1)$ per-round updates via online mirror descent, following the OMD framework of Wang et al. [2025] for heavy-tailed bandits.

The algorithm combines corruption-robust scaling (adapting He et al. [2022]) with heavy-tailed moment techniques (Wang et al. [2025]) in a unified OMD framework. When $\epsilon = 1$ and $C = 0$, it recovers existing bounds.

Prior work addressed either corruption (He et al. [2022]) or heavy tails (Wang et al. [2025]) separately, or both with restrictive assumptions (Yu et al. [2025]).

SIGNIFICANT

Proof Techniques

The proof extends OMD error decomposition from Wang et al. [2025] to handle corruption via careful separation of corruption-dependent and independent terms.

Key decomposition (Lemma B.1):

\[\|\hat{\theta}_{t+1} - \theta^*\|_{V_t}^2 \leq 4\lambda S^2 + \sum_{s=1}^t \|\nabla \ell_s(\hat{\theta}_s)\|_{V_s^{-1}}^2 + 2\sum_{s=1}^t \langle \nabla \tilde{\ell}_s(\hat{\theta}_s) - \nabla \ell_s(\hat{\theta}_s), \hat{\theta}_s - \theta^* \rangle + \left(\frac{1}{\alpha} - 1\right) \sum_{s=1}^t \|\hat{\theta}_s - \theta^*\|_{X_s X_s^T/\sigma_s^2}\]

Main technical insights:

Stability term analysis: Decompose gradient norm as:
\[\|\nabla \ell_s(\hat{\theta}_s)\|_{V_s^{-1}}^2 \leq 3\left(\min\left\{\left|\frac{\eta_s}{\sigma_s}\right|, \tau_s\right\}\right)^2 \left\|\frac{X_s}{\sigma_s}\right\|_{V_s^{-1}}^2 + 3\left(\frac{\langle X_s, \theta^* - \hat{\theta}_s \rangle}{\sigma_s}\right)^2 \left\|\frac{X_s}{\sigma_s}\right\|_{V_s^{-1}}^2 + 3\left(\frac{c_s}{\sigma_s}\right)^2 \left\|\frac{X_s}{\sigma_s}\right\|_{V_s^{-1}}^2\]
The corruption term is controlled by $\sigma_s \geq \sqrt{C\kappa^{-1/4}} \lvert X_s \rvert_{V_{s-1}^{-1}}^{1/2}$, giving $\sum_{s=1}^t c_s^2/\sigma_s^2 \leq \tilde{O}(d)$.
Generalization gap analysis: Use union bound to separate corruption and noise:
\[\mathbf{1}\{|z_s(\theta^*)| > \tau_s/2\} \leq \mathbf{1}\left\{\left|\frac{\eta_s}{\sigma_s}\right| > \tau_s/4\right\} + \mathbf{1}\left\{\left|\frac{c_s}{\sigma_s}\right| > \tau_s/4\right\}\]
Apply heavy-tailed concentration (Lemma D.2) to noise part and deterministic bounds to corruption part.
Key scaling insight: The corruption-aware scale $\sigma_t$ ensures corruption terms have same order as uncorrupted analysis, preserving $\tilde{O}(d t^{(1-\epsilon)/(2(1+\epsilon))})$ confidence radius.

Experiments & Validation

Synthetic experiments with $d=10$, $\theta^*$ uniform on unit sphere, 20 random unit vector arms per round, $T=5000$.

Noise models:

$\epsilon = 1$: Student’s t-distribution (3 d.o.f), $\nu = \sqrt{3}$
$\epsilon = 0.4$: Pareto distribution ($\alpha = 1.5$, $x_{\min} = 1$), $\nu = 15^{1/(1+\epsilon)}$

Corruption: $\theta$-flipping mechanism until budget $C \in {0, 100}$ exhausted.

Results:

When $\epsilon = 1$: CR-Hvt-UCB matches GAdaOFUL regret but is orders of magnitude faster
When $\epsilon < 1$: GAdaOFUL fails (linear regret), CR-Hvt-UCB achieves sublinear regret
10 independent runs, averaged results

Key insight: Demonstrates both computational and statistical advantages, especially for true heavy-tailed case ($\epsilon < 1$) where existing method completely fails.

Limitations & Open Problems

Limitations:

TECHNICAL: Requires knowledge of corruption budget $C$ for optimal guarantees (though unknown $C$ version provided with weaker bounds)
TECHNICAL: Moment parameter $\epsilon$ assumed known (standard in heavy-tailed bandit literature)
TECHNICAL: Constants in regret bound likely not tight (common in theoretical analysis)
NATURAL: Additive corruption model (standard assumption, covers many practical scenarios)
RESTRICTIVE: When both $C$ and $\nu_t$ unknown, regret bound $\tilde{O}(dT^{1/(1+\epsilon)})$ doesn’t achieve lower bound $\tilde{\Omega}(d^{2\epsilon/(1+\epsilon)} T^{1/(1+\epsilon)})$

Open problems:
Lower bound optimality: Can any algorithm for general linear contextual bandits with heavy tails and corruption achieve the $\tilde{O}(d^{2\epsilon/(1+\epsilon)} T^{1/(1+\epsilon)})$ lower bound, or does this require finite decision sets?
Extension to GLBs: Generalize the OMD framework to generalized linear bandits under corruption and heavy tails while maintaining $O(1)$ per-round complexity.

Saddle Point Evasion via Curvature-Regularized Gradient Dynamics

Authors: Liraz Mudrik, Isaac Kaminer, Sean Kragelund, Abram H. Clark · Institution: MIT · Category: math.OC

Introduces Curvature-Regularized Gradient Dynamics that augments the objective with a smooth penalty on negative eigenvalues, yielding the first deterministic saddle escape method with user-selectable convergence rates and escape time that improves with smaller eigenvalue gaps.

Tags: nonconvex optimization saddle point escape Lyapunov methods gradient dynamics curvature regularization prescribed-time stability second-order stationarity control theory

arXiv · PDF

Problem Formulation

Motivation: Nonconvex optimization underlies many machine learning and control tasks, where saddle points pose the dominant obstacle to reliable convergence in high-dimensional settings. Existing methods either lack deterministic guarantees (stochastic perturbation methods) or suffer from Hessian singularity issues (Newton-type approaches), making controllable escape from saddle points an open challenge.
Mathematical setup: Consider the unconstrained optimization problem:
\[\min_{x \in \mathbb{R}^n} J(x)\]
where $J : \mathbb{R}^n \to \mathbb{R}$ is the objective function with gradient $g(x) = \nabla J(x)$ and Hessian $H(x) = \nabla^2 J(x)$. A point $x^* \in \mathbb{R}^n$ is a strict saddle point if $g(x^*) = 0$ and $H(x^*)$ has at least one strictly negative eigenvalue. The set of second-order stationary points (SOSP) is:
\[\mathcal{X}^* := \{x \in \mathbb{R}^n \mid g(x) = 0, H(x) \succeq 0\}\]
Assumptions:
1. $J \in C^4(\mathbb{R}^n)$ (smoothness)
2. At every strict saddle point $x^*$, the minimum eigenvalue $\lambda_{\min}(H(x^*))$ is simple and satisfies $\nabla_x \lambda_{\min}(x^*) \neq 0$ (eigenvalue simplicity)
3. The sublevel sets ${x \in \mathbb{R}^n : J(x) \leq c}$ are compact for every $c \in \mathbb{R}$ (compact sublevel sets)
Toy example: When $n=2$ and $J(x) = \frac{1}{2}x_1^2 - \frac{1}{2}x_2^2$, we have $g(0) = 0$ and $H(0) = \text{diag}(1, -1)$ with $\lambda_{\min} = -1 < 0$, making the origin a strict saddle point. Standard gradient descent initialized near the origin along the unstable manifold (e.g., $x_0 = [\epsilon, 0]$) converges to the saddle with exponentially slow escape time $O(\frac{1}{\delta} \ln \frac{1}{\epsilon})$ where $\delta = 1$ is the eigenvalue gap.
Formal objective: Design a dynamical system $\dot{x} = u$ with feedback control $u$ such that trajectories converge to $\mathcal{X}^*$ at a user-selectable rate while avoiding entrapment at saddle points.

Method

The Curvature-Regularized Gradient Dynamics (CRGD) method augments the original objective with a smooth penalty on the most negative Hessian eigenvalue. Let $\lambda_{\min}(x) := \lambda_{\min}(H(x))$ denote the smallest eigenvalue of the Hessian. The augmented cost is:

\[\Phi(x) = J(x) + \frac{\beta^2}{2} [\max(0, -\lambda_{\min}(x))]^2\]

where $\beta > 0$ is the curvature penalty weight.

The gradient of $\Phi$ is:

\[\nabla \Phi = g - \beta^2 \max(0, -\lambda_{\min}) w_{\min}\]

where $w_{\min}$ is the curvature sensitivity vector with components $(w_{\min})_j = u_{\min}^T \frac{\partial H}{\partial x_j} u_{\min}$ and $u_{\min}(x)$ is a unit eigenvector associated with $\lambda_{\min}$.

The CRGD control law follows the Optimization Lyapunov Function framework:

Choose Lyapunov function $V = \Phi - \Phi^*$ where $\Phi^* := \inf_{x \in \mathcal{X}^*} \Phi(x)$
Select convergence law $\sigma(V,t) \geq 0$ (e.g., exponential: $\sigma = cV$)
Apply feedback law: $u = -\frac{\sigma(\Phi(x) - \Phi^*, t)}{\lvert \nabla \Phi(x) \rvert^2} \nabla \Phi(x)$

Applied to toy example: For $J(x) = \frac{1}{2}x_1^2 - \frac{1}{2}x_2^2$, we have $\lambda_{\min} = -1$ and $u_{\min} = [0, 1]^T$ at the origin. The curvature sensitivity is $w_{\min} = [0, -2]^T$, so $\nabla \Phi(0) = [0, 0]^T - \beta^2 \cdot 1 \cdot [0, -2]^T = [0, 2\beta^2]^T \neq 0$, providing a nonzero descent direction that escapes the saddle.

Novelty & Lineage

This work extends the Optimization Lyapunov Function (OLF) framework of Mudrik et al. to handle nonconvex optimization. Prior saddle escape methods include:

Perturbed Gradient Descent (Jin et al.) which uses stochastic noise but lacks deterministic guarantees
Cubic regularization methods (Nesterov-Polyak, Carmon et al.) which achieve optimal complexity but don’t offer selectable rates, and
Newton-based approaches (Paternain et al.) which suffer from Hessian singularity. The key novelty is constructing an augmented cost whose critical points coincide with SOSP of the original problem, while serving as a Lyapunov function with exact decay rates. This is the first method combining deterministic guarantees, selectable convergence rates, and escape time that improves with smaller eigenvalue gaps. SIGNIFICANT.

Proof Techniques

The main proof strategy proceeds in three stages:

Regularity analysis: Show that the augmented cost $\Phi$ is $C^{1,1}$ despite potential eigenvalue crossings. The key insight is that $\phi(\lambda) = \frac{1}{2}[\max(0,-\lambda)]^2$ has Lipschitz derivative $\phi’(\lambda) = \min(\lambda,0)$, so the composition $P(x) = \phi(\lambda_{\min}(H(x)))$ inherits $C^{1,1}$ regularity through the chain rule.
Saddle point elimination: At a strict saddle point where $g(x^*) = 0$ and $\lambda_{\min}(x^*) < 0$, the gradient becomes:
\[\nabla \Phi(x^*) = -\beta^2 |\lambda_{\min}(x^*)| w_{\min}(x^*) \neq 0\]
provided $w_{\min}(x^*) \neq 0$ (generic condition). This eliminates saddle points as equilibria.
Spurious critical point analysis: New critical points of $\Phi$ may satisfy $g = \beta^2 \lvert \lambda_{\min} \rvert w_{\min}$ with $\lambda_{\min} < 0$. The Hessian of $\Phi$ at such points is:
\[\nabla^2 \Phi = H + \beta^2[w_{\min} w_{\min}^T - |\lambda_{\min}| \nabla_x w_{\min}]\]
The quadratic form along $u_{\min}$ gives:
\[u_{\min}^T \nabla^2 \Phi u_{\min} = \lambda_{\min} + \beta^2 C(\bar{x})\]
where $C(\bar{x})$ contains curvature terms. For $\beta^2 < \lvert \lambda_{\min} \rvert/\lvert C(\bar{x}) \rvert$, this remains negative, proving spurious points are saddles of $\Phi$.

Experiments & Validation

Two main experiments:

2D three-fold symmetric potential with 7 critical points, comparing gradient descent vs. CRGD with different convergence laws (exponential, finite-time, fixed-time, prescribed-time). CRGD escapes all saddles while GD gets trapped.
High-dimensional matrix factorization problem $J(x) = \frac{1}{4}\lvert xx^T - M^* \rvert_F^2$ with tunable eigenvalue gap $\delta$. Monte Carlo study over 2000 trials shows CRGD achieves 100% SOSP convergence across all gap sizes, while GD success rate degrades from 99% to 80% as $\delta$ decreases. Key result: CRGD escape time scales as $O(\delta)$ vs. GD’s $O(1/\delta)$ scaling, confirmed across gap range $\delta \in [0.001, 0.1]$. Computational cost is $O(n^3)$ per step due to eigendecomposition, with 17-19× overhead vs. gradient descent.

Limitations & Open Problems

Limitations:

Computational cost is $O(n^3)$ per iteration due to Hessian eigendecomposition - RESTRICTIVE (significantly limits scalability to very high dimensions)
Assumption 2.3 (eigenvalue simplicity and non-degeneracy) - NATURAL (holds generically in the space of symmetric matrices)
Assumption 3.2 (well-posedness condition) requires gradient and curvature correction don’t cancel - TECHNICAL (non-generic codimension-n condition, analogous to hyperbolicity assumptions in dynamical systems)
Curvature penalty weight $\beta$ must satisfy $0 < \beta < \beta^*$ where $\beta^*$ depends on landscape properties - TECHNICAL (threshold can be estimated but requires landscape knowledge)

Open problems:
Develop computationally efficient approximations to avoid full Hessian eigendecomposition while preserving convergence guarantees
Extend the framework to constrained optimization problems where feasible sets may contain saddle points of the Lagrangian

Spatial Causal Tensor Completion for Multiple Exposures and Outcomes: An Application to the Health Effects of PFAS Pollution

Authors: Xiaodan Zhou, Brian J Reich, Shu Yang · Institution: (high-relevance, institution unverified) · Category: stat.ME

A spatial causal tensor completion framework that jointly models multiple exposures and outcomes while adjusting for unmeasured spatial confounding via graph Laplacian eigenvectors, with application to PFAS health effects.

Tags: causal-inference spatial-statistics tensor-completion environmental-health matrix-completion unmeasured-confounding multiple-treatments graph-laplacian

arXiv · PDF

Problem Formulation

Motivation: Environmental health studies face the challenge of estimating causal effects of multiple chemical exposures on multiple health outcomes using spatial data, where unmeasured spatial confounders can bias estimates. Traditional methods analyze one exposure-outcome pair at a time, failing to leverage shared structure across the high-dimensional counterfactual space.
Mathematical setup: Consider $N$ spatial units with $K$-dimensional binary exposure vectors $A_i \in {0,1}^K$ defining $L = 2^K$ exposure combinations indexed by $\ell_i \in {1,\ldots,L}$. Each unit has potential outcomes ${Y_{i\ell o} : \ell = 1,\ldots,L, o = 1,\ldots,O}$ organized into tensor $\mathcal{Y} \in \mathbb{R}^{N \times L \times O}$.

The observed outcome tensor follows a low-rank Tucker decomposition:
\[\mathcal{Y} = \mathcal{Y}^* + \mathcal{E} = \mathcal{G} \times_1 U_1 \times_2 U_2 \times_3 U_3 + \mathcal{E}\]
where $\mathcal{G} \in \mathbb{R}^{r_1 \times r_2 \times r_3}$ is the core tensor and $U_k$ are factor matrices. The spatial factor decomposes as:
\[U_1 = Z \eta_Z + S \eta_S\]
where $Z$ are measured covariates and $S$ are unmeasured spatial components approximated by graph Laplacian eigenvectors $S \approx \Phi_{1:k} \beta$.

Assumptions:
1. SUTVA: No interference between units and consistent treatment versions
2. Latent Ignorability: $A_i \perp {Y_{i\ell o} : \ell, o} \lvert (Z_i, S_i)$
3. Latent Positivity: $p_{\min} \leq P(A_i = a(\ell) \lvert Z_i, S_i) \leq 1 - p_{\min}$
Toy example: With $N=4$ spatial units on a line, $K=2$ exposures (PFOA, PFOS), and $O=2$ outcomes (hypertension, diabetes), we have tensor $\mathcal{Y} \in \mathbb{R}^{4 \times 4 \times 2}$. Only 4 entries are observed (one per unit), leaving 28 counterfactual outcomes to impute. Low-rank structure assumes outcomes share latent factors across space, exposures, and diseases.
Formal objective: Estimate average treatment effects:
\[\theta^*_{\ell o} = \mathbb{E}\left[\frac{1}{N} \sum_{i=1}^N (Y_{i\ell o} - Y_{iLo})\right]\]

Method

The method proceeds in three steps:

Unweighted spatial tensor completion: Minimize unweighted loss to learn initial spatial component:
\[\arg\min_{\mathcal{G},U_2,U_3,\eta_Z,\beta} \|\mathcal{A} \circ (\mathcal{Y} - \mathcal{G} \times_1 (Z\eta_Z + \Phi_{1:k}\beta) \times_2 U_2 \times_3 U_3)\|_F^2\]
Propensity score estimation: Fit multinomial logistic regression using initial spatial component:
\[\hat{\pi}(\ell|Z_i, \hat{S}^{(0)}_i) = P(A_i = a(\ell) | Z_i, \hat{S}^{(0)}_i)\]
Construct inverse probability weights:
\[\hat{W}^{(0)}_{i\ell o} = \frac{\mathbf{1}(A_i = a(\ell))}{\hat{\pi}(\ell|Z_i, \hat{S}^{(0)}_i)}\]
Weighted tensor completion: Re-estimate with exposure-corrected weights:
\[\arg\min_{\mathcal{G},U_2,U_3,\eta_Z,\beta} \|\sqrt{\hat{W}^{(0)}} \circ \mathcal{A} \circ (\mathcal{Y} - \mathcal{G} \times_1 (Z\eta_Z + \Phi_{1:k}\beta) \times_2 U_2 \times_3 U_3)\|_F^2\]
The Spatial Projected Gradient Descent (S-PGD) algorithm adaptively selects Laplacian eigenvectors using BIC criterion at each iteration.

Toy example application: For the 4×4×2 tensor, S-PGD would select among the 4 eigenvectors of the path graph Laplacian, likely choosing the first 1-2 smooth eigenvectors. The algorithm imputes the 28 missing entries by projecting onto the learned low-rank + spatial structure.

Novelty & Lineage

This work extends matrix/tensor completion to spatial causal inference by combining three existing ideas:

Tucker tensor decomposition for multi-way data (Kolda & Bader 2009)
spectral adjustment for spatial confounding using graph Laplacian eigenvectors (Reich et al. 2021), and
causal tensor completion frameworks (Abadie et al. 2024, Mao et al. 2024).

The key novelty is the integration of spatial confounding adjustment directly into the tensor completion objective, rather than treating spatial adjustment and tensor completion as separate steps. Recent parallel work includes Prim et al. (2025) who use tensors as linear model components, and Wu & Franks (2025) who focus on panel data with temporal variation. This paper uniquely addresses cross-sectional spatial data with multiple binary exposures.

Prior causal tensor work (Poulos et al. 2021, Auerbach et al. 2022) treated exposures independently. Recent advances (Gao et al. 2024, 2025) integrate exposures but don’t handle spatial confounding.

SIGNIFICANT

Proof Techniques

The main theoretical result establishes Frobenius error bounds for the tensor completion estimator. The proof strategy involves five key components:

Tensor concentration: Apply matrix/tensor concentration inequalities to bound deviations of the empirically observed entries from their expectations under the sampling model.
Spectral approximation error: Bound the error from approximating smooth spatial functions by truncated Laplacian eigenbasis. For signals with polynomial eigenvalue decay $\lvert c_j^{(\ell)} \rvert \leq C \cdot j^{-\alpha/\beta}$, this gives:
\[\|S - \Phi_{1:k}\beta\|_F \leq C k^{-\tau_2}\]
Propensity score approximation: Control the error propagation from using estimated propensity scores in the inverse probability weights:
\[\|\hat{W} - W^*\|_F \leq f(\|\hat{\pi} - \pi^*\|, p_{\min})\]
Weighted tensor completion: Extend standard tensor completion results to the weighted setting, where sampling probabilities vary across entries according to the exposure mechanism.
Error decomposition: The final bound decomposes as:
\[\frac{\|\hat{\mathcal{Y}} - \mathcal{Y}^*\|_F^2}{NO} \leq \max(A_1, A_2, A_3, A_4, A_5)\]
where $A_1$ represents statistical error, $A_2$ captures spectral approximation error, $A_3$ reflects propensity score estimation error, $A_4$ accounts for spatial smoothness assumptions, and $A_5$ represents tensor incoherence conditions.

The key technical insight is showing that the weighted tensor completion problem maintains the same statistical guarantees as the unweighted case when the weights are derived from consistent propensity score estimates.

Experiments & Validation

Datasets: EPA PFAS monitoring data (UCMR5) with 5,495 public water systems across 47 states, linked to CDC PLACES health outcomes (13 chronic diseases) and Census demographic data. Exposure: binary indicators for PFOA and PFOS detection above minimum reportable limits.

Baselines:

Spatial-Tensor (proposed)
Tensor completion without spatial adjustment
Spatial-PS (separate regressions with spatial propensity scores)
Standard regression
Augmented regression.

Key results:
- Simulation study on 20×20 spatial grids shows proposed method achieves lowest MSE across varying tensor ranks, spatial confounding strength, and exposure overlap
- PFAS application: Regression methods yield implausible odds ratios >3.0; tensor methods give conservative estimates <1.05
- Most PFAS-health associations attenuate substantially after spatial adjustment
- PFOS remains significantly associated with hypertension (OR=1.021, 95% CI: 1.008-1.033), tooth loss, asthma, and obesity
Cross-validation: 5-fold CV for Tucker rank selection over grids $r_1 \in {1,\ldots,10}$, $r_2 \in {1,\ldots,3}$, $r_3 \in {1,\ldots,5}$.

Limitations & Open Problems

Limitations:

NATURAL: No-interference assumption may be violated by population mobility or environmental contamination spreading between regions, though this is standard in environmental studies.
TECHNICAL: Polynomial decay rate assumption for spatial eigenfunctions ($\lvert c_j^{(\ell)} \rvert \leq C \cdot j^{-\alpha/\beta}$) is needed for spectral approximation bounds but could potentially be relaxed.
RESTRICTIVE: Binary exposure indicators aggregate heterogeneous concentration levels above detection limits, so effects represent averages over observed exposure distributions rather than effects of specific doses.
RESTRICTIVE: Temporal misalignment between exposure measurements (2023-2024) and health outcomes (2017-2018) threatens the no-unmeasured-confounding assumption, though PFAS persistence partially mitigates this.
TECHNICAL: Limited positivity with 10-48% of exposed units having propensity scores below 0.05 may affect finite-sample performance.
RESTRICTIVE: Graph Laplacian approximation only captures spatially structured confounders, missing non-spatial confounders or fine-scale spatial variation below the adjacency resolution.

Open problems:
Extend to continuous exposure levels while maintaining computational tractability of the tensor framework
Develop methods for spatiotemporal data that can exploit both spatial and temporal variation for identification

On Partition Functions for Time-Inhomogeneous Branching Random Walks

Authors: Qianrun Wu · Institution: (high-relevance, institution unverified) · Category: math.PR

Proves that time-inhomogeneous branching random walks with decreasing variance exhibit universal partition function behavior, maintaining the same critical parameter as homogeneous models despite losing martingale structure.

Tags: branching-random-walks phase-transitions gaussian-multiplicative-chaos directed-polymers martingale-theory random-matrix-theory statistical-mechanics

arXiv · PDF

Problem Formulation

Motivation: Branching random walks (BRWs) model population growth with spatial movement, crucial for understanding phase transitions in statistical physics and directed polymers. Time-inhomogeneous BRWs, where the variance of increments changes over time, naturally arise in two-dimensional directed polymers but lack the martingale structure that classical theory relies on.
Mathematical setup: Consider a supercritical Galton-Watson tree $T$ rooted at $o$ with offspring distribution satisfying $E[D_1] = d > 1$. For each vertex $x \in T$, assign independent weights $w(x) \sim N(0,1)$. For time-homogeneous BRW, the normalized partition function at generation $n$ is:
\[W_n^\beta = \frac{1}{d^n} \sum_{|x|=n} e^{\beta H_n(x) - \frac{1}{2}\beta^2 n}\]
where $H_n(x) = \sum_{i=1}^n w(x_i)$ is the Hamiltonian along branch $x = (x_1, \ldots, x_n)$.

For time-inhomogeneous BRW with decreasing variance function $f:[0,1] \to [0,1]$ satisfying:
$f$ is continuous and non-increasing
$f(0) = 1$ and $f(1) = 0$

The modified Hamiltonian is $\bar{H}_n(x) = \sum_{i=1}^n \sqrt{f(i/n)} w(x_i)$, giving partition function:
\[\bar{W}_n^\beta = \frac{1}{d^n} \sum_{|x|=n} e^{\beta \bar{H}_n(x) - \frac{\beta^2}{2}\sum_{i=1}^n f(i/n)}\]
Toy example: When $f(t) = 1-t$ and $n=2$, we have $\bar{H}_2(x) = \sqrt{1} w(x_1) + \sqrt{1/2} w(x_2) = w(x_1) + \frac{w(x_2)}{\sqrt{2}}$. The variance decreases from 1 to 1/2 across generations.
Formal objective: Determine the critical parameter $\bar{\beta}_c$ and establish universality:
\[\lim_{n \to \infty} E[|W_n^\beta - \bar{W}_n^\beta|] = 0 \text{ for } \beta < \beta_c\]

Method

The method adapts Berestycki’s GMC framework to discrete BRWs without relying on martingale properties.

Decompose partition functions into “good environment” and complement parts by defining for fixed $\alpha \in (\beta, 2\beta)$:
\[G_n^\alpha(x) = \{\forall T \in \{n_0, \ldots, n\}, H_T(x) < \alpha T\}\]
Split $W_n^\beta = J_n + K_n$ where $J_n$ restricts to good environments and $K_n$ is the complement.
Apply Cameron-Martin-Girsanov measure tilting: define tilted measure $\hat{P}$ by
\[\frac{d\hat{P}}{dP} = \frac{e^{\beta \bar{H}_n(x)}}{E[e^{\beta \bar{H}_n(x)}]}\]
Under $\hat{P}$, show $\bar{H}_T(x) \sim N(\beta \sum_{i=1}^T f(i/n), \sum_{i=1}^T f(i/n))$ with preserved covariance structure.
Establish L² convergence by showing both $J_n$ and $\bar{J}_n$ (time-inhomogeneous analog) converge to same limit using covariance computation:
\[E[(W_n^\beta)^2] = \sum_{h=0}^n d^{-h} e^{h(\beta^2 - \log d)}\]
Applied to toy example: With $f(t) = 1-t$ and $\alpha = 1.5\beta$, the good environment requires $H_1(x) < 1.5\beta$ and $H_2(x) < 3\beta$. The tilted measure shifts the mean of $\bar{H}_1(x)$ to $\beta$ and $\bar{H}_2(x)$ to $\beta(1 + 1/2) = 1.5\beta$.

Novelty & Lineage

This work extends classical BRW theory (Biggins, Hu-Shi) to time-inhomogeneous settings by adapting Berestycki’s martingale-free GMC approach. Key novelty: proving the critical parameter remains $\beta_c = \sqrt{2\log d}$ despite time-inhomogeneity, and establishing L¹ universality between homogeneous and inhomogeneous partition functions.

Prior work: Biggins established phase transition for time-homogeneous BRWs using spine decomposition. Berestycki (2017) developed martingale-free GMC convergence via thick points analysis. This paper bridges these by showing early generations dominate total mass universally.

The connection to two-dimensional directed polymers (Caravenna-Sun-Zygouras, Cosco-Donadini) motivates the decreasing variance model but previous work didn’t establish partition function universality.

SIGNIFICANT

Proof Techniques

L² critical parameter analysis: For time-inhomogeneous BRWs, compute
\[E[(\bar{W}_n^\beta)^2] = \sum_{h=0}^n d^{-h} e^{\beta^2 \sum_{i=1}^h f(i/n)}\]
```
Show this converges iff $\beta < \sqrt{\log d}$ by comparing with homogeneous case.
```
Good environment decomposition: For $\alpha \in (\beta, 2\beta)$, bound bad environment contribution using Gaussian tail estimate:
\[P\{(G_n^\alpha(x))^c\} \leq \sum_{T \geq n_0} e^{-\alpha^2 \sum_{i=1}^T f(i/n)} \to 0\]
Cameron-Martin-Girsanov tilting: Key insight is measure change preserves Gaussian structure while shifting means. Under tilted measure $\tilde{P}_h$ for coincidence depth $h$:
\[E[(J_n)^2] = \sum_{h=0}^n d^{-h} e^{\beta^2 \sum_{i=1}^h f(i/n)} \tilde{P}_h\{G_n^\alpha(x) \cap \tilde{G}_n^\alpha(x)\}\]
Uniform convergence of good events: Prove for fixed $h < n_1$:
\[\lim_{n \to \infty} \tilde{P}_h\{G_n^\alpha(x) \cap G_n^\alpha(y)\} = g_\alpha(h)\]
```
uniformly, using continuity of $f$ at 0.
```
Fractional moments at criticality: Apply Paley-Zygmund inequality with decomposition at early generation $\tau_n = \lceil 2\epsilon\lvert \log n \rvert/\log d \rceil$:
\[P\{\bar{W}_n^{\beta_c} > e^{-\epsilon|\log n|}\} \geq e^{-\epsilon|\log n|}\]

Experiments & Validation

Purely theoretical. Empirical validation would involve simulating BRWs on finite Galton-Watson trees with decreasing variance functions and comparing partition function behaviors across different $f$ functions. Key metrics would be convergence rates of $E[\lvert W_n^\beta - \bar{W}_n^\beta \rvert]$ and verification of critical parameter invariance.

Limitations & Open Problems

Decreasing variance assumption - TECHNICAL: proof extends to any $f$ continuous at 0 with $f(0)=1$, but monotonicity simplifies analysis
Gaussian increments - NATURAL: standard in BRW literature, though extensions to other distributions are possible
Supercritical regime restriction - NATURAL: subcritical case trivial as all partition functions vanish
Discrete time setting - TECHNICAL: continuous-time branching Brownian motion analog should follow similar approach

Open problems:
Extend universality to time-inhomogeneous branching Brownian motion and establish derivative martingale scaling at criticality
Rigorously connect to two-dimensional directed polymer partition functions and prove convergence to GMC measures beyond log-correlation

Design of Transit Networks: Global Optimization of Continuous Approximation Models via Geometric Programming

Authors: Haoyang Mao, Weihua Gu, Wenbo Fan, Zhicheng Jin et al. (5 authors) · Institution: (high-relevance, institution unverified) · Category: math.OC

First application of geometric programming to transit network design problems, providing globally optimal solutions where previous methods were limited to local optima.

Tags: transportation optimization geometric-programming urban-planning public-transit continuous-approximation network-design

arXiv · PDF

Problem Formulation

Motivation: Transit network design is crucial for urban transportation efficiency but faces computational challenges when using discrete optimization models that suffer from NP-hardness and curse of dimensionality. Continuous approximation (CA) models offer tractability but typically result in nonconvex optimization problems where existing gradient-based methods cannot guarantee global optimality.

Mathematical setup: Consider a square city $R$ with side length $\lvert R \rvert$. The transit network has line density $\delta_i(x,y)$ and service headway $h_i(x,y)$ for direction $i \in {E,W,N,S}$ at location $(x,y) \in [0,\lvert R \rvert]^2$. Vehicle flow is defined as:

\[q_i(x,y) = \frac{\delta_i(x,y)}{h_i(x,y)}\]

The total system cost combines agency cost $Z_A$ and passenger cost $Z_P$:

\[Z = \frac{1}{\mu}Z_A + Z_P\]

where $\mu$ is passengers’ value of time. Agency cost includes line infrastructure, stops, vehicle distance and time:

\[Z_A = \pi_l N_l + \pi_s N_s + \pi_k N_k + \pi_h N_h\]

Passenger cost includes access/egress time, waiting time, in-vehicle time, and transfer penalty.

Assumptions:

Passengers walk to nearest stops and arrive randomly
Bidirectional symmetry: $\delta_E = \delta_W$, $\delta_N = \delta_S$, $h_E = h_W$, $h_N = h_S$
Stops located only at line intersections
Capacity constraint: $\lambda_i^{fl}(x,y)/q_i(x,y) \leq C$ for vehicle capacity $C$

Toy example: For a homogeneous 2×2 km network with constant demand density $\lambda = 100$ passengers/km²/hr, the problem reduces to optimizing four variables: $\delta_{EW}, \delta_{NS}, h_{EW}, h_{NS}$, where the objective becomes a rational function with terms like $\delta_{EW} \delta_{NS}$ (stops) and $\frac{\delta_{EW}}{h_{EW}}$ (vehicle flow).

Formal objective: Minimize the total system cost:
\[\min_{\delta_i, h_i} Z = \frac{1}{\mu}Z_A + Z_P \quad \text{subject to capacity and positivity constraints}\]

Method

Geometric Programming Transformation: The method converts the nonconvex CA optimization problem into a geometric program (GP) that can be solved to global optimality.

Monomial/Posynomial Identification: Express objective terms as posynomials (sums of monomials). A monomial has the form:
\[g(r) = d r_1^{a_1} r_2^{a_2} \cdots r_L^{a_L}\]
Standard GP Form: Formulate as:
\[\min_r f_0(r) \quad \text{s.t.} \quad f_u(r) \leq 1, \quad g_w(r) = 1, \quad r > 0\]
Logarithmic Transformation: Apply change of variables $s_l = \ln r_l$ to obtain convex form:
\[\min_s \ln f_0(s) = \ln \sum_{k=1}^{K_0} e^{a_{0k}^T s + b_{0k}}\]
Solver Application: Use commercial GP solver (MOSEK) to find global optimum in polynomial time.

Application to toy example: For the 2×2 km homogeneous case, terms like $\delta_{EW} \delta_{NS}$ (number of stops) and $\frac{1}{\delta_{EW}}$ (access time) are monomials. The vehicle time term $\frac{\delta_{EW}}{h_{EW}} \cdot (\frac{1}{v} + \tau \delta_{NS})$ becomes a posynomial after expansion. The capacity constraint $\frac{\lambda_{EW}^{fl}}{\delta_{EW}/h_{EW}} \leq C$ transforms to the posynomial inequality $\frac{\lambda_{EW}^{fl} h_{EW}}{C \delta_{EW}} \leq 1$.

Novelty & Lineage

Prior Work: Continuous approximation for transit networks was established by Holroyd (1967) and Chien & Schonfeld (1997), with extensions by Ouyang et al. (2014) and Chen et al. (2015). Existing solution methods rely on coordinate descent, gradient-based algorithms, or general nonlinear solvers like fmincon, which cannot guarantee global optimality due to the nonconvex nature of the problem.

Novelty: This is the first application of geometric programming to continuous approximation transit network design. The key insight is recognizing that CA objective functions naturally have posynomial structure, enabling transformation to globally solvable convex problems. Previous approaches using coordinate descent or general NLP solvers were limited to local optima.

Assessment: SIGNIFICANT - Provides the first globally optimal solution method for an important class of transportation problems, with demonstrated 1-4% cost improvements over existing methods.

Proof Techniques

No Formal Proofs: This is primarily an applied paper demonstrating that CA transit network problems have posynomial structure compatible with geometric programming.

Key Technical Insights:

Posynomial Structure Verification: The CA objective function terms naturally form monomials and posynomials: - Line infrastructure: $\int \delta_i(x,y) dx dy$ (monomial) - Number of stops: $\int \delta_i(x,y) \delta_{\bar{i}}(x,y) dx dy$ (monomial product) - Access time: $\int \frac{1}{\delta_i(x,y)} dx dy$ (monomial with negative exponent)
Capacity Constraint Compatibility: The constraint $\frac{\lambda_i^{fl}(x,y)}{q_i(x,y)} \leq C$ can be rewritten as:
\[\frac{\lambda_i^{fl}(x,y) h_i(x,y)}{C \delta_i(x,y)} \leq 1\]
This is a posynomial inequality constraint, satisfying GP requirements.
Convex Transformation: The logarithmic change of variables $s_l = \ln r_l$ transforms the problem into:
\[\min_s \ln \sum_{k=1}^K e^{a_k^T s + b_k}\]
The log-sum-exp function is convex, guaranteeing global optimality via interior-point methods.

Experiments & Validation

Test Setup: Square 10×10 km city discretized into 0.5×0.5 km cells. Comprehensive evaluation across 72 test cases combining:

6 demand distributions (monocentric, commute, 4 chessboard patterns)
4 demand levels (5K, 10K, 50K, 100K passengers)
3 value-of-time parameters ($5, $20, $25 per hour)

Baselines: Coordinate descent method and fmincon nonlinear solver, both initialized from 10 different starting points.

Key Results:

GP consistently outperforms coordinate descent by 1-4% across all cases
GP matches fmincon in homogeneous networks but outperforms in heterogeneous cases, especially high-demand scenarios
fmincon shows >10% solution variability across initializations in heterogeneous networks
Computation time: GP achieves global optimum efficiently using MOSEK solver

Hardware: 2.5 GHz Intel i5-12500H CPU, 16GB memory.

Limitations & Open Problems

Limitations:

RESTRICTIVE: GP requires strict posynomial structure - objective and constraints must be expressible as sums of monomials with positive coefficients, limiting extensions to more complex transit models.
TECHNICAL: Discretization into uniform cells assumes spatial homogeneity within cells, potentially affecting solution accuracy in highly heterogeneous areas.
NATURAL: Assumes passengers walk to nearest stops and arrive randomly, which are standard assumptions in transit modeling.
TECHNICAL: Stops restricted to line intersections only - intermediate stops between intersections not considered.
NATURAL: Bidirectional symmetry assumption reflects real-world transit operations but may not capture asymmetric demand patterns.

Open Problems:
Extend GP framework to handle signomial (difference of posynomial) constraints for more complex transit models with capacity-constrained transfers
Develop sequential geometric programming (SGP) or mixed-integer GP approaches to incorporate discrete decision variables like stop locations and line routing choices

Delightful Policy Gradient

Authors: Ian Osband · Institution: (high-relevance, institution unverified) · Category: cs.LG

Introduces Delightful Policy Gradient which gates each gradient term by sigmoid of advantage×surprisal, provably reducing variance within contexts and improving expected direction across contexts.

Tags: policy-gradients reinforcement-learning variance-reduction gradient-estimation bandit-algorithms transformer-training optimization-theory

arXiv · PDF

Problem Formulation

Motivation: Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates pathologies where rare negative-advantage actions can distort update directions within a single context, and across contexts, gradient budget is over-allocated to contexts the policy already handles well.
Mathematical setup: Consider episodic reinforcement learning where at timestep $t$, the agent observes history $H_t$, samples action $A_t \sim \pi_\theta(\cdot \lvert H_t)$, and receives reward $R_t$. The standard policy gradient forms updates:
\[g_t = U_t \nabla_\theta \log \pi_\theta(A_t | H_t)\]
where $U_t$ denotes advantage estimate. Define action surprisal as:
\[\ell_t = -\log \pi_\theta(A_t | H_t)\]
Define delight as the product:
\[\chi_t = U_t \ell_t\]
The sigmoid gate is:
\[w_t = \sigma(\chi_t/\eta)\]
where $\sigma(x) = \frac{1}{1+e^{-x}}$ and $\eta > 0$ is temperature.

Assumptions:
1. Access to advantage estimates $U_t$
2. Policy $\pi_\theta$ is differentiable in parameters $\theta$
3. Actions have well-defined probabilities under current policy
Toy example: Consider a 3-armed bandit with correct action having probability $\pi(y^*) = 0.7$ and incorrect actions having $\pi(a) = 0.15$ each. If the correct action has advantage $U = 0.3$ and an incorrect action has advantage $U = -0.2$, then delight values are $\chi_{correct} = 0.3 \times (-\log 0.7) \approx 0.11$ and $\chi_{incorrect} = -0.2 \times (-\log 0.15) \approx 0.38$. The gates become $w_{correct} \approx 0.53$ and $w_{incorrect} \approx 0.59$.
Formal objective: The Delightful Policy Gradient update is:
\[\Delta\theta \propto \sum_{t \in B} w_t g_t = \sum_{t \in B} \sigma(\chi_t/\eta) U_t \nabla_\theta \log \pi_\theta(A_t | H_t)\]

Method

The Delightful Policy Gradient (DG) method modifies standard policy gradients by gating each gradient term with a sigmoid of delight:

Compute action surprisal: $\ell_t = -\log \pi_\theta(A_t \lvert H_t)$
Compute delight: $\chi_t = U_t \ell_t$ (product of advantage and surprisal)
Compute sigmoid gate: $w_t = \sigma(\chi_t/\eta)$ with temperature $\eta$
Apply gated update: $\Delta\theta \propto \sum_{t} w_t U_t \nabla_\theta \log \pi_\theta(A_t \lvert H_t)$

The key insight is asymmetric treatment: when a rare action has positive advantage (breakthrough), $w_t \approx 1$; when a rare action has negative advantage (blunder), $w_t \approx 0$.

Applied to toy example: For the 3-armed bandit with correct action probability 0.7 and advantage 0.3, we get $\chi = 0.3 \times 0.36 = 0.11$, so $w = \sigma(0.11) \approx 0.53$. For incorrect action with probability 0.15 and advantage -0.2, we get $\chi = -0.2 \times 1.90 = -0.38$, so $w = \sigma(-0.38) \approx 0.41$. The DG update amplifies the breakthrough while dampening the blunder compared to standard policy gradients which would weight both by advantage magnitude alone.

Novelty & Lineage

This work introduces a fundamentally new weighting scheme for policy gradients that incorporates action surprisal alongside advantage. Prior work includes:

Standard policy gradients (Williams, 1992; Sutton et al., 1999) weight by advantage alone
Trust-region methods (TRPO, PPO) clip importance ratios but remain probability-blind
Advantage-weighted methods (AWR, PMPO) reweight by advantage functions but ignore surprisal
Variance reduction methods (baselines, control variates) reduce noise without changing expected direction

The key novelty is using the product of advantage and surprisal to create asymmetric treatment of rare successes vs failures. This changes the expected gradient direction itself, not just its variance. The sigmoid gating mechanism is derived from entropy-regularized optimization and provides theoretical guarantees for both within-context variance reduction and cross-context directional improvement.

SIGNIFICANT

Proof Techniques

The main theoretical analysis proceeds through two complementary mechanisms:

Within-context variance reduction (Proposition 1): For symmetric K-armed bandits, uses the score identity:
\[\sum_a \pi(a) \phi(a) = 0\]
to show that expected gradients remain collinear. The key inequality is that for incorrect actions with gate $w_- < 1/2$:
\[\text{Var}_\perp(g_{DG}) = w_-^2 \cdot \text{Var}_\perp(g_{PG})\]
The cosine gap ratio becomes:
\[\frac{1 - E[\cos(\bar{g}_{DG}, g^*_{PG})]}{1 - E[\cos(\bar{g}_{PG}, g^*_{PG})]} \approx \frac{w_-^2}{s^2} < 1\]
Cross-context directional improvement (Proposition 2): For multiple independent contexts, the key insight is that PG and cross-entropy oracles have different weightings:
\[g^*_{PG} = \sum_n p_n v_n, \quad g^*_{CE} = \sum_n v_n\]
The critical monotonicity lemma shows that for orthogonal vectors $v_1, v_2$:
\[\cos(rv_1 + v_2, v_1 + v_2)\]
is uniquely maximized at $r = 1$. Since DG compresses the ratio $p_1/p_2$ toward 1 via the sigmoid gate, it achieves higher cosine with the cross-entropy oracle.
Path argument for general N: Uses antisymmetrization to show:
\[\Phi'(t) \propto \sum_{n<m} (\lambda_n - \lambda_m) a_n a_m c_n c_m (c_m - c_n) > 0\]
where $\lambda_n = \log \sigma((-\log p_n)/\eta)$ and the positivity follows from ordering preservation.

Experiments & Validation

Datasets and Tasks:

MNIST contextual bandits (28×28 images, 10 classes, 2-layer ReLU network)
Token Reversal with Transformers (sequences of length H from vocabulary size M)
DeepMind Control Suite (28 continuous control environments)
Tabular K-armed bandits for analytical validation

Baselines: REINFORCE, PPO, PMPO (advantage-weighted), entropy regularization variants

Key Numbers:

MNIST: DG closes ~50% gap to supervised cross-entropy oracle
Token Reversal: DG shows better scaling exponent on log-log plots as complexity increases
Control Suite: DG matches/exceeds best baseline on majority of 28 environments, never worst performer
Tabular: DG retains only 4% of PG’s cosine gap at ε=0.5, K=100; only 1% at ε=0.1

Implementation Details:

Temperature η=1 used throughout (robust across experiments)
Continuous actions use clipped log-densities [-10, 10]
Adam optimizer, various batch sizes (1-1000) and learning rates tested
30-100 seeds per experiment for statistical significance

Limitations & Open Problems

Limitations:

TECHNICAL: Analysis assumes normalized gradient steps $z \leftarrow z + \alpha g/\lvert g \rvert$ - needed for theoretical tractability but may not reflect all optimization scenarios in practice.
TECHNICAL: Tabular analysis requires orthogonality of gradient components across contexts - this is violated in neural networks but enables clean theoretical characterization.
TECHNICAL: Condition η > 1/2 required for directional improvement guarantees - ensures monotonicity of weight function but restricts temperature choices.
NATURAL: Continuous action spaces require log-density clipping or whitening to handle scaling - standard preprocessing for density-based methods.
RESTRICTIVE: No formal convergence guarantees provided - only local improvement properties shown, unlike standard policy gradient convergence theory.

Open Problems:
Establish formal convergence rates for DG under standard policy gradient assumptions (e.g., compatible function approximation).
Extend theoretical analysis beyond tabular setting to neural networks with shared representations across contexts.

Preconditioned One-Step Generative Modeling for Bayesian Inverse Problems in Function Spaces

Authors: Zilan Cheng, Li-Lian Wang, Zhongjian Wang · Institution: (high-relevance, institution unverified) · Category: stat.ML

Introduces a one-step generative model for Bayesian PDE inverse problems that learns conditional transports from prior-aligned reference measures, achieving 10^4x speedup over MCMC while maintaining posterior accuracy.

Tags: bayesian-inverse-problems pde-inverse-problems generative-modeling neural-operators function-space-methods uncertainty-quantification mean-flows one-step-sampling

arXiv · PDF

Problem Formulation

Motivation: Bayesian inverse problems arising from PDEs require sampling from high-dimensional posterior distributions to quantify uncertainty. Classical MCMC methods are computationally expensive due to repeated forward PDE solves, motivating the need for fast amortized posterior samplers.

Mathematical setup: Let $H$ be a separable Hilbert space and $x \in H$ be an unknown function (e.g., PDE coefficient). Consider the forward operator $G : H \to Y$ and observation operator $O : Y \to \mathbb{R}^m$. The observation model is:

\[y_{\text{obs}} = O(G(x)) + \eta\]

where $\eta \sim \mathcal{N}(0, \sigma_\eta^2 I_m)$ is additive Gaussian noise. The Bayesian prior is an anisotropic Gaussian $\gamma = \mathcal{N}(0, C)$ with $C = (-\Delta + \kappa^2 I)^{-d}$ for $d > 1$ (trace-class). The posterior measure is:

\[\pi(dx | y_{\text{obs}}) \propto \exp(-\Phi(x; y_{\text{obs}})) \gamma(dx)\]

where $\Phi(x; y_{\text{obs}}) = \frac{1}{2\sigma_\eta^2} \lvert y_{\text{obs}} - O(G(x)) \rvert_2^2$.

Assumptions:

The forward operator $G$ is well-defined from $H$ to $Y$
The posterior satisfies Gaussian-tail regularity in the $C$-geometry
The prior covariance $C$ is trace-class on $L^2(\Omega)$

Toy example: Consider the identity problem where $G = I$ and $O$ selects 49 random points in $\Omega = [0,1]^2$. With prior $x \sim \mathcal{N}(0, (-\Delta + 9I)^{-2})$ and noise $\eta \sim \mathcal{N}(0, 0.1^2 I)$, the posterior has closed-form expression via conditional Gaussian formula.

Formal objective: Learn a conditional transport map $T_\theta(\cdot; y_{\text{obs}}) : H \to H$ such that:
\[(T_\theta(\cdot; y_{\text{obs}}))_\# \rho \approx \pi(\cdot | y_{\text{obs}})\]
where $\rho$ is the reference measure and the approximation is in distribution.

Method

The method extends Mean Flows to learn a conditional one-step transport for Bayesian inverse problems.

Algorithm:

Sample training pairs $(x, y_{\text{obs}})$ from the joint distribution induced by the Bayesian model
For each pair, sample independent reference draw $\xi \sim \rho$ where $\rho = \gamma$ (prior-aligned)
Sample time parameters $0 \leq r \leq t \leq 1$ and form interpolation state:
\[z_t = (1-t)x + t\xi\]
Define the instantaneous velocity $v = \xi - x$ and average velocity target:
\[w_{r,t} = v - (t-r) \frac{d}{dt} u(z_t, r, t)\]
Train neural operator $w_\theta(z, r, t; y_{\text{obs}})$ to predict $w_{r,t}$ using loss:
\[\mathcal{L}_{EO}(\theta) = \mathbb{E}[\|w_\theta(z_t, r, t; \tilde{y}_{\text{obs}}) - \text{sg}(w^{\text{tgt}})\|^2]\]
At inference, generate posterior sample via one-step transport:
\[T_\theta(\xi; y_{\text{obs}}) = \xi - w_\theta(\xi, 0, 1; y_{\text{obs}})\]
Toy example application: For the identity problem with $49$ observation points, the method learns to map from prior-aligned reference draws $\xi \sim \mathcal{N}(0, (-\Delta + 9I)^{-2})$ to approximate posterior samples. The neural operator takes current state $\xi$, time parameters $(0,1)$, and normalized observations to output the correction term.

Novelty & Lineage

This work extends Mean Flows (Geng et al., 2025) from unconditional to conditional generation for Bayesian inverse problems. Key novel contributions include:

New elements:

First application of one-step generative transport to PDE-based Bayesian inverse problems
Theoretical analysis showing necessity of prior-aligned reference measures in function space
Conditional neural operator architecture for amortized posterior sampling
Encode-only training objective that respects reference geometry

Prior work: Builds on Mean Flows for one-step transport, diffusion models for inverse problems (Song et al., 2022; Cardoso et al., 2024), and neural operators (Li et al., 2021). Differs from multistep approaches like score-based diffusion in function space (Lim et al., 2025) by avoiding iterative sampling.

Assessment: SIGNIFICANT - introduces principled one-step sampling for PDE inverse problems with theoretical guarantees, though the core Mean Flows framework is incremental extension.

Proof Techniques

The main theoretical result establishes Lipschitz regularity for trace-class references while proving obstruction for white-noise references.

Trace-class regularity (Theorem 3.2):

Assume Gaussian-tail condition: $\sup_x \exp(\alpha \lvert x \rvert_H^2) \pi(x\lvert y_{\text{obs}}) < \infty$ for some $\alpha > 0$
Key inequality for average velocity Lipschitz bound:
\[\|w(z_1; 0,1) - w(z_2; 0,1)\|_H \leq L_0 \|z_1 - z_2\|_H\]
Transport map Lipschitz estimate:
\[\|T(ξ_1; y_{\text{obs}}) - T(ξ_2; y_{\text{obs}})\|_H \leq (1 + L_0/2)\|ξ_1 - ξ_2\|_H\]
where $L_0$ depends only on Gaussian-tail parameters, not discretization.

White-noise obstruction (Proposition 3.3):
Consider white noise $\rho_W$ and trace-class Gaussian $\rho_A$ on Sobolev space $H^{-s}$
Mutual singularity via Cameron-Martin theorem:
\[\rho_W \perp \rho_A \text{ on } H^{-s}\]
Since posterior $\pi \ll \rho_A$ and $\pi(H) = 1$, conclude $\rho_W \perp \pi$
No regular deterministic map $H \to H$ can transport between mutually singular measures

Technical insight: The proof exploits trace-class geometry where both prior and posterior live, while white noise has infinite Cameron-Martin norm and cannot be transported via Lipschitz maps.

Experiments & Validation

Datasets: Four PDE inverse problems - Darcy flow (permeability inference), Advection equation, Reaction-Diffusion (Allen-Cahn), Navier-Stokes (vorticity form). All use $64 \times 64$ or $65 \times 65$ grids with partial noisy observations.

Baselines: MCMC (emcee with 30k steps), multistep diffusion model (320 steps), alternative neural operator backbones (PODNO, U-Net).

Key results:

Accuracy: Relative L2 errors $\epsilon_\mu < 10\%$ and $\epsilon_\sigma < 10\%$ for posterior mean and std deviation across all PDE problems
Speed: Generates $64 \times 64$ posterior sample in $\sim 10^{-3}$s vs $\sim 10^1$s for MCMC
Reference measure validation: Prior-aligned reference $\rho_A = \gamma$ maintains correct spectral decay across grid refinements, while white-noise reference $\rho_W$ exhibits high-frequency artifacts
Energy spectra: Figure 1 shows $\rho_A$ preserves exact posterior spectrum, $\rho_W$ deviates at high frequencies, especially on finer grids

Architecture ablations: FNO backbone outperforms PODNO and U-Net. Encode-only training more stable than Encode-Decode for white references.

Limitations & Open Problems

Limitations:

Performance depends on neural operator expressivity - may struggle with highly anisotropic or multiscale posteriors (TECHNICAL - improvable with better architectures)
Lacks principled diagnostics for learned velocity approximation error (TECHNICAL - could develop posterior predictive checks)
Training requires offline PDE simulations, incurring computational/storage costs (NATURAL - standard for simulation-based methods)
Current experiments limited to KL truncation $N_{KL} = 64$ for computational tractability (TECHNICAL - MCMC becomes prohibitive at higher dimensions)
Method assumes additive Gaussian observation noise (RESTRICTIVE - many applications have non-Gaussian noise models)

Open problems:
Adaptive reference measures: How to automatically learn optimal reference geometries for arbitrary priors beyond the $(-\Delta + \kappa^2 I)^{-d}$ family?
Uncertainty quantification: Can we develop rigorous error bounds for the learned transport map’s approximation quality and propagate this to posterior credible intervals?

Trustworthy Koopman Operator Learning: Invariance Diagnostics and Error Bounds

Authors: Gustav Conradie, Nicolas Boullé, Jean-Christophe Loiseau, Steven L. Brunton et al. (5 authors) · Institution: (high-relevance, institution unverified) · Category: math.NA

Develops computable error bounds and invariance diagnostics for Koopman operator approximations using principal angles, enabling certified spectral analysis and forecasting from snapshot data.

Tags: koopman-operator-theory dynamic-mode-decomposition data-driven-dynamics spectral-methods error-bounds principal-angles reproducing-kernel-hilbert-spaces model-reduction

arXiv · PDF

Problem Formulation

Motivation: Data-driven Koopman operator methods like DMD and EDMD are widely used for analyzing dynamical systems, but their finite-dimensional approximations often produce spurious eigenvalues, misleading modes, and overconfident forecasts due to lack of invariance. This problem is critical in applications ranging from fluid dynamics to neuroscience where reliable spectral analysis and forecasting are essential.
Mathematical setup: Consider a discrete-time dynamical system
\[x_{n+1} = F(x_n), \quad n = 0, 1, 2, \ldots\]
with state space $X$ and evolution map $F: X \to X$. The Koopman operator $K$ acts on observables $g: X \to \mathbb{C}$ via
\[[Kg](x) = g(F(x))\]
Define the finite-dimensional subspace $V_N = \text{span}{\psi_1, \ldots, \psi_N}$ where ${\psi_j}_{j=1}^N$ is a user-chosen dictionary. Let $P_{V_N}$ denote orthogonal projection onto $V_N$ and $Q_{V_N} = I - P_{V_N}$.

Assumptions:
1. Snapshot data ${x^{(m)}, y^{(m)} = F(x^{(m)})}_{m=1}^M$ available
2. Dictionary functions ${\psi_j}$ are linearly independent in chosen Hilbert space $H$
3. Quadrature rule converges for $L^2$ setting or RKHS setting with bounded kernel
4. Koopman operator $K$ is bounded on $H$
Toy example: When $X = \mathbb{R}^2$, $F(x) = Ax$ with $A = \begin{pmatrix} \cos\theta & -\sin\theta \ \sin\theta & \cos\theta \end{pmatrix}$ (rotation), and dictionary $V_2 = \text{span}{x_1, x_2}$, the subspace is exactly invariant since $K[x_i] = [Ax]_i \in V_2$. The core difficulty arises when $V_N$ is not $K$-invariant, causing closure errors.
Formal objective: Bound the multi-step Koopman mode decomposition error
\[\|K^n P_{V_N}^* g - P_{V_N}^*(P_{V_N} K P_{V_N}^*)^n g\|\]
and quantify subspace invariance via principal angles between $V_N$ and $KV_N$.

Method

The method has three main components:

Principal Angle Computation: For subspaces $V = \text{span}{\Psi B}$ and $KV$, construct the mass matrix
\[J_V = \begin{pmatrix} G_V & A_V \\ A_V^* & L_V \end{pmatrix}\]
where $G_V = B^* G B$, $A_V = B^* A B$, $L_V = B^* L B$. Compute eigendecomposition $J_V = U D U^*$, then extract orthonormal representatives and apply subspacea algorithm.
Error Bounds: For observable $g \in V_N$, bound the $n$-step error using
\[\|K^n P_{V_N}^* g - P_{V_N}^*(P_{V_N} K P_{V_N}^*)^n g\| \leq \sqrt{\sum_{j=0}^{n-1} \|P_{V_N} K\|^{2j} \|Q_{V_N} K^{n-j} P_{V_N}^* g\|^2}\]
Principal Angle Decomposition: Select $r$ principal observables with smallest angles, form reduced matrices $A_{\text{pad}} = U^* A U$ where $U$ contains the $r$ best principal observables.
Expected Error Estimation: Replace worst-case bounds with expected norms $E_C[K^j] = E[\lvert K^j \zeta \rvert/\lvert \zeta \rvert]$ for Gaussian process $\zeta \sim GP(0, C)$.

Toy example application: For rotation $F(x) = Ax$ with dictionary ${x_1, x_2, x_1^2}$, the method computes principal angles between $\text{span}{x_1, x_2, x_1^2}$ and $\text{span}{[Ax]_1, [Ax]_2, [Ax]_1^2}$. Since $[Ax]_1^2 \notin \text{span}{x_1, x_2, x_1^2}$ generally, one principal angle will be non-zero, quantifying the invariance failure.

Novelty & Lineage

This work extends recent convergent Koopman methods (ResDMD [Colbrook et al.] and SpecRKHS [Colbrook et al.]) by developing a posteriori error certificates. Key novelties include:

Principal angle framework: First systematic use of principal angles to quantify Koopman subspace invariance, connecting to canonical correlation analysis but applied to dynamical systems
Multi-step error bounds: Rigorous bounds for iterated Koopman operators, extending beyond one-step residual methods
Unified L²/RKHS treatment: Single framework handling both classical ergodic theory setting and modern kernel methods
Expected error surrogates: Gaussian process methodology to tighten conservative worst-case bounds
Principal Angle Decomposition: Dynamics-informed alternative to SVD truncation

Prior work focused on specific dictionaries [Korda et al., Williams et al.] or asymptotic convergence [Klus et al.]. This provides general, computable, finite-sample certificates for arbitrary dictionaries.

SIGNIFICANT

Proof Techniques

The main proof strategy uses operator decomposition and geometric analysis:

Subspace decomposition: Key identity from Lemma 4.2:
\[P_{V_N} K^n P_{V_N}^* - (P_{V_N} K P_{V_N}^*)^n = \sum_{j=1}^{n-1} P_{V_N} (P_{V_N}^* P_{V_N} K)^j Q_{V_N}^* Q_{V_N} K^{n-j} P_{V_N}^*\]
Orthogonality exploitation: Uses $P_{V_N} Q_{V_N} = 0$ to separate invariant and non-invariant components:
\[\|K^n P_{V_N}^* g - P_{V_N}^*(P_{V_N} K P_{V_N}^*)^n g\|^2 = \left\|\sum_{j=0}^{n-1} (P_{V_N}^* P_{V_N} K)^j Q_{V_N}^* Q_{V_N} K^{n-j} P_{V_N}^* g\right\|^2\]
Recursive bound construction: Two complementary bounds via Cauchy-Schwarz vs. triangle inequality:
\[\sum_{j=0}^{n-1} \|(P_{V_N}^* P_{V_N} K)^j Q_{V_N}^* Q_{V_N} K^{n-j} P_{V_N}^* g\| \leq \sum_{j=0}^{n-1} \|P_{V_N} K\|^j \|Q_{V_N} K^{n-j} P_{V_N}^* g\|\]
Principal angle geometry: Connection to residuals via
\[\sin(\theta(V, KV)) = \inf_{\lambda \in \mathbb{C}\setminus\{0\}} \frac{\|Kg - \lambda g\|}{|\lambda|\|g\|}\]
establishing geometric interpretation of spectral approximation quality.
Matrix approximation: Quadrature convergence
\[\lim_{M \to \infty} G_{jk} = \langle\psi_k, \psi_j\rangle_{L^2}, \quad \lim_{M \to \infty} A_{jk} = \langle K\psi_k, \psi_j\rangle_{L^2}\]
enables finite-data computation of infinite-dimensional quantities.

Experiments & Validation

Datasets: Duffing oscillator (1000 snapshots, Chebyshev polynomials), Lorenz-63 system (exponential RBFs), high-dimensional cavity flow (144k DOF, Reynolds 7500), Pluto-Charon astronomical data.

Baselines: Standard SVD truncation vs. Principal Angle Decomposition (PAD), comparison of L² vs. RKHS error bounds.

Key numbers: PAD achieves ~50% reduction in short-term prediction error vs. SVD on Lorenz system, principal angle threshold θ ≈ 0.01 separates “good” vs. “bad” observables on Duffing system, error bounds are often conservative (10x overestimate) but expected error surrogates provide tighter approximation.

Validation includes synthetic examples with known ground truth (Lebesgue spectrum systems) where bounds are provably sharp, and real-world applications demonstrating practical utility.

Limitations & Open Problems

First-order approximation in error bounds - TECHNICAL (improves computational tractability but may lose tightness)
Large-data limit requirement for L² convergence - NATURAL (standard in ergodic theory, satisfied by long trajectories)
Bounded kernel assumption for RKHS setting - NATURAL (most practical kernels satisfy this)
Dictionary independence assumption - NATURAL (fundamental requirement for well-posed linear algebra)
Conservative worst-case bounds - TECHNICAL (geometric inequalities introduce slack, addressed by expected error methods)
Cut-off tolerance parameter selection - TECHNICAL (requires problem-specific tuning, could be automated)
Computational cost scales as O(N³) for N dictionary functions - RESTRICTIVE (limits scalability to very high-dimensional dictionaries)

Open problems:
Optimal dictionary selection: Given error tolerance, what is the minimal dictionary achieving desired accuracy?
Adaptive refinement: How to automatically grow/shrink dictionary based on principal angle diagnostics during computation?

$K-$means with learned metrics

Authors: Pablo Groisman, Matthieu Jonckheere, Jordan Serres, Mariela Sued · Institution: (high-relevance, institution unverified) · Category: math.ST

Establishes the first general consistency framework for k-means clustering when the distance metric itself must be learned from data, unifying results for Isomap, Fermat, and diffusion distances.

Tags: k-means metric-learning clustering manifold-learning consistency Gromov-Hausdorff diffusion-maps Isomap

arXiv · PDF

Problem Formulation

Motivation: Traditional k-means clustering assumes a known metric, but in modern applications the distance function itself must be learned from data (e.g., manifold learning, diffusion distances). Understanding when these learned-metric k-means procedures are consistent is crucial for clustering and dimensionality reduction.
Mathematical setup: Let $(X, d, \mu)$ be a compact metric measure space where $X$ is the space, $d$ is the metric, and $\mu$ is a Borel probability measure. Define the functional
\[\Phi_Y(S, \mu) = \int_Y \min_{x \in S} d^p(y, x) \mu(dy)\]
for $S \in C_k(Y)$ (subsets of $Y$ with cardinality at most $k$) and $p \geq 1$. The k-means set is
\[S_Y(k) = \{S \in C_k(Y) : \Phi_Y(S, \mu) \leq \Phi_Y(A, \mu) \text{ for all } A \in C_k(Y)\}\]
Assumptions:
1. All metric spaces are compact
2. Measures are normalized Borel probability measures
3. The sequence $X_n = (X_n, d_n, \mu_n)$ converges to $X = (X, d, \mu)$ in measured Gromov-Hausdorff (mGH) topology
The mGH convergence means there exist $\varepsilon_n \to 0$ and $\varepsilon_n$-isometries $h_n: X_n \to X$ such that $(h_n)_#(\mu_n) \to \mu$ weakly.
Toy example: Consider $X_n = {x_1, x_2, x_3} \subset \mathbb{R}$ with empirical measure $\mu_n = \frac{1}{3}(\delta_{x_1} + \delta_{x_2} + \delta_{x_3})$ and learned distance $d_n$ (e.g., diffusion distance). For $k=1$, we seek the point minimizing average squared distance. If $d_n$ converges uniformly to Euclidean distance $d$ and sample points fill $[0,1]$ as $n \to \infty$, the empirical 1-means should converge to the population 1-mean.
Formal objective: Establish that
\[\lim_{n \to \infty} \max_{S_n \in S_{X_n}(k)} \min_{S \in S_X(k)} d_H(h_n(S_n), S) = 0\]

Method

The main theoretical framework establishes continuity of k-means with respect to measured Gromov-Hausdorff topology:

Prove that the functional $\Phi_X(S, \mu)$ is continuous in both arguments under appropriate topologies
Use compactness of $C_k(X)$ and weak convergence $\mu_n \to \mu$ to show convergence
Apply $\varepsilon_n$-isometry properties to transfer distances between spaces

Key technical step: For any subsequence of $h_n(S_n)$ with $S_n \in S_{X_n}(k)$, show it has a convergent subsequence whose limit belongs to $S_X(k)$. This uses:
\[|\Phi_X(h_n(A), \mu_n^h) - \Phi_{X_n}(A, \mu_n)| \leq \tilde{C}\varepsilon_n\]
where $\mu_n^h = (h_n)_#(\mu_n)$ and the bound follows from $\varepsilon_n$-isometry properties.

For applications to specific distances (Isomap, Fermat, diffusion), the method requires:
Establish uniform convergence $\sup_{x,y \in X_n} \lvert d_n(x,y) - d(x,y) \rvert \to 0$ a.s.
Verify sample fills space: $\sup_{x \in X} d(x, X_n) \to 0$ a.s.
Apply main theorem

Toy example application: For $X_n = {0, 0.5, 1}$ with Euclidean distance and uniform weights, if learned distance $d_n$ satisfies $\lvert d_n(x,y) - \rvert x-y\rvert \rvert \leq \varepsilon_n \to 0$, then empirical 1-mean (which is 0.5) converges to population 1-mean of uniform distribution on $[0,1]$ (which is also 0.5).

Novelty & Lineage

This work significantly extends Pollard’s classical k-means consistency result and Jaffe’s work on Wasserstein k-means by allowing both the metric and measure to be unknown. Prior work assumed known distances:

Pollard (1981): consistency for empirical k-means in $\mathbb{R}^d$ with fixed Euclidean distance
Müller & Quintana-Fernández: extension to separable metric spaces with known metrics
Jaffe: Wasserstein k-means consistency with known ground metric
Bachoc et al.: relaxed k-means for metric spaces with Heine-Borel property

Novel contributions:

First general framework for k-means consistency under learned metrics via mGH topology
Unified treatment covering Isomap, Fermat distances, diffusion distances, Wasserstein with learned ground metrics
Cluster consistency results (Voronoi cells) under metric learning
Applications to first passage percolation and length space discretizations
Uniqueness criteria for Fermat distances using non-positive curvature

Most consistency results for specific learned distances (Isomap, Fermat, diffusion) appear to be new even for k=1.

SIGNIFICANT

Proof Techniques

The main proof strategy uses compactness arguments combined with measured Gromov-Hausdorff convergence properties:

Compactness extraction: Since $h_n(S_n) \in C_k(X)$ (compact space), extract convergent subsequence $h_n(S_n) \to S$ for some $S \in C_k(X)$.
Key continuity bound: Exploit $\varepsilon_n$-isometry property to show
\[|d^p(h_n(x), h_n(y)) - d_n^p(x,y)| \leq \tilde{C}\varepsilon_n\]
for all $x,y \in X_n$, where $\tilde{C}$ depends on diameter bounds and parameter $p$.
Transfer minimality: For limit point $S$, show $\Phi_X(S, \mu) \leq \Phi_X(S’, \mu)$ for all $S’ \in C_k(X)$ by: - Using weak convergence $(h_n)_#(\mu_n) \to \mu$ and continuity of $\Phi_X$ to get
\[\Phi_X(S, \mu) = \lim_{n \to \infty} \Phi_X(h_n(S_n), \mu_n^h)\]
```
- For any competitor $S'$, construct approximate pre-images $S'\_n \in C\_k(X\_n)$ with $d\_H(h\_n(S'\_n), S') \to 0$
- Apply optimality $\Phi\_{X\_n}(S\_n, \mu\_n) \leq \Phi\_{X\_n}(S'\_n, \mu\_n)$ and transfer via $\varepsilon\_n$-isometry bounds
```
Cluster stability: For Voronoi cells, use enlarged regions $W(\delta) = {x: d(x,b) \leq d(x,b’) + \delta \text{ for all } b’}$ and show $d_H(W, W(\delta)) \to 0$ as $\delta \to 0$ by contradiction using compactness.
Application-specific uniform convergence: For concrete distances (Fermat, Isomap, diffusion), leverage existing pointwise convergence results and strengthen to uniform convergence using: - Fermat: Scaling $n^{(\alpha-1)/\ell} d_{n,\alpha} \to c d_{f,\alpha}$ from Benard et al. - Diffusion: Spectral convergence of graph Laplacians from von Luxburg et al. - Isomap: Graph distance convergence under connectivity conditions

Experiments & Validation

Purely theoretical. The paper provides no empirical validation, focusing entirely on establishing consistency guarantees.

Empirical validation would naturally involve:

Synthetic data on known manifolds (e.g., Swiss roll, sphere) comparing learned-metric k-means to ground truth clusters
Real datasets where intrinsic geometry matters (single-cell RNA-seq, image patches) showing improved clustering over Euclidean k-means
Convergence rate experiments demonstrating how sample size affects cluster recovery
Robustness studies under model misspecification (e.g., manifold assumption violations)

Limitations & Open Problems

Limitations:

Compactness assumption - NATURAL: Standard for theoretical analysis of k-means, satisfied by bounded data after appropriate normalization
Measured Gromov-Hausdorff convergence requirement - TECHNICAL: Strong condition, but authors show it reduces to uniform distance convergence plus sample density for i.i.d. case
No convergence rates provided - TECHNICAL: Framework establishes consistency but not speed of convergence
Uniqueness assumptions needed for stronger results - NATURAL: Non-uniqueness is inherent to k-means optimization landscape, uniqueness criteria provided for Fermat case
Specific conditions for each distance type - TECHNICAL: Bandwidth choices for diffusion distances, connectivity for Isomap, smoothness for Fermat distances
No finite-sample guarantees - RESTRICTIVE: Only asymptotic results, no concentration bounds or sample complexity

Open problems:
Establish convergence rates for the various learned distance settings, particularly for diffusion and Fermat distances where rates are unknown
Extend uniqueness criteria beyond Fermat distances to other learned metrics, possibly using curvature or spectral properties

Nonlinear Information Theory: Characterizing Distributional Uncertainty in Communication Models with Sublinear Expectation

Authors: Wen-Xuan Lang, Shaoshi Yang, Jianhua Zhang, Zhiming Ma · Institution: (high-relevance, institution unverified) · Category: cs.IT

Establishes nonlinear information theory using sublinear expectation to handle distributional uncertainty in communication systems, extending Shannon theory beyond fixed probability distributions.

Tags: information theory nonlinear expectation distributional uncertainty sublinear expectation communication theory source coding channel coding rate distortion

arXiv · PDF

Problem Formulation

Motivation: Classical Shannon information theory assumes fixed, known probability distributions for sources and channels, but real communication systems often exhibit distributional uncertainty due to non-stationary, heterogeneous phenomena that cannot be characterized by single probability distributions. This limitation becomes critical in complex networks like 6G where higher precision requirements demand accounting for uncertainty in probability models themselves.
Mathematical setup: Consider a sublinear expectation space $(\Omega, \mathcal{H}, E)$ where $\Omega$ is a sample space, $\mathcal{H}$ is a linear space of real-valued functions on $\Omega$, and $E$ is a sublinear expectation satisfying:
\[E[X + Y] \leq E[X] + E[Y]\] \[E[\lambda X] = \lambda E[X] \text{ for } \lambda \geq 0\] \[E[c] = c \text{ for constants } c\]
By the representation theorem, there exists a family of probability measures $\mathcal{P} = {P_\theta}_{\theta \in \Theta}$ such that:
\[E[X] = \sup_{\theta \in \Theta} E_{P_\theta}[X]\]
For a discrete random variable $X$ on this space, the uncertain probability distributions are characterized by ${p_\theta(X)}_{\theta \in \Theta}$.

Assumptions:
1. The nonlinear entropy $\hat{H}(X) \leq \sup_{\theta \in \Theta} \sum_x p_\theta(x) \log \frac{1}{p_\theta(x)}$ and is continuous in each distribution
2. If uniform distribution $p(i) = \frac{1}{N}$ is in the family, then $\hat{H}(X) = \log N$
3. Stronger distributions (higher uncertainty) yield higher entropy
4. Sequential random choices satisfy subadditivity property
Toy example: Consider a binary random variable $X \in {0,1}$ where $\Pr(X=0) \in [p-\epsilon, p+\epsilon]$ for $p=0.5, \epsilon=0.1$. The uncertain distribution family is ${p_q = {q, 1-q} : q \in [0.4, 0.6]}$. The classical Shannon entropy would be $H(0.5) = 1$ bit, but the nonlinear entropy is $\hat{H}(X) = \max_{q \in [0.4,0.6]} [q \log \frac{1}{q} + (1-q) \log \frac{1}{1-q}] > 1$.
Formal objective: Define and characterize the nonlinear information entropy:
\[\hat{H}(X) = \sup_{\theta \in \Theta} \sum_x p_\theta(x) \log \frac{1}{p_\theta(x)}\]

Method

The method establishes nonlinear information theory through several key definitions based on sublinear expectation:

Nonlinear information entropy for uncertain-distribution sources:
\[\hat{H}(X) := \sup_{\theta \in \Theta} \sum_x p_\theta(x) \log \frac{1}{p_\theta(x)}\]
Nonlinear joint entropy for two random variables:
\[\hat{H}(X,Y) := \sup_{\theta \in \Theta} \sup_{\lambda \in \Lambda} \sum_{x,y} p_\theta(x) p_\lambda(y|x) \log \frac{1}{p_\theta(x) p_\lambda(y|x)}\]
Nonlinear conditional entropy:
\[\hat{H}(Y|X) := \sup_{\theta \in \Theta} \sum_x p_\theta(x) \hat{H}(Y|X=x)\]
Nonlinear mutual information:
\[I[X;Y] := \sup_{\theta \in \Theta} \sup_{\lambda \in \Lambda} \sum_{x,y} p_\theta(x) p_\lambda(y|x) \log \frac{p_\lambda(y|x)}{\sum_x p_\lambda(y|x) p_\theta(x)}\]
Source coding theorem uses strong law of large numbers under sublinear expectations to establish that $\hat{H}(X)$ is the upper bound for coding rate under maximum error criterion.

Applying to toy example: For the binary variable with $q \in [0.4, 0.6]$, we compute:
\[\hat{H}(X) = \max_{q \in [0.4,0.6]} [q \log_2 \frac{1}{q} + (1-q) \log_2 \frac{1}{1-q}]\]
Since the entropy function is concave and symmetric around 0.5, the maximum occurs at the boundary points, giving $\hat{H}(X) = 0.6 \log_2 \frac{1}{0.6} + 0.4 \log_2 \frac{1}{0.4} \approx 0.971$ bits.

Novelty & Lineage

This work introduces the first application of Peng’s nonlinear expectation theory to information theory and communications. Prior work includes:

Shannon’s classical information theory (1948) with deterministic probability distributions
Rényi entropy and Choquet capacity entropy as generalizations, but still requiring explicit probability models
Universal source coding (Ziv) and arbitrarily varying channels (Blackwell et al.) handling unknown parameters within fixed distributional families
Peng’s nonlinear expectation theory (2010) for financial mathematics

The key novelty is moving beyond single probability distributions or parametric families to directly model distributional uncertainty using sublinear expectations. This enables analysis of “uncertainty about uncertainty” - a second-order uncertainty that classical theory cannot handle. The work establishes fundamental limits (source coding, channel coding, rate-distortion theorems) in this new framework, showing both similarities and crucial differences from classical results.

SIGNIFICANT

Proof Techniques

The main proof techniques involve:

Axiomatic derivation of nonlinear entropy: Uses four fundamental assumptions about information measures under distributional uncertainty to prove:
\[\hat{H}(X) = \sup_{\theta \in \Theta} \sum_x p_\theta(x) \log \frac{1}{p_\theta(x)}\]
Strong Law of Large Numbers under sublinear expectations: Key theorem states that for IID sequence ${X_i}$ under sublinear expectation $E[\cdot] = \sup_{\theta \in \Theta} E_{P_\theta}[\cdot]$:
\[V\left(b \in C\left(\frac{\sum_{i=1}^n X_i}{n}\right)\right) = 1\]
for any $b \in [-E[-X_1], E[X_1]]$, where $C({x_n})$ denotes cluster points.
Mathematical capacity theory: Uses capacity generated by sublinear expectation:
\[V(A) := E[I_A], \quad v(A) := -E[-I_A]\]
where $V(A) = \sup_{\theta \in \Theta} P_\theta(A)$ and $v(A) = \inf_{\theta \in \Theta} P_\theta(A)$.
Supremum operations over probability families: Most properties proved by taking appropriate suprema over the underlying probability measure sets ${P_\theta}_{\theta \in \Theta}$.
Key technical insight: Unlike classical theory where chain rule holds with equality, nonlinear theory yields inequalities:
\[\hat{H}(X,Y) \leq \hat{H}(X) + \hat{H}(Y|X)\]
This reflects the additional uncertainty introduced by uncertain transition probabilities.

Experiments & Validation

The paper is primarily theoretical but includes:

Simulation of binary random variable example showing how nonlinear entropy $\hat{H}(X)$ increases with distribution uncertainty parameter $\epsilon$
Graphical illustration comparing classical Shannon entropy (blue solid line) with nonlinear entropy for different uncertainty levels ($\epsilon = 0.05$ red dashed, $\epsilon = 0.1$ yellow dash-dotted)
Results confirm theoretical prediction that greater distributional uncertainty leads to higher information entropy

Empirical validation would require:

Real communication systems with measurable distributional uncertainty
Performance comparison of coding schemes designed using nonlinear vs classical information theory
Verification of predicted fundamental limits in practical scenarios with non-stationary sources/channels

Limitations & Open Problems

Limitations:

Discrete random variables only - TECHNICAL (continuous extension likely feasible using same principles)
Requires explicit characterization of uncertainty families ${P_\theta}_{\theta \in \Theta}$ - RESTRICTIVE (difficult to specify in practice without some distributional knowledge)
Computational complexity of supremum operations over potentially infinite parameter sets - TECHNICAL (approximation schemes needed for practical implementation)
No constructive coding algorithms provided, only existence results - TECHNICAL (algorithmic development is natural next step)
Fundamental limits expressed as cluster points rather than achievable rates - NATURAL (reflects inherent uncertainty in the framework)

Open problems:
Develop computationally tractable approximation methods for the supremum operations in nonlinear information measures when the parameter set $\Theta$ is large or continuous.
Extend the theory to continuous random variables and establish nonlinear differential entropy with proper measure-theoretic foundations under sublinear expectations.

Weak Adversarial Neural Pushforward Method for the McKean-Vlasov / Mean-Field Fokker-Planck Equation

Authors: Andrew Qing He, Wei Cai · Institution: (high-relevance, institution unverified) · Category: math.NA

Extends neural pushforward methods to McKean-Vlasov equations, showing that quadratic interaction kernels reduce the mean-field nonlinearity to a simple batch mean with key training insights about gradient flow and frequency initialization.

Tags: mean-field-theory fokker-planck-equations neural-pde-solvers adversarial-training interacting-particle-systems pushforward-maps mckean-vlasov-equations

arXiv · PDF

Problem Formulation

Motivation: The McKean-Vlasov (mean-field) Fokker-Planck equation governs interacting particle systems where each particle’s drift depends on the entire population distribution, appearing in granular media, stochastic flocking, mathematical finance, and neural network analysis. Standard numerical methods suffer from statistical noise, require large particle counts, and don’t produce queryable functional representations.
Mathematical setup: Consider the stationary McKean-Vlasov equation:
\[-\nabla \cdot [b(x, \rho) \rho(x)] + \frac{\sigma^2}{2} \Delta \rho(x) = 0\] \[\int_{\mathbb{R}^n} \rho(x) dx = 1\]
For the quadratic interaction kernel $W(x-y) = \frac{1}{2}\lvert x-y \rvert^2$ with confinement $V(x) = \frac{\theta}{2}\lvert x \rvert^2$, the drift simplifies to:
\[b(x, \rho) = -\lambda x + m(\rho)\]
where $\lambda = \theta + 1$ and $m(\rho) = \mathbb{E}_{y \sim \rho}[y]$ is the mean of the distribution $\rho$.

Assumptions:
1. Quadratic interaction kernel and quadratic confinement potential
2. Constant isotropic diffusion coefficient $\sigma^2$
3. Existence and uniqueness of stationary solution
Toy example: When $n=1$, $\theta=1$, $\sigma=1$, we have $\lambda=2$ and the unique stationary solution is $\rho^* = \mathcal{N}(0, 0.25)$ with $m^*=0$, giving drift $b(x,\rho^*) = -2x$.
Formal objective: Find a neural pushforward map $F_\vartheta: \mathbb{R}^d \to \mathbb{R}^n$ such that if $r \sim \pi_{base}$, then $x = F_\vartheta(r)$ has distribution $\rho^*$ satisfying:
\[\mathbb{E}_{x \sim \rho^*}[L[\rho^*] f(x)] = 0 \quad \forall f \in C_c^\infty(\mathbb{R}^n)\]
where $L[\rho^*] f(x) = b(x,\rho^*) \cdot \nabla f(x) + \frac{\sigma^2}{2} \Delta f(x)$.

Method

The Weak Adversarial Neural Pushforward Method (WANPM) for McKean-Vlasov equations:

Parametrize the stationary distribution implicitly via neural pushforward map $F_\vartheta: \mathbb{R}^d \to \mathbb{R}^n$
Use plane-wave test functions $f^{(k)}(x) = \sin(w^{(k)} \cdot x + b^{(k)})$ with learnable frequencies $w^{(k)}$ and biases $b^{(k)}$
Compute batch mean estimate with gradient flow:
\[\hat{m} = \frac{1}{M} \sum_{m=1}^M F_\vartheta(r^{(m)})\]
Evaluate weak residual using the generator:
\[L[\rho] f^{(k)}(x) = (-\lambda x + \hat{m}) \cdot w^{(k)} \cos(w^{(k)} \cdot x + b^{(k)}) - \frac{\sigma^2}{2} \|w^{(k)}\|^2 \sin(w^{(k)} \cdot x + b^{(k)})\]
Minimize adversarial loss:
\[L_{total}[\vartheta, \eta] = \frac{1}{K} \sum_{k=1}^K \left[\frac{1}{M} \sum_{m=1}^M L[\rho_\vartheta] f^{(k)}(x^{(m)})\right]^2\]
Initialize test function frequencies at large scale (∼2.0) to avoid spurious two-point solutions

Applied to toy example: With $n=d=1$, $\lambda=2$, $\sigma=1$, the network learns to transform uniform $U[0,1)$ samples to $\mathcal{N}(0,0.25)$ samples. The method achieves mean error $3.5 \times 10^{-4}$ and variance error $8.0 \times 10^{-3}$.

Novelty & Lineage

This extends the Weak Adversarial Neural Pushforward Method (WANPM) from He & Cai [2025] for standard Fokker-Planck equations to the McKean-Vlasov setting. Key novel contributions:

shows that for quadratic kernels, the mean-field nonlinearity reduces to a simple batch mean requiring no secondary sampling
identifies that gradient must flow through the mean estimate for self-consistency
discovers that test function frequencies must be initialized at large scale to avoid spurious two-point solutions. Prior work includes particle methods exploiting propagation of chaos and other neural PDE solvers, but none specifically address the mean-field Fokker-Planck case with this adversarial pushforward approach. INCREMENTAL.

Proof Techniques

The paper is primarily algorithmic rather than theoretical, with key technical insights rather than formal proofs:

Quadratic kernel simplification via direct computation:
\[\int_{\mathbb{R}^n} \nabla W(x-y) \rho_t(y) dy = \int_{\mathbb{R}^n} (x-y) \rho_t(y) dy = x - m(t)\]
Spurious solution analysis showing two-point distribution $\rho_{two} = \frac{1}{2}\delta_{-a} + \frac{1}{2}\delta_{+a}$ with $a = \sigma/\sqrt{2\lambda}$ has residual:
\[R^{(k)}_{two} = \sin(wa)\left[\lambda aw \sin(b) - \frac{\sigma^2 w^2}{2} \cos(b)\right]\]
For small frequencies $wa \ll 1$, the factor $\sin(wa) \approx wa$ makes $\lvert R^{(k)}_{two} \rvert$ small compared to the Gaussian residual (which is exactly zero), creating spurious minima.
Self-consistency argument: detaching the mean estimate $\hat{m}$ from the gradient graph prevents the coupling $m^* = \mathbb{E}_{\rho}[x]$ from being enforced through optimization dynamics.

The analysis is computational rather than providing convergence guarantees or approximation bounds.

Experiments & Validation

Single numerical benchmark on 1D linear McKean-Vlasov equation with $\theta=1$, $\sigma=1$, exact solution $\mathcal{N}(0,0.25)$. Network architecture: (1,10,10,1) with Tanh activations. Training: 5000 epochs, batch size 1000, $K=1000$ test functions, Adam optimizer with learning rate $10^{-3}$ for generator and SGD with $10^{-2}$ for adversary. Results: mean error $3.5 \times 10^{-4}$, variance error $8.0 \times 10^{-3}$, training loss converges from $\sim 6$ to $9.6 \times 10^{-6}$ in 1.1 minutes on single GPU. Q-Q plot shows accurate distributional agreement including tails.

Limitations & Open Problems

Restriction to quadratic interaction kernel - RESTRICTIVE (excludes many physically relevant kernels like Keller-Segel $W(x) = -\log\lvert x \rvert$)
Focus on stationary problem only, with time-dependent extension sketched but not validated - TECHNICAL (extension appears straightforward but needs empirical validation)
Single 1D numerical experiment - TECHNICAL (more extensive benchmarking needed across dimensions and parameters)
No convergence theory or approximation bounds - TECHNICAL (typical for neural PDE methods but limits theoretical understanding)
Requires careful hyperparameter tuning (frequency initialization scale, learning rates) - NATURAL (common in adversarial training)

Open problems:
Extension to general interaction kernels beyond quadratic case, particularly handling kernels requiring secondary sampling for interaction integrals
Theoretical analysis of convergence rates and approximation error bounds for the adversarial training dynamics in the mean-field setting

Timely Best Arm Identification in Restless Shared Networks

Authors: Mengqiu Zhou, Vincent Y. F. Tan, Meng Zhang · Institution: (high-relevance, institution unverified) · Category: cs.IT

First theoretical analysis of best arm identification for minimizing age of information under unknown restless Markovian dynamics, with novel algorithms achieving optimal sample complexity.

Tags: restless-bandits best-arm-identification age-of-information markov-chains concentration-inequalities edge-computing sample-complexity information-theory

arXiv · PDF

Problem Formulation

Motivation: Real-time applications like autonomous vehicles and industrial control systems require fresh data updates to maintain safety and performance. Edge computing nodes exhibit time-varying congestion that affects data freshness, measured by age of information (AoI), and identifying the best node without prior knowledge of dynamics is critical.
Mathematical setup: Consider $K$ edge nodes modeled as restless multi-armed bandit (RMAB) arms. Each arm $a \in [K]$ evolves according to unknown Markov chain on finite state space $\mathcal{S} = {0,1,\ldots,S}$ with transition probabilities $P_{\theta_a}(s’\lvert s)$ parameterized by $\theta_a$. The hidden congestion state $X_{a,i} \in \mathcal{S}$ generates deterministic service delay $Y_{a,i} = f(X_{a,i})$. The AoI process $\Delta(t)$ has sawtooth evolution, and the time-average AoI is:
\[\Delta^{(ave)} = \lim_{T \to \infty} \frac{1}{T} \int_0^T \Delta(t) dt\]
This equals:
\[\gamma^* = \min_{a \in [K]} \frac{\mathbb{E}[Q(Y_{a,i}, Y_{a,i+1})]}{\mathbb{E}[Y_{a,i}]}\]
where $Q(Y_i, Y_{i+1}) = \frac{1}{2}Y_i^2 + Y_i Y_{i+1}$.

Assumptions:
1. Each arm’s transition matrix satisfies partial irreducibility conditions (Assumption 1)
2. Arms evolve independently with unknown parameters $\theta = [\theta_1, \ldots, \theta_K]^T$
3. Deterministic emission: $Y_{a,i} = f(X_{a,i})$ where $f$ is known monotone function
Toy example: When $K=2$ nodes, $\mathcal{S} = {0,1}$, and $f(0)=1, f(1)=2$, arm 1 has stationary probability $\mu_1(1) = 0.3$ and arm 2 has $\mu_2(1) = 0.7$. Then $\mathbb{E}[Y_1] = 1.3, \mathbb{E}[Y_2] = 1.7$, making arm 1 likely optimal since it has lower expected delay and thus lower AoI.
Formal objective: Using Dinkelbach transformation to handle fractional objective, find:
\[a^*(θ, γ) = \arg\min_{a \in [K]} g_{\theta_a}(γ) = \arg\min_{a \in [K]} [\eta_{\theta_a}(γ) + \Gamma_{\theta_a}]\]
where $\eta_{\theta_a}(γ) = \sum_{s \in \mathcal{S}} [\frac{1}{2}f^2(s) - γf(s)]\mu_{\theta_a}(s)$ and $\Gamma_{\theta_a} = \sum_{s,s’} f(s)f(s’)P_{\theta_a}(s’\lvert s)\mu_{\theta_a}(s)$.

Method

The Age-Aware LUCB algorithm has three main components:

Markov Regeneration Sampling: To handle temporal dependence, sample each arm in blocks between returns to regenerative state $\tilde{Y}_a$. Within each block, collect samples during regeneration phase and construct empirical estimators:
\[\hat{\eta}_{a,T_jb}(γ) = \frac{1}{T^a_{jb}} \sum_{i=1}^{T^a_{jb}} [\frac{1}{2}Y_{a,i}^2 - γY_{a,i}]\] \[\hat{\Gamma}_{a,T_jb} = \frac{1}{T^a_{jb}} \sum_{i=1}^{T^a_{jb}} Y_{a,i}Y_{a,i+1}\]
Confidence Bounds: Using concentration for Markov chains with pseudo spectral gap $β_{\min}$, construct confidence intervals:
\[\text{LCB}_a(b) = \hat{g}^b_a(γ) - \text{rad}_a(b)\] \[\text{UCB}_a(b) = \hat{g}^b_a(γ) + \text{rad}_a(b)\]
where $\text{rad}_a(b) = \sqrt{\frac{c \log(4j_bK/(δ\bar{\mu}))}{T^a_{jb}}}$.
LUCB Selection: At each block $b$, identify best arm candidate $h(b) = \arg\min_a \hat{g}^b_a(γ)$ and challenger $l(b) = \arg\min_{a \neq h(b)} \text{LCB}_a(b)$. Stop when $\text{UCB}_{h(b)}(b) \leq \text{LCB}_{l(b)}(b)$.
Dinkelbach Update: Use bisection search to find optimal $γ^*$ such that $J(γ^*) = 0$.

Applied to toy example: With two arms and regenerative state $\tilde{Y} = 1$, the algorithm alternates between arms, collecting regenerative samples. For arm 1 with observations $[1,2,1,1,2,1]$, it estimates $\hat{\eta}_1 ≈ 0.17$ and $\hat{\Gamma}_1 ≈ 1.4$, yielding $\hat{g}_1 ≈ 1.57$ (assuming $γ=1.4$). Similar estimation for arm 2 likely gives higher value, leading to selection of arm 1.

Novelty & Lineage

This is the first work to study best arm identification (BAI) for age of information minimization under unknown restless dynamics. Prior AoI-RMAB works (Kadota et al., Le et al.) assume known transition probabilities and use Whittle’s index policies. Prior RMAB-BAI work (Combes et al.) focuses on expected reward maximization, not structured objectives like AoI that depend on temporal trajectories.

The paper introduces:

novel age-optimal BAI problem formulation
Markovian regeneration sampling strategy to handle temporal dependence
instance-dependent sample complexity bounds incorporating Markov mixing behavior
information-theoretic lower bounds with explicit two-arm case study.

Technical novelty includes extending LUCB algorithms to fractional objectives via Dinkelbach transformation and developing concentration bounds for non-reversible Markov chains using pseudo spectral gaps.

SIGNIFICANT

Proof Techniques

The main proof strategy consists of several key stages:

Concentration Analysis: Establish concentration bounds for empirical estimators under Markovian dependence using regenerative structure. The key concentration inequality from [47, Theorem 3.3] gives:
\[P[|\hat{g}_{\theta_a}(γ) - g_{\theta_a}(γ)| > \epsilon] \leq 4\|\omega_a\|_2 \exp\left(-\frac{T^a_{jb}\epsilon^2\beta_{\theta_a}}{96(\bar{f}^2 - \underline{f}^2)^2}\right)\]
where $\beta_{\theta_a}$ is the pseudo spectral gap and $\omega_a$ captures initialization bias.
Union Bound Construction: Apply union bound over all $K$ arms and $j_b$ blocks to ensure simultaneous concentration. The global good event $E$ satisfies $P{E} \geq 1-δ$ where:
\[E = \bigcap_{a \in [K]} \bigcap_{b \in \mathbb{N}} \{|\hat{\eta}^b_a(γ) + \hat{\Gamma}^b_a - g_{\theta_a}(γ)| \leq \text{rad}_a(b)\}\]
Elimination Analysis: For suboptimal arm $a$ with gap $\Lambda_{\theta_a} = g_{\theta_a}(γ^*) - g_{\theta_1}(γ^*)$, show elimination occurs when confidence radius satisfies:
\[\text{rad}_a(b) \leq \frac{\Lambda_{\theta_a}}{4}\]
This uses midpoint threshold $ξ = \frac{1}{2}(g_{\theta_1}(γ^*) + g_{\theta_2}(γ^*))$ and the key inequality $g_{\theta_a}(γ^*) - ξ \geq \frac{\Lambda_{\theta_a}}{2}$.
Sample Complexity Bound: Combine elimination condition with concentration bounds and account for Markovian regeneration overhead characterized by expected hitting times $\Psi^{\max}_{\theta_a}$ and minimal stationary probabilities $\mu^{\min}_{\theta_a}$.
Lower Bound via Change of Measure: Construct alternative instances $θ’ \in \text{ALT}(θ)$ and use log-likelihood ratio analysis. For two-arm case, the key insight is Fisher information scaling:
\[I^{(d_2)}_{\theta_2} = \mu_{\theta_2}(1)(1-\mu_{\theta_2}(1))(2 + 2\lambda^{(2)}_2)\]
where $\lambda^{(2)}_2$ is the second eigenvalue capturing temporal correlation.

Experiments & Validation

The experiments use a 5-node mobile edge computing system with congestion states $\mathcal{S} = {0,1,2,3,4}$ and deterministic service delays $f(s) = 10 + 5s$ ms. The problem instances have increasing difficulty: $θ^{(κ)} = [0.1, 0.1 + 0.2(a-1)/(κ-1)]$ for $κ \in {2,3,4,5}$.

Key results:

Age-optimal BAI reduces sample complexity by 22-43% vs baselines (Markovian-LUCB, Markovian-UCB-BAI, Markovian-Uniform)
Savings increase with stricter confidence levels (δ from 10^-1 to 10^-7)
Performance gains are more pronounced for harder instances with smaller AoI gaps
Two-state mixing analysis shows oscillatory dynamics require highest sample complexity, consistent with lower bound predictions

Baselines: Three variants using Markovian regeneration sampling - Markovian-LUCB (classical LUCB with regenerative samples), Markovian-UCB-BAI (UCB adapted for BAI), and Markovian-Uniform (round-robin selection).

Evaluation metrics: Sample complexity (number of updates until termination) across 1000 trials with 95% confidence intervals.

Limitations & Open Problems

Limitations:

Deterministic emission model $Y_{a,i} = f(X_{a,i})$ - TECHNICAL (simplifies analysis but real systems have stochastic delays; extension to stochastic emissions is feasible)
Independent arm evolution assumption - NATURAL (reasonable for spatially distributed edge nodes with weakly correlated traffic)
Known function $f(\cdot)$ mapping states to delays - TECHNICAL (could be learned jointly with transition dynamics)
Finite state space assumption - NATURAL (practical systems have bounded congestion levels)
Zero-waiting policy assumption - RESTRICTIVE (optimal waiting strategies could further improve AoI performance)
One-parameter exponential family for transitions - TECHNICAL (enables tractable analysis but more general families are possible)

Open problems:
Optimal waiting strategy design: How should the scheduler optimally time update submissions to minimize AoI beyond just selecting the best node?
Joint learning of emission function and transition dynamics: When $f(\cdot)$ is unknown, how can both the state-to-delay mapping and Markov transitions be learned simultaneously while maintaining sample efficiency?

Extinction behaviour for mutually enhancing continuous-state population dynamics

Authors: Jie Xiong, Xu Yang, Xiaowen Zhou · Institution: (high-relevance, institution unverified) · Category: math.PR

Establishes sharp extinction/non-extinction criteria for two-dimensional continuous-state branching processes with mutually enhancing interactions using generalized Chen’s criteria and novel two-dimensional test functions.

Tags: stochastic differential equations continuous-state branching processes population dynamics extinction probability Lotka-Volterra models Chen's criteria martingale methods jump-diffusion processes

arXiv · PDF

Problem Formulation

Motivation (2–3 sentences): Understanding extinction behavior in population dynamics is fundamental for modeling biological systems, where species interact through mutual enhancement rather than competition. This study addresses a gap in the literature on continuous-state branching processes by analyzing when mutually enhancing populations survive or go extinct.

Mathematical setup: Consider a probability space $(\Omega, \mathcal{F}, \mathcal{F}_t, \mathbb{P})$ supporting two independent Brownian motions $(B_1(t))_{t \geq 0}$, $(B_2(t))_{t \geq 0}$ and two independent compensated Poisson random measures $\tilde{N}_1(ds,dz,du)$, $\tilde{N}_2(ds,dz,du)$ with intensity $ds\mu_i(dz)du$ where $\mu_i(dz) = \frac{\alpha_i(\alpha_i-1)}{\Gamma(\alpha_i)\Gamma(2-\alpha_i)}z^{-1-\alpha_i}\mathbf{1}_{{z>0}}dz$ for $\alpha_i \in (1,2)$. The two-dimensional population process $(X_t, Y_t)_{t \geq 0}$ satisfies the SDE system:

\[X_t = X_0 + a_1\int_0^t X_s^{\theta_1}Y_s^{\kappa_1}ds - b_{10}\int_0^t X_s^{r_{10}}ds + b_{11}\int_0^t\sqrt{2X_s^{r_{11}}}dB_1(s) + \int_0^t\int_0^\infty\int_0^{b_{12}X_{s-}^{r_{12}}} z\tilde{N}_1(ds,dz,du)\] \[Y_t = Y_0 + a_2\int_0^t Y_s^{\theta_2}X_s^{\kappa_2}ds - b_{20}\int_0^t Y_s^{r_{20}}ds + b_{21}\int_0^t\sqrt{2Y_s^{r_{21}}}dB_2(s) + \int_0^t\int_0^\infty\int_0^{b_{22}Y_{s-}^{r_{22}}} z\tilde{N}_2(ds,dz,du)\]

Mutual enhancement assumption: $a_1, a_2, \kappa_1, \kappa_2 > 0$
Non-degeneracy: $b_{11} + b_{12} > 0$ and $b_{21} + b_{22} > 0$
Parameter ranges: $\theta_i, r_{ij}, b_{ij} \geq 0$ for $i=1,2$, $j=0,1,2$
Initial conditions: $X_0, Y_0 > 0$ deterministic

Define $r_i := \min{r_{ij} - \varrho_{ij} : j \in {b_{ij} \neq 0, j = 0,1,2}}$ where $\varrho_{i0} = \varrho_{i1} = 1+j$ and $\varrho_{i2} = \alpha_i$.

Toy example: When $\theta_1 = \theta_2 = 1$, $\kappa_1 = \kappa_2 = 1$, $r_{ij} = 1$ for all $i,j$, and $b_{12} = b_{22} = 0$, the system becomes:
\[dX_t = (a_1Y_t - b_{10})X_tdt + b_{11}\sqrt{2}X_t^{1/2}dB_1(t)\] \[dY_t = (a_2X_t - b_{20})Y_tdt + b_{21}\sqrt{2}Y_t^{1/2}dB_2(t)\]
This illustrates the core difficulty: mutual enhancement can either promote survival through positive feedback or lead to extinction when negative drift dominates.

Formal objective: Determine conditions on the parameters such that:
\[\mathbb{P}\{\tau_0 < \infty\} = 0 \quad \text{or} \quad \mathbb{P}\{\tau_0 < \infty\} > 0\]
where $\tau_0 := \inf{t \geq 0 : X_t = 0 \text{ or } Y_t = 0}$.

Method

The analysis uses generalized Chen’s criteria adapted to two-dimensional processes, employing carefully constructed test functions to establish extinction/non-extinction dichotomies.

Method steps:

Non-extinction criterion (Proposition 2.1): For test function $g \in C^2((0,\infty) \times (0,\infty))$, if $\lim_{x \wedge y \to 0} g(x,y) = \infty$ and $\mathcal{L}g(x,y) \leq d_ng(x,y)$ for $(x,y) \in (0,n]^2$, then $\mathbb{P}{\tau_0 < \infty} = 0$.
Extinction criterion (Proposition 2.2): If $g(x_0,y_0) > 0$, $\sup_{x,y>0} g(x,y) < \infty$, $g(x,y) \leq 0$ for $x \vee y \geq v$, and $\mathcal{L}g(x,y) \geq dg(x,y)$ for $(x,y) \in (0,v]^2$, then $\mathbb{P}{\tau_0 < \infty} > 0$.
Test function construction: - For non-extinction: $g(x,y) = x^{-\rho_1} + y^{-\rho_2}$ with $\rho_1, \rho_2 > \theta_1 \vee \theta_2$ - For extinction: $g(x,y) = c + h(x,y)$ where $h(x,y) = -(x^{\rho_1} + y^{\rho_2})^\rho$ with modifications
Generator computation: The infinitesimal generator is:
\[\mathcal{L}g(x,y) = a_1x^{\theta_1}y^{\kappa_1}g_x'(x,y) + a_2y^{\theta_2}x^{\kappa_2}g_y'(x,y) - b_{10}x^{r_{10}}g_x'(x,y) - b_{20}y^{r_{20}}g_y'(x,y) + \text{diffusion and jump terms}\]
Toy example application: For the simplified case, using test function $g(x,y) = x^{-1} + y^{-1}$, the generator becomes:
\[\mathcal{L}g(x,y) = -a_1x^{-1}y + -a_2y^{-1}x + b_{10}x^{-1} + b_{20}y^{-1} + \text{diffusion terms}\]
Non-extinction occurs when the negative enhancement terms dominate the positive drift terms for small $x,y$.

Novelty & Lineage

This work extends the Chen’s criteria framework from one-dimensional continuous-state branching processes to two-dimensional mutually enhancing systems. Prior work includes:

Li (2000): Established extinction criteria for one-dimensional nonlinear CSBPs using Chen’s criteria
Palau & Pardo (2017, 2018): Studied CSBPs with competition (negative interaction)
Yang et al. (2022, 2023): Analyzed competitive Lotka-Volterra models with mutual inhibition

Key novelties:

First systematic study of mutually enhancing (positive interaction) continuous-state population dynamics
Development of new two-dimensional test functions that cannot be reduced to one-dimensional cases
Sharp characterization of critical cases where $(r_1 + 1 - \theta_1)(r_2 + 1 - \theta_2) = \kappa_1\kappa_2$
Novel technical conditions distinguishing drift-dominated vs. diffusion-dominated regimes

The test functions and technical conditions are substantially different from competitive models and require fundamentally new approaches due to positive feedback effects.

Assessment: SIGNIFICANT

Proof Techniques

The proofs employ generalized Chen’s criteria with carefully constructed two-dimensional test functions and martingale techniques.

Key proof strategy:

Test function design: For non-extinction, use functions of the form $g(x,y) = x^{-\rho_1} + y^{-\rho_2}$ where the negative powers ensure $g(x,y) \to \infty$ as $(x,y) \to 0$. For extinction, use $g(x,y) = c + h(x,y)$ with $h(x,y) = -(x^{\rho_1} + y^{\rho_2})^\rho$.
Generator bounds: The critical inequality for non-extinction is:
\[a_1\rho_1x^{\theta_1-1-\rho_1}y^{\kappa_1} + a_2\rho_2y^{\theta_2-1-\rho_2}x^{\kappa_2} \geq C(x^{r_1-\rho_1} + y^{r_2-\rho_2})\]
This is established using Young’s inequality and the key constraint:
\[\frac{r_2 + 1 - \theta_2}{\kappa_2} > \frac{1 + \rho_2 + \kappa_1 - \theta_2}{1 + \rho_1 + \kappa_2 - \theta_1}\]
Critical case analysis: When $(r_1 + 1 - \theta_1)(r_2 + 1 - \theta_2) = \kappa_1\kappa_2$, the dichotomy depends on:
\[\left(\frac{a_1}{b_1}\right)^{1/(r_1+1-\theta_1)}\left(\frac{a_2}{b_2}\right)^{1/\kappa_2} \gtrless 1\]
Martingale argument: For extinction, construct $M_t^g = g(X_t, Y_t) - g(X_0, Y_0) - \int_0^t \mathcal{L}g(X_s, Y_s)ds$ and show it’s a supermartingale when $\mathcal{L}g \geq dg$.
Jump term estimates: For $\rho \in (0,1)$ and test function $(x^{\rho_1} + y)^\rho$, the key jump integral bound is:
\[\int_0^\infty K_z^1(x,y)\mu_1(dz) \geq \rho(1-\rho)\rho_1^2 v^2 d_1 - \rho\rho_1(\rho_1-1)v\tilde{d}_1\]
where $v = \frac{x^{\rho_1}}{x^{\rho_1} + y}$ and the first term dominates for large $\rho_1$.

Experiments & Validation

Purely theoretical. The paper provides complete mathematical characterization of extinction behavior through parameter conditions, but no numerical simulations or empirical validation.

Potential empirical validation would involve:

Numerical simulation of the SDE system for various parameter regimes
Verification of theoretical extinction probabilities through Monte Carlo estimation
Application to real population data with mutual enhancement (e.g., symbiotic species, cooperative breeding)

Limitations & Open Problems

Limitations:

RESTRICTIVE: Analysis limited to mutually enhancing interactions ($a_1, a_2 > 0$); mixed interaction case $a_1a_2 < 0$ remains open and is significantly more challenging.
TECHNICAL: Power function coefficients assumption ($\gamma_{ij}(x) = b_{ij}x^{r_{ij}}$) needed for tractable analysis but excludes other natural functional forms.
TECHNICAL: Independence assumption on Brownian motions and jump processes; correlated noise would be more realistic but mathematically harder.
RESTRICTIVE: Spectral positivity constraint on Lévy measures ($\alpha_i \in (1,2)$) excludes other jump behaviors like infinite variation processes.
TECHNICAL: Test function construction is largely ad hoc without systematic design principles, limiting extension to higher dimensions.

Open problems:
Mixed interaction case: Establish extinction criteria when one population enhances the other but receives competitive pressure ($a_1 > 0, a_2 < 0$).
Complete extinction conditions: Determine when $\mathbb{P}{\tau_0 < \infty} = 1$ rather than just $> 0$, which requires more sophisticated techniques.

High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Authors: Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas · Institution: (high-relevance, institution unverified) · Category: cs.LG

Establishes the first information-computation gap for Gaussian mean estimation under realizable contamination, showing exponential Statistical Query complexity and providing a matching near-polynomial time algorithm.

Tags: statistical-query-lower-bounds robust-statistics missing-data gaussian-mean-estimation information-computation-gaps spectral-methods hermite-polynomials non-gaussian-component-analysis

arXiv · PDF

Problem Formulation

Motivation: This paper studies mean estimation for Gaussian distributions when data can go missing in a structured way—an adversary can selectively remove samples based on their values with probability up to ε. This models realistic scenarios like survey non-response or dropout in clinical trials, where missingness depends systematically on the data values.
Mathematical setup: Let $P = N(\mu, I_d)$ be a $d$-dimensional Gaussian with unknown mean $\mu \in \mathbb{R}^d$ and identity covariance. An adversary chooses a function $r: \mathbb{R}^d \to [0, \varepsilon]$ and each sample $x \sim P$ goes missing with probability $r(x)$. Equivalently, the adversary chooses $f: \mathbb{R}^d \to \mathbb{R}_+$ such that:
\[(1-\varepsilon)p(x) \leq f(x) \leq p(x)\]
where $p$ is the pdf of $P$. The observed distribution has pdf proportional to $f(x)$, with probability $1 - \int f(x)dx$ of observing the special symbol $\perp$ (missing).

Assumptions:
1. The contamination rate $\varepsilon \in (0,1)$ is known
2. The true covariance is $I_d$ (known)
3. Samples are drawn i.i.d. from the contaminated distribution
Toy example: When $d=1$, $\mu = 0.5$, $\varepsilon = 0.3$, and we want error $\delta = 0.1$, the adversary might choose $f(x)$ to preferentially remove samples in the tail regions where the pdfs of $N(0,1)$ and $N(0.5,1)$ differ significantly. The information-theoretic sample complexity becomes roughly $1/(\delta^2(1-\varepsilon)) + \exp((\log(1+\varepsilon/(1-\varepsilon))/\delta)^2)$.
Formal objective: Find an estimator $\hat{\mu}$ such that with probability at least $0.9$:
\[\|\hat{\mu} - \mu\|_2 \leq \delta\]

Method

The algorithm has three main phases:

List-decoding phase: Use a robust mean estimator to produce a list $L$ of candidate means, one of which is $O(\sqrt{\log(1/(1-\varepsilon))})$-close to the true mean.
Tournament selection: Apply a tournament-based procedure to select $\hat{\mu}_0 \in L$ that is approximately as good as the best candidate in the list.
Spectral dimension reduction: - Recenter samples: $S’ = {x - \hat{\mu}_0 : x \in S}$ - For each $t = 0, 1, \ldots, k$ where $k = \lceil C(1/\delta \log(1 + \varepsilon/(1-\varepsilon)))^2 \rceil$:
- Compute empirical Hermite moment tensor:
\[\hat{T}_t = \frac{1}{n}\sum_{x \in S'} H_t(x)\]
- Flatten $\hat{T}_t$ into matrix $M(\hat{T}_t) \in \mathbb{R}^{d \times d^{t-1}}$
- Find subspace $V_t$ spanned by singular vectors with singular values $> \eta = \varepsilon/(C(k+1)^{k/2})$
  - Set $V = \text{span}(V_1, \cup \cdots \cup V_k)$
  - Apply brute-force estimator to $\text{Proj}_V(S’)$ with error tolerance $\delta/2$
Output: Return $\hat{\mu} + \hat{\mu}_0$

Applied to toy example: With $d=1$, $\mu=0.5$, $\varepsilon=0.3$, $\delta=0.1$, we get $k \approx 9$. The algorithm would identify that the 4th moment significantly deviates from the standard Gaussian, leading to $\dim(V) = 1$, and the brute-force step reduces to one-dimensional robust estimation.

Novelty & Lineage

This paper establishes the first computational-information gap for mean estimation under realizable contamination, a model introduced by Ma et al. (2024). Ma et al. provided information-theoretic bounds but their estimator required exponential time. The key novelties are:

First Statistical Query lower bound showing exponential query complexity or exponential sample complexity
First polynomial-time algorithm (up to dimension-independent exponential factors) achieving near-optimal sample complexity
Novel connection to Non-Gaussian Component Analysis for proving SQ hardness
Spectral dimension reduction technique using Hermite moment tensors

Prior work: Ma et al. (2024) studied information theory; Diakonikolas-Kane-Stewart (2017) established SQ hardness for robust statistics but in different contamination models; robust statistics literature (Huber, 1964) studied different adversarial models.

SIGNIFICANT

Proof Techniques

The proof has two main components:

SQ Lower Bound via Moment Matching: The key insight is to construct an adversary that makes a shifted Gaussian $N(\delta, 1)$ match many moments of $N(0,1)$. The construction uses two steps: - First, design $g(x)$ satisfying $(1-\varepsilon)\phi(x-\delta) \leq g(x) \leq \phi(x-\delta)$ and $g(x)/(1-\varepsilon/2) = \phi(x)$ on interval $[-B,B]$ where $B = (1/\delta)\log(1 + \varepsilon/(2(1-\varepsilon/2)))$ - Then add polynomial correction $p(x)$ on $[-1,1]$ to match first $m = \Omega(\gamma^2/\log \gamma)$ moments where $\gamma = (1/\delta)\log(1 + \varepsilon/2/(1-\varepsilon/2))$

Key inequality from polynomial bounds:
\[|p(x)| \leq 4^m e^{-\beta^2/4} m^{m+2}\]
where $\beta = B - \delta$, leading to $\lvert p(x) \rvert \leq \varepsilon$ for appropriate choice of $m$.
Algorithm Analysis: The correctness relies on a structural lemma showing that if $\lvert v^T\mu \rvert > \delta/2$, then some moment of order $t \leq k$ must deviate by at least $\varepsilon$:

Key moment bound:
\[\mathbb{E}_{x \sim P'}[x^k] - \mathbb{E}_{z \sim N(0,1)}[z^k] > \varepsilon\]
when $k \geq 3(1/\delta \log(1 + 2\varepsilon/(1-\varepsilon)))^2$.

The dimension bound uses:
\[\dim(V) \leq \frac{\|\hat{T}_t\|_2^2}{\eta^2} \leq \frac{e^{O(k)}}{\varepsilon^2/(k+1)^k} = e^{O(k)}/\varepsilon^2\]

Experiments & Validation

Purely theoretical. The paper provides no experiments. Empirical validation would require:

Synthetic experiments comparing sample complexity to information-theoretic bounds across different values of $\varepsilon, \delta, d$
Comparison with existing robust estimators under different contamination models
Real-world datasets with naturally occurring structured missingness patterns
Runtime comparisons showing the exponential dependence on $\exp(\tilde{O}(k))/\varepsilon^2$ factor

Limitations & Open Problems

Limitations:

The algorithm assumes knowledge of the contamination rate $\varepsilon$ - TECHNICAL (standard in robust statistics literature)
Covariance is assumed to be identity - TECHNICAL (can likely be extended using standard whitening)
Runtime has dimension-independent but potentially large factor $\exp(\tilde{O}(k))/\varepsilon^2$ - RESTRICTIVE (makes algorithm impractical for small $\delta$ or large $\varepsilon$)
Requires $\delta \leq \varepsilon$ - TECHNICAL (larger $\delta$ can use existing robust methods)
Only handles Gaussian distributions - RESTRICTIVE (many applications involve non-Gaussian data)

Open problems:
Close the remaining poly-logarithmic gaps between the SQ lower bound and algorithmic upper bound
Extend to unknown covariance matrices and broader distribution families beyond Gaussians