Mar 21, 2026 Theory 8 papers

Theory Digest — Mar 21, 2026

Today’s Digest at a Glance

Preliminary

Today’s digest spans optimization theory breakthroughs, novel neural network architectures for scientific computing, and advanced statistical methods for high-dimensional problems.

Physics-Informed Neural Networks (PINNs)

Physics-Informed Neural Networks solve partial differential equations by embedding physical constraints directly into the neural network loss function. Traditional neural networks struggle with PDE boundary conditions and physical laws because they optimize purely on data without enforcing mathematical consistency. The core insight is to augment the standard data loss with physics-based penalty terms that measure how well the network satisfies the governing PDE.

Mathematically, for a PDE $\mathcal{N}[u] = f$ with boundary conditions, the PINN loss becomes:

\[\mathcal{L} = \mathcal{L}_{data} + \lambda_{pde} \mathcal{L}_{pde} + \lambda_{bc} \mathcal{L}_{bc}\]

where $\mathcal{L}_{pde} = \lvert \mathcal{N}[u_{\theta}] - f \rvert^2$ penalizes PDE residuals and $\mathcal{L}_{bc}$ enforces boundary conditions. The network learns to approximate solutions that are both data-consistent and physically plausible.

The key challenge is that standard activation functions often fail to capture the complex oscillatory or multi-scale behavior inherent in many physical systems, leading to poor convergence or inaccurate solutions.

Forward-Backward Stochastic Differential Equations (FBSDEs)

Forward-Backward Stochastic Differential Equations provide a framework for solving optimal control problems where the system state evolves forward in time while the adjoint variables (representing optimal controls) evolve backward from terminal conditions. Classical approaches fail when dealing with mean-field interactions because the optimal strategy depends on the population distribution, creating a circular dependency.

An FBSDE system consists of a forward SDE:

\[dX_t = b(t, X_t, Y_t, Z_t) dt + \sigma(t, X_t, Y_t, Z_t) dW_t\]

and a backward SDE:

\[dY_t = -g(t, X_t, Y_t, Z_t) dt + Z_t dW_t, \quad Y_T = \xi\]

The forward equation describes state dynamics while the backward equation captures the adjoint process encoding optimal decisions. The coupling between $(X_t, Y_t, Z_t)$ makes these systems highly nonlinear and challenging to solve numerically.

Spike-and-Slab Priors

Spike-and-slab priors address variable selection in high-dimensional Bayesian models where most features are irrelevant. Standard continuous priors cannot effectively perform feature selection because they assign non-zero probability to all parameters. The spike-and-slab framework uses a mixture prior combining a “spike” (concentrated near zero) and a “slab” (diffuse) component.

For each parameter $\theta_j$, the prior is:

\[\pi(\theta_j) = (1-\gamma_j) \delta_0(\theta_j) + \gamma_j \mathcal{N}(0, \tau^2)\]

where $\delta_0$ is a point mass at zero (the spike), $\mathcal{N}(0, \tau^2)$ is a diffuse normal (the slab), and $\gamma_j \in [0,1]$ controls the selection probability. This creates a natural mechanism for automatic feature selection during posterior sampling.

Reading Guide

Papers 1 and 7 tackle different aspects of iterative optimization—Adam’s convergence guarantees for convex problems and bandit-based controller selection for dynamic systems. Papers 2 and 4 both address mathematical modeling on complex spaces: adaptive activations for PINNs and geometric formulations of stochastic processes on manifolds. Papers 3, 5, 6, and 8 focus on advanced statistical inference under various challenges—robust game theory, high-dimensional spline estimation, representation learning, and Bayesian variable selection for matrix completion.

Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method

Authors: Steffen Dereich, Thang Do, Arnulf Jentzen · Institution: University of Münster, Chinese University of Hong Kong Shenzhen · Category: cs.LG

First rigorous proof that Adam optimization iterates remain uniformly bounded for strongly convex problems, enabling unconditional convergence analysis.

Tags: optimization theory stochastic gradient descent Adam optimizer uniform bounds strongly convex optimization convergence analysis Lyapunov methods

arXiv · PDF

Problem Formulation

Motivation: The Adam optimizer is the most popular stochastic gradient descent method for training deep neural networks, yet even for strongly convex problems, all previous convergence analyses assume Adam remains bounded. This foundational assumption has never been rigorously justified.
Mathematical setup: Let $(\Omega, \mathcal{F}, P)$ be a probability space. Consider a stochastic optimization problem where $d \in \mathbb{N}$ is the parameter dimension and $\mathcal{D} \in \mathbb{N}$ is the data dimension. Let $X_{n,m}: \Omega \to U \subset \mathbb{R}^{\mathcal{D}}$ be i.i.d. random variables representing data, where $U$ is compact and convex. The loss function is $L: \mathbb{R}^d \times \mathbb{R}^{\mathcal{D}} \to \mathbb{R}$ of class $C^2$. The Adam iterates are defined as:
\[\Theta^{k,M,\beta}_{0} = \xi \in \mathbb{R}^d\] \[\Theta^{k,M,\beta}_{n,i} = \beta_k \Theta^{k,M,\beta}_{n-1,i} + (1-\beta_k)\left(\frac{1}{M}\sum_{m=1}^M (\nabla_{\theta_i}L)(\Theta^{0,M,\beta}_{n-1}, X_{n,m})\right)^k\] \[\Theta^{0,M,\beta}_{n,i} = \Theta^{0,M,\beta}_{n-1,i} - \gamma_n [1-(\beta_1)^n]^{-1} \left(\epsilon + \sqrt{[1-(\beta_2)^n]^{-1}\Theta^{2,M,\beta}_{n,i}}\right)^{-1} \Theta^{1,M,\beta}_{n,i}\]
Assumptions:
1. For all $x \in U$, the function $\theta \mapsto L(\theta,x) - q\lvert \theta \rvert^2$ is convex for some $q \in (0,1/2)$ (strong convexity)
2. The gradient satisfies local Lipschitz conditions with polynomial growth
3. The step sizes $(\gamma_n)_{n \in \mathbb{N}}$ are non-increasing and satisfy certain tail conditions
Toy example: When $d=2$, $L(\theta,x) = \lvert \theta - x \rvert^2$, and $X_{n,m}$ are i.i.d. Gaussian vectors, the problem reduces to finding $\mathbb{E}[X_{1,1}]$. The Adam updates become momentum-accelerated steps toward the sample mean, but without uniform bounds, the iterates could potentially diverge.
Formal objective: Establish that
\[\sup_{n \in \mathbb{N}} \|\Theta^{0,M,\beta}_n\| < \infty \quad \text{almost surely}\]
and prove convergence rates of the form
\[\mathbb{E}[\|\Theta^{0,M,\beta}_n - \vartheta\|^p]^{1/p} \leq \mathcal{C}_0 M^{-1}(1-\beta_2 \mathbf{1}_{[\mathcal{C}_0,\infty)}(M)) + \mathcal{C}_\beta \sqrt{\gamma_n}\]

Method

The method establishes uniform a priori bounds through a Lyapunov-like analysis. The key steps are:

Define an auxiliary energy function:
\[H_n = L(\Theta_n, x) + \frac{1}{2(1-\alpha)} \sum_{i=1}^d (m_{n,i})^2 V_{n,i}\]
where $V_{n,i}$ are scaled step sizes and $m_{n,i}$ are the first moment estimates.
Prove that $H_n$ is a discrete supermartingale by showing:
\[H_n - H_{n-1} \leq A\tau_1 \eta^2 (\gamma_n)^2 - \frac{1}{4\chi} \sum_{i=1}^d \left(\frac{1}{M}\sum_{j=1}^M G^1_i(\Theta_{n-1}, X_{n,j})\right)^2 V_{n,i}\]
Use the strong convexity to establish:
\[\kappa \|\theta - \vartheta\|^2 \leq \langle\theta - \vartheta, \nabla_\theta L(\theta,x) - \nabla_\theta L(\vartheta,x)\rangle\]
Combine energy decay with gradient bounds to prove:
\[L(\Theta_n, x) - L(\vartheta, x) \leq A(1 + \Gamma)\]
where $\Gamma = \sum_{n=1}^\infty \tau_1 \eta^2 (\gamma_n)^2$.

Applied to the toy example: For $L(\theta,x) = \lvert \theta - x \rvert^2$ with $d=2$, the energy function becomes $H_n = \lvert \Theta_n - \mathbb{E}[X] \rvert^2 + \text{momentum terms}$. The uniform bound gives $\lvert \Theta_n \rvert \leq \lvert \mathbb{E}[X] \rvert + C$ for some universal constant $C$.

Novelty & Lineage

This is the first work to establish uniform a priori bounds for Adam without assuming boundedness. Previous works by Chen et al. (2018) and Defossez & Bach (2022) provided conditional convergence analyses that assumed Adam stays bounded, but this assumption was never verified. The work extends Kingma & Ba (2014)’s original Adam proposal by providing the missing theoretical foundation.

The technical innovation is using a carefully constructed Lyapunov function that captures both the optimization progress and the adaptive momentum dynamics. Prior a priori bound attempts (e.g., Levy 2021) only worked for specific cases like $L(\theta,x) = \lvert \theta - x \rvert^2$ and failed for general strongly convex problems.

SIGNIFICANT

Proof Techniques

The proof uses a discrete Lyapunov analysis with several key technical innovations:

Energy function construction: The Lyapunov function combines the loss with scaled momentum terms:
\[H_n = L(\Theta_n, x) + \frac{1}{2(1-\alpha)} \sum_{i=1}^d (m_{n,i})^2 V_{n,i}\]
Supermartingale property: The key inequality establishing energy decay is:
\[H_n - H_{n-1} \leq \frac{K}{2} \sum_{i=1}^d (m_{n,i})^2 (V_{n,i})^2 \left(1 + \sum_{i=1}^d |m_{n,i}|V_{n,i}^{\tau_1} + 2\|\Theta_{n-1}\|^{\tau_1}\right) - \frac{1}{4} \sum_{i=1}^d (m_{n,i})^2 V_{n,i}\]
Strong convexity exploitation: Uses the fundamental bound:
\[\kappa \|\theta_1 - \theta_2\|^2 \leq \langle\theta_1 - \theta_2, \frac{1}{M}\sum_{j=1}^M \nabla_\theta L(\theta_1, x_j) - \frac{1}{M}\sum_{j=1}^M \nabla_\theta L(\theta_2, x_j)\rangle\]
Moment control via Lemma 2.2: For the Adam moment updates $m_n = \alpha m_{n-1} + (1-\alpha)x_n$ and $V_n = \beta V_{n-1} + (1-\beta)\lvert x_n \rvert^2$:
\[\frac{|m_n|}{1-\alpha^n} \left(\epsilon + \sqrt{\frac{V_n}{1-\beta^n}}\right)^{-1} \leq \frac{\sum_{k=0}^{n-1} \beta^{k/2}}{\sqrt{1-\alpha^2\beta^{-1}}} + \frac{|m_0|}{\epsilon(1-\alpha)}\]
Uniform bound extraction: The critical step showing that if the energy is controlled, then:
\[\|\Theta_n - \vartheta\|^2 \leq \frac{2}{\kappa}(L(\Theta_n, x) - L(\vartheta, x)) \leq \frac{2A(1+\Gamma)}{\kappa}\]
The proof handles three challenging cases: large parameter norm, large gradient norm, and the remaining case through careful polynomial growth estimates.

Experiments & Validation

Purely theoretical. The paper provides no experiments but gives concrete convergence rates for quadratic problems of the form:

\[\arg\min_{\theta \in \mathbb{R}^d} \mathbb{E}\left[\|A_0\theta - f_0(X)\|^2 + \sum_{i=1}^v (\|A_i\theta\|^2 + \mu_i)^{r_i} f_i(X)\right]\]

Empirical validation would require:

Verifying the uniform bounds hold in practice for neural network training
Comparing the theoretical convergence rates with observed rates
Testing whether the polynomial growth assumptions on gradients are satisfied in deep learning applications.

Limitations & Open Problems

Strong convexity assumption - RESTRICTIVE: Real neural networks are highly non-convex, severely limiting practical applicability.
Polynomial gradient growth conditions in equation (1) - TECHNICAL: The assumption $\sum_{i=1}^d [\lvert \nabla_{\theta_i}\nabla_\theta L \rvert_2 + (1+\lvert \theta \rvert)^{-r}\lvert \nabla_{\theta_i}\nabla_x L \rvert_2] \leq p$ may be removable with more sophisticated analysis.
Compact data domain $U$ - TECHNICAL: Real data is typically unbounded, though this could likely be relaxed.
Mini-batch size $M$ must be sufficiently large - NATURAL: The bound deteriorates as $M \to 1$, consistent with practice where larger batches help Adam.
Step size conditions - NATURAL: The requirements on $(\gamma_n)$ are standard for convergence analysis.

Open problems:
Extend the uniform bounds to non-convex objectives relevant to neural network training
Remove the polynomial growth assumptions and establish bounds under minimal regularity conditions

A Family of Adaptive Activation Functions for Mitigating Failure Modes in Physics-Informed Neural Networks

Authors: Krishna Murari · Institution: IIT Madras · Category: cs.LG

Introduces five adaptive wavelet-based activation functions that significantly improve PINN performance on benchmark PDEs by combining trainable wavelet components with hyperbolic tangent nonlinearity.

Tags: physics-informed-neural-networks activation-functions wavelets partial-differential-equations scientific-computing deep-learning adaptive-methods

arXiv · PDF

Problem Formulation

Motivation: Physics-Informed Neural Networks (PINNs) struggle with failure modes including poor convergence, inability to capture oscillatory patterns, and accuracy degradation over time when solving partial differential equations. These issues stem from the choice of activation functions, which critically affect gradient flow and expressiveness in neural networks.
Mathematical setup: Let $D \subset \mathbb{R}^d$ be an open bounded spatial domain with boundary $\partial D$. Consider a general time-dependent PDE:
\[\mathcal{D}[u(x,t)] = f(x,t), \quad (x,t) \in D\] \[\mathcal{B}[u(x,t)] = g(x,t), \quad (x,t) \in \partial D\] \[\mathcal{I}[u(x,0)] = h(x), \quad x \in D\]
where $u(x,t)$ is the unknown solution, $\mathcal{D}$ is the differential operator, $\mathcal{B}$ enforces boundary conditions, and $\mathcal{I}$ enforces initial conditions. Let $\hat{u}(x;\theta)$ represent a feed-forward network with parameters $\theta = {W_\ell, b_\ell}_{\ell=1}^L$. The PINN loss function is:
\[L_{\text{PINNs}} = \lambda_R L_R + \lambda_B L_B + \lambda_I L_I\]
where $L_R$, $L_B$, $L_I$ are mean-squared errors for residual, boundary, and initial condition terms respectively.
Toy example: For the 1D reaction equation with $\rho = 5$:
\[\frac{\partial u}{\partial t} - 5u(1-u) = 0, \quad x \in [0,2\pi], t \in [0,1]\]
with initial condition $u(x,0) = \exp(-(x-\pi)^2/(2(\pi/4)^2))$ and periodic boundary conditions. Standard tanh activation yields high relative error (rRMSE ≈ 0.973).
Formal objective: Design adaptive wavelet-based activation functions $\psi(x)$ that minimize the PINN loss while improving training stability and reducing approximation error compared to standard activations like tanh.

Method

The method introduces five novel adaptive activation functions combining wavelets with trainable parameters:

SoftMexTanh combines Mexican hat wavelet with hyperbolic tangent:
\[\psi_{\text{SoftMexTanh}}(x) = \tanh(\beta x)(1 - \gamma x^2)\exp(-\alpha x^2)\]
SoftMorTanh uses Morlet wavelet structure:
\[\psi_{\text{SoftMorTanh}}(x) = \cos(\omega x)\exp\left(-\frac{x^2}{2\sigma^2}\right)\tanh(\beta x)\]
SoftGaussTanh employs Gaussian envelope:
\[\psi_{\text{SoftGaussTanh}}(x) = \tanh(\beta x)\exp(-\alpha x^2)\]
SoftGaborTanh combines Gabor function elements:
\[\psi_{\text{SoftGaborTanh}}(x) = \tanh(\beta x)\exp\left(-\frac{x^2}{2\sigma^2}\right)\cos(\omega x)\]
SoftHerTanh uses Hermite polynomials:
\[\psi_{\text{SoftHerTanh}}(x) = \tanh(\beta x)H_n(x)\exp(-\alpha x^2)\]
All parameters $\alpha, \beta, \gamma, \omega, \sigma$ are made trainable using softplus transformation: $\alpha = \text{softplus}(\alpha_0) = \ln(1 + \exp(\alpha_0))$ to ensure positivity.

Applied to the toy reaction equation: Using SoftGaborTanh with trainable parameters reduces rRMSE from 0.973 (tanh) to 0.0021, demonstrating substantial improvement in approximation accuracy.

Novelty & Lineage

This work extends prior research on adaptive activation functions in PINNs. Previous work includes GaborPINN [19] using multiplicative filtered networks, wavelet-based activations in [43,57], and various adaptive schemes in [1,21,26,44]. The key novelty is the systematic combination of wavelet functions (Mexican hat, Morlet, Gaussian, Hermite, Gabor) with trainable hyperbolic tangent components and softplus parameterization. Unlike previous fixed wavelet activations, these functions adapt their shape during training through gradient-based optimization. The approach addresses specific PINN failure modes identified in [24,57] through principled wavelet design rather than architectural modifications. INCREMENTAL.

Proof Techniques

This is primarily an empirical study without formal theoretical proofs. The validation approach uses:

Experimental comparison across four benchmark PDEs (1D reaction, 1D wave, 1D convection, 2D Navier-Stokes)
Error metrics comparison using relative mean absolute error (rMAE) and relative root mean square error (rRMSE):
\[\text{rRMSE} = \sqrt{\frac{\sum_{n=1}^{N_P} |\hat{u}(x_n,t_n) - u(x_n,t_n)|^2}{\sum_{n=1}^{N_P} |u(x_n,t_n)|^2}}\]
Statistical analysis using bar plots to demonstrate consistent performance improvements
Convergence analysis through training loss trajectories over 1000 L-BFGS iterations

The main technical insight is that wavelet localization properties combined with adaptive parameterization provide better gradient flow and feature representation for PDE solutions with oscillatory or multi-scale behavior.

Experiments & Validation

Datasets: Four benchmark PDEs - 1D reaction equation (ρ=5), 1D wave equation (β’=3), 1D convection equation (β’=50), 2D Navier-Stokes equations (λ₁=1, λ₂=0.01). Training uses 10,201 collocation points for 1D cases, 2,500 for 2D case.

Baselines: Standard PINNs with tanh activation [37], PINNsFormer [57], QRes [4], FLS [51], PINN-Mamba [52], ML-PINN [14].

Key results: SoftGaborTanh achieves rRMSE of 0.0021 vs 0.973 for tanh on reaction equation. For wave equation, SoftHer2Tanh achieves 0.019 vs 0.223 for tanh. Training efficiency: PINN-SoftGaborTanh runs at 9.01 it/s vs 6.54 it/s for PINNsFormer.

Hardware: Single NVIDIA A100 GPU with 32GB memory, implemented in PyTorch, L-BFGS optimizer with 1000 iterations.

Limitations & Open Problems

Limited theoretical analysis - TECHNICAL (empirical validation without convergence guarantees or approximation theory)
Hyperparameter sensitivity - TECHNICAL (initial parameter values for ω₀ set empirically to 3 or 5 for Gabor functions)
Problem-specific performance - RESTRICTIVE (some activations require non-trainable β parameter for certain PDEs like convection)
Computational overhead - NATURAL (additional trainable parameters increase memory and computation compared to fixed activations)
Limited to specific PDE types - RESTRICTIVE (tested only on four benchmark problems, generalization unclear)

Open problems:
Develop theoretical framework connecting wavelet properties to PINN approximation guarantees for different PDE classes
Establish adaptive parameter initialization strategies that eliminate empirical tuning requirements

Robust mean-field games under entropy-based uncertainty

Authors: François Delarue, Pierre Lavigne · Institution: Université Côte d’Azur · Category: math.OC

Introduces entropy-penalized robust mean-field games where Nature distorts probability measures and equilibrium is defined under the effective measure rather than reference measure.

Tags: mean-field-games robust-control risk-aversion entropy-regularization FBSDE stochastic-control model-uncertainty quantitative-finance

arXiv · PDF

Problem Formulation

Motivation: In finance and economics, agents often face model uncertainty about market dynamics and interaction effects. Traditional mean-field games assume risk-neutral agents under a reference probability measure, but real decision-makers are risk-averse and must account for worst-case scenarios when the true model is unknown.
Mathematical setup: Let $(Ω, F, P)$ be a probability space supporting a $d$-dimensional Brownian motion $W$ and independent initial condition $η \in \mathbb{R}^n$. The representative agent controls process $\psi$ with state dynamics:
\[dX_t = b(t, X_t, \psi_t)dt + σ(t, \psi_t)dW_t, \quad X_0 = η\]
Nature controls density process $q$ satisfying:
\[q_t = 1 + \int_0^t q_s Y^*_s ds + \int_0^t q_s Z^*_s \cdot dW_s\]
The agent’s robust objective is:
\[\inf_{\psi \in A} \sup_{q \in Q} J[\mu](\psi, q) = E\left[q_T g(X^{\psi}_T, \mu) + \int_0^T q_s \ell(s, \psi_s) ds\right] - S(q)\]
where $S(q) = E\left[\int_0^T q_s f^*(s, Y^*_s, Z^*_s) ds\right]$ is the generalized entropy.

Assumptions:
1. Initial condition $η \in L^{\infty}(F_0, \mathbb{R}^n)$ and drift $b$ is linear
2. Volatility $σ$ is linear in control
3. Driver $f$ is progressively measurable, convex, with quadratic growth
4. Running cost $\ell$ is strongly convex with quadratic growth
5. Terminal cost $g$ is convex in state, continuous in measure
Toy example: When $n=d=1$, $b(t,x,\psi) = \psi$, $σ(t,\psi) = 1$, $f^*(t,y^*,z^*) = \frac{1}{2}\lvert z^* \rvert^2$, and $g(x,\mu) = \frac{1}{2}x^2 - \lambda x\bar{\mu}$ where $\bar{\mu}$ is the mean of $\mu$. This reduces to a risk-averse portfolio problem where Nature distorts probabilities via relative entropy.
Formal objective: Find equilibrium $(\psi^*, q^*, \mu^*) \in A \times Q \times M(\mathbb{R}^n)$ such that:
\[\mu^* = (q^*_T P) \circ (X^{\psi^*}_T)^{-1}\]

Method

The method uses forward-backward stochastic differential equations (FBSDEs) to characterize equilibria via Pontryagin’s principle.

For fixed measure $\mu$, solve the coupled FBSDE system: - Representative player FBSDE: $(ψ, p, k, X)$ solving
\[-dp_t = b_t^T p_t dt - k_t dW_t, \quad p_T = q_T ∇_x g(X_T, \mu)\] \[dX_t = b(t, X_t, ψ_t)dt + σ(t, ψ_t)dW_t, \quad X_0 = η\] \[ψ_t \in \arg\min_α H(t, X_t, α, p_t, k_t, q_t)\]
Nature’s FBSDE: $(Y, Z, q)$ solving
\[-dY_t = (f(t, Y_t, Z_t) + \ell(t, ψ_t))dt - Z_t \cdot dW_t, \quad Y_T = g(X_T, \mu)\] \[dq_t = q_t Y^*_t dt + q_t Z^*_t \cdot dW_t, \quad q_0 = 1\] \[(Y^*_t, Z^*_t) = (∂_y f(t, Y_t, Z_t), ∇_z f(t, Y_t, Z_t))\]
Apply Schauder fixed point theorem to map $Φ: μ ↦ (q^μ_T P) \circ (X^μ_T)^{-1}$

Toy example application: For the portfolio case, the optimal control becomes $ψ_t = -(q_t \nabla_ψ \ell(t, ψ_t) + p_t)^{-1}$, and Nature’s density follows $dq_t = q_t Z_t \cdot dW_t$ where $Z_t$ is determined by the quadratic BSDE structure.

Novelty & Lineage

This work extends mean-field games to incorporate explicit model uncertainty via entropy-penalized robust optimization. Prior risk-sensitive MFGs (Djehiche et al., Lacker) used exponential utilities, while robust MFGs (Bauso et al.) considered H-infinity control without entropy structure.

Key novelty: The equilibrium condition uses the effective measure $q_T P$ rather than reference measure $P$, ensuring consistency between agent’s risk perception and mean-field formation. This differs from classical approaches where agents are risk-averse individually but aggregate under the reference measure.

The FBSDE characterization extends beyond standard theory by coupling the agent’s and Nature’s optimization problems simultaneously, requiring new solvability results for quadratic BSDEs with unbounded terminal conditions.

SIGNIFICANT

Proof Techniques

Existence via Schauder fixed point theorem requires three key steps: - Uniform bounds: Establish entropy estimates
\[\sup_{\mu \in M^{2-r}(\mathbb{R}^n)} S(q^{\mu}) \leq C_1\]
```
using convexity of terminal cost $g$
```
Continuity of the fixed point map $Φ$ using stability estimates:
\[E\left[\tilde{q}^{\mu}_T (g(X^{\mu}_T, \tilde{\mu}) - g(X^{\mu}_T, \mu)) + q^{\mu}_T (g(X^{\tilde{\mu}}_T, \mu) - g(X^{\tilde{\mu}}_T, \tilde{\mu}))\right] \geq c E\left[\int_0^T (q^{\mu}_t + \tilde{q}^{\mu}_t)(|\tilde{\psi}^{\mu}_t - \psi^{\mu}_t|^2 + |\tilde{Y}^{*,\mu}_t - Y^{*,\mu}_t|^2) dt\right]\]
Strong convexity inequality from Fenchel duality:
\[f^*(t, ∂_y f(t,y,z) + h_y, ∇_z f(t,y,z) + h_z) \geq f^*(∂_y f(t,y,z), ∇_z f(t,y,z)) + y \cdot h_y + z \cdot h_z + c(|h_y|^2 + |h_z|^2)\]
Uniqueness via joint flat anti-monotonicity and displacement monotonicity:
\[\langle \mu_1 - \mu_2, G(\cdot, \mu_1) - G(\cdot, \mu_2) \rangle + \langle \mu_1 - \mu_2, \nabla \Phi(\mu_1) - \nabla \Phi(\mu_2) \rangle \leq 0\]
```
where $G$ captures mean-field interaction and $\Phi$ represents displacement structure.
```
Key technical insight: Pinsker’s inequality controls total variation of Nature’s measures, enabling tightness arguments for compactness.

Experiments & Validation

Purely theoretical. The paper provides no numerical experiments or empirical validation.

Empirical validation would require:

Finite-player game simulations to verify mean-field convergence rates
Portfolio optimization examples with specific utility functions and market parameters
Comparison with classical risk-sensitive and robust control benchmarks
Computational algorithms for solving the coupled FBSDE systems numerically

Limitations & Open Problems

Linear state dynamics required - TECHNICAL (needed for FBSDE solvability but could potentially be extended to nonlinear case with additional regularity)
Quadratic growth conditions on costs - TECHNICAL (standard in BSDE theory but limits applicability to exponential utilities)
Deterministic terminal cost function $g$ - TECHNICAL (could allow randomness with more complex analysis)
Continuity assumption A9 on mean-field coupling - RESTRICTIVE (significantly narrows class of admissible interaction functions)
No explicit convergence rates for $N$-player approximation - TECHNICAL (provides $ε$-Nash results but not quantitative bounds)

Open problems:
Extend to nonlinear dynamics while maintaining FBSDE solvability for quadratic BSDEs with unbounded terminals
Develop computational methods for solving the coupled FBSDE systems in practice

Global Tensor Field Formulation of the Fokker-Planck Equation on Riemannian Manifolds

Authors: Taeyoung Lee, Gregory S. Chirikjian · Institution: George Washington University, University of Delaware · Category: math.PR

Develops coordinate-free formulations of Fokker-Planck equations on Riemannian manifolds using tensor fields and geometric operators, unifying stochastic dynamics with differential geometry.

Tags: stochastic_processes differential_geometry fokker_planck riemannian_manifolds tensor_fields geometric_probability stochastic_analysis

arXiv · PDF

Problem Formulation

Motivation: Classical formulations of the Fokker-Planck equation on Riemannian manifolds are expressed in local coordinates, obscuring geometric structure and complicating analysis of globally defined processes. A coordinate-free, intrinsic formulation is needed to capture diffusion and probability transport consistent with manifold geometry.
Mathematical setup: Let $(M, g)$ be an $n$-dimensional Riemannian manifold with metric $g$. Consider a stochastic process $x(t)$ on $M$ with infinitesimal generator $A : C^\infty(M) \to C^\infty(M)$ defined by
\[(Af)(x_0) = \lim_{t \to 0^+} \frac{E[f(x(t)) | x(0) = x_0] - f(x_0)}{t}\]
for $f \in C^\infty(M)$ and $x_0 \in M$. The L2-inner product on $M$ is
\[\langle p, q \rangle = \int_M p(x)q(x) d\mu(x)\]
where $d\mu$ is the Riemannian volume form. The adjoint operator $A^*$ satisfies
\[\langle p, Aq \rangle = \langle A^*p, q \rangle\]
for all $p, q \in C^\infty_c(M)$. The probability density $p_t(x)$ evolves according to the abstract Fokker-Planck equation
\[\frac{\partial p_t}{\partial t} = A^* p_t\]
Assumptions:
1. $(M, g)$ is a complete Riemannian manifold
2. All vector fields are smooth
3. The stochastic process has continuous sample paths
4. The probability density has compact support
Toy example: Consider the 2-sphere $S^2$ with standard metric and orthonormal frame $E_\theta = \frac{\partial}{\partial \theta}$, $E_\phi = \frac{1}{\sin \theta} \frac{\partial}{\partial \phi}$. For isotropic diffusion with $\sigma_1 = E_\theta$, $\sigma_2 = E_\phi$, the diffusion tensor becomes
\[D = E_\theta \otimes E_\theta + E_\phi \otimes E_\phi\]
The core difficulty is expressing curvature-dependent terms like $\nabla_{E_\phi} E_\phi = -\cot \theta E_\theta$ in coordinate-free form.
Formal objective: Derive coordinate-free expressions for both Stratonovich and Itô forms of the Fokker-Planck equation:
\[\frac{\partial p_t}{\partial t} = A^* p_t\]
where $A^*$ is expressed intrinsically using geometric operators like divergence and Lie derivatives.

Method

The method develops two coordinate-free formulations:

Stratonovich Formulation:

Express the generator using Lie derivatives for SDE $dx = X(x)dt + \sum_{i=1}^m \sigma_i(x) \circ dW_i$:
\[Af = X[f] + \frac{1}{2} \sum_{i=1}^m \sigma_i[\sigma_i[f]]\]
Apply integration-by-parts identity $\langle p, X[q] \rangle = \langle -\text{div}_\mu(pX), q \rangle$ to derive adjoint:
\[\frac{\partial p_t}{\partial t} = -\text{div}_\mu(p_t X) + \frac{1}{2} \sum_{i=1}^m \text{div}_\mu(\text{div}_\mu(p_t \sigma_i) \sigma_i)\]
Itô Formulation:
Introduce diffusion tensor field for SDE $dx = \tilde{X}(x)dt + \sum_{i=1}^m \sigma_i(x)dW_i$:
\[D = \sum_{i=1}^m \sigma_i \otimes \sigma_i\]
Express generator using tensor contraction with Hessian:
\[Af = \tilde{X}[f] + \frac{1}{2}(D : \text{Hess}f)\]
Apply tensor field integration-by-parts $\langle \nabla T, S \rangle_T = \langle T, -\text{div}_\mu(S) \rangle_T$ to derive:
\[\frac{\partial p_t}{\partial t} = -\text{div}_\mu(p_t \tilde{X}) + \frac{1}{2} \text{div}_\mu(\text{div}_\mu(p_t D))\]
Application to toy example: On $S^2$ with $D = E_\theta \otimes E_\theta + E_\phi \otimes E_\phi$, the inner divergence yields:
\[\text{div}_\mu(p_t D) = \frac{\partial p_t}{\partial \theta} E_\theta + \frac{1}{\sin \theta} \frac{\partial p_t}{\partial \phi} E_\phi + p_t \cot \theta E_\theta\]
The outer divergence then produces the complete Fokker-Planck equation with curvature terms automatically incorporated.

Novelty & Lineage

This work extends classical formulations by Yosida and Itô who established Fokker-Planck equations on manifolds but in local coordinates. Recent work by Huang (2022) presented a coordinate-free Stratonovich form by converting coordinate expressions, but derived it indirectly.

Novel contributions:

Direct geometric derivation of Stratonovich Fokker-Planck from first principles using Lie derivatives and divergence theorem
Introduction of diffusion tensor fields $D = \sum_i \sigma_i \otimes \sigma_i$ as intrinsic generalization of Euclidean diffusion matrices
Tensor field integration-by-parts framework extending classical identities to arbitrary tensor fields
Double divergence representation $\text{div}_\mu(\text{div}_\mu(p_t D))$ for Itô diffusion term, providing geometric interpretation of curvature-adjusted probability spreading
Complete coordinate-free Itô formulation not previously reported in literature

The framework unifies stochastic dynamics, differential geometry, and tensor analysis under single intrinsic structure, with compact proofs using geometric identities.

SIGNIFICANT

Proof Techniques

The proofs rely on three main geometric techniques:

1. Integration-by-parts on manifolds (Lemma 1): Key identity derived from Stokes’ theorem and divergence properties:

\[\langle p, X[q] \rangle = \langle -\text{div}_\mu(pX), q \rangle\]

Uses Cartan’s formula $L_X \mu = d i_X \mu$ and product rule for divergence:

\[\text{div}_\mu(fX) = f \text{div}_\mu(X) + L_X f\]

2. Tensor field integration-by-parts (Lemma 2): Constructs auxiliary vector field $V^k = T^I_J S^{Jk}_I$ from tensor contraction and applies divergence theorem:

\[\int_M \text{div}_\mu(V) d\mu = 0\]

Product rule decomposition yields:

\[\text{div}_\mu(V) = \nabla T : S + T : \text{div}_\mu(S)\]

Leading to general identity:

\[\langle \nabla T, S \rangle_T = \langle T, -\text{div}_\mu(S) \rangle_T\]

3. Tensor contraction with Hessian (Theorem 4): Shows diffusion tensor contracts with Hessian via bilinearity:

\[D : \text{Hess}f = \sum_{i=1}^m \text{Hess}f(\sigma_i, \sigma_i)\]

Uses orthonormal frame expansion and symmetry of Hessian to establish:

\[\sum_{i=1}^m \text{Hess}f(\sigma_i, \sigma_i) = D^{jk}(\text{Hess}f)_{jk}\]

The key technical insight is extending classical divergence theorem identities to arbitrary tensor fields through systematic use of product rules and frame-independent contractions.

Experiments & Validation

Purely theoretical. The paper provides a detailed worked example on the 2-sphere $S^2$ demonstrating coordinate-free calculations, including explicit computation of Christoffel symbols, covariant derivatives of orthonormal frames, and resulting Fokker-Planck equations in both Stratonovich and Itô forms.

Empirical validation would involve:

Numerical verification that coordinate-free formulations match classical coordinate-based expressions
Implementation of geometric Brownian motion simulation on curved manifolds
Comparison with existing manifold-based filtering and estimation algorithms
Applications to robotics problems involving motion on configuration manifolds like $SO(3)$ or $SE(3)$

Limitations & Open Problems

Limitations:

Complete Riemannian manifolds assumption - NATURAL (standard in stochastic analysis)
Compact support for probability densities - TECHNICAL (needed for integration-by-parts, removable with careful boundary analysis)
Smooth vector fields and continuous sample paths - NATURAL (standard regularity assumptions)
Focus on time-homogeneous processes - TECHNICAL (extension to time-dependent coefficients straightforward)

Open Problems:
Jump-diffusion processes: Extend framework to include Lévy processes with discontinuous sample paths, requiring analysis of jump measures on manifolds
Infinite-dimensional manifolds: Develop tensor field formulation for stochastic PDEs on function spaces and path manifolds

Highly Adaptive Empirical Risk Minimization with Principal Components

Authors: Carlos García Meixide, Mingxun Wang, Alejandro Schuler, Mark J. van der Laan · Institution: UC Berkeley, ICMAT (Spain) · Category: math.ST

Introduces Principal Component Highly Adaptive (PC-HA) estimators that achieve HAL’s optimal convergence rates with drastically reduced computational cost through theoretically-grounded eigenspace dimension reduction.

Tags: nonparametric-statistics high-dimensional-inference spline-methods dimension-reduction empirical-risk-minimization adaptive-estimation principal-components sectional-variation

arXiv · PDF

Problem Formulation

Motivation: The Highly Adaptive Lasso (HAL) achieves optimal nonparametric rates but becomes computationally prohibitive in moderate-to-high dimensions due to exponential basis expansion. Existing dimension reduction strategies lack theoretical justification, creating a gap between HAL’s theoretical guarantees and practical implementation.
Mathematical setup: Let $O \sim P_0 \in \mathcal{M}$ be i.i.d. observations from a statistical model $\mathcal{M}$. Define the target functional $\Psi: \mathcal{M} \to D^{(0)}([0,1]^d)$ mapping data distributions to càdlàg functions with finite zero-order sectional variation norm. The loss function $L(\psi)(O)$ satisfies $\psi_0 = \Psi(P_0) = \arg\min_{\psi \in D^{(0)}([0,1]^d)} P_0 L(\psi)$. The loss-based dissimilarity is:
\[d_0(\psi, \psi_0) = P_0 L(\psi) - P_0 L(\psi_0)\]
The $k$-th order smoothness class $D^{(k)}_M([0,1]^d)$ contains functions with $k$-th order sectional variation norm bounded by $M$. HAL works with finite-dimensional spline submodels $D^{(k)}(\mathbb{R}^N)$ spanned by $N$ basis functions ${\phi_j : 1 \leq j \leq N}$, where typically $N = n(2^d - 1)$ for zero-order HAL.

Assumption 1 (Sufficiency): Loss $L(\psi_{N,\beta})(O_i)$ depends on $\psi_{N,\beta}$ only through design matrix $H_{mn \times N}(i,j) = \phi_j(x_i)$.

Assumption 2 (Approximate sufficiency): There exists approximate loss $L_n$ with minimizer $\psi_{0n}$ such that $d_0(\psi_{0n}, \psi_0) = o_P(n^{-\frac{2k^*}{2k^*+1}-\delta})$ for some $\delta > 0$.
Toy example: When $d=2$, $n=100$, and $k=0$, standard HAL requires $N = 100(2^2-1) = 300$ basis functions. The PC-HA approach reduces this to at most $n=100$ principal components while preserving the design matrix span under sufficiency.
Formal objective: Minimize empirical risk over PC-reduced space:
\[\hat{\alpha} = \arg\min_{\alpha: \|\alpha\| \leq C_n} P_n L(\theta_{n,\alpha})\]
where $\theta_{n,\alpha} = \sum_{m=1}^n \alpha(m) \tilde{\phi}_m$ with $\tilde{\phi}_m$ being PC basis functions.

Method

The PC-HA method performs principled dimension reduction via eigendecomposition:

Compute design matrix $H_{n \times N}(i,j) = \phi_j(x_i)$ from original $N$ spline basis functions
Form inner-product matrix $I_N = H^T H$ and compute its eigenvalue decomposition $I_N = E_N \Lambda_N E_N^T$
Extract $n$ leading eigenvectors $E^n_N$ (the $N \times n$ matrix of first $n$ columns)
Define PC basis functions: $\tilde{\phi}_m = \sum_{j=1}^N E^n_N(j,m) \phi_j$ for $m=1,\ldots,n$
Estimate via regularized minimization over PC space $D^{(k)}(E_n) = {\sum_m \alpha(m) \tilde{\phi}_m : \alpha \in \mathbb{R}^n}$

Three variants differ by constraint on $\alpha$:
- PC-HAGL: $\lvert \beta(\alpha) \rvert_1 \leq C$ where $\beta(\alpha)(j) = \sum_m \alpha(m) E^n_N(j,m)$
- PC-HAL: $\lvert \alpha \rvert_1 \leq C$
- PC-HAR: $\lvert \alpha \rvert_2 \leq C$
For computational efficiency, use SVD trick: $H = UDE^T$ gives PC basis as columns of $UD$ without storing full $H$.

Toy example application: With $d=2, n=100$, the $300 \times 300$ matrix $H^T H$ eigendecomposition yields 100 PC basis functions $\tilde{\phi}_m$. PC-HAGL then solves:
\[\hat{\alpha} = \arg\min_{\alpha: \|\sum_m \alpha(m) E^n_N(\cdot,m)\|_1 \leq C} \frac{1}{100} \sum_{i=1}^{100} L\left(\sum_{m=1}^{100} \alpha(m) \tilde{\phi}_m(x_i)\right)\]

Novelty & Lineage

This work extends the Highly Adaptive Lasso (van der Laan et al.) by introducing the first theoretically-grounded dimension reduction. Prior HAL implementations used ad-hoc screening methods without theoretical guarantees. The PC-HA family provides principled dimension reduction while preserving HAL’s optimal convergence rates and asymptotic properties.

The approach builds on classical principal component methods and recent advances in high-dimensional nonparametric estimation. Unlike previous HAL approximations that lacked theory, PC-HA maintains the connection to sectional variation norms and covering number bounds that drive HAL’s theoretical guarantees.

Key innovation is showing that PC dimension reduction preserves sufficiency for the empirical risk under appropriate conditions, enabling transfer of HAL’s efficiency and normality results.

SIGNIFICANT

Proof Techniques

The main proof strategy establishes rate preservation through complexity control and approximation bounds:

Sufficiency preservation: Under Assumption 1, the PC-reduced space $D^{(k)}(E_n)$ satisfies ${L(\theta_{n,\alpha})(O_n) : \alpha} = {L(\psi_{N,\beta})(O_n) : \beta}$, making PC reduction lossless for empirical risk minimization.
Sectional variation control: Key inequality bounding complexity:
\[\sup_{\|\alpha\| \leq C} \|\beta(\alpha)\|_1 = O(C)\]
This ensures PC-estimators remain in bounded variation classes $D^{(k)}_M(\mathbb{R}^N)$ with controlled covering numbers.
Covering number transfer: Lemma 29 from van der Laan et al. gives entropy bounds:
\[\log N(\epsilon, D^{(k)}_M([0,1]^d), \|\cdot\|_\infty) = O(\epsilon^{-\frac{2k+1}{2k+2}})\]
These bounds transfer to PC-subspaces via the embedding $D^{(k)}_{\text{norm},C}(E_n) \subset D^{(k)}_M(\mathbb{R}^N)$.
Oracle approximation: Under i.i.d. sampling from $P_{X,0}$ (Assumption 6), the approximation error satisfies:
\[d_0(\psi_{n,0}, \psi_{N,0}) = O_+(n^{-\frac{2k^*}{2k^*+1}})\]
where $\psi_{n,0} = \arg\min_{\psi \in D^{(k)}(E_n)} P_0 L(\psi)$.
Cross-validation oracle property: Standard techniques show the CV selector $C_{n,cv}$ achieves:
\[d_0(\hat{\psi}_{PC}, \psi_0) = O_+(n^{-\frac{2k^*}{2k^*+1}})\]
Score equation analysis: For plug-in efficiency, Theorem 3 shows PC-estimators solve empirical scores exactly on the tangent subspace $D_n(E(J_n)) = {f \in D(E(J_n)) : \langle f, f_{a_n} \rangle = 0}$ and approximately for bounded functions:
\[\sup_{\|f^*\|_2 \leq J_n^{-1/2}} |P_n \frac{d}{d\theta_n} L(\theta_n)(f^*)| = o_P(n^{-1/2})\]

Experiments & Validation

Purely theoretical. The paper establishes theoretical foundations without empirical validation.

Meaningful empirical validation would require:

Computational comparisons showing PC-HA achieves similar performance to full HAL with dramatically reduced runtime
Simulation studies across varying dimensions $d$ demonstrating rate preservation
Real data experiments on regression/classification tasks comparing PC-HA variants
Memory usage benchmarks showing scalability improvements
Cross-validation performance demonstrating practical benefits of the dimension reduction

Limitations & Open Problems

Sufficiency Assumption 1 requires losses depend only on function values at design points - NATURAL for regression/classification but RESTRICTIVE for density estimation with normalizing constants.
The i.i.d. sampling Assumption 6 from distribution $P_{X,0}$ equivalent to loss-based dissimilarity - TECHNICAL and may be hard to verify in practice.
Assumption 4 requiring PC-representation of HAL-MLE has bounded norm - TECHNICAL but plausible under standard HAL regularity conditions.
PC-HAL and PC-HAR require additional boundedness conditions on coefficients beyond what’s automatically guaranteed - TECHNICAL but potentially removable with refined analysis.
The method requires storing eigenvalues/eigenvectors when $N$ is very large, limiting scalability - RESTRICTIVE for ultra-high dimensions.

Open problems:
Develop adaptive procedures for choosing between PC-HA variants without knowing smoothness order $k$.
Extend results to non-sufficiency settings and establish constructive algorithms for selecting design points $x_n$.

Deep Tabular Representation Corrector

Authors: Hangting Ye, Peng Wang, Wei Fan, Xiaozhuang Song et al. (7 authors) · Institution: Jilin University · Category: cs.LG

TRC provides a model-agnostic post-hoc method to enhance deep tabular model representations by correcting shift and reducing redundancy without retraining the backbone.

Tags: tabular-learning representation-learning deep-learning model-agnostic post-hoc-correction singular-value-analysis

arXiv · PDF

Problem Formulation

Motivation: Deep learning methods for tabular data often underperform compared to tree-based methods, with existing approaches either requiring training from scratch (in-learning) or expensive pretraining (pre-learning). This limits the adoption of deep networks for structured/tabular datasets across domains like healthcare, finance, and engineering.
Mathematical setup: Consider a tabular dataset $D = {(x_i, y_i)}_{i=1}^N$ where $x_i = (x_i^{(num)}, x_i^{(cat)}) \in \mathcal{X}$ represents numerical and categorical features and $y_i \in \mathcal{Y}$ is the label. A trained deep tabular model consists of backbone $G_f(\cdot; \theta_f)$ and prediction head $G_h(\cdot; \theta_h)$ such that $F(\cdot; \theta) = G_h(G_f(\cdot; \theta_f); \theta_h)$.

Key assumptions:
Representation shift: inherent noise in tabular data causes learned representations $z_i = G_f(x_i; \theta_f)$ to deviate from optimal representations
Representation redundancy: deep models incorporate redundant information in representation space, measured by singular value entropy
\[\text{SVE} = -\sum_{i=1}^D \frac{\sigma_i}{\sum_{j=1}^D \sigma_j} \log \frac{\sigma_i}{\sum_{j=1}^D \sigma_j}\]
where ${\sigma_i}_{i=1}^D$ are singular values of representation matrix $Z \in \mathbb{R}^{N \times D}$.
Toy example: For a 2-feature classification problem with $N=100$ samples and representation dimension $D=10$, if SVE is high but accuracy is poor, this suggests redundant information hurts prediction performance.
Formal objective: Given frozen backbone parameters $\theta_f$, learn correction module $G_p(\cdot; \theta_p)$ and new head $G_h(\cdot; \theta_h)$ to minimize:
\[\min_{\theta_p, \theta_h} \mathbb{E}_{(x,y)} [L(G_h(G_p(G_f(x; \theta_f); \theta_p); \theta_h), y)]\]

Method

The Tabular Representation Corrector (TRC) consists of two sequential tasks:

Task 1: Tabular Representation Re-estimation

Identify approximated optimal representations $\mathcal{Z}_o = {z_k}$ from samples with lowest gradient norms satisfying:
\[\text{Rank}(\|\nabla_{\theta_f} L(F(x; \theta), y)\|_p^q)/|D_{val}| \leq \tau\]
Generate simulated sub-optimal representations by perturbing:
\[\tilde{z}_{km} = G_f(m_{km} \odot x_k + (1 - m_{km}) \odot \epsilon_k; \theta_f)\]
where $m_{km} \in {0,1}^d$ is a mask and $\epsilon_k$ is sampled noise.
Train shift estimator $\phi(\cdot; \theta_\phi)$ to minimize:
\[L_{shift} = \frac{1}{KM} \sum_{k,m=1}^{K,M} \|\phi(\tilde{z}_{km}; \theta_\phi) - \Delta_{km}\|_2^2\]
where $\Delta_{km} = \tilde{z}_{km} - z_k$.
Re-estimate representations: $\Phi(z_i; \theta_\phi) = z_i - \phi(z_i; \theta_\phi)$

Task 2: Tabular Space Mapping
Define light embedding space spanned by $\mathcal{B} = {\beta_t}_{t=1}^T$ where $T \ll D$.
Use coordinate estimator $s(\cdot; \theta_s)$ to compute coordinates:
\[r_i = s(\Phi(z_i; \theta_\phi); \theta_s)\]
Transform to light embedding space: $r_i^\mathcal{B} = \sum_{t=1}^T r_{it} \beta_t$
Joint optimization with orthogonality loss:
\[L_{orth} = \frac{\|A\|_1}{\|A\|_2^2} + (\|A\|_1 - T)^2\]
where $A_{ij} = \lvert \cos(\beta_i, \beta_j) \rvert$, and prediction loss:
\[L_{pred} = \frac{1}{N} \sum_{i=1}^N L(G_h(r_i^\mathcal{B}; \theta_h), y_i)\]
Applied to toy example: For the 2-feature problem, TRC would identify clean samples, simulate noise perturbations, learn to remove estimated shift from 10D representations, then compress to ~3D light embedding space while preserving classification information.

Novelty & Lineage

This work introduces a novel post-hoc representation correction approach for tabular deep learning, distinct from existing in-learning methods (AutoInt, FT-Transformer, PTaRL) that modify training procedures, and pre-learning methods (SCARF, SAINT, VIME) that require expensive pretraining.

Key novelties:

Model-agnostic post-hoc correction without altering backbone parameters
Explicit modeling of representation shift through gradient-based sample selection
Light embedding space compression with orthogonality constraints
Joint solution to representation shift and redundancy problems

Prior work focuses on either regularization during training (SNN, TANGOS) or self-supervised pretraining (TabNet, SubTab), while TRC operates on frozen representations. The gradient-norm based sample selection for shift estimation is novel.

INCREMENTAL - builds on established concepts but combines them in a new architecture for tabular data.

Proof Techniques

This is primarily an empirical paper with limited theoretical analysis. The main technical insights are:

Gradient-based optimality assumption: Samples with low gradient norms $\lvert \nabla_{\theta_f} L(F(x; \theta), y) \rvert_p^q$ are assumed to have representations closest to optimal, justified by the intuition that well-classified samples have smaller gradients.
Singular Value Entropy analysis: Uses SVD decomposition $Z = U\Sigma V^T$ to show that higher SVE doesn’t correlate with better performance in tabular models, motivating compression:
\[\text{SVE} = -\sum_{i=1}^D \frac{\sigma_i}{\sum_j \sigma_j} \log \frac{\sigma_i}{\sum_j \sigma_j}\]
Orthogonality regularization: The loss function encourages orthogonal embeddings:
\[L_{orth} = \frac{\|A\|_1}{\|A\|_2^2} + (\|A\|_1 - T)^2\]
where the first term promotes sparsity in the cosine similarity matrix $A$, and the second ensures only diagonal elements remain non-zero.
Shift estimation consistency: The method assumes that artificially corrupted representations $\tilde{z}_{km}$ exhibit similar shift patterns to naturally occurring representation deviations.

The paper lacks formal convergence guarantees or generalization bounds - the approach is validated primarily through extensive experiments across 13 backbone models and 11 datasets.

Experiments & Validation

Datasets: 11 tabular datasets including regression (CO, DI, QS, CA, PO, SU, YE), binary classification (AD, AU), and multi-class classification (GE, COV) tasks, ranging from 690 to 581,012 samples.

Baselines: 13 deep tabular models including in-learning methods (MLP, DCN2, SNN, ResNet, AutoInt, TANGOS, FT-Transformer, PTaRL) and pre-learning methods (SCARF, SAINT, VIME, MLP w/ SSL variants).

Key results:

TRC achieves consistent improvements across all 13 backbones and 11 datasets
Average relative improvement: 5.1% overall (6.5% regression, 2.4% classification)
Improvements statistically significant (p-value = 6.07e-9, Wilcoxon signed-rank test)
Outperforms fine-tuning approaches (LoRA, head replacement, full model fine-tuning)

Analysis: Ablation studies confirm both representation re-estimation and space mapping components are necessary. Visualizations show TRC reduces SVE while improving performance, confirming redundancy reduction. Works effectively even with missing values and reduced training data.

Limitations & Open Problems

NATURAL: Assumes gradient norms correlate with representation quality - standard assumption in optimization literature
TECHNICAL: Requires small validation subset (1%) for shift estimator training - could potentially be eliminated with unsupervised approaches
TECHNICAL: Fixed hyperparameters (τ=0.01, T=10, M=3) across datasets - likely tunable for better performance
RESTRICTIVE: Only tested on supervised tabular tasks, unclear how it extends to unsupervised or semi-supervised settings
RESTRICTIVE: Assumes representation shift follows patterns similar to artificial corruption - may not hold for all noise types
TECHNICAL: Limited theoretical justification for why orthogonal embeddings preserve critical information

Open problems:
Develop theoretical guarantees for when TRC preserves/improves generalization performance
Extend methodology to handle distribution shift between training and test data, where the shift estimator may not generalize properly

Online Learning for Supervisory Switching Control

Authors: Haoyuan Sun, Ali Jadbabaie · Institution: MIT · Category: math.OC

Achieves $\mathcal{O}(N \log N)$ sample complexity for supervisory switching control of partially-observed linear systems by adapting multi-armed bandit algorithms with observability-based evaluation criteria.

Tags: online learning control theory system identification multi-armed bandits switching control linear systems partial observability finite-time analysis

arXiv · PDF

Problem Formulation

Motivation: In many complex control systems like autonomous vehicles or power grids, no single controller performs well across all operating conditions. Supervisory switching control addresses this by maintaining multiple candidate controllers and adaptively selecting the best one for an unknown plant. However, classical estimator-based approaches only guarantee asymptotic stability without quantitative finite-time bounds.

Mathematical setup: Consider a collection of discrete-time, partially-observed linear systems ${(C_i, A_i, B_i)}_{i=1}^N$ where the unknown true system $(C_*, A_*, B_*)$ belongs to this collection with index $i^*$. The system evolves as:

\[x_{t+1} = A_{i^*} x_t + B_{i^*} \bar{u}_t + w_t\] \[y_t = C_{i^*} x_t + \eta_t\]

where $x_t \in \mathbb{R}^{d_x}$, $\bar{u}_t \in \mathbb{R}^{d_u}$, $y_t \in \mathbb{R}^{d_y}$, with zero initialization $x_0 = 0$, process noise $w_t \sim \mathcal{N}(0, \sigma_w^2 I)$, and observation noise $\eta_t \sim \mathcal{N}(0, \sigma_\eta^2 I)$. Each system is paired with a controller, and applying controller $j$ to system $i$ yields closed-loop dynamics $M_{[i,j]} = (C_{[i,j]}, A_{[i,j]}, B_{[i,j]})$.

For each $i$, the correctly matched system $M_{[i,i]}$ is $\varepsilon_c$-strictly observable with index $\nu$
Unstable closed-loop systems satisfy $\rho(A_{[i,j]}) \geq 1 + \varepsilon_a$ for some $\varepsilon_a > 0$
Markov parameters of different systems are sufficiently separated: $\lvert G_{[i,j]} - G_{[i’,j]} \rvert_{op} \geq \gamma$ for $i \neq i’$

Toy example: Consider $N=2$ systems with $d_x = d_u = d_y = 1$. System 1 has $A_1 = 0.5, B_1 = 1, C_1 = 1$ (stable), System 2 has $A_2 = 1.5, B_2 = 1, C_2 = 1$ (unstable). Controller 1 stabilizes System 1 but destabilizes System 2. The challenge is identifying which system is the true one by testing controllers that may cause instability.

Formal objective: Find the matching controller $i^*$ in $\mathcal{O}(N \log N)$ episodes while maintaining finite $L_2$-gain:
\[\sum_{t=1}^T \|x_t\|^2 \leq C_0 \sum_{t=1}^{\tau L} (\|u_t\|^2 + \|w_t\|^2) + C_1 \sum_{t=\tau L + 1}^T (\|u_t\|^2 + \|w_t\|^2)\]

Method

The algorithm operates in episodes of fixed length $\tau$, applying one controller per episode and scoring it using two criteria.

Algorithm steps:

For episodes $\ell = 1, \ldots, N$: test each controller once
For episodes $\ell > N$: select controller $i_\ell = \arg\max_i \bar{s}_i(q) + \sqrt{\frac{a_\ell}{Q_i(\ell-1)}}$ where $\bar{s}_i(q)$ is average score and $Q_i(\ell-1)$ counts previous applications
Apply selected controller for $\tau$ steps, observe outputs $y_{\tau(\ell-1)+1}, \ldots, y_{\tau\ell}$
Compute score $s_{i_\ell} = \lfloor\frac{1}{2}(S^{(1)} + S^{(2)})\rfloor$ using two criteria
After $L = \mathcal{O}(N \log N)$ episodes, commit to most frequently selected controller

Criterion 1 (Instability Detection): Estimates initial state using observability matrix:
\[\hat{x}_1 = O_{[j,j]}^\dagger [y_\nu; \ldots; y_1]^\top\]
Predicts trajectory $\hat{y}_t = C_{[j,j]} A_{[j,j]}^{t-1} \hat{x}_1$ and returns 0 if $\lvert y_\tau - \hat{y}_\tau \rvert > \sqrt{2d_y \Theta \log(2d_y/\delta)}$.

Criterion 2 (System Identification): Constructs shifted output:
\[\tilde{y}_t = y_t - C_{[j,j]} A_{[j,j]}^{t-1} \hat{x}_1 - G_{[j,j]} z_t\]
where $z_t = [u_{t-h}; \ldots; u_{t-1}]$ and estimates parameter difference $\hat{\Delta}_k$ via OLS along critical directions. Returns 0 if $\lvert \hat{\Delta}_k \rvert > \gamma$ for any $k$.

Toy example application: For the 2-system example, when testing Controller 1 on the true System 1, both criteria return 1 (stable, matching). When testing Controller 1 on true System 2, Criterion 1 detects instability (returns 0), yielding overall score 0. This separation enables identification.

Novelty & Lineage

This work bridges classical supervisory control theory with modern online learning. Prior supervisory control (Morse 1996, Hespanha 2001) provides asymptotic stability but no finite-time bounds. Recent non-asymptotic system identification (Simchowitz et al. 2018, Sarkar et al. 2021) requires closed-loop stability assumptions incompatible with testing destabilizing controllers.

Previous bandit approaches for control (Li et al. 2023) assume full observability, while the authors’ prior work (Sun & Jadbabaie 2024) achieved exponential sample complexity $\mathcal{O}(\exp(N))$.

Key innovations:

Novel evaluation criteria that leverage observability to isolate state history effects
Adaptation of multi-armed bandit algorithms to handle state dependencies in control
Dimension-free finite-time guarantees for partially-observed switching control

SIGNIFICANT - Major improvement from exponential to logarithmic sample complexity while removing full observability assumptions.

Proof Techniques

The analysis adapts UCB-style techniques to the control setting by decoupling episode dependencies through observability-based state estimation.

Key proof stages:

Criterion analysis: For Criterion 1, when controller matches true system ($j = i^*$), the residual satisfies concentration:
\[\mathbb{P}(\|y_\tau - \hat{y}_\tau\| \leq \sqrt{2d_y \Theta \log(2d_y/\delta)}) \geq 1-\delta\]
When controller is destabilizing, explosive eigenvalue dynamics ensure:
\[\|y_\tau - \hat{y}_\tau\| \geq \frac{1}{4}\varepsilon_c \varepsilon_a (1+\varepsilon_a)^{\tau-2\nu-2} \sigma_w\]
with probability $\geq 2/5$.
OLS concentration: For Criterion 2, apply self-normalized martingale bound (Abbasi-Yadkori et al. 2011):
\[\left\|\sum_{t=\nu+h+1}^\tau z_t r_t\right\|^2_{V^{-1}} \leq 2\sigma_r^2 \log\left(\frac{\det(V+V_0)^{1/2}}{\delta \det(V_0)^{1/2}}\right)\]
where $V = \sum_t z_t z_t^\top$.
UCB adaptation: Define “good” event $\mathcal{E}$ where empirical averages concentrate. For non-matching controller $j \neq i^*$ selected at episode $\ell$, one of three conditions holds: - Matching controller average too low: $\bar{s}_{i^*} \leq \mu^+ - \sqrt{\frac{a_\ell}{Q_{i^*}(\ell)}}$ - Non-matching average too high: $\bar{s}_j \geq \mu^- + \sqrt{\frac{a_\ell}{Q_j(\ell)}}$
- Exploration bonus too wide: $2\sqrt{\frac{a_\ell}{Q_j(\ell)}} > \mu^+ - \mu^-$
Martingale concentration: Apply Azuma-Hoeffding to bound frequency of condition violations, yielding $\mathbb{P}(\mathcal{E}) \geq 1-\delta$.

Key insight: Observability allows reconstructing initial states, effectively making episodes independent for analysis purposes despite physical state coupling.

Experiments & Validation

Purely theoretical. The paper provides no empirical validation.

Experimental validation would require:

Simulated linear systems with various dimensions and noise levels
Comparison against classical estimator-based supervisory control
Demonstration of the $\mathcal{O}(N \log N)$ scaling with number of controllers
Validation of finite $L_2$-gain bounds in practice
Robustness tests when assumptions (observability, spectral gaps) are approximately satisfied

Limitations & Open Problems

Limitations:

Requires exact knowledge of candidate models $(C_i, A_i, B_i)$ - RESTRICTIVE (limits practical applicability when models are uncertain)
Assumes true system belongs to finite candidate set - RESTRICTIVE (real systems may not match any candidate exactly)
Fixed episode length $\tau$ must satisfy $\tau \geq \max(T_1, T_2)$ where bounds depend on problem parameters - TECHNICAL (could potentially be made adaptive)
Observability assumption for matched controllers - NATURAL (standard controllability/observability assumptions in control theory)
Spectral separation assumption for different systems - TECHNICAL (needed for identification but could potentially be weakened)
Gaussian noise assumptions - NATURAL (standard in system identification literature)

Open problems:
Extend to continuous controller spaces rather than finite discrete sets
Develop adaptive episode length selection that doesn’t require knowing problem-dependent constants $T_1, T_2$

BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug Discovery

Authors: Sijian Fan, Liyan Xiong, Dayuan Wang, Guoshuai Cai et al. (5 authors) · Institution: University of South Carolina · Category: cs.LG

Introduces BVSIMC, a Bayesian inductive matrix completion method using spike-and-slab priors for variable selection from high-dimensional side features in drug discovery applications.

Tags: matrix completion bayesian variable selection drug discovery spike and slab priors inductive matrix completion bioinformatics proximal gradient methods sparsity

arXiv · PDF

Problem Formulation

Motivation: Drug discovery often incorporates side information (chemical properties, genomic data) to improve prediction, but these high-dimensional features are frequently noisy and irrelevant. Existing methods either use all available side features or apply uniform regularization, leading to poor performance when side information is contaminated.
Mathematical setup: Let $Y \in {0,1}^{I \times J}$ be a binary matrix where $y_{ij} = 1$ indicates a known drug-disease interaction. Let $U \in \mathbb{R}^{I \times d_1}$ and $V \in \mathbb{R}^{J \times d_2}$ be side feature matrices for drugs and diseases respectively.

The model assumes each entry follows:
\[p_{ij} = \frac{\exp(m_{ij})}{1 + \exp(m_{ij})}\]
where the latent matrix is:
\[M = UAB^T V^T\]
with projection matrices $A \in \mathbb{R}^{d_1 \times r}$ and $B \in \mathbb{R}^{d_2 \times r}$.

Assumptions:
1. Known interactions ($y_{ij} = 1$) are more trustworthy than zeros
2. Many side features are irrelevant (sparsity in $A$ and $B$ rows)
3. Relevant features should share information across the latent space
Toy example: When $I=3$ drugs, $J=2$ diseases, $d_1=d_2=4$ side features, and $r=2$ latent dimensions, if only the first 2 drug features and first disease feature are relevant, then rows 3-4 of $A$ and row 2 of $B$ should be exactly zero.
Formal objective: Find posterior modes $\hat{A}, \hat{B}$ that maximize:
\[L(A,B) = \sum_{i,j} \left[\xi Y \odot (UAB^T V^T) - (\xi Y + 1 - Y) \odot \log(1 + \exp(UAB^T V^T))\right]_{ij} + \sum_k \log\pi(a_k) + \sum_\ell \log\pi(b_\ell)\]

Method

BVSIMC uses spike-and-slab group lasso (SSGL) priors for variable selection in latent factor matrices:

Each row $a_k$ of $A$ and $b_\ell$ of $B$ receives a mixture prior:
\[\pi(a_k | \tilde{\theta}) = (1-\tilde{\theta})\Psi(a_k | \tilde{\lambda}_0) + \tilde{\theta}\Psi(a_k | \tilde{\lambda}_1)\]
where $\Psi(x \lvert \lambda) \propto \lambda^r \exp(-\lambda \lvert x \rvert _2)$ is multivariate Laplace density.
Set $\tilde{\lambda}_0 \gg \tilde{\lambda}_1$ so spikes concentrate at zero, slabs allow large coefficients.
Place beta priors on mixing proportions:
\[\tilde{\theta} \sim \text{Beta}(\tilde{\alpha}, \tilde{\beta}), \quad \theta \sim \text{Beta}(\alpha, \beta)\]
Update via coordinate ascent with proximal gradient steps:
\[a_k^{(t)} = \text{argmin}_{x} \frac{1}{2\eta}\|x - (a_k^{(t-1)} - \eta\nabla f(a_k^{(t-1)}))\|_2^2 + \lambda^*(x; \tilde{\theta}_k, \tilde{\lambda}_0, \tilde{\lambda}_1)\|x\|_2\]
Apply refined thresholding:
\[a_k^{(t)} \leftarrow \left(1 - \frac{\lambda^*(a_k^{(t-1)}, \tilde{\theta}_k, \tilde{\lambda}_0, \tilde{\lambda}_1)}{\|z_k\|_2}\right)_+ z_k \mathbf{I}(\|z_k\|_2 > \Delta_U)\]
Toy example application: For the 3×2 drug-disease matrix with 4×4 side features, BVSIMC would shrink rows 3-4 of $A$ and row 2 of $B$ to zero vectors, effectively selecting only the first 2 drug features and first disease feature as relevant for prediction.

Novelty & Lineage

Extends prior IMC methods by adding Bayesian variable selection. DRIMC (Zhang et al., 2020) and NRLMF (Liu et al., 2016) use all side features without selection. SGIMC (Burkina et al., 2021) applies uniform ℓ_{2,1} penalties to all features. DirtyIMC (Chiang et al., 2015) balances data and side features but applies same regularization globally.

Key novelties:

spike-and-slab priors enable row-wise sparsity in latent factors
non-separable beta-Bernoulli priors allow information sharing across features
refined proximal updates eliminate suboptimal local modes
overcomes rotational ambiguity through row-wise rather than element-wise sparsity.

INCREMENTAL

Proof Techniques

The paper focuses on algorithm development rather than theoretical analysis. Key technical contributions:

Marginal prior derivation: Integrates out mixing proportions to obtain marginal SSGL penalty:
\[\text{pen}(a_k) = -\lambda^*(a_k; \tilde{\theta}_k, \tilde{\lambda}_0, \tilde{\lambda}_1) \|a_k\|_2\]
Refined proximal operator characterization: Uses Proposition 1 to find global mode:
\[\hat{a}_k = \begin{cases} 0_r, & \text{if } \|z_k\|_2 \leq \Delta \\ \left(1 - \frac{\lambda^*(\hat{a}_k)}{\|z_k\|_2}\right)_+ z_k, & \text{if } \|z_k\|_2 > \Delta \end{cases}\]
where threshold bounds:
\[\Delta_L < \Delta < \Delta_U = \sqrt{2\eta \log[1/p^*(0_r; \tilde{\theta}_k, \tilde{\lambda}_0, \tilde{\lambda}_1)]} + \eta\tilde{\lambda}_1\]
Conditional expectation approximation: When spike parameter is large:
\[E[\tilde{\theta} | \hat{A}] \approx \frac{\tilde{\alpha} + \hat{q}_A}{\tilde{\alpha} + \tilde{\beta} + d_1}\]
where $\hat{q}_A$ is number of nonzero rows in $\hat{A}$.
Nesterov acceleration: Applies momentum to proximal updates for faster convergence.

Experiments & Validation

Two simulation studies and two real applications:

Simulation: 800×1600 binary matrix with 100 side features each (only first 25 relevant). BVSIMC (ξ=10) achieves AUC=0.868 vs. best competitor SGIMC at 0.655.

M. tuberculosis drug resistance: 6949 strains × 13 drugs with 9684 SNP features and 33 drug chemical groups. BVSIMC consistently achieves highest AUC (>0.9 when ρ≥0.141 training proportion) vs. competitors (≤0.8). Identifies 68 relevant SNPs and 14 chemical groups.

Drug repositioning (Cdataset): 658 drugs × 409 diseases with 1899 drug features and 797 disease phenotype features. BVSIMC outperforms across all training sizes. DRIMC/NRLMF show poor performance (AUC≈0.5) due to noisy side information.

Key baselines: IMC (ridge penalties), SGIMC (ℓ_{2,1} penalties), DRIMC (similarity-based), NRLMF (neighborhood regularized). Hyperparameters tuned via grid search. Confidence parameter ξ=10 found optimal.

Limitations & Open Problems

Limitations:

Linearity assumption for latent matrix M = UAB^T V^T - RESTRICTIVE (may miss nonlinear drug-disease relationships)
Binary outcomes only - TECHNICAL (could extend to other discrete/continuous outcomes)
Fixed column dimension r requires tuning - NATURAL (standard hyperparameter in matrix factorization)
Coordinate ascent may find local optima despite refined updates - TECHNICAL (nonconvex optimization inherent difficulty)
Spike hyperparameters (λ₀, λ₁) require tuning - TECHNICAL (could use adaptive or data-driven selection)

Open problems:
Extend to nonlinear IMC using kernel methods or neural networks while maintaining interpretable variable selection
Develop theoretical guarantees for variable selection consistency and prediction accuracy under sparsity assumptions