Theory 15 papers

Theory Digest — Mar 18, 2026

Today’s Digest at a Glance

Today’s digest spans foundational advances in Bayesian methods and uncertainty quantification, alongside breakthrough results in geometric deep learning, causal inference, and high-dimensional statistical theory.

Bayesian Methods and Uncertainty Quantification

Bayesian statistics provides a principled framework for reasoning under uncertainty by treating unknown parameters as random variables with probability distributions. Instead of finding a single “best” parameter estimate, Bayesian methods maintain full probability distributions that quantify our uncertainty about parameter values. This is particularly powerful when dealing with complex, high-dimensional data where traditional frequentist approaches may struggle.

Several papers today advance Bayesian methodology in challenging settings. Variational Bayes is a key technique that approximates intractable posterior distributions $p(\theta\lvert x)$ by finding the closest member of a tractable family $q_\phi(\theta)$ by minimizing the Kullback-Leibler divergence $KL(q_\phi \lvert \rvert p)$. This transforms a difficult integration problem into an optimization problem. For high-dimensional inverse problems—where we observe $y = f(\theta) + \epsilon$ and want to recover $\theta$—the computational cost can be prohibitive. Tensor decompositions like the Tucker format represent high-dimensional arrays as products of smaller matrices, reducing complexity from $O(n^3)$ to $O(n_G^3)$ where $n_G \ll n$.

Gaussian Processes (GPs) are another cornerstone of modern Bayesian machine learning. A GP defines a distribution over functions $f \sim GP(m, k)$ specified by a mean function $m(x)$ and covariance kernel $k(x,x’)$. The kernel encodes our prior beliefs about function smoothness and structure. When observations contain input measurement uncertainty—where we observe noisy inputs $\tilde{x} = x + \delta$ rather than true inputs $x$—standard GP inference can be severely miscalibrated. The Wasserstein distance between probability measures provides a principled way to handle this uncertainty by measuring the optimal transport cost between distributions.

Causal Inference and Optimal Transport

Causal inference seeks to understand cause-and-effect relationships from data, going beyond mere correlation to answer questions like “What would happen if we intervened?” The fundamental challenge is the missing data problem: we can never observe the same unit under both treatment and control conditions. Modern causal inference relies heavily on counterfactual reasoning—imagining alternative worlds where different interventions occurred.

Optimal transport has emerged as a powerful mathematical framework for causal problems. Given two probability measures $\mu$ and $\nu$, optimal transport finds the minimum-cost way to transform one into the other. Mathematically, this solves $\inf_{\gamma \in \Gamma(\mu,\nu)} \int c(x,y) d\gamma(x,y)$ where $\Gamma(\mu,\nu)$ represents all couplings between the measures and $c(x,y)$ is a cost function. The Wasserstein distance is the square root of the optimal transport cost when $c(x,y) = \lvert x-y \rvert^2$.

In causal mediation analysis, we want to understand how a treatment $T$ affects an outcome $Y$ through an intermediate variable $M$ (the mediator). Traditional approaches require strong parametric assumptions about structural equation models. Transport-based methods can construct counterfactual mediator values by optimally moving probability mass along the causal graph structure. However, extreme interventions can cause manifold tearing—topological singularities where the smooth structure of the data manifold breaks down, establishing fundamental geometric limits on what counterfactuals are possible.

Graph Neural Networks and Structural Learning

Graph Neural Networks (GNNs) have revolutionized learning on graph-structured data, but they face fundamental limitations. Traditional message-passing GNNs iteratively aggregate information from neighboring nodes: $h_v^{(l+1)} = \sigma(W h_v^{(l)} + \sum_{u \in N(v)} h_u^{(l)})$. A critical problem is oversmoothing—as layers increase, node representations become increasingly similar, losing the ability to distinguish between nodes.

Neural Network Gaussian Processes (NNGPs) provide theoretical insight by studying the infinite-width limit of neural networks. Under certain conditions, random neural networks converge to Gaussian processes with specific kernel functions. This framework reveals why message-passing architectures suffer from rank collapse—the effective dimensionality of node representations shrinks exponentially with depth.

Graph Transformers offer a potential solution by using attention mechanisms instead of fixed message-passing rules. The attention weights $\alpha_{ij} = \text{softmax}(\frac{Q_i K_j^T}{\sqrt{d}})$ allow nodes to selectively focus on relevant parts of the graph rather than averaging over fixed neighborhoods. The NNGP analysis reveals that this structural difference can prevent the rank collapse that plagues traditional GNNs.

High-Dimensional Statistics and Missing Data

High-dimensional statistics studies settings where the number of parameters $p$ is comparable to or larger than the sample size $n$. Classical statistical theory assumes $p$ is fixed while $n \to \infty$, but modern applications often have $p \gg n$. This regime requires fundamentally different theoretical tools and reveals surprising statistical-computational gaps—problems that are information-theoretically solvable but computationally intractable.

The sum-of-squares hierarchy provides a systematic way to study these gaps. It considers polynomial-time algorithms that use degree-$d$ sum-of-squares relaxations, with higher degrees being more powerful but more expensive. Many high-dimensional problems exhibit thresholds where statistical recovery becomes possible at some sample complexity, but efficient algorithms require exponentially more samples.

Missing data compounds these challenges. Under realizable contamination, a fraction of the data is corrupted in worst-case fashion. The minimax framework asks: what is the best possible error rate any algorithm can achieve in the worst case? For contamination fraction $\epsilon$, rates often exhibit phase transitions—below certain thresholds, efficient recovery is impossible regardless of computational resources.

Reading Guide

For readers interested in Bayesian methodology: start with papers 1 and 2 for applications to neuroimaging and experimental design, then proceed to papers 4 and 5 for advanced kernel methods and computational techniques.

Those focused on causal inference should read papers 7 and 14 together—they provide complementary perspectives on geometric constraints and optimal transport methods for counterfactual reasoning.

For graph learning: paper 3 provides essential theoretical foundations that explain structural advantages of attention-based architectures over traditional message-passing approaches.

Readers interested in computational complexity and statistical theory should prioritize paper 8 for missing data theory, with papers 5 and 11 providing concrete algorithmic advances for high-dimensional problems.

The optimal transport theme connects papers 4, 7, 14, and 15, offering a unified mathematical perspective across uncertainty quantification, causal inference, and statistical methodology.


Bayesian Scalar-on-Tensor Quantile Regression for Longitudinal Data on Alzheimer’s Disease

Authors: Rongke Lyu, Marina Vannucci, Suprateek Kundu · Institution: Rice University, MD Anderson Cancer Center · Category: stat.ME

Develops the first Bayesian tensor quantile regression for longitudinal neuroimaging data, decomposing effects into visit-invariant and visit-specific components with specialized shrinkage priors.

Tags: quantile-regression tensor-methods neuroimaging longitudinal-data bayesian-methods alzheimers-disease high-dimensional-statistics spatial-statistics

arXiv · PDF

Problem Formulation
  1. Motivation: Alzheimer’s disease neuroimaging studies aim to understand how brain-behavior associations change over time, particularly how brain structure relates to cognitive decline. Traditional mean regression assumes normally distributed errors, which fails for skewed cognitive scores, motivating quantile regression approaches that model the full outcome distribution.

  2. Mathematical setup: Consider $n$ subjects with longitudinal MRI data over $T$ visits. For subject $i$ at visit $t$, let $X_{it} \in \mathbb{R}^{p_1 \times p_2}$ denote the brain image tensor, $y_{it}$ the scalar cognitive outcome, and $z_i \in \mathbb{R}^p$ time-invariant covariates. The quantile regression model is:

    \[Q_q(y_{i,t}) = b_0 + b_{0i} + b_1 T_{it} + z_i^T \eta + \langle X_{it}, B_0 \rangle + \langle X_{it}, B_t \rangle + \epsilon^q_{it}\]

    where $B_0$ captures visit-invariant effects, $B_t$ captures visit-specific effects, and $\epsilon^q_{it}$ follows asymmetric Laplace distribution. The tensor coefficients have PARAFAC decompositions:

    \[B_0 = \sum_{r=1}^R \beta_{01}^{(r)} \circ \cdots \circ \beta_{0D}^{(r)}\] \[B_t = \sum_{r_t=1}^{R_t} \chi_{t1}^{(r_t)} \circ \cdots \circ \chi_{tD}^{(r_t)}\]

    Assumptions:

    1. Regular visit times across subjects
    2. Missing visits are missing at random
    3. Tensor coefficients have low-rank PARAFAC structure
    4. Errors follow asymmetric Laplace distribution at quantile $q$
  3. Toy example: With $n=2$ subjects, $T=2$ visits, $2 \times 2$ brain images, and rank-1 decompositions, $B_0 = \beta_{01} \circ \beta_{02}$ where $\beta_{01}, \beta_{02} \in \mathbb{R}^2$. For a single voxel $(1,1)$, the coefficient is $\beta_{01}[1] \cdot \beta_{02}[1]$.

  4. Formal objective: Estimate posterior distributions of tensor coefficients $(B_0, B_1, \ldots, B_T)$ and predict quantiles:

    \[\hat{Q}_q(y_{it}) = \mathbb{E}[Q_q(y_{i,t}) | \text{data}]\]
Method

The method uses a Bayesian hierarchical model with specialized priors for tensor decomposition:

  1. Location-scale mixture representation of asymmetric Laplace distribution:

    \[Q_q(y_{i,t}) = b_0 + b_{0i} + b_1 T_{it} + z_i^T \eta + \langle X_{it}, B_0 \rangle + \langle X_{it}, B_t \rangle + \theta\nu_{it} + \rho\sqrt{\sigma\nu_{it}}u_{it}\]

    where $\nu_{it} \sim \text{Exp}(1)$, $u_{it} \sim N(0,1)$, $\theta = \frac{1-2q}{q(1-q)}$, $\rho^2 = \frac{2}{q(1-q)}$.

  2. Multiway Dirichlet generalized double Pareto (M-DGDP) prior for visit-invariant effects:

    \[\beta_{0j}^{(r)} \sim N(0, (\phi_r \tau) W_{jr})\] \[w_{jr,k} \sim \text{Exp}(\lambda_{jr}^2/2)\]
  3. Spike-and-slab prior for visit-specific effects:

    \[\chi_{tj}^{(r_t)} \sim N(0, (\phi_{tr_t}^{\chi} \tau_t^{\chi}) W_{tjr_t}^{\chi})\] \[w_{tjr_t,k}^{\chi} \sim (1-\pi_{tjr_t,k})\delta_\nu + \pi_{tjr_t,k}\text{Exp}(\lambda_{tjr_t}^2)\]
  4. MCMC algorithm alternates between Gibbs updates for continuous parameters and Metropolis-Hastings for variable selection using add/delete/swap moves.

    For the toy example with $2 \times 2$ images and rank-1 decomposition: $B_0 = \beta_{01} \circ \beta_{02}$ where each margin is 2-dimensional. The algorithm updates $\beta_{01}$ given $\beta_{02}$ and vice versa using conjugate normal updates, then updates the variance parameters $w_{jr,k}$ using generalized inverse Gaussian distributions.

Novelty & Lineage

This extends quantile regression to high-dimensional longitudinal tensor data. Prior work includes:

  • Yu and Moyeed (2001): Bayesian quantile regression with ALD
  • Guhaniyogi et al. (2017): Bayesian scalar-on-tensor regression for mean models
  • Li and Zhang (2021): Frequentist tensor quantile regression (cross-sectional)
  • Lu et al. (2020): Tucker decomposition for tensor quantile regression

Key novelties:

  1. First Bayesian approach for longitudinal tensor quantile regression
  2. Decomposition into visit-invariant and visit-specific effects with different priors
  3. Spike-and-slab priors on tensor margins for visit-specific sparsity
  4. MCMC algorithm combining tensor updates with variable selection.

    SIGNIFICANT

Proof Techniques

The paper is primarily methodological rather than theoretical, focusing on algorithm development. Key technical contributions:

  1. Location-scale mixture representation enables conjugate updates:

    \[f(\epsilon_{it}^q | \sigma, q) = \frac{q(1-q)}{\sigma} \exp\left(-\rho\frac{\epsilon_{it}^q}{\sigma}\right)\]
  2. Conditional posterior for tensor margins uses matrix normal form. For visit-invariant margins:

    \[\beta_{0j}^{(r)} | \cdot \sim N(\mu_{\beta_{0jr}}, \Sigma_{\beta_{0jr}})\]

    where the mean and covariance involve Kronecker products of other tensor margins.

  3. Spike-and-slab variable selection uses joint Metropolis-Hastings acceptance probability:

    \[\alpha = \min\left(1, \frac{p(\pi^*, w^* | \cdot)}{p(\pi, w | \cdot)} \cdot \frac{q(\pi, w | \pi^*, w^*)}{q(\pi^*, w^* | \pi, w)}\right)\]
  4. Multiplicity adjustment via simultaneous credible bands using maximal deviations:

    \[\zeta_{\alpha/2}^* = \max_{j=1,\ldots,J} |\hat{\Gamma}(\theta_j) - \zeta_{\alpha/2}(\theta_j)|\]

    The main technical insight is designing priors that respect tensor structure while enabling efficient MCMC updates through back-fitting procedures.

Experiments & Validation

Datasets: ADNI-1 neuroimaging study with T1-weighted MRI scans and ADAS cognitive scores for Alzheimer’s disease and mild cognitive impairment patients.

Simulations: 5 scenarios with different 2D signal shapes (rectangular, cross, triangular, circular) plus 3D cubic signals. 250 training samples per visit, 50 test samples. Quantile levels 0.2, 0.5, 0.8.

Baselines:

  1. Cross-sectional Bayesian tensor quantile regression variants (CsB1, CsB2)
  2. Frequentist tensor quantile regression (Li & Zhang 2021)
  3. Penalized quantile regression (rqPen) ignoring spatial structure.

    Key results: Proposed method achieves lower relative error (0.15 vs 0.25+ for competitors), higher correlation (0.85+ vs 0.70- for competitors), better feature selection (MCC scores 0.75+ vs 0.45- for competitors), and superior prediction accuracy (lower check loss) across all scenarios. Performance gains increase with signal complexity.

Limitations & Open Problems
  1. Regular visit times assumption - TECHNICAL (commonly satisfied in planned longitudinal studies like ADNI but could be relaxed with time-varying coefficients)

  2. PARAFAC decomposition assumption - NATURAL (standard low-rank assumption in tensor methods, validated empirically)

  3. Asymmetric Laplace distribution for quantile regression - NATURAL (standard choice in Bayesian quantile regression literature)

  4. Missing at random assumption for missing visits - NATURAL (reasonable for planned studies with administrative missingness)

  5. Fixed tensor ranks chosen via DIC - TECHNICAL (could be made adaptive but current approach works well in practice)

  6. Grid mapping of irregular brain surfaces - TECHNICAL (necessary for tensor structure but may lose some spatial precision)

    Open problems:

  7. Extend to irregular visit times with time-varying coefficient functions
  8. Develop adaptive rank selection methods that can vary across visits

Sequential Bayesian Experimental Design for Prediction in Physical Experiments Informed by Computer Models

Authors: Hao Zhu, Markus Hainy · Institution: Johannes Kepler University Linz · Category: stat.CO

Develops the first mutual information-based sequential Bayesian experimental design for Kennedy-O’Hagan models with computational acceleration strategies, theoretically connecting MI and IMSPE criteria.

Tags: bayesian-experimental-design gaussian-processes kennedy-ohagan-model mutual-information sequential-design model-calibration computer-experiments uncertainty-quantification

arXiv · PDF

Problem Formulation
  1. Motivation: In scientific and engineering domains, physical experiments are costly and time-consuming, while computer simulators provide cheap but imperfect approximations. The Kennedy-O’Hagan (KOH) framework combines simulator runs with limited experimental observations, but selecting which physical experiments to conduct to maximize predictive performance remains challenging.

  2. Mathematical setup: Let $X_p \in \mathbb{R}^{n \times d}$ be physical experiment design points and $X_c \in \mathbb{R}^{m \times d}$ be computer model design points. The KOH model is:

    \[y_p = y_c(X_p, \theta) + \delta(X_p) + \epsilon\] \[y_s = y_c(X_c, T)\]

    where $\theta \in \mathbb{R}^h$ are true calibration parameters, $T \in \mathbb{R}^{m \times h}$ are design settings for calibration parameters, $\delta(\cdot)$ captures model discrepancy, and $\epsilon \sim \mathcal{N}(0, \sigma^2 I_n)$ is observation noise.

    The computer model and discrepancy follow Gaussian processes:

    \[y_c(X_c, T) \sim \mathcal{GP}(0, K_{\phi_1}((X_c, T), (X_c', T')))\] \[\delta(X_p) \sim \mathcal{GP}(m_\delta(X_p; \theta), K_{\phi_2}(X_p, X_p'))\]

    Assumptions:

    1. Both GPs are stationary with hyperparameters $\phi_1, \phi_2$
    2. Observation noise is independent Gaussian
    3. Candidate design points form a finite discrete set $\mathcal{D}_{cand} = {\xi_1, \ldots, \xi_M}$
  3. Toy example: When $d=1$, $n=2$, $m=1$, and we have one physical observation $y_p = 0.5$ at $x_p = 0.3$ and one simulator run $y_s = 0.2$ at $x_c = 0.1$ with $T = 0.5$. We want to select the next physical experiment location from candidates ${0.7, 0.9}$ to best predict at $x_* = 0.8$.

  4. Formal objective: Select design sequence $\mathcal{D}_{sel}^{(B)} = {\xi^{(1)}, \ldots, \xi^{(B)}}$ that maximizes mutual information:

    \[U(\xi^{(b)}) = I(y_*; y_{new}^{(b)} | y, \mathcal{D}_{sel}^{(b)}) = \mathbb{E}_{p(y_*, y_{new}^{(b)} | y)} \left[ \log \frac{p(y_* | y_{new}^{(b)}, y)}{p(y_* | y)} \right]\]
Method

The method proposes a mutual information-based sequential Bayesian experimental design with computational acceleration strategies:

  1. Mutual Information Criterion: At round $b$, select design point maximizing:

    \[U(\xi^{(b)}) = \int \int p(y_*, y_{new}^{(b)} | y) \log \frac{p(y_*, y_{new}^{(b)} | y)}{p(y_{new}^{(b)} | y) p(y_* | y)} dy_{new}^{(b)} dy_*\]
  2. Hybrid Extension: Incorporate local complexity measure combining slope and curvature:

    \[U_{hyb}(\xi^{(b)}) = (1-\alpha) U(\xi^{(b)}) + \alpha C(\xi^{(b)})\]

    where $C(\xi^{(b)}) = w_g \tilde{g}(\xi^{(b)}) + w_s \tilde{s}(\xi^{(b)})$ with local gradient $g$ and slope variation $s$.

  3. Nested Monte Carlo Estimation: Approximate mutual information using:

    \[\hat{U}_{S,J}(\xi^{(b)}) = \frac{1}{S} \sum_{s=1}^S \left[ \log \hat{p}_{joint}^J(y_s^*, y_{new,s}^{(b)}) - \log \hat{p}_*^J(y_s^*) - \log \hat{p}_{new}^J(y_{new,s}^{(b)}) \right]\]
  4. Gaussian Mixture Compression: Reduce inner Monte Carlo samples from $J$ to $J_{target}$ using k-means clustering and KNN-guided Runnalls algorithm.

  5. Schur Complement Updates: Precompute full covariance matrix and use rank-one updates to avoid cubic matrix inversions.

    Toy example application: For the 1D case with candidates ${0.7, 0.9}$, compute mutual information between prediction at $x_* = 0.8$ and potential observations at each candidate. The method would evaluate covariance structures, sample from posterior distributions, and select the candidate yielding higher information gain about $y_*$.

Novelty & Lineage

This work extends the Kennedy & O’Hagan framework by developing the first genuinely sequential Bayesian experimental design specifically for physical (not computer) experiments targeting predictive performance. Prior work includes: Leatherman et al. (2017) and Koermer et al. (2024) who used IMSPE minimization; Krishna et al. (2021) who combined D-optimal and space-filling designs; Ryan et al. (2016) who used mutual information for other contexts.

Key novelties:

  1. First mutual information-based criterion explicitly for KOH model prediction improvement
  2. Hybrid criterion incorporating local complexity measures
  3. Gaussian mixture compression using k-means initialization and KNN-guided Runnalls algorithm
  4. Theoretical connection between MI and IMSPE criteria via asymptotic analysis
  5. Schur complement acceleration for sequential covariance updates

    The computational acceleration strategies are novel combinations of existing techniques (Runnalls algorithm, k-means, Schur complements) tailored to the KOH framework.

    SIGNIFICANT

Proof Techniques

The main theoretical results establish the asymptotic equivalence between mutual information and IMSPE criteria:

  1. Theorem 1 Setup: Consider the covariance reduction at round $b$:

    \[\Delta\Sigma_*(\xi^{(b)}) := \Sigma_*^{(b-1)} - \Sigma_*^{(b)} \succeq 0\] \[\Delta IMSPE(\xi^{(b)} | \Omega) := \frac{1}{n_*} \text{tr}(\Delta\Sigma_*(\xi^{(b)}))\]
  2. Small Update Regime: As sequential rounds proceed, assume:

    \[\|(\Sigma_*^{(b-1)})^{-1/2} \Delta\Sigma_*(\xi^{(b)}) (\Sigma_*^{(b-1)})^{-1/2}\|_2 \to 0\]
  3. Key Asymptotic Result: Under joint Gaussianity and positive definiteness conditions:

    \[\frac{U(\xi^{(b)} | \Omega)}{\Delta IMSPE(\xi^{(b)} | \Omega)} \to C_{b-1}\]

    where the constant satisfies:

    \[\frac{n_*}{2\lambda_{max}(\Sigma_*^{(b-1)})} \leq C_{b-1} \leq \frac{n_*}{2\lambda_{min}(\Sigma_*^{(b-1)})}\]
  4. Convergence Analysis: For the nested Monte Carlo estimator, apply Hong & Juneja (2009) framework with outer function $f(x) = \log(x)$. The log function satisfies bounded derivative conditions on $[\tau, \infty)$ for $\tau > 0$, yielding:

    \[\mathbb{E}[(\hat{U}_{S,J} - U)^2] = O(S^{-1} + J^{-2})\]

    The proof uses Taylor expansion around the identity for small covariance updates and eigenvalue bounds to control the ratio between MI and IMSPE criteria.

Experiments & Validation

Synthetic Example: 1D test function with 30 candidate locations, comparing MI-based, IMSPE, D-optimal, and Maximin distance criteria under both sequential (SDE) and adaptive (ADE) settings. Evaluated using root mean squared prediction error (RMSPE).

Real Application: Biochemical case study involving enzyme kinetics modeling with physical laboratory experiments. Used actual experimental data to demonstrate practical effectiveness.

Baselines:

  • IMSPE minimization (Leatherman et al. 2017)
  • D-optimal design (Krishna et al. 2021)
  • Maximin distance space-filling design

Key Results:

  • MI-based criterion outperforms classical methods, especially in early design rounds
  • Hybrid MI criterion with local complexity shows further improvement
  • Computational acceleration reduces runtime by orders of magnitude while maintaining design quality
  • ADE generally outperforms SDE when real-time experimentation is feasible

Performance Metrics: Root mean squared prediction error (RMSPE), computational runtime, convergence behavior across design rounds.

Limitations & Open Problems

Limitations:

  1. Discrete candidate set assumption - RESTRICTIVE (limits applicability to continuous optimization problems common in many domains)

  2. Gaussian process assumptions for computer model and discrepancy - NATURAL (standard in KOH framework and widely satisfied)

  3. Joint Gaussianity assumption for theoretical results - TECHNICAL (needed for closed-form expressions but likely extendable to broader cases)

  4. Fixed hyperparameters during design process - TECHNICAL (could be relaxed with hyperparameter learning)

  5. Computational complexity still cubic in number of prediction points $n_*$ - TECHNICAL (inherent to GP inference but could be approximated)

  6. Nested Monte Carlo introduces bias for finite inner samples - TECHNICAL (bias vanishes asymptotically)

    Open Problems:

  7. Extend to continuous design spaces while maintaining computational efficiency and theoretical guarantees

  8. Develop adaptive hyperparameter learning within the sequential design framework to handle non-stationary model behavior


Gaussian Process Limit Reveals Structural Benefits of Graph Transformers

Authors: Nil Ayday, Lingchu Yang, Debarghya Ghoshdastidar · Institution: Technical University of Munich, University of Tokyo · Category: stat.ML

First theoretical analysis showing that graph transformers can structurally avoid oversmoothing through NNGP limits, unlike message-passing GNNs which are destined for rank collapse.

Tags: graph neural networks neural network gaussian processes attention mechanisms oversmoothing theoretical analysis graph transformers stochastic block models kernel methods

arXiv · PDF

Problem Formulation
  1. Motivation: Graph transformers (GAT, Graphormer, Specformer) achieve state-of-the-art performance on graph learning tasks and empirically avoid oversmoothing issues that plague message-passing GNNs. However, there is limited theoretical understanding of why attention-based architectures provide these structural benefits.

  2. Mathematical setup: Consider a graph $G = (V, E)$ with $n$ nodes and adjacency matrix $A \in \mathbb{R}^{n \times n}$. Node features are collected in $X \in \mathbb{R}^{n \times d_{in}}$. A graph transformer with $L$ layers computes representations via:

    \[g^{(\ell)}(X) = \sum_{h=1}^{d_{\ell,H}} S^{(\ell,h)}_{GNN} f^{(\ell-1)}(X) W^{\ell,h} W^{\ell,H}\] \[f^{(\ell)}(X) = \text{ReLU}(g^{(\ell)}(X))\]

    where $S^{(\ell,h)}_{GNN} \in \mathbb{R}^{n \times n}$ is the attention-based graph convolution operator for head $h$ at layer $\ell$.

    Assumptions:

    1. Weight matrices follow independent Gaussian priors: $W^{\ell,h}_{ij} \sim \mathcal{N}(0, \sigma^2_w/d_{\ell-1})$
    2. Infinite width and head limits: $d_\ell, d_{\ell,H} \to \infty$
    3. Graph structure is fixed and known
  3. Toy example: Consider a 2-community stochastic block model with $n=4$ nodes, communities $C_1 = {1,2}$, $C_2 = {3,4}$, intra-community edge probability $p$, inter-community probability $q < p$. The expected adjacency has block structure:

    \[A = \begin{pmatrix} p & p & q & q \\ p & p & q & q \\ q & q & p & p \\ q & q & p & p \end{pmatrix}\]

    This illustrates the core difficulty: deep GNNs should preserve community structure (maintain block structure in representations) rather than oversmoothing to uniform representations.

  4. Formal objective: Derive the Neural Network Gaussian Process (NNGP) kernel recursion ${K^{(\ell)}}_{\ell=0}^L$ such that as $d_\ell, d_{\ell,H} \to \infty$:

    \[g^{(\ell)}(X) \sim \mathcal{GP}(0, \Sigma^{(\ell)})\]

    and analyze when $\lim_{\ell \to \infty} K^{(\ell)}/\text{tr}(K^{(\ell)})$ preserves community structure.

Method

The method derives NNGP limits for three graph transformer architectures through kernel recursions:

  1. GAT-GP: Derive edge-level logit kernel and convolution kernel - Edge kernel: $K^{(\ell),E}_{ai,bj} = \sigma^2_w \sigma^2_v (K^{(\ell-1)}_{ab} + K^{(\ell-1)}_{ij})$ - Node kernel: $\Sigma^{(\ell)}_{ab} = \sigma^2_H \sigma^2_w \sum_{i \in N_a} \sum_{j \in N_b} K^{(\ell-1)}_{ij} C^{(\ell)}_{ai,bj}$

  2. Graphormer-GP: Incorporate positional encodings via augmented kernel - Augmented kernel: $\tilde{K}^{(\ell)} = \alpha K^{(\ell-1)} + (1-\alpha) R^{(\ell)}$ - Node kernel: $\Sigma^{(\ell)}_{ab} = \sigma^2_H \sigma^2_w \sum_{i,j \in V} \tilde{K}^{(\ell-1)}_{ij} C^{(\ell)}_{ai,bj}$

  3. Specformer-GP: Spectral domain operation with learned filters - Learned spectral kernel: $K_{\lambda,ab} = \mathbb{E}[\bar{\lambda}_a \bar{\lambda}_b]$ - Node kernel: $\Sigma^{(\ell)} = \sigma^2_H \sigma^2_w U(K_\lambda \odot (U^T K^{(\ell-1)} U))U^T$

  4. CSBM Analysis: Specialize kernels under 2-community block model structure

    For the toy 4-node CSBM example:

    • GAT-GP kernel becomes: $K^{(\ell)} = \begin{pmatrix} x_\ell \mathbf{1}\mathbf{1}^T & y_\ell \mathbf{1}\mathbf{1}^T \ y_\ell \mathbf{1}\mathbf{1}^T & x_\ell \mathbf{1}\mathbf{1}^T \end{pmatrix}$
    • Evolution: $x_\ell, y_\ell = \frac{1}{2} G^{2\ell} [(x_0 + y_0) \pm (x_0 - y_0) F^\ell]$
    • Where $F = 2\frac{p^2 - pq + q^2}{(p+q)^2}$ determines community preservation
Novelty & Lineage

This work provides the first NNGP analysis of graph transformers, building on:

Prior work:

  • Lee et al. (2018): NNGP limits for standard neural networks
  • Hron et al. (2020): NNGP limits for standard transformers
  • Niu et al. (2023), Sabanayagam et al. (2023): NNGP limits for GCNs
  • Various empirical work showing graph transformers avoid oversmoothing

Novel contributions:

  • First derivation requiring both infinite width AND infinite heads (unlike GCNs which only need infinite width)
  • Novel multi-level kernel structure: edge-level, convolution-level, and node-level kernels
  • First theoretical proof that graph transformers can structurally avoid oversmoothing under specific conditions
  • Unified analysis framework covering GAT, Graphormer, and Specformer

Assessment: SIGNIFICANT - provides fundamental theoretical foundation for understanding graph transformers, with novel technical contributions in the infinite-head limit analysis.

Proof Techniques

The proofs use several key techniques:

  1. Infinite-head CLT: Unlike standard NNGP analysis, requires careful handling of both width and head limits. Key insight is the aggregation mechanism:

    \[f^{(\ell)}(X) = \sigma\left(\sum_{h=1}^{d_{\ell,H}} S^{(\ell,h)}_{GNN} f^{(\ell-1)}(X) W^{\ell,h} W^{\ell,H}\right)\]

    The shared projection $W^{\ell,H}$ enables CLT despite $n \times n$ stochasticity in attention matrices.

  2. Multi-level kernel recursion: Establishes Gaussian limits at multiple levels: - Attention logits: $E^{(\ell)} \sim \mathcal{GP}(0, K^{(\ell),E})$ - Graph convolutions: $C^{(\ell)}_{ai,bj} = \mathbb{E}[S^{(\ell,1)}_{ai} S^{(\ell,1)}_{bj}]$ - Node representations: $g^{(\ell)} \sim \mathcal{GP}(0, \Sigma^{(\ell)})$

  3. CSBM spectral analysis: For oversmoothing analysis, uses eigendecomposition of block matrices. Key inequality for GAT-GP:

    \[\lim_{\ell \to \infty} \frac{y_\ell}{x_\ell} = \frac{1 - \gamma}{1 + \gamma}\]

    where $\gamma = \lim_{\ell \to \infty} \left(\frac{x_0 - y_0}{x_0 + y_0}\right) F^\ell$

  4. Rank preservation condition: Shows that if $F \geq 1$ (well-separated communities), then $\text{rank}(\lim_{\ell \to \infty} K^{(\ell)}) = 2$, avoiding oversmoothing.

  5. Spectral amplification mechanism: For Specformer-GP, proves that learned filters can amplify community-preserving spectral directions when $\lvert \bar{\lambda}_2 \rvert \geq \lvert \bar{\lambda}_1 \rvert$.

Experiments & Validation

Synthetic experiments:

  • Contextual Stochastic Block Model (CSBM) with 2 communities
  • Random features vs aligned features settings
  • Validation of GP convergence (Figure 1) as width and heads increase

Real-world datasets:

  • Pubmed (homophilic): 19,717 nodes, 44,338 edges
  • Chameleon (heterophilic): 2,277 nodes, 36,101 edges

Key results:

  • GAT-GP maintains stable performance with depth (confirms Corollary 2)
  • GCN-GP shows consistent performance degradation (confirms oversmoothing)
  • Graphormer-GP with Laplacian eigenvectors shows performance improvement with depth
  • Specformer-GP behavior depends on learned spectral weights

Baselines: Compared against GCN-GP as the message-passing baseline that suffers from oversmoothing

Validation methodology: Test accuracy vs number of layers (2-500), comparing original models and models augmented with Laplacian positional encodings

Limitations & Open Problems

Limitations:

  1. NNGP analysis requires infinite width and heads - TECHNICAL (enables tractable analysis but finite models may behave differently)

  2. Limited to ReLU and linear activations for closed-form kernels - TECHNICAL (more general activations could be handled numerically)

  3. CSBM analysis restricted to 2-community case - TECHNICAL (extension to multi-community straightforward but algebraically complex)

  4. Fixed graph assumption - NATURAL (standard in graph learning theory)

  5. Independence assumptions on weight priors - NATURAL (standard in NNGP literature)

  6. No training dynamics analysis - RESTRICTIVE (GP limit studies initialization, not optimization)

    Open problems:

  7. Finite-width finite-head analysis: Develop concentration bounds showing when finite graph transformers approximate their GP limits

  8. Heterophilic graph analysis: Extend CSBM results to graphs where connected nodes have dissimilar features, requiring more sophisticated random graph models beyond SBM


Wasserstein-type Gaussian Process Regressions for Input Measurement Uncertainty

Authors: Hengrui Luo, Xiaoye S. Li, Yang Liu, Marcus Noack et al. (6 authors) · Institution: Rice University, Lawrence Berkeley National Laboratory · Category: stat.ME

Introduces Projected Wasserstein ARD kernels for Gaussian processes that handle input measurement uncertainty through deterministic coordinate projections of Wasserstein distances, providing better-calibrated uncertainty quantification than standard approaches.

Tags: gaussian-processes uncertainty-quantification errors-in-variables wasserstein-distance optimal-transport kernel-methods statistical-learning-theory

arXiv · PDF

Problem Formulation
  1. Motivation: Standard Gaussian process regression assumes noise-free inputs, but when covariates are measured with error (errors-in-variables setting), this leads to overly narrow posterior intervals and poor uncertainty quantification. This matters for scientific applications like accelerator calibration and climate monitoring where reliable uncertainty quantification drives decision-making.

  2. Mathematical setup: Let $f: \mathcal{P} \to \mathbb{R}$ be a function defined on probability measures $\mathcal{P}$ on $\mathbb{R}^d$. Each noisy input is represented as a probability measure $\mu_i \in \mathcal{P}$ rather than a point in $\mathbb{R}^d$. The standard GP assumes observations:

    \[y_i = f(x_i) + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)\]

    where $x_i \in \mathbb{R}^d$ are noise-free. In the errors-in-variables setting we observe:

    \[y_i = f(X_i + \epsilon_{X,i}) + \epsilon_i\]

    where $X_i$ are latent true inputs and $\epsilon_{X,i} \sim N(0, \Sigma_X)$ is input measurement error. The Wasserstein distance between probability measures $\mu, \nu \in \mathcal{P}_p(\mathbb{R}^d)$ is:

    \[W_p(\mu, \nu) = \left(\inf_{\gamma \in \Gamma(\mu,\nu)} \int_{\mathbb{R}^d \times \mathbb{R}^d} \|x-y\|^p d\gamma(x,y)\right)^{1/p}\]

    where $\Gamma(\mu,\nu)$ is the set of couplings with marginals $\mu$ and $\nu$.

    Assumptions:

    1. Input distributions have finite $p$-th moments
    2. For $p=1$ theory: quantile functions are uniformly Lipschitz
    3. Covering number condition: $M(\tau, \mathcal{P}) \leq C\tau^{-\alpha_0}$ for some $\alpha_0 > 0$
  3. Toy example: When $d=1$ and inputs are Gaussian $\mu_i = N(m_i, s_i^2)$, $\nu_j = N(m_j, s_j^2)$, the 1-Wasserstein distance has closed form:

    \[W_1(\mu_i, \nu_j) = |m_i - m_j| + |s_i - s_j|\]

    The core difficulty is that standard kernels $k(m_i, m_j) = \exp(-\lvert m_i - m_j \rvert^2/\ell^2)$ ignore the variance terms $s_i, s_j$, losing crucial uncertainty information.

  4. Formal objective: Define a positive definite kernel on probability measures that captures input uncertainty:

    \[k(\mu, \nu) = \lambda \exp(-\sigma W_p(\mu, \nu)^p)\]
Method

The method introduces the Projected Wasserstein ARD (PWA) kernel that operates on marginal distributions:

  1. For measures $\mu, \nu$ on $\mathbb{R}^d$, define the PWA kernel:

    \[k(\mu, \nu) = \lambda \prod_{i=1}^d \exp(-\sigma_i W_p(\mu_i, \nu_i)^p)\]

    where $\mu_i, \nu_i$ are the marginal distributions along dimension $i$.

  2. For 1D marginals with CDFs $F_\mu, F_\nu$, use the closed-form Wasserstein distance:

    \[W_p(\mu, \nu) = \left(\int_0^1 |F_\mu^{-1}(q) - F_\nu^{-1}(q)|^p dq\right)^{1/p}\]
  3. The principal component variant (PCPWA) projects onto fixed directions ${v_r}_{r=1}^m$:

    \[k_{PCPWA}(\mu, \nu) = \lambda \prod_{r=1}^m \exp(-\sigma_r W_p(\mu_{v_r}, \nu_{v_r})^p)\]

    where $\mu_{v_r}$ is the pushforward of $\mu$ along direction $v_r$.

  4. Fit the GP using standard likelihood maximization with kernel matrix $K_{ij} = k(\mu_i, \mu_j)$.

    Toy example application: For 1D Gaussians $\mu_1 = N(0, 0.1^2)$, $\mu_2 = N(0, 0.5^2)$:

    \[W_1(\mu_1, \mu_2) = |0.1 - 0.5| = 0.4\] \[k(\mu_1, \mu_2) = \lambda \exp(-\sigma \cdot 0.4)\]

    This captures that despite identical means, these distributions differ substantially in spread.

Novelty & Lineage

This extends Wasserstein GP work by Candelieri et al. (2022) for discrete measures and sliced Wasserstein kernels by Meunier et al. (2022). Key differences:

  1. PWA uses deterministic coordinate projections rather than random slicing, avoiding Monte Carlo error
  2. incorporates ARD-style dimension-specific parameters $\sigma_i$
  3. provides theoretical coverage guarantees via net-extension bounds. Prior EIV methods like Kennedy & O’Hagan
  4. use latent variable approaches requiring high-dimensional integration. The contribution is developing a computationally tractable alternative with theoretical guarantees. INCREMENTAL.
Proof Techniques

The main theoretical result is a uniform posterior band for the $p=1$ case:

  1. Net-extension argument: For a $\tau$-net ${\bar{\mu}_j}_{j=1}^M$ of the measure space $\mathcal{P}$, if point-wise bounds hold:

    \[|f(\bar{\mu}_j) - \nu_N(\bar{\mu}_j)| \leq B \sigma_N(\bar{\mu}_j)\]

    then by Lipschitz continuity:

    \[|f(\mu) - \nu_N(\mu)| \leq B \sigma_N(\mu) + (L_f + L_{\nu_N})\tau + B \omega_{\sigma_N}(\tau)\]
  2. Gaussian tail bound: Under well-specified GP prior $f \sim GP(0,k)$, for each net point:

    \[P(|f(\bar{\mu}_j) - \nu_N(\bar{\mu}_j)| > \sqrt{\beta(\tau)} \sigma_N(\bar{\mu}_j)) \leq \frac{\delta}{M(\tau, \mathcal{P})}\]

    where:

    \[\beta(\tau) = \left[\Phi^{-1}\left(1 - \frac{\delta}{2M(\tau, \mathcal{P})}\right)\right]^2\]
  3. Union bound over the net combined with the net-extension gives the uniform bound:

    \[|f(\mu) - \nu_N(\mu)| \leq \sqrt{\beta(\tau)} \sigma_N(\mu) + \gamma(\tau)\]

    with posterior probability $\geq 1-\delta$, where $\gamma(\tau) = (L_f + L_{\nu_N})\tau + \sqrt{\beta(\tau)} \omega_{\sigma_N}(\tau)$.

    Key insight: The deterministic projection avoids additional randomness while the ARD structure provides dimension-specific length scales.

Experiments & Validation

Datasets:

  1. 7 synthetic scenarios including 1D errors-in-variables, varying variance, skewed distributions, 2D anisotropic, and high-dimensional Ackley functions
  2. particle accelerator calibration with 5D magnet settings and beam emittance
  3. NOAA ocean drifter trajectories with sea surface temperature.

    Baselines: Regular GP on means, aggregated GP on samples, uncertain-input GP (Girard et al. 2002), sliced Wasserstein GP, KME/MMD distributional GPs.

    Key results: In 1D-EIV scenario, Regular GP achieves RMSE 0.309 but undercoverage (65% vs nominal 90%), while PWA-GP achieves similar RMSE (0.301) with proper coverage (97%). In accelerator data, uncertain-input GP performs best (RMSE 0.134, coverage 86%), but PWA methods provide conservative coverage (96-98%) with competitive accuracy. NOAA experiments show PWA-GP maintains robust performance across temporal shifts while standard GP degrades.

Limitations & Open Problems

Limitations:

  1. Theory restricted to $p=1$ case for covering number control - TECHNICAL (higher-order cases likely tractable with more sophisticated analysis)
  2. Requires representing input distributions, not applicable when only point observations available - NATURAL (matches the errors-in-variables problem setup)
  3. Computational cost scales with dimension through product kernel structure - NATURAL (standard for ARD kernels)
  4. Positive definiteness not guaranteed for general $W_p$ kernels when $p \neq 2$ in high dimensions - TECHNICAL (resolved by projection approach)

    Open problems:

  5. Extend uniform bounds to $p > 1$ and multivariate optimal transport settings
  6. Develop adaptive projection methods that learn optimal directions rather than using coordinates or PCA

Adaptive regularization parameter selection for high-dimensional inverse problems: A Bayesian approach with Tucker low-rank constraints

Authors: Qing-Mei Yang, Da-Qing Zhang · Institution: University of Science and Technology Liaoning · Category: cs.LG

Tucker-VB reduces computational complexity of variational Bayesian regularization from O(n^3) to O(n_G^3) by projecting inference onto low-rank tensor subspaces with adaptive per-mode regularization.

Tags: inverse problems variational Bayes tensor decomposition regularization high-dimensional inference image processing heat conduction computational efficiency

arXiv · PDF

Problem Formulation
  1. Motivation: High-dimensional inverse problems in medical imaging, signal processing, and remote sensing require efficient regularization parameter selection. Existing variational Bayesian methods become computationally intractable when the unknown dimension n reaches 10^4 to 10^6 due to O(n^3) complexity and O(n^2) memory requirements.

  2. Mathematical setup: Consider the discretized linear inverse problem

    \[y = Ax + \epsilon\]

    where $A \in \mathbb{R}^{m \times n}$ is the forward operator, $y \in \mathbb{R}^m$ is the observation vector, $\epsilon \sim N(0, \sigma^2 I)$ is Gaussian noise, and $x \in \mathbb{R}^n$ is the unknown. The unknown vector $x$ can be reshaped into a d-th order tensor $X \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_d}$ where $n = \prod_{k=1}^d I_k$. Under Tucker decomposition:

    \[X \approx G \times_1 U_1 \times_2 \cdots \times_d U_d\]

    where $G \in \mathbb{R}^{R_1 \times R_2 \times \cdots \times R_d}$ is the core tensor with $R_k \ll I_k$.

    Assumptions:

    1. The unknown admits accurate low-rank Tucker approximation
    2. Factor matrices $U_k$ are predetermined from problem structure (orthonormal)
    3. Gaussian observation noise with unknown precision $\beta$
    4. Zero-mean Gaussian prior on core tensor coefficients with unknown precision $\alpha$
  3. Toy example: When $d=2$ with $I_1 = I_2 = 100$ and ranks $R_1 = R_2 = 10$, the original problem has $n = 10,000$ unknowns requiring a $10,000 \times 10,000$ posterior covariance matrix. The Tucker constraint reduces this to $n_G = 100$ core tensor coefficients with a $100 \times 100$ covariance matrix, reducing storage from 800MB to 80KB.

  4. Formal objective: Estimate the regularization parameter $\lambda = \alpha/\beta$ by maximizing the evidence lower bound:

    \[\mathcal{L}(q) = \int q(g,\alpha,\beta) \log \frac{p(y,g,\alpha,\beta)}{q(g,\alpha,\beta)} dg d\alpha d\beta\]
Method

The Tucker-VB method performs variational Bayesian inference in the reduced core tensor space:

  1. Transform the observation model using Tucker constraint $x = \tilde{U}g$ where $\tilde{U} = U_d \otimes \cdots \otimes U_1$:

    \[y = A\tilde{U}g + \epsilon = \tilde{A}g + \epsilon\]
  2. Use mean-field approximation $q(g,\alpha,\beta) = q(g)q(\alpha)q(\beta)$ with variational updates:

  3. Core tensor posterior (Gaussian):

    \[q^*(g) = N(g|\mu_g, \Sigma_g)\] \[\Sigma_g = (E[\beta]\tilde{A}^T\tilde{A} + E[\alpha]I)^{-1}\] \[\mu_g = E[\beta]\Sigma_g\tilde{A}^T y\]
  4. Precision hyperparameter posteriors (Gamma):

    \[q^*(\alpha) = \text{Gamma}(\alpha|\tilde{a}_\alpha, \tilde{b}_\alpha)\] \[\tilde{a}_\alpha = a_0 + \frac{n_G}{2}, \quad \tilde{b}_\alpha = b_0 + \frac{1}{2}E[\|g\|^2]\]
  5. Multi-hyperparameter extension assigns independent precisions ${\alpha_k}_{k=1}^d$ to each tensor mode with effective precision $\bar{\alpha} = \sum_{k=1}^d E[\alpha_k]$.

    Applied to toy example: With $n_G = 100$ and $\tilde{A} \in \mathbb{R}^{m \times 100}$, the method inverts a $100 \times 100$ matrix instead of $10,000 \times 10,000$, reducing complexity from O(10^12) to O(10^6) operations.

Novelty & Lineage

This work extends variational Bayesian regularization parameter selection (classical methods like empirical Bayes) by incorporating Tucker low-rank constraints for computational tractability. Prior work includes Calvetti-Somersalo distributed regularization and standard variational Bayes for inverse problems, but these operate in the original parameter space. The key innovation is projecting variational inference onto the Tucker subspace, reducing complexity from O(n^3) to O(n_G^3), and introducing per-mode precision parameters for anisotropic regularization. The multi-hyperparameter prior structure enabling mode-adaptive regularization is novel. SIGNIFICANT.

Proof Techniques

The derivation uses standard variational calculus with mean-field approximation:

  1. The optimal variational factor satisfies:

    \[\log q^*(z_i) = E_{q_{-i}}[\log p(y,z)] + C\]
  2. For the core tensor, substituting Gaussian likelihood and prior yields:

    \[\log q^*(g) \propto -\frac{E[\beta]}{2}\|y - \tilde{A}g\|^2 - \frac{E[\alpha]}{2}\|g\|^2\]
  3. Completing the square in the quadratic form:

    \[q^*(g) = N(g|\mu_g, \Sigma_g)\]

    where the precision matrix is $\Sigma_g^{-1} = E[\beta]\tilde{A}^T\tilde{A} + E[\alpha]I$.

  4. For hyperparameters, conjugate Gamma priors yield closed-form Gamma posteriors:

    \[\log q^*(\alpha) \propto \left(a_0 + \frac{n_G}{2} - 1\right)\log\alpha - \left(b_0 + \frac{E[\|g\|^2]}{2}\right)\alpha\]
  5. The multi-mode analysis exploits the identity:

    \[\sum_{i_k=1}^{R_k} \|g_{i_k}^{(k)}\|^2 = \|g\|^2\]

    showing the joint prior factorizes with effective precision $\bar{\alpha} = \sum_{k=1}^d \alpha_k$.

Experiments & Validation

Experiments include:

  1. Scalability comparison showing Tucker-VB scales O(n^2) vs direct VB’s O(n^6), enabling problems up to 110,000 variables
  2. 2D Fredholm integral equations with 500 trials per noise level, comparing against L-curve, GCV, UPRE, DP baselines - Tucker-VB shows 20-50% improvement
  3. Anisotropic image deblurring on 512×512 Cameraman image showing 0.73-2.09 dB PSNR gains
  4. 3D backward heat conduction (48×48×48 grid) demonstrating 6.75 dB average improvement and automatic learning of physical anisotropy parameters. Key metrics: PSNR, SSIM, relative reconstruction error, computational timing.
Limitations & Open Problems

Limitations:

  1. Requires predetermined Tucker factor matrices from problem structure - TECHNICAL (could be addressed with alternating optimization)
  2. Assumes low-rank Tucker structure exists - NATURAL (common in multi-dimensional data)
  3. Sensitivity to rank selection in Tucker decomposition - TECHNICAL (could be automated)
  4. Limited to Gaussian noise and priors - NATURAL (standard assumption)
  5. No theoretical convergence guarantees for variational updates - TECHNICAL (standard limitation)

    Open problems:

  6. Automated rank selection algorithms for Tucker decomposition without manual tuning
  7. Extension to non-Gaussian noise models and heavy-tailed priors for robust reconstruction

Variational Rectification Inference for Learning with Noisy Labels

Authors: Haoliang Sun, Qi Wei, Lei Feng, Yupeng Hu et al. (7 authors) · Institution: Shandong University, Nanyang Technological University · Category: cs.LG

Proposes variational rectification inference that prevents model collapse in probabilistic meta-learning for noisy labels by formulating rectification as amortized variational inference with KL regularization.

Tags: meta-learning noisy-labels variational-inference robust-learning deep-learning classification bayesian-methods bi-level-optimization

arXiv · PDF

Problem Formulation
  1. Motivation: Learning with noisy labels is a fundamental challenge in deep learning, where label corruption significantly degrades model performance. Meta-learning approaches have emerged to address this, but existing probabilistic methods suffer from model collapse where the posterior degenerates to a Dirac delta function.

  2. Mathematical setup: Consider a training dataset $D_N = {(x^{(i)}, y^{(i)})}_{i=1}^N$ with potentially noisy labels and a clean meta-dataset $D_M = {(\tilde{x}^{(i)}, \tilde{y}^{(i)})}_{i=1}^M$ where $M \ll N$. Let $F_\theta: \mathcal{X} \to \mathbb{R}^C$ be a classification network with parameters $\theta$. The goal is to learn a rectifying vector $v^{(i)} \in \mathbb{R}^C$ for each sample to modify the prediction via element-wise multiplication:

    \[\hat{y}^{(i)} = \text{softmax}(v^{(i)} \odot F_\theta(x^{(i)}))\]

    Assumptions:

    1. The rectifying vector follows a factorized Gaussian distribution $v \sim \mathcal{N}(\mu, \sigma^2)$
    2. Clean meta-data is available for meta-learning
    3. The smoothness assumption holds: similar features should produce similar rectification vectors
  3. Toy example: When $C=2$ classes and $v = [0.9, 1.1]$, if $F_\theta(x) = [2.0, 1.0]$, then the rectified logits become $[1.8, 1.1]$, reducing confidence in the potentially noisy prediction.

  4. Formal objective: Maximize the evidence lower bound (ELBO):

    \[\mathcal{L}_{ELBO} = \mathbb{E}_{q_\phi(v|x,y)}[\log p_\theta(y|x,v)] - D_{KL}[q_\phi(v|x,y) \| p_\omega(v|x)]\]
Method

Variational Rectification Inference (VRI) formulates rectification as an amortized variational inference problem with three networks:

  1. Classification network $F_\theta$ with feature extractor $F’_{\theta’}$
  2. Meta-network $V_\phi$ that outputs Gaussian parameters $(\mu, \sigma^2) = V_\phi(F’_{\theta’}(x), y)$
  3. Prior network $H_\omega$ that outputs prior parameters from features only

    The method proceeds as:

  4. Sample rectifying vector using reparameterization: $v = \mu + \sigma \cdot \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$
  5. Compute rectified prediction: $\hat{y} = \text{softmax}(\pi[v] \odot F_\theta(x))$ where $\pi$ is sigmoid
  6. Minimize empirical loss with KL regularization:

    \[\mathcal{L}_{emp}(\theta) = \frac{1}{N}\sum_{i=1}^N \mathcal{L}(y^{(i)}, \pi[v^{(i)}] \odot F_\theta(x^{(i)})) + \lambda D_{KL}[V_\phi(\cdot) \| H_\omega(\cdot)]\]
  7. Meta-learning via bi-level optimization where meta-network parameters are updated using clean meta-data performance.

    Toy example application: For $x$ with feature $F’_{\theta’}(x) = [0.5, 0.3, 0.8]$ and noisy label $y=1$, the meta-network outputs $\mu = [0.9, 1.1, 0.8], \sigma^2 = [0.1, 0.1, 0.1]$. Sampling gives $v = [0.88, 1.13, 0.79]$ which rectifies the original logits.

Novelty & Lineage

This work extends probabilistic meta-learning for noisy labels, specifically building on:

  • Meta-Weight-Net (Shu et al. 2019): introduced meta-learning for sample reweighting
  • WarPI (Sun et al. 2022a): used Monte Carlo approximation for rectification vectors
  • Probabilistic Meta-Weight-Net (Zhao et al. 2023): applied Bayesian modeling to meta-learning

The key novelty is formulating rectification as amortized variational inference rather than Monte Carlo approximation. This prevents model collapse where the posterior degenerates to a Dirac delta function. The variational formulation with KL regularization between posterior and prior is new in this context, as is the smoothness assumption enforcement through the prior network.

SIGNIFICANT

Proof Techniques

The theoretical analysis focuses on convergence guarantees for the bi-level optimization algorithm:

  1. Smoothness analysis: Establishes that the meta loss function is $\hat{\ell}$-smooth under standard assumptions:

    \[\|∇L_{meta}(\theta_1) - ∇L_{meta}(\theta_2)\| \leq \hat{\ell}\|\theta_1 - \theta_2\|\]

    Key technical requirement: The loss function $L$ is $\ell$-smooth and $\tau$-Lipschitz, the KL term has $o$-bounded gradient, and the meta-network $V_\phi$ is differentiable with $\delta$-bounded gradient and $\zeta$-bounded Hessian.

  2. Convergence rate proof: Uses stochastic gradient analysis with learning rate schedule $\alpha_t = \min{1, \kappa/T}$ and $\eta_t = \min{1/\hat{\ell}, C/(\sigma\sqrt{T})}$ to establish:

    \[\frac{1}{T}\sum_{t=1}^T \mathbb{E}[\|\nabla L_{meta}(\hat{\theta}^{(t)}(\phi^{(t)}))\|_2^2] \leq O(1/\sqrt{T})\]
  3. Key technical insight: The variational formulation prevents collapse because when the posterior approaches a Dirac delta (variance → 0), the KL divergence $D_{KL}[q_\phi \lvert p_\omega]$ becomes large, penalizing this degeneration.

    The proof technique follows standard stochastic optimization analysis but requires careful handling of the bi-level structure and the sampling operations via reparameterization.

Experiments & Validation

Datasets: CIFAR-10, CIFAR-100, Clothing1M, Food-101N, ANIMAL-10N, and CIFAR-80N (open-set noise)

Noise types: Flip noise (20%, 40%), uniform noise (20%, 40%), instance-dependent noise (20%, 40%), real-world noise, and open-set noise

Baselines: 20+ methods including DivideMix, ELR, Meta-Weight-Net, MSLC, WarPI, CORES, SFT, GSS-SSL, PES, CoDis, etc.

Key results:

  • CIFAR-10 flip 40%: 93.27% vs 92.48% (previous best MSLC)
  • CIFAR-100 flip 20%: 75.13% vs 74.20% (previous best ELR)
  • Clothing1M: 75.19% vs 74.98% (previous best WarPI)
  • Food-101N: 86.24% vs 86.13% (previous best CoDis)
  • CIFAR-80N open-set uniform 20%: 72.90% vs 71.54% (previous best USDNL)

Architectures tested: ResNet-18/32/34, Wide ResNet-28-10, ResNet-50

Ablation studies show VRI requires only k=1-2 samples vs k=5+ for MC methods, and prevents model collapse demonstrated by stable variance norms.

Limitations & Open Problems

Limitations:

  1. NATURAL: Requires clean meta-dataset for meta-learning, standard assumption in this field
  2. TECHNICAL: Assumes factorized Gaussian distributions for tractability, likely removable with more sophisticated variational families
  3. TECHNICAL: Uses sigmoid activation to constrain rectification vectors to [0,1], design choice that could be optimized
  4. NATURAL: Smoothness assumption that similar features should have similar rectifications, reasonable in most domains
  5. RESTRICTIVE: Limited to classification tasks, extension to regression or structured prediction unclear
  6. TECHNICAL: KL weight λ requires tuning, though shown to be robust across datasets

    Open problems:

  7. Extending VRI to semi-supervised learning scenarios where unlabeled data is also available during meta-learning
  8. Developing theoretical understanding of when the variational posterior provides better approximation than MC sampling beyond preventing collapse

The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions

Authors: Rui Wu, Hong Xie, Yongjun Li · Institution: University of Science and Technology of China · Category: cs.LG

Proves that deterministic generative models inevitably develop finite-time singularities under extreme causal interventions, establishing fundamental geometric limits and an uncertainty principle for continuous counterfactual inference.

Tags: causal-inference optimal-transport manifold-geometry generative-models flow-matching schrodinger-bridges stochastic-analysis riemannian-geometry

arXiv · PDF

Problem Formulation

Motivation: Continuous causal inference requires transporting probability measures on manifolds to simulate counterfactuals, but existing deterministic generative models (Flow Matching, Neural ODEs) fail under extreme interventions. Understanding the fundamental geometric limits of such transport is crucial for reliable causal AI in high-dimensional domains like genomics and medical imaging.

Mathematical Setup: Let $(M, g)$ be a complete Riemannian manifold with observational measure $\mu_0 \in \mathcal{P}(M)$ having smooth density $\rho_0 > 0$ supported on compact domain $\Omega_0$ with diameter $\Delta$. A continuous intervention $do(x^*)$ creates target measure:

\[\mu_{do(x^*)}^{1,\sigma}(x) = p_{\sigma^2/2}(x^*, x) d\text{vol}_g(x)\]

where $p_t(x,y)$ is the heat kernel and intervention distance $D = d_g(\text{supp}(\mu_0), x^*) \gg \Delta$.

Assumptions:

  1. Sectional curvature bounded below by $-\kappa^2$ for $\kappa \geq 0$
  2. Reference potential $V$ satisfies distant dissipativity: $V(x) \geq C_V d_g(x, x_{\text{obs}})^2$ for $x \notin K \supset \text{supp}(\mu_0)$
  3. Invariant measure satisfies Log-Sobolev inequality with constant $C_{LS} > 0$

    Toy Example: When $M = \mathbb{R}^2$, $\mu_0 = \mathcal{N}(0, I_2)$, and intervention target $x^* = (5, 0)^T$, the transport distance is $D = 5$ and the mollified target is $\mu_{do(x^*)}^{1,\sigma} = \mathcal{N}(x^*, \sigma^2 I_2)$.

    Formal Objective: Find the critical transport distance (Counterfactual Event Horizon) $\delta_{\text{crit}}$ and singularity time $t_c$ such that deterministic transport fails:

    \[\inf_{t < t_c} \det(\nabla_x \Phi_t(x)) = 0\]

    where $\Phi_t$ is the optimal transport flow map.

Method

Geometry-Aware Causal Flow (GACF): An adaptive algorithm that operates in deterministic ODE mode but injects entropy when manifold tearing is imminent.

Key Steps:

  1. Compute geometric viscosity requirement:

    \[\varepsilon_{\text{req}} = \frac{C_0 \Delta}{1 - \kappa_- \Delta^2} \cdot D\]
  2. Use Hutchinson trace estimator as topological radar:

    \[\tilde{\theta}_t = z^T \nabla_x u_t(x_t) z\]

    where $z \sim \text{Rademacher}({-1,1}^n)$ and $u_t$ is the velocity field.

  3. Adaptive entropy injection:

    \[\varepsilon_t = \begin{cases}\] \[0 & \text{if } \tilde{\theta}_t \geq \lambda_{\text{thresh}} \text{ (ODE mode)} \\\] \[\varepsilon_{\text{req}} & \text{if } \tilde{\theta}_t < \lambda_{\text{thresh}} \text{ (SDE mode)}\] \[\end{cases}\]
  4. Update state:

    \[x_{t+\Delta t} = x_t + u_t(x_t)\Delta t + \sqrt{2\varepsilon_t \Delta t} \xi\]

    Applied to Toy Example: For $D = 5$, $\Delta = 2$, the required entropy is $\varepsilon_{\text{req}} = C_0 \cdot 10$. The Hutchinson estimator tracks divergence $\tilde{\theta}_t \approx -5t/(1-5t)$ until threshold $\lambda_{\text{thresh}} = -10$ is crossed, triggering SDE mode.

Novelty & Lineage

Novel Contributions: First rigorous proof that deterministic causal transport develops finite-time singularities under extreme interventions, establishing fundamental geometric limits via manifold tearing theorems. Introduces the Causal Uncertainty Principle quantifying the trade-off between intervention extremity and identity preservation.

Prior Work Extensions: Builds on classical optimal transport regularity (Villani 2009, Loeper 2009) by connecting abstract singularities to physical causal mechanisms. Extends Schrödinger Bridge theory (Schiebinger et al. 2019, Bunne et al. 2023) from algorithmic heuristics to geometric necessity proofs.

Assessment: SIGNIFICANT - Provides foundational theoretical understanding of when and why continuous causal inference fails, with rigorous mathematical proofs and practical algorithmic solutions.

Proof Techniques

Main Proof Strategy - Three Theorems:

Theorem 5 (Event Horizon): Uses measure disintegration and data processing inequality:

\[\text{KL}(P^* \| Q) \geq \text{KL}(\mu_{do(x^*)}^{1,\sigma} \| Q_1) \geq \frac{C_V}{\varepsilon} D^2 + \frac{n}{\varepsilon} \log(1/\sigma)\]

Theorem 8 (Manifold Tearing):

  1. Bound initial Hessian via Brenier regularity and Monge-Ampère equation:

    \[\lambda_0 \geq \left(\frac{\max \rho_0}{m_0}\right)^{1/n} + \kappa D \coth(\kappa D)\]
  2. Apply Raychaudhuri equation along characteristics:

    \[\dot{\theta}(t) \leq -\frac{1}{n}\theta^2(t) + nKD^2\]
  3. Integrate Riccati ODE to get singularity time:

    \[t_c \leq \frac{n}{\sqrt{nKD}} \text{arccoth}\left(\frac{\lambda_0}{\sqrt{nKD}}\right)\]

    Theorem 13 (Uncertainty Principle): Combines Bochner-Weitzenböck identity with Bakry-Émery bounds:

    \[H(P_{1|0}(\cdot | x_0)) \geq \frac{n}{2} \log\left(4\pi e \cdot \frac{C_0\Delta}{1-\kappa_-\Delta^2} \cdot D\right)\]

    Key Technical Insight: The Cole-Hopf transformation connects SDEs to viscous Burgers’ equation, revealing that deterministic limit produces shockwaves governed by hyperbolic PDEs.

Experiments & Validation

Datasets: Synthetic 2D transport scenarios with controlled geometric voids, high-dimensional neural flows, real-world single-cell RNA sequencing (scRNA-seq) data.

Baselines: Standard deterministic ODEs (Flow Matching), fixed-entropy SDEs, standard Schrödinger Bridges.

Key Results:

  • Empirical validation of $t_c \propto 1/D$ scaling law for singularity times
  • GACF reduces identity loss by 58.4% compared to fixed-SDE baselines while maintaining topological validity
  • Demonstrates prevention of “biological chimeras” in scRNA-seq counterfactuals
  • Confirms geometric singularity at support diameter threshold $\Delta \approx 1/\sqrt{\kappa_-}$
Limitations & Open Problems

Limitations:

  1. TECHNICAL - Assumes distant dissipativity condition on reference potential (Assumption 3), needed for control energy bounds but likely relaxable via more sophisticated regularity theory.

  2. TECHNICAL - Requires sectional curvature bounds, standard in Riemannian optimal transport but restrictive for neural network geometries.

  3. RESTRICTIVE - Analysis focuses on single interventions; multi-variable causal graphs may have additional topological obstructions.

  4. NATURAL - Assumes manifold hypothesis for high-dimensional data, widely accepted in modern ML.

  5. RESTRICTIVE - Hutchinson estimator requires Lipschitz velocity fields, which may not hold for all neural architectures.

    Open Problems:

  6. Extend manifold tearing analysis to multiple simultaneous interventions with potential cohomological obstructions
  7. Develop adaptive curvature estimation for unknown manifold geometries in learned representations

High-dimensional estimation with missing data: Statistical and computational limits

Authors: Kabir Aladin Verchand, Ankit Pensia, Saminul Haque, Rohith Kuditipudi · Institution: USC, Carnegie Mellon, Stanford · Category: math.ST

Establishes tight statistical rates and identifies exponential statistical-computational gaps for high-dimensional parameter estimation under realizable contamination from missing data.

Tags: missing_data robust_statistics sum_of_squares statistical_computational_gaps minimax_theory high_dimensional_statistics contamination_models gaussian_estimation

arXiv · PDF

Problem Formulation
  1. Motivation: Real-world data collection often suffers from missing observations that depend on the underlying data values in complex ways. Classical missing data assumptions like MCAR (missing completely at random) are too restrictive, while MNAR (missing not at random) is too general to permit consistent estimation.

  2. Mathematical setup: Consider a base distribution $P$ on $\mathbb{R}^d$ and contamination parameter $\epsilon \in [0,1]$. Define the realizable contamination model as:

    \[\mathcal{R}(P, \epsilon) := (1-\epsilon)\text{MCAR}(P,1) + \epsilon\text{MNAR}_P\]

    where MCAR$(P,1) = {\text{Law}(X \odot_* \Omega) : X \sim P, \Omega \in {0^d, 1^d}, \Omega \perp X, \mathbb{P}(\Omega = 1^d) = 1}$ and MNAR$_P = {\text{Law}(X \odot_* \Omega) : X \sim P, \text{Law}(\Omega) \in \mathcal{P}({0^d, 1^d})}$. The operator $\odot_*$ sets missing coordinates to $*$.

    This model satisfies the sandwich relation: for $Q \in \mathcal{R}(P, \epsilon)$ and conditional distribution $Q_R = \text{Law}(X \lvert X \in \mathbb{R}^d)$:

    \[1 - \frac{\epsilon}{1-\epsilon} \leq \frac{dQ_R}{dP}(z) \leq 1 + \frac{\epsilon}{1-\epsilon}\]
  3. Toy example: When $d=2$, $P = N(0, I_2)$, $\epsilon = 0.1$, and we observe samples $(x_1, x_2), (*, *), (x_3, x_4)$, the likelihood ratio for observed points is bounded in $[0.89, 1.11]$, creating a biased but controlled sampling mechanism.

  4. Formal objective: Estimate population parameters $\theta(P)$ (mean, covariance, regression coefficients) from $n$ i.i.d. samples from some $Q \in \mathcal{R}(P, \epsilon)$ with minimax risk:

    \[M_n(\delta, \mathcal{P}, L) = \inf_{\hat{\theta}} \sup_{\theta} \sup_{P_\theta \in \mathcal{P}_\theta} \text{Quantile}(1-\delta, P_\theta, L(\hat{\theta}, \theta))\]
Method

The paper develops three main approaches:

Information-theoretic algorithms (Section 3):

  1. Construct optimal univariate estimators for projections $\langle v, \theta \rangle$ using minimum Kolmogorov distance
  2. Combine via variational principle: $\hat{\theta} \in \arg\min_\theta \sup_{v \in S^{d-1}} \lvert \langle v, \theta \rangle - \hat{\theta}_v \rvert$
  3. Use $\epsilon$-nets to approximate the unit sphere (exponential time)

    Sum-of-squares algorithms (Section 4): For mean estimation, solve the SoS program with constraint:

    \[\{\|v\|_2^2 = 1\}^{4k}_v \mathbb{E}_{x \sim T}[\langle v, x - \theta \rangle^{2k}] \leq \frac{(1+\epsilon)^2}{1-\epsilon} \mathbb{E}_{G \sim N(0,1)}[G^{2k}]\]

    The key insight is that polynomial moment constraints can separate contaminated from clean distributions, and SoS can efficiently solve the resulting optimization.

    Linear regression algorithm (Section 5): Minimize the strongly convex empirical risk:

    \[\hat{\theta} = \arg\min_\theta \frac{1}{|T|} \sum_{(x,y) \in T} \ell(y - x^\top \theta)\]

    Applied to toy example: With $d=2$, $\epsilon=0.1$, the SoS program uses degree-$k$ polynomials to test if sample moments match those of $N(0, I_2)$ within the contamination bounds, then outputs the center of the feasible region.

Novelty & Lineage

This work extends the realizable contamination model introduced by [MVBWS24]. Novel contributions include:

  1. Tight information-theoretic bounds: First optimal rates for high-dimensional mean/covariance estimation under realizable contamination, improving sample complexity from $n \gtrsim d^{36/31}$ [MVBWS24] to $n \gtrsim d$.

  2. Statistical-computational gaps: First evidence of exponential gaps between information-theoretic ($n \gtrsim de^{1/\rho^2}$) and computationally efficient ($n \gtrsim d^{1/\rho^2}$) sample complexities for mean/covariance estimation.

  3. SoS algorithms: Novel sum-of-squares based estimators that achieve computational lower bounds, extending SoS techniques from robust statistics [KSS18, HL18] to missing data.

  4. Linear regression optimality: Shows no statistical-computational gap persists for linear regression, contrasting with mean/covariance settings.

    The proof techniques build on recent advances in computational lower bounds via NGCA [DKS17] and adaptive contamination models [DGKPX26].

    SIGNIFICANT

Proof Techniques

Information-theoretic lower bounds:

  1. Construct hard instances using high-dimensional hidden direction distributions $P_{A,v}(x) := a(v^\top x)\phi_{v^\perp}(x)$ where $A$ matches many moments with $N(0,1)$
  2. Apply Fano’s inequality (Lemma 2.10) with loss separation:

    \[\max\{L(\theta, \theta_k), L(\theta, \theta_\ell)\} \geq \gamma\]
  3. Bound KL divergences: $\frac{1}{M}\sum_{\ell} \text{KL}(P_{\theta_\ell}, H) \leq \frac{1}{2}\log(M)$

    Upper bounds via variational principles: Use the characterization $\lvert \theta - \theta_*\ \rvert_2 = \sup_{v \in S^{d-1}} \langle v, \theta - \theta_* \rangle$ and oracle inequality:

    \[\|\hat{\theta} - \theta_*\|_2 \leq 2 \sup_{v \in S^{d-1}} |\langle v, \theta_* \rangle - \hat{\theta}_v|\]

    SoS upper bounds:

  4. Show polynomial constraints separate clean from contaminated data under good event $E$
  5. Apply SoS rounding: if $\tilde{\mathbb{E}}$ satisfies constraints, then $\tilde{\mathbb{E}}[\theta]$ approximates true parameter
  6. Key moment bound under contamination:

    \[\mathbb{E}_{x \sim Q_R}[\langle v, x \rangle^{2k}] \leq (1+\epsilon) \mathbb{E}_{z \sim P}[\langle v, z \rangle^{2k}]\]

    Computational lower bounds: Reduce from NGCA testing problem using moment-matching distributions. Any SQ algorithm requires either $2^{d^{\Omega(1)}}$ queries or tolerance $\kappa \leq d^{-\Omega(k)}$ where $k$ is the degree of moment-matching.

Experiments & Validation

Purely theoretical. The paper provides no empirical validation.

Experiments would naturally test:

  1. finite-sample performance of SoS algorithms vs information-theoretic bounds on synthetic Gaussian data with varying $d, n, \epsilon$
  2. comparison with existing robust estimators like Tukey median on real datasets with artificial missingness
  3. computational runtime scaling for SoS vs. exponential-time optimal algorithms
  4. performance under model misspecification (non-Gaussian base distributions).
Limitations & Open Problems

Limitations:

  1. RESTRICTIVE: Gaussian base distribution assumption - many real datasets are non-Gaussian
  2. TECHNICAL: All-or-nothing missingness pattern in main results - partially missing coordinates are deferred to appendix
  3. TECHNICAL: Known contamination parameter $\epsilon$ - adaptation to unknown $\epsilon$ not addressed
  4. NATURAL: High-dimensional regime focus - low-dimensional performance could differ significantly
  5. RESTRICTIVE: SoS computational model - practical implementations may face numerical issues

    Open problems:

  6. Can the statistical-computational gaps for mean/covariance estimation be closed, or do they reflect fundamental computational barriers?
  7. Do similar statistical-computational gaps exist for other parameter estimation problems under realizable contamination (e.g., principal components, sparse regression)?

A Portfolio-Anchored Frequency-Severity Risk Index for Trip and Driver Assessment Using Telematics Signals

Authors: Jongtaek Lee, Andrei Badescu, X. Sheldon Lin · Institution: University of Toronto · Category: stat.AP

Introduces a frequency-severity joint risk index for telematics-based trip assessment using MODWT multi-scale representation and Gaussian-Uniform mixture models with portfolio-level severity layers.

Tags: telematics insurance risk_assessment wavelet_transform mixture_models usage_based_insurance anomaly_detection sequential_bayesian_updating

arXiv · PDF

Problem Formulation
  1. Motivation: Traditional auto insurance pricing relies on static driver characteristics that cannot capture real-time driving behavior or provide trip-level risk assessment. With telematics data becoming available, there is a need for dynamic risk indices that can evaluate individual trips and update driver profiles over time using behavioral signals alone, particularly in usage-based insurance where claims are sparse.

  2. Mathematical setup: Let $X_i = (X_{i,t})_{t=0}^{T_i-1}$ be a longitudinal acceleration time series for trip $i$ of length $T_i$. The MODWT decomposes this into wavelet coefficients:

    \[\tilde{W}_{i,j,t} := \sum_{\ell=0}^{L_j-1} \tilde{h}_{j,\ell} X_{i,(t-\ell) \bmod T_i}\]

    for levels $j = 1, \ldots, J$. Aggregated coefficients are defined as:

    \[C_{i,t} := \mathfrak{g}(\{\tilde{W}_{i,j,t}\}_{j \in \mathcal{J}})\]

    The portfolio distribution of coefficients $C = {C_r}_{r=1}^n$ follows a Gaussian-Uniform mixture:

    \[f(\cdot; \eta) = \sum_{m'=1}^{M^-} \pi_{m'}^- u(\cdot; \theta_{m'}^-) + \sum_{g=1}^G \pi_g \phi(\cdot; \theta_g) + \sum_{m=1}^{M^+} \pi_m^+ u(\cdot; \theta_m^+)\]

    Multi-layer tail counts for trip $i$ are:

    \[N_{im} := \sum_{t=0}^{E_i-1} \mathbf{1}\{C_{it} \in \Theta_m\}\]

    where ${\Theta_m}_{m=1}^M$ are disjoint severity layers. Conditional on intensities $\lambda_{im}$:

    \[N_{im} | \lambda_{im}, E_i \sim \text{Poisson}(E_i \lambda_{im})\]

    with Gamma priors:

    \[\lambda_{im} \sim \text{Gamma}(\alpha_{0m}, \beta_{0m})\]

    Assumptions:

    1. Telematics signals exhibit short-term mean reversion
    2. Aggregated coefficients form an approximately i.i.d. portfolio sample after hierarchical sampling and thinning
    3. Tail counts across severity layers are conditionally independent given intensities
    4. Gamma-Poisson conjugacy holds for sequential updating
  3. Toy example: Consider a simple case with $J=2$ MODWT levels, $M=2$ severity layers (one mild tail layer $\Theta_1$ and one extreme tail layer $\Theta_2$), and $G=1$ Gaussian component. For a trip with $E_i=100$ exposure and tail counts $N_{i1}=5$, $N_{i2}=2$, the posterior intensities become $\lambda_{i1} \lvert N_{i1}, E_i \sim \text{Gamma}(\alpha_{01}+5, \beta_{01}+100)$ and $\lambda_{i2} \lvert N_{i2}, E_i \sim \text{Gamma}(\alpha_{02}+2, \beta_{02}+100)$.

  4. Formal objective: Estimate the trip-level frequency-severity risk index:

    \[\hat{S}_i = \sum_{m=1}^M w_m \frac{\alpha_{0m} + N_{im}}{\beta_{0m} + E_i}\]

    where severity weights $w_m = \frac{(1/\pi_m)^\gamma}{\sum_{l=1}^M (1/\pi_l)^\gamma}$ penalize rarer tail layers.

Method

The method consists of four main components:

  1. MODWT Signal Representation: Decompose longitudinal acceleration using maximal overlap discrete wavelet transform with Daubechies D4 filters up to level $J=6$. Aggregate coefficients across levels using:

    \[C_{i,t} = \mathfrak{g}(\{\tilde{W}_{i,j,t}\}_{j \in \mathcal{J}})\]
  2. Portfolio-Level Severity Modeling: Fit Gaussian-Uniform mixture using Multi-Uniform MEMR algorithm. The mixture density is:

    \[f(\cdot; \eta) = \sum_{m'=1}^{M^-} \pi_{m'}^- u(\cdot; \theta_{m'}^-) + \sum_{g=1}^G \pi_g \phi(\cdot; \theta_g) + \sum_{m=1}^{M^+} \pi_m^+ u(\cdot; \theta_m^+)\]

    Constraints ensure tail separation, consecutive layer coverage, and monotonic membership probabilities.

  3. Trip-Level Frequency Modeling: For each severity layer $m$, model tail counts using Poisson-Gamma conjugacy:

    \[N_{im} | \lambda_{im}, E_i \sim \text{Poisson}(E_i \lambda_{im})\] \[\lambda_{im} \sim \text{Gamma}(\alpha_{0m}, \beta_{0m})\]
  4. Risk Index Construction: Define severity weights and compute trip-level index:

    \[w_m = \frac{(1/\pi_m)^\gamma}{\sum_{l=1}^M (1/\pi_l)^\gamma}\] \[\hat{S}_i = \sum_{m=1}^M w_m \frac{\alpha_{0m} + N_{im}}{\beta_{0m} + E_i}\]

    Application to toy example: With $M=2$ layers, $N_{i1}=5$, $N_{i2}=2$, $E_i=100$, and suppose $\pi_1=0.1$, $\pi_2=0.05$, $\gamma=1$. Then $w_1 = 10/(10+20) = 1/3$, $w_2 = 20/30 = 2/3$. If priors are $\alpha_{01}=1$, $\beta_{01}=1$, $\alpha_{02}=1$, $\beta_{02}=1$, then:

    \[\hat{S}_i = \frac{1}{3} \cdot \frac{6}{101} + \frac{2}{3} \cdot \frac{3}{101} = \frac{2+6}{3 \cdot 101} = \frac{8}{303} \approx 0.026\]
Novelty & Lineage

This work extends the frequency-based telematics risk assessment literature by introducing formal severity modeling. Prior work by Chan et al. (2025a) proposed a Continuous-Time Hidden Markov Model for anomaly detection but was purely frequency-oriented and struggled with drowsy driving detection. Verbelen et al. (2018) and Denuit et al. (2019) developed actuarial frameworks for telematics but without multi-scale signal representation or severity layers.

Key novel contributions:

  1. Severity concept: Redefines severity as statistical rarity in portfolio distribution rather than claim amounts, using inverse-probability penalties through layered tail structure
  2. Multi-Uniform MEMR algorithm: Extends Coretto (2022)’s single Uniform component MEMR to multiple Uniform components with constraints ensuring interpretable severity ordering
  3. Multi-scale wavelet aggregation: Uses MODWT to capture risk-relevant patterns across temporal scales, addressing limitations of global representations
  4. Joint frequency-severity framework: Combines Poisson-Gamma modeling with portfolio-level severity weights for sequential updating

    The Gaussian-Uniform mixture approach builds on Coretto and Hennig (2011) but the multi-component extension with severity interpretation and the overall actuarial framework are novel.

    SIGNIFICANT

Proof Techniques

The paper primarily focuses on algorithmic development and empirical validation rather than theoretical proofs. The main technical contributions are:

  1. Multi-Uniform MEMR Algorithm: Extends the MEMR approach by managing multiple Uniform components through constrained parameter space:

    \[\Omega_n = \{\eta: v_j \geq \exp(-nd), \text{ tail separation, consecutive coverage, monotonicity}\}\]

    The algorithm uses isotonic regression to enforce monotonicity constraints:

    \[\pi_1^- \geq \pi_2^- \geq \cdots \geq \pi_{M^-}^-, \quad \pi_1^+ \geq \pi_2^+ \geq \cdots \geq \pi_{M^+}^+\]
  2. Likelihood Maximization: The constrained MLE is:

    \[\hat{\eta}_n = \arg\max_{\eta \in \Omega_n} \ell_n(\eta)\]

    solved through multiple EM runs with grid-based initialization over candidate endpoint combinations.

  3. Conjugacy Properties: Exploits Gamma-Poisson conjugacy for closed-form posterior updates:

    \[\lambda_{im} | E_i, N_{im} \sim \text{Gamma}(\alpha_{0m} + N_{im}, \beta_{0m} + E_i)\]

    enabling sequential Bayesian updating with credibility-style weighting.

  4. MODWT Energy Decomposition: Uses the established MODWT property:

    \[\|X_i\|^2 = \sum_{j=1}^J \|\tilde{W}_{i,j}\|^2 + \|\tilde{V}_{i,J}\|^2\]

    for principled level selection through variance contribution analysis.

Experiments & Validation

Dataset: UAH-DriveSet controlled experiment with 6 drivers, 40 trips total, three behavioral states (normal, aggressive, drowsy) on two road types. Longitudinal acceleration recorded at 10 Hz.

Setup: MODWT decomposition with $J=6$ levels, Daubechies D4 filters, average pooling aggregation. Portfolio sample constructed via hierarchical sampling with autocorrelation-based thinning. Gaussian-Uniform mixture with $G=1$, $M^-=M^+=2$ (4 severity layers total).

Key Results:

  1. Fitted Model: Portfolio distribution shows clear tail structure with severity layers at quantiles 0.05%, 0.1%, 0.5% on each tail. Severity weights increase for deeper layers ($w_4 > w_3 > w_2 > w_1$).

  2. Trip Classification: Binary classification (normal vs. risky) using proposed index as feature: - Accuracy: 85% (34/40 trips correctly classified) - Outperforms frequency-only approaches that miss drowsy driving patterns - Clear separation between behavioral states in risk index distributions

  3. Driver Ranking: Sequential updating produces coherent driver-level profiles that align with expected risk ordering based on observed behavioral patterns.

  4. Severity Detection: Successfully identifies drowsy trips that appear “normal” in distributional analysis but exhibit rare extreme tail events, addressing limitation noted in Chan et al. (2025a).

    Baselines: Compared against frequency-based anomaly detection from Chan et al. (2025a) and simple quantile-based tail counting methods.

Limitations & Open Problems

Limitations:

  1. Single-signal focus: Framework limited to longitudinal acceleration only - TECHNICAL (method is signal-agnostic and can be extended to multivariate case)

  2. Controlled dataset: Evaluation on small controlled experiment (40 trips, 6 drivers) may not generalize to real-world conditions - RESTRICTIVE (real telematics data has more noise, diverse conditions)

  3. Gaussian-Uniform mixture assumption: May not capture all distributional patterns in telematics signals - NATURAL (reasonable for mean-reverting acceleration data)

  4. Fixed severity layer structure: Number of layers $M^-, M^+$ chosen manually rather than data-driven - TECHNICAL (could be addressed with model selection)

  5. No ground-truth accident validation: Risk index validated only against instructed behavioral states, not actual claims or accidents - RESTRICTIVE (but typical in telematics research due to sparse claims)

  6. Hierarchical sampling sensitivity: Portfolio construction depends on sampling scheme choices - TECHNICAL (robustness could be studied)

    Open Problems:

  7. Multivariate extension: How to optimally combine frequency-severity indices across multiple telematics signals (speed, lateral acceleration, etc.) into a unified risk measure
  8. Dynamic severity modeling: Developing time-varying severity structures that adapt to changing traffic conditions, weather, or road types rather than static portfolio-level baselines

Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs

Authors: Abhishek Gupta, Aditya Mahajan · Institution: Ohio State University, McGill University · Category: cs.LG

Establishes operator-theoretic foundations for policy gradient methods in general state-action space MDPs, unifying existence theory and deriving new PPO-type algorithms using perturbation theory and integral probability metrics.

Tags: reinforcement-learning markov-decision-processes operator-theory policy-gradient continuous-control functional-analysis optimization-theory

arXiv · PDF

Problem Formulation
  1. Motivation: Proximal policy optimization (PPO) algorithms have shown empirical success in continuous state-action space MDPs, but their theoretical foundations for general (non-finite) state and action spaces remain unclear. This limits understanding of when and why PPO works beyond finite MDPs.

  2. Mathematical setup: Consider a discounted-cost Markov decision process $(S, A, K, P, c, \rho, \gamma)$ where:
    • $S$ and $A$ are complete separable metric spaces (Polish spaces)
    • $K \subset S \times A$ is the set of feasible state-action pairs
    • $P: K \to P(S)$ is the transition kernel
    • $c: K \to \mathbb{R}_{\geq 0}$ is the per-step cost function
    • $\rho \in P(S)$ is the initial state distribution
    • $\gamma \in [0,1)$ is the discount factor

    Let $\pi: S \to P(A)$ be a policy and define the value function:

    \[v^\pi(s_0) = \mathbb{E}_{a_t \sim \pi(s_t)}\left[\sum_{t=0}^\infty \gamma^t c(s_t, a_t) \mid s_0\right]\]

    The performance is:

    \[J^\pi(\rho) = \langle v^\pi, \rho \rangle\]

    Define weighted normed spaces $F_S$ and $F_K$ with weight functions $w_S(s) \geq 1$ and $w_K(s,a) \geq 1$.

    Assumptions:

    1. Initial distribution $\rho \in P_{w_S}(S)$ and cost $c \in Q$ is non-negative and inf-compact
    2. Set of stable policies $\Pi_{\text{stable}}$ is non-empty
    3. $PV \subset Q$ and $\pi Q \subset V$ for all $\pi \in \Pi$
    4. For every inf-compact $q \in Q$, there exists $\pi \in \Pi$ achieving the infimum
    5. $V$ is closed under bounded increasing limits
  3. Toy example: Linear Quadratic Regulator with $s_{t+1} = As_t + Ba_t$, quadratic cost $c(s,a) = s^T Qs + a^T Ra$, and linear policies $\pi(s) = -Ks$. Here $V = {v: v(s) = s^T Qs, Q \geq 0}$ and $Q = {q: q(s,a) = s^T Qs + a^T Ra + 2\lambda s^T A^T QBa}$.

  4. Formal objective: Find an optimal policy

    \[\pi^* \in \arg\min_{\pi \in \Pi} J^\pi(\rho)\]
Method

The method views MDPs through linear operator theory by defining:

  1. Linear operator $P: F_S \to F_K$ as $Pv = \int_S v(s’) P(ds’\lvert s,a)$
  2. Policy operator $\pi: F_K \to F_S$ as $\pi q = \int_A q(s,a) \pi(da\lvert s)$
  3. Composed operator $P^\pi = \pi P$
  4. Bellman operator $T^\pi v = c^\pi + \gamma P^\pi v$

    The algorithm proceeds via:

  5. Define stable policies as those with spectral radius $\text{spec}(\gamma P^\pi) < 1$
  6. Establish value function existence: $v^\pi = \sum_{t=0}^\infty \gamma^t (P^\pi)^t c^\pi$
  7. Derive policy difference lemma using perturbation theory:

    \[v^{\pi'} - v^\pi = \sigma_{\pi'} \Delta q^\pi\]

    where $\sigma_{\pi’} = (I - \gamma P^{\pi’})^{-1}$ and $\Delta q^\pi = (\pi’ - \pi)A^\pi$

  8. Use integral probability metrics to bound second-order terms:

    \[\|[\Delta P \sigma_{\pi'} \Delta q^\pi]\|_{F_S} \leq \beta(q, \pi') w_S(s) \text{IPM}_{\Pi, w_S}(\pi, \pi')^2\]
  9. Derive majorization-minimization update:

    \[\pi_{k+1} = \arg\min_{\pi' \in \Pi} \langle \pi' A^{\pi_k} + \beta w_S \text{IPM}_{\Pi, w_S}(\pi', \pi_k)^2, \sigma^*_{\pi_k} \rho \rangle\]

    Applied to LQR toy example: With quadratic value functions, the operators preserve the quadratic structure. The stable policy set becomes ${\pi: \text{spec}(A - BK) < 1}$ and the algorithm updates the gain matrix $K$ while maintaining stability.

Novelty & Lineage

This work extends policy gradient methods from finite MDPs (Schulman et al. TRPO, Kakade 2001) to general state-action spaces using operator theory. Prior results by Neu & Olkhovskaya (2020) covered some continuous cases with linear function approximation, while this work handles general Polish spaces.

Key novelties:

  • First operator-theoretic foundation for PPO in general MDPs
  • Unification of existence results across LQG, finite, and Lipschitz MDPs
  • Policy difference lemma via perturbation theory rather than finite-dimensional derivatives
  • Integral probability metrics framework generalizing total variation/KL divergence approaches
  • New MM-RKHS algorithm avoiding KL divergence second derivatives

The approach builds on Schäl (1975) for existence theory but adds computational algorithms. Extends Bertsekas & Shreve dynamic programming to structured function spaces.

SIGNIFICANT

Proof Techniques

The main proof strategy uses perturbation theory of linear operators in Banach spaces:

  1. Spectral stability for value function existence: For $\pi \in \Pi_{ss}$, show $\text{spec}(\gamma P^\pi) < 1$ implies convergence of Neumann series:

    \[v^\pi = \sum_{t=0}^\infty \gamma^t (P^\pi)^t c^\pi = (I - \gamma P^\pi)^{-1} c^\pi\]
  2. Policy difference via operator perturbation: Apply Lemma 2.1 with $A = I - \gamma P^\pi$, $B = -\gamma P(\pi’ - \pi)$:

    \[(I - \gamma P^{\pi'})^{-1} = (I - \gamma P^\pi)^{-1} + \epsilon (I - \gamma P^{\pi'})^{-1} P \Delta \sigma_\pi\]

    where $\Delta = \pi’ - \pi$ and $\epsilon$ parameterizes the perturbation.

  3. Majorization bounds using IPMs: Key inequality from Lemma 5.8:

    \[\|\Delta P \sigma_{\pi'} \Delta q^\pi\|_{F_S} \leq \beta(q, \pi') w_S(s) \text{IPM}_{\Pi, w_S}(\pi, \pi')^2\]

    This uses the bound $\lvert \pi q - \pi’ q\ \rvert_{F_S} \leq \varrho_{F_Q}(q) \text{IPM}_{\Pi, w_S}(\pi, \pi’)$.

  4. Value iteration convergence without contraction: Use increasing sequence argument instead of Banach fixed point theorem. Show ${T^k(0)}$ is increasing and bounded via Theorem 3.10, then apply Assumption 5 (closure under bounded increasing limits).

  5. Structured space preservation: Mathematical induction showing if $v^k \in V$, then $T v^k \in V$ using Assumptions 3-4 and measurable selection for inf-compact functions.

Experiments & Validation

Purely theoretical. The paper provides no numerical experiments or empirical validation.

Empirical validation would require:

  • Implementation on continuous control tasks (robotics, fluid dynamics)
  • Comparison of MM-RKHS algorithm against standard PPO/TRPO
  • Verification that theoretical assumptions hold in practice
  • Performance analysis on the motivating applications (soft robotics, mean field control)
Limitations & Open Problems

Limitations:

  1. TECHNICAL: Assumption 8 requires universal bound $\varrho_{F_Q}(Pv) \leq \kappa_P \lvert v\ \rvert_{F_S}$ - needed for majorization bounds but may be restrictive for some transition kernels

  2. TECHNICAL: Assumption 5 (closure under bounded increasing limits) - required for value iteration convergence but may not hold for all structured spaces like general Lipschitz functions

  3. RESTRICTIVE: Assumption 4 (structured selectors) - requires optimal actions lie in policy class $\Pi$, significantly limiting applicability

  4. TECHNICAL: Assumption 6 (IPM-continuity of transition kernel) - needed for inf-compactness but requires careful choice of generator sets $F_V$

  5. RESTRICTIVE: Assumption 7 (interchange of integral and minimization) - limits policy classes and may fail for complex constraints

    Open problems:

  6. Develop sample complexity bounds for the MM-RKHS algorithm in the finite-sample setting
  7. Extend results to unbounded costs (current framework requires $c \geq 0$ and inf-compact)

A deep backward regression-based scheme for high-dimensional nonlinear partial differential equations

Authors: Qiang Han, Shaolin Ji, Yunzhang Li · Institution: Yangzhou University, Shandong University, Fudan University · Category: math.NA

Proposes a deep backward regression method that uses conditional expectations to denoise neural network training for high-dimensional nonlinear PDE solving, achieving superior stability and scaling up to 200 dimensions.

Tags: high-dimensional PDEs backward stochastic differential equations deep neural networks numerical analysis quantitative finance curse of dimensionality Monte Carlo methods variational inequalities

arXiv · PDF

Problem Formulation
  1. Motivation: High-dimensional nonlinear parabolic PDEs arise in quantitative finance (option pricing, portfolio optimization) and scientific computing, but traditional numerical methods suffer from the curse of dimensionality. Existing deep learning approaches like DBDP face training instability due to noise in backward stochastic differential equation discretizations.

  2. Mathematical setup: Consider the quasilinear parabolic PDE on filtered probability space $(\Omega, \mathcal{F}, (\mathcal{F}_t)_{0 \leq t \leq T}, \mathbb{P})$:

    \[\partial_t u + \mu \cdot D_x u + \frac{1}{2} \text{Tr}(\sigma \sigma^T D_x^2 u) = f(t, x, u, \sigma^T D_x u)\]

    with terminal condition $u(T, \cdot) = g(\cdot)$ on $\mathbb{R}^d$. By the nonlinear Feynman-Kac formula, this connects to the FBSDE:

    \[dX_t = \mu(t, X_t) dt + \sigma(t, X_t) dW_t, \quad X_0 = x_0\] \[-dY_t = f(t, X_t, Y_t, Z_t) dt - Z_t dW_t, \quad Y_T = g(X_T)\]

    where $Y_t = u(t, X_t)$ and $Z_t = \sigma^T(t, X_t) D_x u(t, X_t)$.

    Assumptions:

    1. $x_0$ is square integrable
    2. $\mu, \sigma$ are Lipschitz in space with uniform bounds
    3. $f$ is $\frac{1}{2}$-Hölder in time, uniformly Lipschitz in other variables
    4. $g$ has linear growth
  3. Toy example: When $d=2$, $\mu = 0$, $\sigma = I_2$, $f(t,x,y,z) = y$, and $g(x) = \lvert x \rvert^2$, we get the heat equation $\partial_t u + \frac{1}{2} \Delta u = u$ with $u(T,x) = \lvert x \rvert^2$. The exact solution is $u(t,x) = e^{(T-t)/2} \lvert x \rvert^2 + (T-t) e^{(T-t)/2}$, but computing this in high dimensions requires approximating conditional expectations in the BSDE.

  4. Formal objective: Minimize the error

    \[\max_{0 \leq i \leq N-1} \mathbb{E}[|Y_{t_i} - Y_i|^2] + \mathbb{E}\left[\sum_{i=0}^{N-1} \int_{t_i}^{t_{i+1}} |Z_s - Z_i|^2 ds\right]\]

    where $(Y_i, Z_i)$ are DBR approximations to the true BSDE solution $(Y_{t_i}, Z_{t_i})$.

Method

The Deep Backward Regression (DBR) method replaces noisy single-path BSDE discretizations with denoised conditional expectation targets.

Algorithm steps:

  1. Time discretization: $\pi = {0 = t_0 < t_1 < \cdots < t_N = T}$ with $h = T/N$
  2. Forward simulation: Generate paths $X^{m,k}_{i+1} = X^m_i + \mu(t_i, X^m_i)h + \sigma(t_i, X^m_i)\Delta W^{m,k}_i$
  3. Neural network training at each timestep $t_i$ (backward induction):

    Loss for $Y$-component:

    \[F_{y,i}(\theta_{y,i}) = \frac{1}{M} \sum_{m=1}^M \left|S_{y,i}(p^m_i, X^m_{i+1}) - \text{NN}_{y,i}(p^m_i, \theta_{y,i})\right|^2\]

    Loss for $Z$-component:

    \[F_{z,i}(\theta_{z,i}) = \frac{1}{M} \sum_{m=1}^M \left|S_{z,i}(p^m_i, \Delta W^m_i, X^m_{i+1}) - \text{NN}_{z,i}(p^m_i, \theta_{z,i})\right|^2\]

    where the denoised targets are:

    \[S_{y,i} = \frac{1}{K_i} \sum_{k=1}^{K_i} \text{NN}_{y,i+1}(p^{m,k}_{i+1}, \theta^*_{y,i+1}) + hf(t_i, X^m_i, \text{NN}_{y,i}, \text{NN}_{z,i})\] \[S_{z,i} = \frac{1}{K_i} \sum_{k=1}^{K_i} \frac{\text{NN}_{y,i+1}(p^{m,k}_{i+1}, \theta^*_{y,i+1}) (\Delta W^{m,k}_i)^T}{h}\]
  4. Output: $(Y_i, Z_i) = (\text{NN}_{y,i}(p^m_i, \theta^*_{y,i}), \text{NN}_{z,i}(p^m_i, \theta^*_{z,i}))$

    Toy example application: For the 2D heat equation case, at each timestep we generate $K_i$ paths from each base path, average the neural network predictions $\text{NN}_{y,i+1}(t_{i+1}, X^{m,k}_{i+1})$ to form the regression target, eliminating the Brownian noise $\Delta W^{m,k}_i$ that would corrupt single-path methods.

Novelty & Lineage

This extends the Deep BSDE method of E, Han, Jentzen (2017) and particularly builds on the DBDP method of Huré, Pham, Warin (2020). The key prior work uses loss functions of the form:

\[J^{HPW}_i(\theta_i) = \mathbb{E}[|Y_i - Y^*_{i+1} - hf(t_i, X^\pi_i, Y_i, Z_i) + Z_i \Delta W_i|^2]\]

The main novelty is replacing this noisy target (containing $Z_i \Delta W_i$) with conditional expectation-based regression targets that eliminate Brownian noise through multi-path averaging. This “denoising” mechanism is achieved by reformulating the stochastic optimization as deterministic function approximation.

The paper also provides rigorous error analysis with half-order convergence rates and extends to variational inequalities (reflected BSDEs). Related work includes SOC-MartNet (Cai, Fang, Zhou) but DBR specifically targets the variance reduction problem in backward induction schemes.

Classification: SIGNIFICANT

Proof Techniques

The error analysis uses a three-step approach with discrete Grönwall inequalities and neural network approximation theory.

Key proof steps:

  1. Euler discretization error analysis using Young’s inequality in the form $(a+b)^2 \leq (1+\gamma h)a^2 + (1+\frac{1}{\gamma h})b^2$ to control:

    \[\mathbb{E}[|Y_{t_i} - U^m_{t_i}|^2] \leq (1+Ch)\mathbb{E}[|Y_{t_{i+1}} - \text{NN}_{y,i+1}|^2] + Ch^2 + C\mathbb{E}\left[\int_{t_i}^{t_{i+1}} |Z_s - \bar{Z}_{t_i}|^2 ds\right]\]
  2. Neural network approximation error bounds using GroupSort networks. The key inequality from Lemma 3.2:

    \[\sup_{x \in [-R,R]^d} |\tilde{f}(x) - g(x)| \leq 2\sqrt{d_1}RK\varepsilon\]

    for $K$-Lipschitz functions with network parameters $\kappa = \lceil 2\sqrt{d}/\varepsilon \rceil$, depth $O(d^2)$, width $O((2\sqrt{d}/\varepsilon)^{d^2-1})$.

  3. Discrete Grönwall lemma application to obtain the final bound:

    \[\max_{0 \leq i \leq N-1} \mathbb{E}[|Y_{t_i} - Y_i|^2] \leq C\mathbb{E}[|g(X_T) - g(X^\pi_N)|^2] + Ch + C\sum_{i=0}^{N-1} \mathbb{E}[|\text{approximation errors}|^2]\]

    The key technical insight is showing that the conditional expectation formulation reduces variance while maintaining the same convergence rate. The proof establishes half-order ($h^{1/2}$) convergence by choosing network parameters $R = O(N^{3/2})$, $\varepsilon = O(N^{-3})$, $K = O(N^3)$ to balance discretization and approximation errors.

Experiments & Validation

Experiments test bounded and unbounded PDE solutions across dimensions $d = 10, 50, 100, 200$.

Benchmarks:

  • Allen-Cahn equation: $\partial_t u + \frac{1}{2} \Delta u = u - u^3$, $u(T,x) = (2+\frac{2}{3}\sin(0.5 \sum x_i))^{-1}$
  • Nonlinear Black-Scholes: pricing equations with nonlinear volatility
  • Hamilton-Jacobi-Bellman equations from optimal control

Key results:

  • Table 1: DBR maintains accuracy up to $d=200$ for bounded solutions with relative errors under 2%
  • Table 6: For complex unbounded PDEs where DBDP1 fails beyond $d=10$, DBR remains robust up to $d=20$ with relative errors under 9.7%
  • Training stability: DBR shows consistent convergence while DBDP1 exhibits oscillations and divergence in high dimensions

Baselines: DBDP1 method (Huré et al. 2020), standard deep BSDE approaches. Implementation uses Xavier initialization, Adam optimizer, batch sizes 1000-5000, learning rates $10^{-3}$ to $10^{-4}$.

Limitations & Open Problems

Limitations:

  1. Computational cost scales with $K_i$ (number of paths per timestep for denoising) - TECHNICAL (trade-off between accuracy and efficiency)
  2. Lipschitz assumptions on coefficients $\mu, \sigma, f, g$ - NATURAL (standard in BSDE theory)
  3. Half-order convergence rate matches Euler scheme limitations - NATURAL (inherent to first-order temporal discretization)
  4. GroupSort network architecture requirements scale exponentially with dimension for theoretical guarantees - TECHNICAL (practical networks often perform better than theory suggests)
  5. Method limited to $d \leq 200$ demonstrated - RESTRICTIVE (though significant improvement over $d \leq 10$ for unbounded cases)

    Open problems:

  6. Can higher-order temporal discretizations (Milstein, Runge-Kutta) be combined with the DBR denoising approach to achieve full-order convergence rates?
  7. Is it possible to extend DBR to McKean-Vlasov PDEs or systems with path-dependent coefficients where the Markovian structure is lost?

Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

Authors: Xiaozhou Ye, Feng Jiang, Zihan Wang, Xiulai Wang et al. (6 authors) · Institution: Nanjing University of Information Science and Technology · Category: cs.LG

Introduces CTFG, a framework that reformulates cross-user activity recognition as critic-free reinforcement learning over autoregressive temporal feature generation, achieving state-of-the-art performance through group-relative advantage estimation.

Tags: human-activity-recognition domain-generalization reinforcement-learning temporal-modeling wearable-sensors transformers cross-user-adaptation policy-optimization

arXiv · PDF

Problem Formulation

Motivation: Human Activity Recognition (HAR) using wearable sensors faces a critical deployment challenge: models trained on source users suffer substantial performance degradation on unseen target users due to physiological diversity (limb length, body mass, muscle composition). Existing domain generalization methods either neglect temporal structure or require target annotations.

Mathematical setup: Consider $K$ source users with labeled data $D_k = {(x_i^k, y_i^k)}_{i=1}^{n_k}$ where $x_i^k \in \mathbb{R}^{l \times d}$ is a sensor sequence of length $l$ with $d$ channels and $y_i^k \in {1, \ldots, C}$ is the activity label. Each user $k$ induces distribution $P_k(x, y)$ with $P_i \neq P_j$ for $i \neq j$. Target users follow distributions ${Q_m}_{m=1}^M$ with $Q_m \neq P_k$ for all $m, k$.

Feature extraction is formulated as an MDP where for input $x_i$:

  • State at step $j$: $s_{i,j} = (h_i, z_{i,1:j-1})$ combining encoder representation $h_i \in \mathbb{R}^{l \times d_{\text{model}}}$ with prior tokens
  • Action: generate token $z_{i,j} \in \mathbb{R}^k$ from parameterized distribution
  • Reward: computed after complete sequence generation

Assumptions:

  1. Label space remains identical across users
  2. Activities have compositional temporal structure
  3. User-invariant dynamics exist in temporal relationships

    Toy example: For $d=6$ (3-axis accelerometer + 3-axis gyroscope), $l=75$ samples, $k=16$ token dimensions, and $s=4$ tokens, the decoder generates sequences $z_i = {z_{i,1}, z_{i,2}, z_{i,3}, z_{i,4}}$ where $z_{i,1}$ captures coarse periodicity and $z_{i,4}$ refines phase transitions.

    Formal objective: Find policy parameters $\theta^*$ that maximize expected batch-level reward:

    \[\theta^* = \arg\max_\theta \mathbb{E}_{\pi_\theta}[R(\{z_i\}_{i \in B}, \{h_i\}_{i \in B})]\]
Method

CTFG employs autoregressive feature generation with Group-Relative Policy Optimization:

  1. Encoder: Linear projection and positional encoding followed by Transformer encoder:

    \[h_i = \text{TransformerEncoder}(x_i W_{\text{in}} + PE(l))\]
  2. Decoder: Autoregressive generation with causal self-attention and cross-attention:

    \[[\mu_{i,j}, \log \sigma_{i,j}] = \text{TransformerDecoder}(h_i, z_{i,1:j-1})\] \[z_{i,j} \sim \mathcal{N}(\mu_{i,j}, \text{diag}(\exp(\log \sigma_{i,j})))\]
  3. Group-Relative Advantages: For each input, sample $G$ sequences and compute:

    \[\hat{A}^{(g)} = \frac{R^{(g)} - \bar{R}_G}{\hat{\sigma}_G + \epsilon_s}\]

    where $\bar{R}_G = \frac{1}{G}\sum_{g=1}^G R^{(g)}$ and $\hat{\sigma}_G = \sqrt{\frac{1}{G}\sum_{g=1}^G (R^{(g)} - \bar{R}_G)^2}$

  4. Tri-objective Reward:

    \[R = w_{\text{cls}} R_{\text{cls}} + w_{\text{inv}} R_{\text{inv}} + w_{\text{tmp}} R_{\text{tmp}}\]

    Toy example application: For a walking sequence with $G=4$ samples, if rewards are $[0.8, 0.6, 0.9, 0.5]$, then $\bar{R}_G = 0.7$, $\hat{\sigma}_G = 0.158$, yielding advantages $[0.63, -0.63, 1.27, -1.27]$.

Novelty & Lineage

The paper introduces several novel components:

  1. Reformulating feature extraction as sequential generation governed by RL, departing from single-pass methods
  2. First application of Group-Relative Policy Optimization (GRPO) to sensor-based domain generalization, eliminating critic networks that suffer from distribution-dependent bias
  3. Tri-objective reward combining class discrimination, cross-user invariance, and temporal fidelity.

    Prior work includes adversarial alignment methods (ACON), causal invariance learning (Xiong et al.), and contrastive approaches (CrossHAR, ContrastSense), but these operate on fixed representations without exploiting temporal structure. GRPO was introduced by Ahmadian et al. for language model alignment but never applied to feature generation.

    The autoregressive Transformer architecture builds on standard designs but is novel in its application to feature construction rather than prediction. The theoretical analysis of affine-invariant advantages is a new contribution.

    SIGNIFICANT

Proof Techniques

The main theoretical result (Proposition 1) establishes properties of group-relative advantages:

Key Result: For i.i.d. reward samples $R^{(1)}, \ldots, R^{(G)}$, the group-relative advantage satisfies:

\[\hat{A}^{(g)} = \frac{R^{(g)} - \bar{R}_G}{\hat{\sigma}_G + \epsilon_s}\]

Proof technique:

  1. Zero-centered property: Direct computation shows $\mathbb{E}[\hat{A}^{(g)}] = 0$ since $\mathbb{E}[R^{(g)} - \bar{R}_G] = 0$

  2. Variance convergence: As $G \to \infty$, $\bar{R}_G \to \mathbb{E}[R]$ and $\hat{\sigma}_G \to \text{Std}[R]$ by law of large numbers, so:

    \[\text{Var}(\hat{A}^{(g)}) \to \text{Var}\left(\frac{R - \mathbb{E}[R]}{\text{Std}[R]}\right) = 1\]
  3. Affine invariance: For transformed rewards $aR + b$ with $a > 0$:

    \[\hat{A}^{(g)}(aR + b) = \frac{(aR^{(g)} + b) - (a\bar{R}_G + b)}{a\hat{\sigma}_G} = \frac{a(R^{(g)} - \bar{R}_G)}{a\hat{\sigma}_G} = \hat{A}^{(g)}(R)\]

    The key insight is that normalization by empirical standard deviation absorbs reward scale changes, making advantages invariant to cross-user distribution shifts that would bias critic-based methods.

Experiments & Validation

Datasets: DSADS (8 subjects, 19 activities, 25 Hz, torso/arms/legs sensors) and PAMAP2 (6 subjects, 11 activities, 100 Hz, chest/wrist/ankle sensors). Leave-one-group-out cross-validation.

Baselines: ERM (77.56% DSADS, 64.18% PAMAP2), RSC (81.44%, 66.95%), ANDMask (82.05%, 64.87%), AdaRNN (79.38%, 67.59%), ACON (87.65%, 71.95%), PPO-variant (88.29%, 74.15%).

Key Results: CTFG achieves 88.53% on DSADS and 75.22% on PAMAP2, representing 10.97% and 11.04% improvements over ERM. Convergence is 2-3x faster than PPO with 84.8% reduction in inter-task variance. Performance remains stable across token counts s=5 to s=20, while PPO degrades catastrophically at high horizons.

Architecture: 1-layer Transformer, d_model=64, 4 heads, learning rate 10^-4. Group size G=4, tri-objective weights (3.0, 2.0, 1.0).

Limitations & Open Problems

Limitations:

  1. Static activities (sitting, standing) achieve lower performance (F1 ≈ 0.6) because they lack temporal structure the autoregressive decoder is designed to exploit - NATURAL

  2. Environmental confounding (elevator activities) introduces non-activity temporal dynamics that are preserved by temporal fidelity constraint - TECHNICAL

  3. Limited to datasets with sufficient user diversity for meaningful group statistics in GRPO - RESTRICTIVE

  4. Computational overhead of generating G sequences per input during training - TECHNICAL

  5. Fixed group size G may be suboptimal for different dataset scales - TECHNICAL

  6. Hyperparameter sensitivity in tri-objective reward weighting requires domain knowledge - NATURAL

    Open Problems:

  7. Adaptive group sizing strategies for datasets with limited user diversity that maintain statistical validity of group normalization
  8. Extension to multi-modal sensor configurations (visual + inertial) where temporal alignment across modalities becomes critical

Low-Complexity and Consistent Graphon Estimation from Multiple Networks

Authors: Roland Boniface Sogan, Tabea Rebafka · Institution: Sorbonne Université, AgroParisTech · Category: stat.ME

Introduces a fast, consistent histogram-based graphon estimator that jointly aligns nodes across multiple networks instead of analyzing them separately.

Tags: graphon-estimation network-analysis multiple-networks histogram-estimator consistency computational-complexity joint-alignment latent-positions

arXiv · PDF

Problem Formulation

Motivation: Graphon estimation from multiple networks is crucial for understanding common structural patterns across collections of networks in neuroscience, ecology, and sociology. Existing methods either analyze networks individually (losing joint information) or are computationally expensive, limiting scalability.

Mathematical setup: Let $W : [0,1]^2 \to [0,1]$ be a symmetric measurable graphon function. For each network $G^{(m)} = (V^{(m)}, E^{(m)})$ with $n^{(m)}$ nodes, generate latent positions:

\[U_1^{(m)}, \ldots, U_{n^{(m)}}^{(m)} \stackrel{i.i.d.}{\sim} \text{Uniform}[0,1]\]

The adjacency matrix $A^{(m)}$ satisfies:

\[A_{i,j}^{(m)} | U_{1:n^{(m)}}^{(m)} \sim \text{Bernoulli}(W(U_i^{(m)}, U_j^{(m)}))\]

for $i < j$, with $A_{j,i}^{(m)} = A_{i,j}^{(m)}$ and $A_{i,i}^{(m)} = 0$.

Assumptions:

  1. All networks are generated from the same graphon $W$
  2. Node sets $V^{(m)}$ are disjoint with potentially different sizes
  3. The normalized degree function $g(u) = \int_0^1 W(u,v) dv$ is strictly increasing
  4. $W$ is $L$-Lipschitz continuous
  5. The degree function satisfies $L_1\lvert x-y \rvert \leq \lvert g(x)-g(y) \rvert$ for some $L_1 > 0$

    Toy example: Consider $M=2$ networks with $n^{(1)}=3$, $n^{(2)}=4$ nodes, generated from $W(u,v) = uv$. The normalized degrees are $g(u) = u/2$. A node with latent position $u=0.7$ has expected normalized degree $0.35$.

    Formal objective: Estimate the graphon function by minimizing:

    \[\text{MISE}(\hat{W}_{JGS}, W) = \mathbb{E}[\|\hat{W}_{JGS} - W\|_{L^2}^2]\]
Method

The Joint Graph Sorting (JGS) estimator consists of three steps:

  1. Global sorting: Compute normalized empirical degrees for all nodes:

    \[\hat{d}_{i,(m)}^{\text{norm}} = \frac{1}{n^{(m)}-1} \sum_{j=1}^{n^{(m)}} A_{i,j}^{(m)}\]
  2. Latent position estimation: Sort all $N = \sum_{m=1}^M n^{(m)}$ nodes by their normalized degrees and assign positions:

    \[\hat{U}_{i,(m)}^{JGS} = \frac{(\hat{\sigma}_{JGS})^{-1}(i,m)}{N} - \frac{1}{2N}\]

    where $\hat{\sigma}_{JGS}$ is the global sorting permutation.

  3. Histogram construction: Partition $[0,1]^2$ into $k^2$ blocks and compute edge frequencies:

    \[\hat{H}_{s,t}^{JGS} = \frac{\sum_{m=1}^M \sum_{i \in \hat{J}_s^{(m)}} \sum_{j \in \hat{J}_t^{(m)}} A_{i,j}^{(m)}}{\max\{1, \sum_{m=1}^M |\hat{J}_s^{(m)}||\hat{J}_t^{(m)}|\}}\]

    The final estimator is:

    \[\hat{W}_{JGS}(u,v) = \hat{H}_{s,t}^{JGS} \text{ if } (u,v) \in I_s \times I_t\]

    Toy example application: With $k=2$ blocks and our toy networks, nodes are globally ranked by their empirical degrees. A node with high degree from either network gets positioned near 1, enabling joint information pooling across the sparse within-network observations.

Novelty & Lineage

This extends classical single-network graphon estimators like SAS (Chan & Airoldi, 2014), SBA (Airoldi et al., 2013), and USVT (Chatterjee, 2015). Recent multi-network approaches include Gromov-Wasserstein Barycenters (Xu et al., 2021) and neural methods like IGNR (Xia et al., 2023) and SIGL (Azizpour et al., 2025).

Key innovation: Unlike existing methods that estimate graphons separately per network then average, JGS performs joint node alignment across all networks simultaneously before histogram construction. This enables better latent position estimation with resolution $1/N$ instead of $1/n^{(m)}$, and computational complexity $O(N \log N + S)$ vs. expensive optimization-based alignment.

Assessment: SIGNIFICANT

Proof Techniques

The consistency proof follows a three-step decomposition strategy:

  1. Latent position consistency: Use Hoeffding’s inequality to bound degree concentration:

    \[\mathbb{P}[|\hat{d}_{i,(m)}^{\text{norm}} - g(U_{i,(m)})| \geq \varepsilon_n^{(m)}] \leq \delta\]

    where $\varepsilon_n^{(m)} = \sqrt{\frac{\log(2/\delta)}{2n^{(m)}}}$. The Lipschitz condition on $g$ converts degree errors to position errors:

    \[|\hat{U}_{i,(m)}^{JGS} - U_{i,(m)}| \leq \frac{2}{L_1}\left(\varepsilon_i^{(m)} + \frac{1}{N}\sum_{(j,\ell) \neq (i,m)} \varepsilon_j^{(\ell)}\right) + \varepsilon_N\]
  2. MISE decomposition: Bound the estimation error using triangle inequality:

    \[\mathbb{E}[\|\hat{W}_{JGS} - W\|_{L^2}^2] \leq 3\mathbb{E}[\|\hat{W}_{JGS} - W^{\text{oracle}}\|_{L^2}^2] + 3\mathbb{E}[\|W^{\text{oracle}} - W^{\text{approx}}\|_{L^2}^2] + 3\mathbb{E}[\|W^{\text{approx}} - W\|_{L^2}^2]\]
  3. Block-wise error control: For each term: - Discretization error: $\mathbb{E}[\lvert W^{\text{approx}} - W\ \rvert_{L^2}^2] \leq \frac{CL^2}{k^2}$ - Stochastic error: $\mathbb{E}[\lvert W^{\text{oracle}} - W^{\text{approx}}\ \rvert_{L^2}^2] \leq \frac{1}{4k^2}\sum_{s,t} \mathbb{E}[\frac{1}{M_{s,t}}]$
    - Misclassification error bounds via vertex boundary analysis

  4. Asymptotic convergence: With $k = o(\frac{N}{M + \log N})$ and $\frac{k}{n_{\max}} \to 0$, all error terms vanish as $n_{\max} \to \infty$.

Experiments & Validation

Synthetic data: 13 graphons (9 monotone, 4 non-monotone) from Azizpour et al. (2025), $M=200$ networks with sizes $n^{(m)} \in [10,100]$. MISE evaluation over 20 trials.

Baselines: SBA, SAS, USVT (via Xu et al. 2021 multi-network implementations), SGWB, SIGL.

Key results:

  • JGS achieves lowest MISE on 12/13 graphons (only fails on very sparse graphon 9)
  • 1-2 orders of magnitude faster than SIGL/SGWB
  • JGS-smooth (with total variation smoothing) further reduces errors
  • Robust performance even on non-monotone graphons despite theoretical assumptions

Real data: IMDB-BINARY (1000 graphs, 2 classes) and IMDB-MULTI (1500 graphs, 3 classes) for GNN classification via G-Mixup data augmentation. JGS achieves highest test accuracy: 75.72% vs 72.85% (SGWB baseline) on IMDB-BINARY.

Scaling studies: Confirms consistency when network sizes grow, but shows non-consistency when only number of networks grows (inherent limitation of the setting).

Limitations & Open Problems

Limitations:

  1. Strict monotonicity of degree function assumption - TECHNICAL (needed for identifiability, but many natural graphons satisfy this)

  2. Fixed number of networks $M$ for consistency - RESTRICTIVE (rules out growing network collections, though empirically JGS still improves with more networks)

  3. Poor performance on very sparse networks - NATURAL (sparse regimes are inherently challenging for degree-based methods)

  4. Lipschitz continuity assumption on graphon - NATURAL (standard smoothness condition in nonparametric estimation)

    Open problems:

  5. Develop consistent graphon estimators for the regime where $M \to \infty$ while network sizes remain bounded - fundamental theoretical challenge about what information is available

  6. Improve latent position estimation for sparse networks, potentially using higher-order graph statistics beyond degrees


Sequential Transport for Causal Mediation Analysis

Authors: Agathe Fernandes-Machado, Iryna Voitsitska, Arthur Charpentier, Ewen Gallic · Institution: Université du Québec à Montréal (UQAM), Ukrainian Catholic University, Kyoto University, Aix Marseille Université · Category: stat.ME

Sequential Transport provides a nonparametric framework for causal mediation analysis that constructs unit-level mediator counterfactuals by optimally transporting mediators along a DAG structure without requiring parametric structural equation models.

Tags: causal-inference mediation-analysis optimal-transport counterfactuals algorithmic-fairness nonparametric-methods sequential-transport

arXiv · PDF

Problem Formulation

Motivation: Classical causal mediation analysis relies on structural equation models (SEMs) to decompose treatment effects into direct and indirect pathways through mediators. However, these approaches require specifying parametric functional forms and cross-world counterfactuals that are difficult to define without full structural models, especially for high-dimensional or mixed-type mediators.

Mathematical setup: Let $(A, X, Y) \in \mathcal{A} \times \mathcal{X} \times \mathcal{Y}$ be a random vector where $A \in {0,1}$ is a binary treatment, $X = (X_1, \ldots, X_d)$ is a vector of mediators, and $Y$ is the outcome. We assume $(A, X, Y) \sim P$ with causal ordering $A \to X \to Y$. The mediators follow a directed acyclic graph (DAG) $G$ such that:

\[P(X | A = a) = \prod_{j=1}^d P(X_j | \text{parents}_G(X_j), A = a)\]

For unit $i$, let $X_i(a)$ denote potential mediators under treatment $a$ and $Y_i(a,x)$ potential outcomes under treatment $a$ and mediator vector $x$. The total effect decomposes as:

\[\tau_i = Y_i(1, X_i(1)) - Y_i(0, X_i(0)) = \delta_i(1) + \zeta_i(0)\]

where $\delta_i(a) = Y_i(a, X_i(1)) - Y_i(a, X_i(0))$ is the natural indirect effect and $\zeta_i(a) = Y_i(1, X_i(a)) - Y_i(0, X_i(a))$ is the natural direct effect.

Assumptions:

  1. Sequential ignorability: no unobserved confounding in treatment-outcome, treatment-mediator, and mediator-outcome relationships
  2. Treatment is randomized or ignorable given pre-treatment covariates
  3. Mediator DAG $G$ is known and acyclic

    Toy example: When $d=2$ with $A \to X_1 \to X_2 \to Y$ and Gaussian mediators, classical linear SEM gives $X_1(1) = \alpha_1 + \epsilon_1$ and $X_2(1) = \beta X_1(1) + \alpha_2 + \epsilon_2$, requiring specification of functional forms and error distributions.

    Formal objective: Construct unit-level mediator counterfactuals $x_i^*(1)$ without parametric assumptions such that:

    \[\delta_i = \mu_0(x_i^*(1)) - \mu_0(x_{0,i}), \quad \zeta_i = \mu_1(x_i^*(1)) - \mu_0(x_i^*(1))\]
Method

Sequential Transport (ST) constructs mediator counterfactuals by transporting mediators one at a time along a topological ordering of the DAG, preserving conditional dependencies.

Algorithm steps:

  1. Fix a topological ordering $\pi$ of the mediator DAG $G$
  2. For each mediator $X_j$ in order, construct conditional transport map $T_j(\cdot \lvert z_0, z_1)$ such that:

    \[T_j(\cdot | z_0, z_1)_\# P(X_j | Z_j = z_0, A = 0) = P(X_j | Z_j = z_1, A = 1)\]
  3. For numerical mediators with parents, use conditional quantile transport:

    \[T_j(x | z_0, z_1) = Q_{1,j}(F_{0,j}(x | z_0) | z_1)\]
  4. For categorical mediators, transport probability vectors on the simplex $S^{K-1}$
  5. Apply transport recursively for unit $i$ with $A_i = 0$:

    \[x_{1,i,\pi(1)}^* = T_{\pi(1)}(x_{0,i,\pi(1)})\] \[x_{1,i,\pi(j)}^* = T_{\pi(j)}(x_{0,i,\pi(j)} | x_{0,i,\text{parents}_G(X_{\pi(j)})}, x_{1,i,\text{parents}_G(X_{\pi(j)})}^*)\]

    Toy example application: For $A \to X_1 \to X_2 \to Y$ with Gaussian mediators, ST first transports $X_1$ marginally using $T_1(x_1) = Q_1(F_0(x_1))$, then transports $X_2$ conditionally on the already-transported $X_1^*$ using $T_2(x_2 \lvert x_1, x_1^*) = Q_2(F_0(x_2 \rvert x_1) \lvert x_1^*)$, avoiding parametric specification of the structural equations.

Novelty & Lineage

This work extends optimal transport (OT) methods to causal mediation analysis. Prior work includes Charpentier et al. (2023) and De Lara et al. (2024) who used multivariate OT for counterfactual construction, and Fernandes Machado et al. (2025b) who developed sequential transport for probabilistic graphical models. The key novelty is adapting sequential transport specifically for mediation analysis with mixed-type mediators and establishing theoretical guarantees.

The paper also connects to classical mediation literature (Robins and Greenland 1992, Pearl 2001) and recent work using normalizing flows (Zhou and Wodtke 2025, Pawlowski et al. 2020) for nonparametric mediation. The mutatis mutandis interpretation provides an alternative to cross-world counterfactuals without requiring full structural specification.

SIGNIFICANT - The theoretical analysis of sequential transport consistency and the application to mediation represents a substantial contribution, though it builds incrementally on existing OT and mediation frameworks.

Proof Techniques

The main theoretical results establish consistency of sequential transport maps and induced effect decompositions through several key steps:

  1. Conditional transport map consistency: For numerical mediators, establish uniform convergence:

    \[\sup_{x,z} |\hat{T}_j(x|z_0,z_1) - T_j(x|z_0,z_1)| \stackrel{p}{\to} 0\]

    using kernel-based estimators of conditional CDFs and quantile functions under regularity conditions.

  2. Sequential plug-in consistency: Use inductive argument along topological ordering. At step $j$, assume consistency for all parent nodes $k < j$:

    \[\hat{T}_{Z_k}(z_0) \stackrel{p}{\to} T_{Z_k}(z_0)\]

    Then apply continuity of conditional quantile functions in parent arguments (Assumption A5) to show:

    \[\hat{T}_j(x|z_0, \hat{T}_{Z_j}(z_0)) \stackrel{p}{\to} T_j(x|z_0, T_{Z_j}(z_0))\]
  3. Simplex transport stability: For categorical mediators, exploit continuity of optimal transport on the simplex and consistency of conditional probability estimators:

    \[\sup_{k,z} |\hat{p}_{a,j}(k|z) - p_{a,j}(k|z)| \stackrel{p}{\to} 0\]
  4. Effect decomposition consistency: Combine transported mediator consistency with outcome regression consistency (Assumption A8):

    \[\sup_x |\hat{\mu}_a(x) - \mu_a(x)| \stackrel{p}{\to} 0\]

    The key technical insight is the sequential stability: errors in parent transport propagate through Lipschitz-type continuity conditions rather than compounding exponentially, enabling the inductive proof structure.

Experiments & Validation

Experiments include:

  1. Gaussian simulations showing ST recovers classical mediation formulas in linear SEMs
  2. Nonlinear simulations with mixed-type mediators demonstrating good finite-sample performance
  3. Real-world application to COMPAS dataset for algorithmic fairness, using DAG structure: race → (prior_count, charge_degree) → (decile_score, two_year_recid) → compas_screening_date

    The COMPAS application shows ST provides interpretable unit-level attributions of racial disparities through mediator pathways. Key baseline comparisons are against classical linear mediation and normalizing flow approaches.

    Results demonstrate ST maintains good coverage and low bias across various nonlinear settings while providing deterministic, interpretable counterfactuals that respect the assumed causal structure.

Limitations & Open Problems

Limitations:

  1. Requires known mediator DAG structure - TECHNICAL (could potentially be learned but adds complexity)
  2. Assumes sequential ignorability for causal interpretation - NATURAL (standard assumption in mediation literature)
  3. Conditional transport estimation requires sufficient local sample sizes - TECHNICAL (common issue in nonparametric conditional estimation)
  4. Positivity assumption (A0) requires transported mediators stay in-support - NATURAL (standard overlap condition)
  5. Limited to binary treatment case - RESTRICTIVE (though extension to multi-valued treatments is conceptually straightforward)

    Open problems:

  6. Develop methods for simultaneous DAG structure learning and sequential transport estimation
  7. Extend theoretical guarantees to settings with high-dimensional mediators where curse of dimensionality affects conditional transport estimation

On Heterogeneity in Wasserstein Space

Authors: Kisung You · Institution: Baruch College, CUNY · Category: stat.ME

Introduces a simple U-statistic approach to measure heterogeneity in populations of probability measures using pairwise Wasserstein distances, with complete asymptotic theory and empirical eccentricity diagnostics.

Tags: wasserstein-distance u-statistics measure-valued-data heterogeneity optimal-transport bayesian-posteriors empirical-measures asymptotic-inference

arXiv · PDF

Problem Formulation
  1. Motivation: Data increasingly arise as probability measures (empirical distributions, Bayesian posteriors, feature representations of complex objects). Understanding heterogeneity within populations of such measures is crucial for statistical analysis and interpretation.

  2. Mathematical setup: Let $P_2(\mathbb{R}^d)$ denote the space of Borel probability measures on $\mathbb{R}^d$ with finite second moment, equipped with the quadratic Wasserstein distance $W_2$. Let $\Pi$ be a Borel probability law on $P_2(\mathbb{R}^d)$ and $\psi: [0,\infty) \to \mathbb{R}$ be measurable. Define the pairwise Wasserstein heterogeneity as:

    \[D_\psi(\Pi) = \mathbb{E}[\psi\{W_2(\mu, \nu)\}]\]

    where $\mu, \nu \stackrel{i.i.d.}{\sim} \Pi$. Given independent draws $\mu_1, \ldots, \mu_n \sim \Pi$, the natural estimator is the order-two U-statistic:

    \[U_n = \binom{n}{2}^{-1} \sum_{1 \leq i < j \leq n} h(\mu_i, \mu_j)\]

    where $h(\mu, \nu) = \psi{W_2(\mu, \nu)}$.

    Assumptions:

    1. $\lvert \psi(t) \rvert \leq a + bt^p$ for some $p \geq 1$ and constants $a, b \geq 0$
    2. $\mathbb{E}{W_2(\mu, \mu_0)^p} < \infty$ for some $\mu_0 \in P_2(\mathbb{R}^d)$
    3. $\mathbb{E}{W_2(\mu, \mu_0)^{2p}} < \infty$ for asymptotic normality
  3. Toy example: Consider random translations $\mu_\theta = T_\theta # \mu_0$ where $T_\theta(x) = x + \theta$ and $\theta \sim N(0, \sigma^2 I_d)$. Then $W_2(\mu_\theta, \mu_{\theta’}) = \lvert \theta - \theta’\ \rvert$. For $\psi(t) = t^2$, we get $D_{\psi}(\Pi) = 2d\sigma^2$.

  4. Formal objective: Estimate $D_\psi(\Pi)$ and provide asymptotic inference via:

    \[\mathbb{P}\left(D_\psi(\Pi) \in U_n \pm z_{1-\alpha/2} \hat{\sigma}_n n^{-1/2}\right) \to 1-\alpha\]
Method

The method uses U-statistic theory with empirical eccentricities for variance estimation.

  1. Compute all pairwise Wasserstein distances $W_2(\mu_i, \mu_j)$ for $1 \leq i < j \leq n$

  2. Form the U-statistic estimator:

    \[U_n = \binom{n}{2}^{-1} \sum_{1 \leq i < j \leq n} \psi\{W_2(\mu_i, \mu_j)\}\]
  3. Compute empirical eccentricities for each observation:

    \[\hat{g}_i = (n-1)^{-1} \sum_{j \neq i} h(\mu_i, \mu_j) - U_n\]
  4. Estimate the asymptotic variance:

    \[\hat{\sigma}_n^2 = 4(n-1)^{-1} \sum_{i=1}^n \hat{g}_i^2\]
  5. Form confidence intervals using $U_n \pm z_{1-\alpha/2} \hat{\sigma}_n n^{-1/2}$

    Applied to toy example: For $n=3$ random translations with $\psi(t) = t^2$, if the observed shifts are $\theta_1, \theta_2, \theta_3$, then:

    \[U_3 = \frac{1}{3}[\|\theta_1 - \theta_2\|^2 + \|\theta_1 - \theta_3\|^2 + \|\theta_2 - \theta_3\|^2]\]

    The empirical eccentricity $\hat{g}_1 = \frac{1}{2}[\lvert \theta_1 - \theta_2\ \rvert^2 + \lvert \theta_1 - \theta_3\ \rvert^2] - U_3$ measures how isolated $\mu_{\theta_1}$ is from the sample.

Novelty & Lineage

This paper introduces a novel approach to measuring heterogeneity in populations of probability measures using pairwise Wasserstein distances, avoiding expensive barycenter computations required in prior work (Agueh and Carlier 2011). The use of U-statistics for inference on Wasserstein functionals builds on classical theory (Hoeffding 1948, Serfling 1980) but the specific application to measure-valued data and the empirical eccentricity diagnostic are new. The stability result for plug-in approximations addresses practical concerns when measures are estimated. SIGNIFICANT

Proof Techniques

The proofs rely on classical U-statistic theory with careful moment control via Wasserstein geometry.

  1. Moment domination (Lemma 2.1): Uses triangle inequality for $W_2$ to show:

    \[|h(\mu, \nu)| \leq C(1 + W_2(\mu, \mu_0)^p + W_2(\nu, \mu_0)^p)\]
  2. Strong consistency: Apply strong law for U-statistics once $\mathbb{E}\lvert h(\mu, \nu) \rvert < \infty$ is established

  3. Asymptotic normality via Hoeffding decomposition:

    \[U_n - \theta = \frac{2}{n}\sum_{i=1}^n g_i + U_n^\circ\]

    Key step: Show degenerate term satisfies $\mathbb{E}[(U_n^\circ)^2] = O(n^{-1})$ by exploiting conditional independence structure

  4. Variance estimation: Decompose empirical eccentricities as:

    \[\hat{g}_i = \frac{n-2}{n-1}(g_i - \bar{g}_n) + r_{i,n}\]

    Show remainder term $r_{i,n}$ is negligible: $n^{-1}\sum_{i=1}^n r_{i,n}^2 \to 0$ in probability

  5. Plug-in stability uses Lipschitz property of $\psi$ and reverse triangle inequality:

    \[|U_n - \tilde{U}_n| \leq 2Ln^{-1}\sum_{i=1}^n W_2(\hat{\mu}_i, \mu_i)\]
Experiments & Validation

Two main experiments:

  1. Synthetic 2D Gaussian measures with known analytic heterogeneity values - validates estimator accuracy and eccentricity interpretation
  2. MNIST digit classification - 500 images per digit class, estimates within-class heterogeneity with digits 5,2,7 showing highest variability and digits 0,1 showing lowest. Key finding: empirical eccentricities successfully identify atypical images driving heterogeneity (e.g., unusual digit shapes). Also tests transform choice ($\psi(t) = t^2$ vs bounded transforms) and plug-in stability as inner sample size varies.
Limitations & Open Problems
  1. Computational cost scales as $O(n^2)$ due to all pairwise Wasserstein distances - RESTRICTIVE for large datasets
  2. Polynomial growth condition on $\psi$ - NATURAL, covers most practical transforms
  3. Finite second moment assumption on measures - NATURAL in applications
  4. First-order degeneracy (when $\text{var}{g(\mu_1)} = 0$) breaks normal approximation - TECHNICAL, occurs in highly symmetric populations
  5. Requires explicit Wasserstein distance computation - TECHNICAL, limits to moderate dimensions

    Open problems:

  6. Develop scalable approximations based on incomplete pairwise sampling or faster Wasserstein computation
  7. Characterize higher-order asymptotics for degenerate cases and develop appropriate inference procedures