Theory 7 papers

Theory Digest — Mar 22, 2026

Today’s Digest at a Glance

Preliminary

Today’s papers span optimal control theory for financial derivatives, robust machine learning formulations, differential privacy in high-dimensional statistics, Bayesian calibration methods, efficient neural architectures for biomedical signals, and interpretability tools for reinforcement learning.

Mamba/State Space Models (SSMs)

Mamba represents a recent breakthrough in sequence modeling that addresses the quadratic scaling limitations of transformers. While transformers require $O(n^2)$ attention computations for sequence length $n$, many sequence tasks don’t actually need the full expressivity of attending to every token pair. Mamba builds on the linear state space model foundation where hidden states evolve as $h_{t+1} = Ah_t + Bx_t$ and outputs are $y_t = Ch_t$, but makes the transition matrices input-dependent.

The key insight is to make the SSM parameters functions of the input: $A_t = \text{softmax}(\Delta_t \cdot \bar{A})$, $B_t = \Delta_t \cdot \bar{B}$, and $C_t = \bar{C}$, where $\Delta_t$ is a learned “timescale” parameter that determines how much the state should update. This selectivity mechanism allows the model to focus computation on relevant parts of the sequence while maintaining linear complexity. The discretization step converts continuous-time parameters to discrete-time for efficient computation via parallel scans.

Intuitively, Mamba learns to “forget” irrelevant information and “remember” important patterns through its selective state updates, combining the efficiency of RNNs with the parallelizability of transformers.

Koopman Operator Theory

The Koopman operator provides a principled way to analyze nonlinear dynamical systems by embedding them into infinite-dimensional linear spaces. For a nonlinear system $x_{t+1} = f(x_t)$, traditional analysis focuses on the evolution of states in the original space. However, the Koopman operator $\mathcal{K}$ instead acts on functions of the state: $(\mathcal{K}g)(x) = g(f(x))$ for any observable function $g$.

The breakthrough insight is that even though the original dynamics are nonlinear, the Koopman operator is linear by construction. This means we can apply the full toolkit of linear operator theory—eigenvalues, eigenfunctions, spectral decomposition—to understand nonlinear systems. In practice, we approximate the infinite-dimensional operator using a finite set of basis functions $\psi(x) = [\psi_1(x), \ldots, \psi_n(x)]^T$ and learn a matrix $K$ such that $\psi(x_{t+1}) \approx K\psi(x_t)$.

For control systems where $x_{t+1} = f(x_t, u_t)$, the extended Koopman operator becomes $\mathcal{K}_u g(x) = g(f(x,u))$, allowing us to analyze how control inputs affect the lifted linear dynamics. The eigenvalues of $K$ reveal stability properties, while eigenfunctions corresponding to eigenvalues near the unit circle indicate slowly decaying modes that dominate long-term behavior.

Essentially, Koopman theory lets us “linearize” complex nonlinear systems by lifting them to a higher-dimensional space where linear analysis applies.

Simulation-Based Calibration (SBC)

Simulation-based calibration addresses a fundamental problem in Bayesian inference: how do we know if our approximate posterior (from MCMC, variational inference, etc.) actually reflects the true uncertainty? Traditional validation methods like cross-validation don’t directly test whether credible intervals have correct coverage or whether posterior samples represent the true posterior distribution.

SBC works by exploiting a key theoretical property: if we draw parameters $\theta \sim p(\theta)$ from the prior, generate data $y \sim p(y\lvert \theta)$, and then perform posterior inference to get samples ${\theta^{(i)}}$, then the “rank” of the true $\theta$ among these samples should be uniformly distributed. Specifically, we compute $r = \sum_i \mathbf{1}[\theta^{(i)} < \theta]$ for each parameter component, and across many simulation replications, these ranks should follow $\text{Uniform}(0, N)$ where $N$ is the number of posterior samples.

The procedure involves: (1) draw $\theta_\ell \sim p(\theta)$, (2) simulate $y_\ell \sim p(y\lvert \theta_\ell)$, (3) run inference on $y_\ell$ to get posterior samples, (4) compute ranks, and (5) check uniformity via histograms or statistical tests. Deviations from uniformity reveal specific biases—too many low ranks indicate underdispersed posteriors, while too many high ranks suggest overdispersion.

SBC essentially turns the abstract question “is my posterior correct?” into the concrete statistical question “are these ranks uniform?”

Reading Guide

The American options and covariance estimation papers both tackle robustness—model misspecification risk and privacy constraints respectively—using minimax formulations. The Bayesian calibration and stock prediction papers share adaptive themes, with the former adjusting posterior widths and the latter switching between specialized model components. The EEG and RL interpretability papers both leverage state space representations, with Mamba providing efficient sequence modeling and Koopman operators revealing dynamical structure in learned policies.


If Not Now, Then When? Model Risk in the Optimal Exercise of American Options

Authors: Luna Rigby, Rüdiger Frey, Erik Schlögl · Institution: WU Vienna University of Economics and Business · Category: q-fin.MF

Perfect calibration of misspecified models to European option prices fails to eliminate model risk in American option exercise decisions due to stochastic volatility and correlation effects.

Tags: american-options model-risk stochastic-volatility heston-model optimal-stopping calibration finite-differences quantitative-finance

arXiv · PDF

Problem Formulation
  1. Motivation: Model risk in American options arises when traders use misspecified models for optimal exercise decisions. While model risk for European options has been extensively studied, American options present additional challenges because the pricing model determines both continuation values and optimal exercise timing.

  2. Mathematical setup: We work on a filtered probability space $(\Omega, \mathcal{F}, \mathbb{F}, Q)$ where $\mathbb{F} = (\mathcal{F}(t))_{t \in [0,T]}$ is the augmented filtration and $Q$ is a risk-neutral measure. The true benchmark model follows Heston dynamics:

    \[dS(t) = rS(t)dt + \sqrt{v(t)}S(t)dW_1(t)\] \[dv(t) = \kappa(\theta - v(t))dt + \sigma_v\sqrt{v(t)}[\rho dW_1(t) + \sqrt{1-\rho^2}dW_2(t)]\]

    where $W_1, W_2$ are independent Brownian motions.

    Assumptions:

    1. The Feller condition holds: $2\kappa\theta > \sigma_v^2$
    2. Correlation parameter $\rho \in (-1,1)$
    3. Mean-reversion rate $\kappa > 0$, long-term mean $\theta > 0$, vol-of-vol $\sigma_v > 0$

    The optimal exercise time for an American put with strike $K$ and maturity $T$ under the true model is:

    \[\tau^* = \inf\{t \in [0,T] : V(t,S,v) = (K-S)^+\}\]

    where $V(t,S,v)$ is the American put value in the Heston model.

    Misspecified models use either Black-Scholes dynamics:

    \[d\tilde{S}(t) = r\tilde{S}(t)dt + \sigma\tilde{S}(t)dW_1(t)\]

    or Dupire local volatility:

    \[d\tilde{S}(t) = r\tilde{S}(t)dt + \sigma(t,\tilde{S}(t))\tilde{S}(t)dW_1(t)\]
  3. Toy example: When $\rho = -0.5$, $\kappa = 5$, $\theta = 0.16$, $\sigma_v = 0.9$, $r = 0.1$, $S(0) = K = 10$, $v(0) = 0.25^2$, $T = 1$, the Black-Scholes model calibrates to ATM implied volatility $\sigma \approx 0.37$, while the Dupire model calibrates a local volatility surface $\sigma(t,S)$ to match all European option prices.

  4. Formal objective: Quantify model risk by comparing expected discounted payoffs:

    \[\mathbb{E}^Q[e^{-r\tau^*}(K - S(\tau^*))^+] \text{ vs } \mathbb{E}^Q[e^{-r\tilde{\tau}^*}(K - S(\tilde{\tau}^*))^+]\]

    where $\tilde{\tau}^*$ is the exercise time from misspecified models applied to true Heston paths.

Method

The method follows the Hull-Suo benchmark methodology with numerical computation of optimal exercise boundaries:

  1. Generate European option prices under the true Heston model for strikes $K_m$ and maturities $T_m$, $m = 1,\ldots,M$
  2. Calibrate misspecified models to these prices: Black-Scholes to ATM implied volatility, Dupire to full surface using Andersen-Brotherton-Ratcliffe approach
  3. Compute optimal exercise boundaries numerically via finite difference methods for the variational inequality:

    \[\max\left(\frac{\partial V}{\partial t} + \mathcal{L}V - rV, (K-S)^+ - V\right) = 0\]

    where $\mathcal{L}$ is the appropriate generator for each model

  4. For Heston model, use Alternating Direction Implicit (ADI) splitting with Modified Craig-Sneyd scheme on non-uniform grids
  5. Generate $N$ asset price paths under true Heston dynamics using Milstein discretization
  6. Apply each exercise strategy to the same set of Heston paths and compare payoff distributions

    Applied to toy example: With $\rho = -0.5$ baseline parameters, the method generates 1 million Heston paths with 300 time steps. The Heston exercise boundary $B(t,v)$ decreases in volatility $v$, while misspecified model boundaries $\tilde{B}(t)$ are one-dimensional. For high volatility states, the Heston boundary lies significantly below the Black-Scholes/Dupire boundaries, favoring later exercise and achieving mean discounted payoff 1.076 vs 1.061 (Black-Scholes) and 1.064 (Dupire).

Novelty & Lineage

This paper extends the Hull-Suo [2002] benchmark methodology for model risk to American options, which has received limited attention compared to European options. Prior work on American option model risk includes worst-case approaches (Nutz and Zhang [2015]) and parameter uncertainty (Ekstrom and Vannestal [2019]), but systematic analysis comparing calibrated misspecified models is novel.

The numerical methods build on standard approaches: Brennan-Schwartz algorithm for variational inequalities, ADI schemes from In ‘t Hout and Foulon [2010] for Heston PDEs, and Andersen-Brotherton-Ratcliffe [1998] for local volatility calibration.

Key novelty is demonstrating that perfect calibration to European option prices fails to mitigate model risk for American exercise decisions, and that stochastic volatility correlation has substantial impact on optimal exercise that misspecified models cannot capture even with full surface calibration.

SIGNIFICANT

Proof Techniques

This is primarily a numerical study with theoretical foundations rather than new proofs. The main theoretical results used are:

  1. Convexity of American put values from Lamberton-Terenzi [2019]:

    \[V(t,S,v) \text{ is convex in } S\]

    leading to unique exercise boundaries $B(t,v)$ such that exercise region is ${S \leq B(t,v)}$

  2. Viscosity solution characterization (Touzi [1999]) for the variational inequality:

    \[\max\left(\frac{\partial V}{\partial t} + \mathcal{L}V - rV, (K-S)^+ - V\right) = 0\]
  3. For Dupire model, extension of convexity result (Proposition 3.1) using Ekstrom [2004]:

    Under Hölder continuity and growth conditions on $\sigma(t,S)$, convexity of payoff implies convexity of value function

  4. Numerical stability analysis relies on unconditional stability of Modified Craig-Sneyd ADI scheme when parameter $\lambda_2 \geq 1/3$ for pure diffusion, $\lambda_2 \geq 1/2$ for convection-diffusion

    The main technical insight is that two-dimensional Heston model separates volatility level effects from correlation effects, while one-dimensional calibrated models conflate these through the calibrated surface/parameter, leading to systematic exercise timing errors.

Experiments & Validation

Datasets: Synthetic data generated from Heston model with 1 million Monte Carlo paths using Milstein discretization (300 time steps)

Parameters: Baseline case with $\kappa=5$, $\theta=0.16$, $\sigma_v=0.9$, $\rho=-0.5$, $r=0.1$, $S(0)=K=10$, $v(0)=0.25^2$, $T=1$

Calibration: 100 European options (4 maturities × 25 strikes), Black-Scholes to ATM implied vol, Dupire to full surface with mean relative error 0.74%

Key results:

  • Expected discounted payoffs: Heston 1.076, Dupire 1.064, Black-Scholes 1.061
  • Correlation sensitivity: For $\rho \in {-0.5, 0, 0.5}$, Heston always optimal, Black-Scholes vs Dupire ranking reverses for $\rho > 0$
  • Recalibration: Weekly recalibration of Black-Scholes (100k paths, 5M total calibrations) shows no improvement over single calibration

Computational methods: Brennan-Schwartz finite differences for 1D problems, Modified Craig-Sneyd ADI for 2D Heston PDE, validation via Longstaff-Schwartz simulation

Limitations & Open Problems

Limitations:

  1. Single underlying asset with no dividends - NATURAL (standard in American option literature)
  2. Feller condition $2\kappa\theta > \sigma_v^2$ ensures positive volatility - NATURAL (standard assumption for Heston model)
  3. Constant risk-free rate - TECHNICAL (could be extended to stochastic rates)
  4. Perfect European option price observability for calibration - RESTRICTIVE (real markets have bid-ask spreads, limited strikes/maturities)
  5. No transaction costs or market frictions - RESTRICTIVE (significant in practice for exercise decisions)
  6. Recalibration study limited to Black-Scholes model due to computational cost - TECHNICAL (limitation of current implementation)
  7. Single correlation calibration scheme (ATM for Black-Scholes, full surface for Dupire) - TECHNICAL (other calibration approaches could be tested)

    Open problems:

  8. Extend analysis to jump-diffusion models where European calibration may be even less effective at capturing American exercise risks
  9. Develop computationally efficient methods for frequent recalibration of local volatility models to assess whether more sophisticated recalibration can mitigate model risk

Minimax Generalized Cross-Entropy

Authors: Kartheek Bondugula, Santiago Mazuelas, Aritz Pérez, Anqi Liu · Institution: Basque Center for Applied Mathematics, Johns Hopkins University · Category: stat.ML

Proposes a minimax formulation of generalized cross-entropy that achieves convex optimization over classification margins through adaptive link functions and uncertainty sets.

Tags: loss functions robust classification convex optimization minimax learning label noise classification calibration implicit differentiation bilevel optimization

arXiv · PDF

Problem Formulation
  1. Motivation: Generalized cross-entropy (GCE) losses interpolate between cross-entropy and mean absolute error to balance optimization difficulty and robustness to label noise. However, existing GCE formulations result in non-convex optimization over classification margins, leading to underfitting on complex datasets. Convexity is highly desirable for loss functions as it enables tractable optimization.

  2. Mathematical setup: Let $X$ and $Y$ be the feature and label spaces with $k$ classes. A classifier $h: X \to [0,1]^k$ maps features to class probabilities. Classification margins are given by:

    \[f(x,\mu) = [\Phi(x,1)^\top\mu, \Phi(x,2)^\top\mu, \ldots, \Phi(x,k)^\top\mu]\]

    where $\Phi(x,y) \in \mathbb{R}^m$ is the feature mapping and $\mu \in \mathbb{R}^m$ are classifier parameters. The $\alpha$-loss (parameterized by $\beta = \alpha/(\alpha-1)$) is:

    \[\ell_\beta(h,(x,y)) = \beta(1 - h(x)_y^{1/\beta})\]

    The uncertainty set is:

    \[U = \{p \in \Delta(X \times Y) : |E_p[\Phi(x,y)] - \tau| \preceq \lambda, p(x) = p^*(x)\}\]
    1. $p^*(x)$ is the underlying feature distribution
    2. $\tau = \frac{1}{n}\sum_{i=1}^n \Phi(x_i,y_i)$ are empirical feature expectations
    3. $\lambda \succeq 0$ is a confidence vector
  3. Toy example: Consider binary classification with $k=2$, linear features $\Phi(x,y) = [x; \mathbf{1}_{y=1}]$, and $\beta=2$. The $\alpha$-loss becomes $2(1-\sqrt{h(x)_y})$ and existing GCE uses softmax link $h(x)_y = e^{f(x,\mu)_y}/\sum_j e^{f(x,\mu)_j}$, making the loss non-convex in $\mu$.

  4. Formal objective: Find the minimax classifier:

    \[V_\beta = \min_h \max_{p \in U} E_p[\ell_\beta(h,(x,y))]\]
Method

The MGCE method solves the minimax problem via a bilevel convex optimization:

\[V_\beta = \min_{\mu \in \mathbb{R}^m} -\tau^\top\mu + \lambda^\top|\mu| - E_{p^*(x)}[\phi_\beta(x,\mu)]\]

where $\phi_\beta(x,\mu)$ is implicitly defined by:

\[\phi_\beta(x,\mu) = \max_\nu \nu \text{ s.t. } \sum_{y \in Y} \left(\frac{f(x,\mu)_y + \nu}{\beta}\right)_+^{1/\beta} \leq 1\]

The resulting classification rule assigns probabilities:

\[h_\beta(x)_y = \left(\frac{f(x,\mu^*)_y + \phi_\beta(x,\mu^*)}{\beta}\right)_+^{1/\beta}\]

Steps:

  1. For each training sample, solve for $\phi_\beta(x,\mu)$ using bisection method
  2. Compute worst-case distribution: $p_\beta(y\lvert x) = \frac{h_\beta(x)_y^{(\beta-1)/\beta}}{\sum_j h_\beta(x)_j^{(\beta-1)/\beta}}$
  3. Update parameters using stochastic gradient: $\frac{\partial\phi_\beta}{\partial\mu} = -\sum_y p_\beta(y\lvert x)\Phi(x,y)$

    Applied to toy example: With $\beta=2$, $\phi_2(x,\mu)$ satisfies $\sum_y \sqrt{\frac{f(x,\mu)_y + \phi_2(x,\mu)}{2}} = 1$, yielding a convex quadratic constraint.

Novelty & Lineage

This extends the minimax classification framework of Mazuelas et al. (2022, 2023) from MAE and CE losses to general $\alpha$-losses. Prior work by Zhang and Sabuncu (2018) introduced GCE with fixed softmax links leading to non-convex optimization. Sypherd et al. (2022a,b) studied $\alpha$-loss properties but didn’t address convexity. The key novelty is deriving a convex bilevel formulation for general $\alpha$-losses with adaptive link functions that depend on $\beta$, plus efficient optimization via implicit differentiation. The theoretical characterization of worst-case distributions and performance guarantees are also new contributions. SIGNIFICANT.

Proof Techniques

The main proof techniques are:

  1. Lagrangian duality transformation: Convert the minimax problem over probability distributions to optimization over parameters by taking the dual of the inner maximization, yielding:

    \[\min_{\mu,\nu} -\tau^\top\mu + \lambda^\top|\mu| - E_{p^*}[\nu_x]\]

    subject to margin constraints.

  2. Implicit function theorem: The constraint $\sum_y \left(\frac{f(x,\mu)_y+\nu}{\beta}\right)_+^{1/\beta} \leq 1$ implicitly defines $\phi_\beta(x,\mu)$ via:

    \[F(x,\mu,\phi_\beta(x,\mu)) = 1\]
  3. KKT optimality conditions: For the worst-case distribution characterization, apply KKT conditions to:

    \[\min_h -p_\beta^\top h^{1/\beta} \text{ s.t. } h \succeq 0, \mathbf{1}^\top h = 1\]

    yielding $h_{\beta i} = \frac{p_{\beta i}^{\beta/(\beta-1)}}{\sum_j p_{\beta j}^{\beta/(\beta-1)}}$.

  4. Implicit differentiation: For stochastic gradients, differentiate $F(x,\mu,\phi_\beta(x,\mu)) = 1$ to get:

    \[\frac{\partial\phi_\beta}{\partial\mu} = -\frac{\sum_y \left(\frac{f(x,\mu)_y+\phi_\beta}{\beta}\right)_+^{(\beta-1)/\beta} \Phi(x,y)}{\sum_j \left(\frac{f(x,\mu)_j+\phi_\beta}{\beta}\right)_+^{(\beta-1)/\beta}}\]
  5. Convexity proof: Show $\phi_\beta(x,\mu)$ is concave in $\mu$ by demonstrating the feasible set in its definition is convex and applying the envelope theorem.

Experiments & Validation

Datasets: FashionMNIST, CIFAR-10, SVHN, CIFAR-100, Tiny ImageNet, WebVision (real-world noisy), Clothing-1M, plus tabular datasets (MNIST, Letter Recognition, Covertype, Adult).

Baselines: GCE (Zhang & Sabuncu 2018), Cross-entropy, MAE (Mazuelas et al. 2023).

Architectures: ResNet-18/34/50 for different datasets, MLP for tabular data.

Key numbers: MGCE achieves 91.01% vs 90.55% (GCE) on CIFAR-10 clean, 86.91% vs 86.95% under 20% noise. On complex WebVision dataset, MGCE significantly outperforms GCE which underfits. MGCE shows faster convergence and better calibration (lower SCE) especially under label noise. Synthetic symmetric noise rates: 20% and 40%.

Limitations & Open Problems
  1. Bisection method for computing $\phi_\beta(x,\mu)$ adds computational overhead - TECHNICAL (could potentially be optimized with better root-finding methods)

  2. Uncertainty set design requires tuning regularization parameter $\lambda_0$ - NATURAL (standard hyperparameter tuning, similar to existing methods)

  3. Theoretical guarantees assume underlying distribution lies in uncertainty set - NATURAL (standard assumption in robust optimization)

  4. Method requires differentiable feature mappings $\Phi(x,y)$ - TECHNICAL (standard in deep learning but excludes some discrete settings)

  5. Performance gains most significant under label noise, less clear for clean data - RESTRICTIVE (limits applicability to noisy scenarios)

    Open problems:

  6. Extend to multi-label classification where $\sum_y h(x)_y \neq 1$
  7. Develop adaptive schemes for automatically selecting $\beta$ during training rather than cross-validation

Minimax and Adaptive Covariance Matrix Estimation under Differential Privacy

Authors: T. Tony Cai, Yicheng Li · Institution: University of Pennsylvania, Tsinghua University · Category: math.ST

Establishes minimax-optimal differentially private estimation of bandable covariance matrices with novel van Trees inequality technique and adaptive procedures.

Tags: differential privacy covariance estimation minimax theory bandable matrices adaptive estimation high-dimensional statistics privacy-utility tradeoff

arXiv · PDF

Problem Formulation
  1. Motivation: Covariance matrix estimation is fundamental for high-dimensional data analysis, but privacy constraints require adding noise that degrades accuracy. Bandable covariance matrices arise naturally in temporal and spatial data where correlations decay with separation.

  2. Mathematical setup: We observe i.i.d. random vectors $x_1, \ldots, x_n \in \mathbb{R}^d$ with population covariance matrix $\Sigma$. Each vector satisfies $\lvert x_i \rvert_{\psi_2} \leq K$ for some constant $K > 0$. The bandable covariance class is:

    \[\mathcal{F}_{\alpha} = \{\Sigma : \forall k\text{-off-diagonal block } R_k, \|\Sigma_{R_k}\| \leq C_1 k^{-\alpha} \text{ and } \|\Sigma\| \leq C_2\}\]

    where a $k$-off-diagonal block lies above the $k$-th super-diagonal. A randomized algorithm $M$ satisfies $\rho$-zCDP if for all adjacent datasets $S, S’$:

    \[D_{\alpha}(M(S) \| M(S')) \leq \rho \alpha, \quad \forall \alpha \in (1,\infty)\]

    where $D_{\alpha}$ is the $\alpha$-Rényi divergence.

  3. Toy example: When $d=3$ and $\Sigma = \begin{pmatrix} 1 & 0.5 & 0.1 \ 0.5 & 1 & 0.5 \ 0.1 & 0.5 & 1 \end{pmatrix}$ with decay parameter $\alpha=1$, the $(1,3)$ entry has magnitude $0.1 = 0.5 \cdot 2^{-1}$, satisfying the bandable structure where correlations decay as distance increases.

  4. Formal objective: Minimize the minimax risk:

    \[\inf_{\hat{\Sigma} \in \mathcal{M}_{\rho}} \sup_{\Sigma \in \mathcal{F}_{\alpha}} \mathbb{E}[\|\hat{\Sigma} - \Sigma\|^2]\]

    where $\mathcal{M}_{\rho}$ is the class of all $\rho$-zCDP estimators.

Method

The blockwise tridiagonal estimator partitions the covariance matrix into blocks of size $k$ and estimates only diagonal and first off-diagonal blocks:

  1. Partition indices into blocks: $I_{k,\ell} = [1 + (\ell-1)k, \ell k] \cap {1,\ldots,d}$
  2. Define block regions: $B_{k;\ell,\ell’} = I_{k,\ell} \times I_{k,\ell’}$
  3. For each block $B$, compute truncated sample covariance:

    \[\tilde{\Sigma}_B = \frac{1}{n} \sum_{i=1}^n \tilde{x}_{i,I} \tilde{x}_{i,J}^{\top} - \hat{\mu}_I \hat{\mu}_J^{\top}\]

    where $\tilde{x}_{i,I} = x_{i,I} \mathbf{1}{\lvert x_{i,I} \rvert_2 \leq L\sqrt{\lvert I \rvert}}$

  4. Add Gaussian noise for privacy:

    \[\hat{\Sigma}^{DP}_B = \tilde{\Sigma}_B + \sigma_M M_B\]

    where $M_B \sim \text{GUE}(d)_B$ and $\sigma_M^2 = \frac{18L^2\lvert B \rvert}{\rho_0 n^2}$

  5. Retain only tridiagonal blocks (diagonal and first super/sub-diagonal), set others to zero

    For the toy example with $d=3, k=1$: estimate blocks $(1,1), (2,2), (3,3), (1,2), (2,3)$ with noise, set $(1,3)$ entry to zero.

Novelty & Lineage

Novel contributions:

  1. First minimax-optimal differentially private estimator for bandable covariance matrices achieving rate $n^{-\frac{2\alpha}{2\alpha+1}} + (\frac{d}{\rho n^2})^{\frac{\alpha}{\alpha+1}}$ without logarithmic losses.
  2. New DP van Trees inequality providing general lower bound technique for private estimation.
  3. Adaptive estimator achieving optimal rates without knowing decay parameter $\alpha$.

    Extends Cai et al. [15] (non-private bandable covariance estimation) and recent work on unstructured DP covariance estimation [24,36,38]. The blockwise tridiagonal structure is simpler than prior tapering estimators and better suited for private setting.

    SIGNIFICANT

Proof Techniques

Upper bound proof uses blockwise error decomposition:

  1. Individual block concentration: For each block $B$, bound $\lvert \hat{\Sigma}^{DP}_B - \Sigma_B \rvert$ using three error sources: - Bias from truncation: $O(n^{-1})$ - Statistical variance: $O(\frac{k + \log d}{n})$ - Privacy noise variance: $O(\frac{dk(k + \log d)}{\rho n^2})$

  2. Key structural lemma exploiting tridiagonal pattern:

    \[\|\hat{\Sigma}^{DP} - \Sigma\| \leq 4 \max_B \|\hat{\Sigma}^{DP}_B - \Sigma_B\|\]
  3. Bias control using bandable decay: off-diagonal blocks satisfy $\lvert \Sigma_{R_k} \rvert \leq C_1 k^{-\alpha}$

  4. Lower bound via novel DP van Trees inequality: For $\rho$-zCDP estimator $\hat{\theta}$:

    \[\mathbb{E}_{\pi}\mathbb{E}_{\theta}[\|\hat{\theta} - \theta\|_2^2] \geq \frac{p^2}{I + \text{Tr}(J_{\pi})}\]

    where $I = C_{\rho}\rho n^2 \int \lvert I_x(\theta) \rvert d\pi(\theta) \land n\int \text{Tr}(I_x(\theta)) d\pi(\theta)$

  5. Carefully constructed priors over bandable matrices to achieve matching lower bounds

Experiments & Validation

Synthetic experiments on multivariate Gaussian data with bandable covariance structure. Validates theoretical convergence rates under varying privacy budgets $\rho \in [0.1, 10]$, sample sizes $n \in [500, 3000]$, and dimensions $d \in [50, 500]$.

Key findings:

  1. Estimation error decreases as privacy budget increases, matching theory.
  2. Convergence rates match predicted slopes: $n^{-0.67}$ vs theoretical $n^{-\frac{2}{3}}$ for $\alpha=1$.
  3. Adaptive estimator performs robustly across different decay parameters without requiring $\alpha$ knowledge, though with slight performance cost compared to oracle tuned estimator.
Limitations & Open Problems
  1. Sub-Gaussian assumption $\lvert x_i \rvert_{\psi_2} \leq K$ - NATURAL (standard in high-dimensional statistics)

  2. Dimension constraint $d \lesssim n^{\gamma}$ for fixed $\gamma > 0$ - TECHNICAL (ensures minimax risk doesn’t diverge, likely improvable)

  3. Privacy budget constraint $\rho n^2/d \gtrsim (\log d)^{2(\alpha+1)}$ for optimal rates - TECHNICAL (mild condition for meaningful private estimation)

  4. Bandable structure assumption - NATURAL (common in temporal/spatial data)

  5. Gaussian noise mechanism only - TECHNICAL (could extend to other DP mechanisms)

    Open problems:

  6. Determine precise cost of adaptivity under DP (conjecture: logarithmic loss is unavoidable)
  7. Extend to other structured covariance classes (sparse, Toeplitz) under differential privacy

Approximate posterior recalibration

Authors: Tiffany Cai, Philip Greengard, Ben Goodrich, Andrew Gelman · Institution: Columbia University · Category: stat.ME

Introduces two practical methods to widen narrow approximate posterior intervals using simulation-based calibration, with theoretical analysis showing posterior recalibration fundamentally cannot work for simple models.

Tags: bayesian_inference calibration uncertainty_quantification variational_inference simulation_based_calibration approximate_inference hierarchical_models confidence_intervals

arXiv · PDF

Problem Formulation
  1. Motivation: Approximate Bayesian inference methods (e.g., variational inference, MCMC approximations) often produce confidence intervals that are too narrow, failing to capture true posterior uncertainty. This leads to overconfident predictions and poor calibration in downstream applications across statistics and machine learning.

  2. Mathematical setup: Consider a Bayesian model $M$ with parameter vector $\Theta \in \mathbb{R}^d$ and scalar component of interest $\theta = g(\Theta)$ for some function $g: \mathbb{R}^d \to \mathbb{R}$. The true data-generating process follows:

    \[\Theta \sim p(\Theta)\] \[y | \Theta \sim p(y | \Theta)\]

    The exact posterior is:

    \[p(\Theta | y) \propto p(\Theta) p(y | \Theta)\]

    However, we only have access to an approximate model $M’$ that produces approximate posterior draws ${\Theta^{post}_{l,s}}_{s=1}^S$ when fitted to data $y_l$.

    Assumptions:

    1. The model $M$ is correctly specified (data truly generated from this model)
    2. Approximate posteriors have correct location but incorrect scale
    3. Adjustment can be parameterized by a single scaling factor $k$ around the posterior mean
  3. Toy example: Consider the one-parameter normal model:

    \[\theta \sim \text{normal}(0, 1)\] \[y | \theta \sim \text{normal}(\theta, 1)\]

    The exact posterior is $\theta \lvert y \sim \text{normal}(y/2, 1/\sqrt{2})$, but suppose our approximate method produces $\theta^{approx} \lvert y \sim \text{normal}(y/2, \sigma_{wrong})$ where $\sigma_{wrong} < 1/\sqrt{2}$.

  4. Formal objective: Find scaling factor $k$ such that adjusted posterior intervals achieve nominal coverage:

    \[P(\theta_l \in [Q_{\alpha/2}(k), Q_{1-\alpha/2}(k)]) = 1 - \alpha\]

    where $Q_{\alpha}(k)$ are quantiles of the $k$-scaled approximate posterior.

Method

The paper proposes two Approximate Posterior Calibration (APC) methods:

Method 1: Nominal Coverage Method

  1. Generate $L$ parameter draws $\Theta_1, \ldots, \Theta_L$ from prior $p(\Theta)$
  2. For each $l$, generate data $y_l \sim p(y \lvert \Theta_l)$ and fit approximate model to get posterior samples ${\theta_{l,s}^{post}}_{s=1}^S$
  3. Compute quantile $q_l = \frac{1}{S} \sum_{s=1}^S \mathbf{1}_{\theta_l > \theta_{l,s}^{post}}$
  4. Find scaling $k$ that minimizes:

    \[(q_{(1-\alpha/2)} - q_{(\alpha/2)}) - (1-\alpha))^2\]
  5. Apply scaling to new data: $\theta_{adj} = \mu^{post} + k(\theta^{post} - \mu^{post})$

    Method 2: Z-score Method

  6. Compute z-scores: $z_l = \frac{\theta_l - \mu_l^{post}}{\sigma_l^{post}}$
  7. Calculate $s_z = \text{sd}(z_1, \ldots, z_L)$
  8. Apply adjustment:

    \[\theta_{adj,l,s}^{post} = \mu_l^{post} + s_z(\theta_{l,s}^{post} - \mu_l^{post})\]

    Toy example application: For the normal model with true posterior $\theta \lvert y \sim \text{normal}(y/2, 1/\sqrt{2})$ but approximate posterior with standard deviation scaled down by factor 3, both methods would identify $k \approx 3$ to restore proper coverage.

Novelty & Lineage

This work extends simulation-based calibration checking (SBC) from Talts et al. (2020) and Cook et al. (2006) from diagnostic to corrective use. Prior recalibration work includes Rodrigues et al. (2018) for ABC methods, Yu et al. (2021) using moment-based calibration, and Bon et al. (2023) with linear recalibration.

Key novel contributions:

  1. Two simple scaling-based recalibration methods extending SBC
  2. Theoretical analysis showing posterior recalibration fundamentally cannot work for simple models
  3. Mathematical characterization of why posterior predictive calibration fails in normal-normal setting

    The posterior recalibration analysis is novel, showing it reduces pooling toward the prior and creates overconfident intervals, though it might work for hierarchical models with internal replication.

    INCREMENTAL

Proof Techniques

The main theoretical contribution analyzes posterior vs. prior calibration for the normal-normal model:

\[\theta \sim \text{normal}(0, 1), \quad y | \theta \sim \text{normal}(\theta, \sigma)\]

Key analytical result: Under posterior calibration, z-scores have non-standard distribution:

\[z_l = \frac{1}{\sigma_{post}}(\theta_l - \mu_l^{post}) \sim \text{normal}\left(\frac{\sigma y}{(1 + \sigma^2)^{3/2}}, \sqrt{\frac{\sigma^4 + \sigma^2 + 1}{1 + \sigma^2}}\right)\]

Derivation technique:

  1. Express posterior parameters: $\mu^{post} = \frac{y}{1 + \sigma^2}$, $\sigma^{post} = \frac{\sigma}{\sqrt{1 + \sigma^2}}$
  2. Under posterior calibration, $\theta_l \sim \text{normal}(\mu^{post}, \sigma^{post})$
  3. Generate $y_l \lvert \theta_l \sim \text{normal}(\theta_l, \sigma)$, compute $\mu_l^{post} = \frac{y_l}{1 + \sigma^2}$
  4. Analytical computation of $z_l$ distribution using change of variables

    Key insight: The non-unit variance proves posterior calibration cannot achieve nominal coverage except when $\sigma \to 0$ or $\sigma \to \infty$ (uninformative cases).

    Recalibration effect: Shows posterior recalibration reduces pooling from $y/2$ to $3y/4$ and understates uncertainty from $\sigma^{post}$ to $\sigma^{post} \sqrt{3/2}$.

Experiments & Validation

Synthetic experiments:

  1. One-parameter Gaussian model (Section 4.1): Artificial narrowing by factor 3, recovery with both methods achieving $k \approx 3$
  2. 8-schools hierarchical model (Section 4.2): ADVI vs HMC comparison, finding scaling factors 2.4-2.5 needed for ADVI calibration

    Key experimental parameters:

    • $L = 1000$ prior/posterior draws
    • $S = 1000$ posterior samples per replication
    • Coverage levels $\alpha \in {0.05, 0.1, 0.2, 0.5}$
    • Grid search over scaling values 2.00 to 5.00

    Results:

    • Both methods achieve near-nominal coverage (e.g., 95% intervals get 95.1-95.2% coverage)
    • Z-score method computationally more efficient (no grid search)
    • Methods give similar scaling factors (difference typically $< 0.1$)

    Validation approach: Comparison against HMC “ground truth” for 8-schools model shows ADVI posteriors need ~2.5x widening despite being only ~2x too narrow due to systematic bias.

Limitations & Open Problems

Limitations:

  1. Assumes model is correctly specified - NATURAL (standard assumption in Bayesian calibration)
  2. Only corrects posterior variance, not bias or higher moments - TECHNICAL (authors suggest extensions possible)
  3. Universal scaling factor across all confidence levels - TECHNICAL (could vary scaling by $\alpha$)
  4. Single scalar parameter adjustment at a time - RESTRICTIVE (no joint calibration in high dimensions)
  5. Requires repeated approximate inference for calibration - NATURAL (computational cost standard for calibration procedures)
  6. Posterior recalibration fundamentally flawed for non-hierarchical models - NATURAL (theoretical limitation proven in paper)

    Open problems:

  7. Extending to joint calibration of multiple parameters simultaneously while maintaining computational tractability
  8. Developing theoretically principled posterior recalibration methods for hierarchical models that leverage internal replication structure

LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

Authors: Danaé Broustail, Anna Tegon, Thorir Mar Ingolfsson, Yawei Li et al. (5 authors) · Institution: ETH Zurich · Category: cs.AI

LuMamba combines topology-invariant cross-attention with linear-complexity Mamba blocks and introduces LeJEPA pre-training to EEG, achieving competitive performance with 377× fewer FLOPS than Transformer baselines.

Tags: EEG foundation models state-space models self-supervised learning biomedical signal processing topology-invariant encoding Mamba LeJEPA

arXiv · PDF

Problem Formulation
  1. Motivation: EEG foundation models face two critical bottlenecks:
  2. electrode topology heterogeneity across datasets causes performance degradation when transferring between different electrode configurations, and
  3. Transformer-based architectures incur quadratic complexity in sequence length, limiting scalability for long EEG recordings.

    1. Mathematical setup: Let $x \in \mathbb{R}^{B \times C \times T}$ be an EEG recording with batch size $B$, $C$ channels, and $T$ time points. Different datasets have varying channel counts $C \in {16, 20, 22, 26}$ with different electrode placements. The signal is tokenized into $S = T/P$ patches of length $P$, yielding $X_{tok} \in \mathbb{R}^{(B \cdot S) \times C \times E}$ after embedding to dimension $E$.

    Assumptions:

    1. EEG signals are sampled at 256 Hz and segmented into non-overlapping windows
    2. Channel configurations follow standard 10-20 system conventions
    3. Pre-training and downstream tasks have disjoint electrode topologies
  4. Toy example: When $B=1$, $C=20$, $T=1280$ (5-second window), and $P=64$, we have $S=20$ patches. The topology-invariant encoder must map this to a fixed latent space regardless of whether the input has 16, 20, or 26 channels.

  5. Formal objective: Minimize the combined loss

    \[\mathcal{L} = \mathcal{L}_{recon} + \lambda \cdot \mathcal{L}_{LeJEPA}\]

    where $\mathcal{L}_{recon}$ is masked reconstruction loss and $\mathcal{L}_{LeJEPA}$ includes both predictive alignment and SIGReg regularization terms.

Method

LuMamba combines LUNA’s topology-invariant encoding with FEMBA’s bidirectional Mamba blocks:

  1. LUNA Encoder: Input $x \in \mathbb{R}^{B \times C \times T}$ is tokenized and embedded with temporal, spectral, and 3D positional features to yield $X_{tok} \in \mathbb{R}^{(B \cdot S) \times C \times E}$

  2. Channel Unification: Cross-attention with $Q$ learnable queries projects across channel dimension:

    \[X_{lat} = \text{CrossAttn}(Q_{queries}, X_{tok}) \in \mathbb{R}^{(B \cdot S) \times Q \times E}\]
  3. Bi-Mamba Processing: Reshape to $B \times S \times (Q \cdot E)$ and apply bidirectional Mamba blocks for temporal modeling

  4. LeJEPA Pre-training: Sample $N_{global}=2$ global windows of size $T_{global}$ and $N_{local}=4$ local windows of size $T_{local} < T_{global}$. The predictive loss aligns local and global embeddings:

    \[\mathcal{L}_{JEPA} = \frac{1}{N_{local}} \sum_{i=1}^{N_{local}} \|\mu_{global} - v_{local,i}\|_2^2\]

    where $\mu_{global} = \frac{1}{N_{global}} \sum_{j=1}^{N_{global}} v_{global,j}$

  5. SIGReg: Project embeddings onto random 1D slices and measure discrepancy from isotropic Gaussian using Epps-Pulley test

    Toy example application: For a 20-channel input, the method maps it through cross-attention to a fixed $Q=64$ query space, processes temporally with bi-Mamba, then optimizes the combined reconstruction and LeJEPA objectives.

Novelty & Lineage

This work combines existing components in a novel way: LUNA’s topology-invariant encoding (Döner et al., 2025) with FEMBA’s bidirectional Mamba blocks (Tegon et al., 2025). The main novelty is the first adaptation of LeJEPA (Balestriero et al., 2025) from images/video to EEG time series, including temporal view construction and the systematic study of mixed reconstruction-LeJEPA objectives. Prior EEG foundation models like LaBraM, BIOT, and EEGFormer used only masked reconstruction or contrastive learning. The architectural fusion of linear-complexity SSMs with topology-invariant encoding is also novel, enabling efficient processing across heterogeneous electrode configurations. INCREMENTAL

Proof Techniques

This is an empirical paper with no formal proofs. The main technical contributions are:

  1. Architectural Design: Demonstrating that LUNA’s cross-attention mechanism can be successfully combined with bidirectional Mamba blocks while preserving topology invariance

  2. LeJEPA Adaptation: Showing that temporal window sampling (global vs local views) can replace spatial cropping from the original image-based LeJEPA

  3. Empirical Analysis: Using t-SNE visualizations and downstream task performance to show the complementary effects of reconstruction (structured embeddings) vs LeJEPA (isotropic regularization)

    The key technical insight is the trade-off discovery:

    \[\text{Reconstruction} \rightarrow \text{structured latents} \rightarrow \text{better in-distribution performance}\] \[\text{LeJEPA} \rightarrow \text{isotropic latents} \rightarrow \text{better cross-montage transfer}\]
Experiments & Validation

Datasets: Pre-trained on TUEG corpus (21,600 hours, 14,000+ patients). Evaluated on 5 downstream tasks: TUAB (normal/abnormal, 2,329 subjects), TUAR (artifact detection, 213 subjects), TUSL (slowing events, 38 patients), APAVA (Alzheimer’s detection, 23 patients, 16 channels), TDBrain (Parkinson’s detection, 72 patients, 26 channels).

Baselines: BENDR, EEGFormer, BIOT, LaBraM, LUNA, FEMBA, BioMamba, Medformer.

Key Results:

  • TUAB: 80.99% balanced accuracy (competitive with LaBraM’s 81.40%)
  • APAVA: 0.97 AUPR for AD detection (+4% over prior state-of-art)
  • Efficiency: 377× fewer FLOPS than LaBraM, 12× longer sequences before OOM
  • Model size: 4.6M parameters vs 5.9M (LaBraM) and 7M (LUNA)

Ablations: Compared reconstruction-only, LeJEPA-only, and mixed objectives across all tasks. Mixed approach showed best cross-montage generalization.

Limitations & Open Problems
  1. Task-specific performance gaps: Underperforms on TUSL (highly imbalanced dataset) and TUAR compared to task-specific methods - TECHNICAL (methodological choice favoring generalization over task-specific optimization)

  2. Limited pre-training scale: Only 21,600 hours compared to potential larger corpora - TECHNICAL (scalable with more data)

  3. Electrode topology range: Evaluated only on 16-26 channels, not high-density arrays (64-256 channels) - RESTRICTIVE (limits applicability to research-grade EEG)

  4. LeJEPA hyperparameter sensitivity: Number of SIGReg projection slices affects performance significantly - TECHNICAL (requires tuning)

  5. Temporal window selection: Fixed global/local window sizes may not be optimal across all EEG phenomena - TECHNICAL (could be learned or adaptive)

    Open Problems:

  6. Scaling to high-density EEG montages (128+ channels) while maintaining topology invariance
  7. Developing adaptive temporal view construction for LeJEPA that considers EEG-specific time scales and oscillatory patterns

Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

Authors: Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman · Institution: University of Ottawa · Category: cs.LG

Introduces an adaptive stock prediction framework that uses autoencoder-based regime detection to route data through specialized transformer pathways, with SAC reinforcement learning controlling the routing threshold and blending weights based on prediction feedback.

Tags: stock prediction regime switching autoencoder anomaly detection graph neural networks reinforcement learning financial time series transformer architecture adaptive systems

arXiv · PDF

Problem Formulation
  1. Motivation: Stock markets exhibit regime-dependent behavior where prediction models optimized for stable conditions often fail during volatile periods. Existing approaches treat all market states uniformly or require manual regime labeling, which becomes stale as market dynamics evolve.

  2. Mathematical setup: Let $G = (V, E)$ be a graph with $N = 20$ stock nodes representing an investment universe. For each stock $i$ at time $t$, define the raw feature vector:

    \[x^{\text{raw}}_{i,t} = [O_t, H_t, L_t, C_t, V_t]\]

    where $O_t, H_t, L_t, C_t, V_t$ are open, high, low, close prices and volume. After feature engineering, we obtain prediction features $x_{i,t} \in \mathbb{R}^{17}$ and router features $x^{\text{router}}_{i,t} \in \mathbb{R}^6$. An autoencoder $f_{\text{enc}}: \mathbb{R}^{d_{in}} \to \mathbb{R}^{d_z}$ and $f_{\text{dec}}: \mathbb{R}^{d_z} \to \mathbb{R}^{d_{in}}$ trained on stable market data produces reconstruction error:

    \[e_t = \|x_t - f_{\text{dec}}(f_{\text{enc}}(x_t))\|_2^2\]

    A routing threshold $\tau$ classifies market regimes as normal ($e_t < \tau$) or anomalous ($e_t \geq \tau$). Two specialized node transformers $\mathcal{T}_N$ and $\mathcal{T}_E$ process normal and event conditions respectively.

    Assumptions:

    1. Market regimes can be distinguished by reconstruction error from an autoencoder trained on stable periods
    2. Dual specialized pathways outperform single uniform models across regime transitions
    3. Optimal routing thresholds and blending weights can be learned through reinforcement learning feedback
  3. Toy example: Consider $N = 3$ stocks with $d_{in} = 5$ features each. During stable periods (VIX < 75th percentile), the autoencoder learns to reconstruct typical feature patterns with low error $e_t < 0.1$. When a market crash occurs, volatility spikes cause $e_t = 0.8 > \tau = 0.2$, triggering routing to the event pathway that incorporates crisis-specific context features.

  4. Formal objective: Minimize the prediction error across all stocks and time horizons while adapting to regime changes:

    \[\min_{\theta, \tau, \alpha} \mathbb{E}_{i,t,h}\left[\left(y_{i,t+h} - \hat{y}_{i,t+h}(\theta, \tau, \alpha)\right)^2\right]\]

    where $\hat{y}_{i,t+h}$ is the adaptively blended prediction from dual pathways.

Method

The method consists of three main components operating in sequence:

  1. Autoencoder Regime Detection: Train autoencoder on stable market data (VIX < 75th percentile) to learn normal patterns:

    \[z_t = f_{\text{enc}}(x_t) = \text{ReLU}(W_2 \cdot \text{ReLU}(W_1 x_t + b_1) + b_2)\] \[\hat{x}_t = f_{\text{dec}}(z_t) = W_4 \cdot \text{ReLU}(W_3 z_t + b_3) + b_4\]

    Compute anomaly score $e_t = \lvert x_t - \hat{x}_t \rvert_2^2$ and route based on threshold $\tau$.

  2. Dual Node Transformer Processing: For normal regime ($e_t < \tau$), process through standard node transformer with graph attention. For anomalous regime ($e_t \geq \tau$), augment input with event context:

    \[x^{\text{event}}_{i,t} = [x_{i,t} \| c_t]\]

    where $c_t$ includes VIX regime embedding, sentiment spikes, earnings proximity, and cross-asset stress measures.

  3. Adaptive Blending with SAC Control: Blend pathway outputs with learned weight:

    \[\hat{y}_{i,t+h} = \alpha_t \cdot y^{\text{normal}}_{i,t+h} + (1-\alpha_t) \cdot y^{\text{event}}_{i,t+h}\]

    SAC controller adjusts $\tau$ and $\alpha_t$ based on prediction performance feedback with entropy-regularized objective:

    \[J(\pi) = \sum_{t=0}^T \mathbb{E}[r_t + \alpha_{\text{ent}} H(\pi(\cdot|s_t))]\]

    Toy example application: For our 3-stock example with crash scenario, the autoencoder detects $e_t = 0.8 > \tau = 0.2$, routes to event pathway which receives augmented input $[x_t \lvert c_t]$ with crisis indicators, produces event-specific prediction $y^{\text{event}}$, and blends with $\alpha = 0.2$ giving final prediction $\hat{y} = 0.2 y^{\text{normal}} + 0.8 y^{\text{event}}$.

Novelty & Lineage

This work extends prior regime-switching models [Hamilton 1989] and stock prediction transformers [prior work by same authors achieving 0.80% MAPE]. Key innovations over existing work:

  1. Autoencoder-based regime detection: Unlike Hidden Markov Models requiring manual regime specification, uses reconstruction error for weakly supervised anomaly detection
  2. Graph-aware dual pathways: Extends single-pathway transformers with specialized normal/event processing, unlike uniform approaches in Chen et al. [graph CNNs], Wang et al. [multi-graph architectures]
  3. Reinforcement learning meta-control: SAC controller learns adaptive regime boundaries from prediction feedback, contrasting with fixed thresholds in threshold models [SETAR] or manual labeling requirements in supervised approaches
  4. End-to-end regime-aware architecture: Integrates detection, specialized prediction, and adaptive control in unified framework, whereas prior work treats regime detection and prediction as separate problems

    The approach builds on node transformers [Shi et al. 2019], Soft Actor-Critic [Haarnoja et al. 2018], and autoencoder anomaly detection [Hawkins et al. 2002], but their combination for adaptive financial prediction is novel.

    Classification: SIGNIFICANT - meaningful architectural innovation with substantial empirical improvements over strong baselines.

Proof Techniques

This is primarily an empirical systems paper with limited theoretical analysis. The main theoretical components are:

  1. Autoencoder anomaly detection justification: Based on standard reconstruction error theory where anomalies yield higher reconstruction loss:

    \[\mathbb{E}_{x \sim p_{\text{normal}}}[\|x - f_{\text{dec}}(f_{\text{enc}}(x))\|_2^2] < \mathbb{E}_{x \sim p_{\text{anomaly}}}[\|x - f_{\text{dec}}(f_{\text{enc}}(x))\|_2^2]\]
  2. SAC convergence properties: Relies on established SAC theory [Haarnoja et al. 2018] showing convergence to entropy-regularized optimal policy:

    \[\pi^* = \arg\max_\pi \mathbb{E}_{\tau \sim \pi}[\sum_t r_t + \alpha H(\pi(\cdot|s_t))]\]
  3. Graph attention mechanism: Extends standard transformer attention with graph structure bias:

    \[A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M + E\right)V\]

    where $E$ captures learned edge relationships and $M$ provides causal masking.

    The paper lacks formal theoretical guarantees about regime detection accuracy, prediction error bounds, or SAC adaptation convergence in this specific financial setting. Theoretical validation relies primarily on empirical ablation studies rather than mathematical proofs.

Experiments & Validation

Datasets: 20 S&P 500 stocks spanning January 1982 to March 2025, including daily OHLCV data from Yahoo Finance and sentiment scores from 4.2M social media posts (2007-2025). Temporal splits: training (1982-2010), validation (2011-2016), test (2017-2025).

Baselines: Statistical methods (ARIMA, VAR, MS-VAR), classical ML (Random Forest, SVR, XGBoost), deep learning (LSTM, Simple Transformer), recent time-series transformers (TimesNet, PatchTST, iTransformer), multimodal approaches (BERT+LSTM), regime-switching models (HMM-LSTM), and prior work (Integrated NodeFormer-BERT).

Key Results:

  • 1-day ahead: 0.59% MAPE vs 0.80% baseline (26% improvement), 72% vs 65% directional accuracy
  • 5-day ahead: 1.05% MAPE vs 1.30% baseline (19% improvement)
  • 20-day ahead: 1.55% MAPE vs 1.90% baseline (18% improvement)
  • All improvements statistically significant (Diebold-Mariano test, p < 0.001)
  • Maintains Theil’s U < 1.0 across all stocks and horizons
  • Ablation: SAC controller contributes 15% relative improvement, autoencoder routing 36%, dual pathways 7%
Limitations & Open Problems

Limitations:

  1. Binary regime classification assumption - TECHNICAL: The autoencoder performs only binary normal/anomalous classification, potentially missing heterogeneous anomaly types that could benefit from finer-grained routing

  2. Limited stock universe - RESTRICTIVE: Evaluation limited to 20 S&P 500 stocks; scalability to broader universes (hundreds/thousands of stocks) with graph attention mechanisms unclear

  3. Sentiment data availability - RESTRICTIVE: Social media sentiment only available from 2007, limiting model’s effectiveness in earlier historical periods or markets without social media coverage

  4. Stable period definition dependency - TECHNICAL: Autoencoder training relies on VIX-based stable period identification (75th percentile threshold), which may not generalize across different market structures or time periods

  5. Regime detection evaluation - TECHNICAL: No ground truth labels for regime classification, making it difficult to validate autoencoder’s regime detection accuracy independently of downstream prediction performance

  6. Computational complexity - NATURAL: Multi-stage training pipeline with SAC controller increases training time and complexity compared to single-pathway models

    Open Problems:

  7. Multi-class regime detection: Extend beyond binary classification to discover and route among multiple distinct market regimes without manual specification
  8. Cross-market generalization: Evaluate framework’s effectiveness on non-US equity markets, fixed income, commodities, or cryptocurrency markets with different regime dynamics

Interpreting Reinforcement Learning Model Behavior via Koopman with Control

Authors: William T. Redman · Institution: Johns Hopkins University · Category: math.OC

Applies Koopman operator theory with control to extract stability and controllability metrics that reveal interpretable information about RL model behavior evolution during training, including hidden progress measures.

Tags: reinforcement-learning interpretability koopman-theory control-theory dynamical-systems mechanistic-interpretability stability-analysis controllability

arXiv · PDF

Problem Formulation
  1. Motivation: Reinforcement learning models achieve complex behaviors but quantitatively assessing these behaviors for safety assurance and novel strategy discovery remains challenging. Understanding how RL model behavior evolves during training is crucial for interpretability and mechanistic understanding.

  2. Mathematical setup: Consider an RL environment defined by the tuple $(O, S, A)$ where $O$ is the observation space, $S$ is the state space, and $A$ is the action space. The RL model operates as a control system:

    \[s_{t+1} = g(s_t, a_t)\] \[o_t = h(s_t, a_t)\]

    where $g: S \times A \to S$ defines state transitions and $h: S \times A \to O$ defines observations. The policy is $m: O \to A$ where $a_t = m(o_t)$.

    Assumptions:

    1. Deterministic environment: $g$ is deterministic and time-invariant
    2. Full observability: $h = \text{Id}_{st}$ (i.e., $o_t = s_t$)
    3. Discrete action spaces

    The Koopman with control framework lifts this nonlinear system to a linear representation:

    \[z_{t+1} = Az_t + Bu_t\]

    where $z_t = [f_1(x_t), \ldots, f_m(x_t)]$ is the lifted state using observables $f_i \in F$.

  3. Toy example: In CartPole with state dimension $\lvert S \rvert = 4$ (cart position, cart velocity, pole angle, pole angular velocity) and binary actions $A = {0, 1}$ (left/right), we use one-hot encoding $u = [1,0]^T$ or $u = [0,1]^T$. With time-delay embeddings of length 4, the lifted space has dimension approximately 16, yielding matrices $A \in \mathbb{R}^{16 \times 16}$ and $B \in \mathbb{R}^{16 \times 2}$.

  4. Formal objective: Extract interpretable dynamical properties from the fitted matrices $A$ and $B$:

    \[\text{Stability} = \max_i |\lambda_i(A)|, \quad \text{Controllability} = \frac{\text{rank}([B, AB, A^2B, \ldots, A^{n-1}B])}{n}\]
Method

The method applies Koopman operator theory with control to analyze RL model behavior:

  1. Train RL models using standard optimizers (PPO, A2C) on benchmark environments
  2. At regular training intervals, collect trajectory data $(s_t, a_t)$ from model rollouts
  3. Construct lifted state representation using time-delay embeddings: $z_t = [s_t, s_{t-1}, \ldots, s_{t-n_{delay}+1}]$
  4. Apply Dynamic Mode Decomposition with Control (DMDc) to fit the linear system:

    \[z_{t+1} = Az_t + Bu_t\]
  5. Extract stability metric: maximum eigenvalue magnitude $\max_i \lvert \lambda_i(A) \rvert$
  6. Extract controllability metric: normalized rank of controllability matrix $[B, AB, \ldots, A^{n-1}B]$

    Applied to CartPole toy example: With 4-dimensional state and binary actions, using $n_{delay} = 4$ creates lifted dimension $\approx 16$. The DMDc algorithm fits $A \in \mathbb{R}^{16 \times 16}$ and $B \in \mathbb{R}^{16 \times 2}$ from trajectory data. For a well-trained model, eigenvalues cluster near the unit circle (stable), while the controllability matrix approaches full rank, indicating the system can reach all states through appropriate control sequences.

Novelty & Lineage

This work extends Koopman operator theory to RL interpretability. Prior work includes Koopman theory foundations (Mezić 2005, Budišić et al. 2012), Koopman with control (Proctor et al. 2018, Korda & Mezić 2018), and recent applications to neural networks (Ostrow et al. 2023, Huang et al. 2025).

The novelty lies in systematically applying Koopman with control specifically to understand RL model behavior evolution during training, demonstrating that stability/controllability metrics can serve as “hidden progress measures” when reward appears static. Previous work focused on individual system analysis rather than training dynamics comparison.

INCREMENTAL

Proof Techniques

This is primarily an empirical study with no formal proofs. The theoretical foundation relies on established Koopman operator theory:

Key theoretical components:

  1. Koopman operator linearity in lifted space:

    \[U^t f(x_0) = f[T^t(x_0)]\]
  2. Eigendecomposition enabling mode analysis:

    \[U^t f(x_0) = \sum_{k=1}^N \lambda_k^t \phi_k(x_0) v_k\]
  3. DMDc algorithm convergence to optimal least-squares solution for lifted linear system

    The main technical insight is interpreting eigenvalue magnitudes near unity as indicating stable control behavior, while controllability matrix rank measures the system’s ability to reach different states through control inputs. Empirical validation relies on correlation between these metrics and reward performance across multiple independent training runs.

Experiments & Validation

Datasets: Three standard RL environments - CartPole, Acrobot, LunarLander from Gymnasium library.

Baselines: PPO and A2C optimizers from Stable-Baselines3. 25 independently trained models per configuration.

Implementation: PyDMD package for DMDc algorithm with time-delay embeddings (4-5 delays), SVD rank 0.95-0.99.

Key numbers:

  • CartPole: 1000 training epochs, 200-step episodes, 4D state space
  • Acrobot: 10,000 epochs, 500-step episodes, 6D state space
  • LunarLander: 50,000 epochs, 1000-step episodes, 8D state space

Results show PPO generally achieves better stability/controllability alignment with performance. A2C exhibits rapid eigenvalue changes but may require longer training. Stability and controllability metrics successfully predict performance improvements even when reward appears static, demonstrating “hidden progress” detection capability.

Limitations & Open Problems

Limitations:

  1. Discrete action spaces only - TECHNICAL (DMDc can handle continuous actions but requires different encoding)
  2. Deterministic, time-invariant environments - RESTRICTIVE (many real RL applications have stochastic dynamics)
  3. Full observability assumption - RESTRICTIVE (partial observability common in practice)
  4. Limited to “physical” control tasks - RESTRICTIVE (unclear applicability to abstract domains like chess)
  5. Only two optimizers tested - NATURAL (PPO/A2C are standard baselines)
  6. Analysis at sparse training checkpoints - TECHNICAL (could analyze more frequently)

    Open problems:

  7. Extend framework to stochastic, partially observable environments with richer observable function selection beyond time delays
  8. Apply to abstract RL domains (games, language) and investigate multi-scale analysis combining behavioral and neural activation levels