Mar 19, 2026 Theory 8 papers

Theory Digest — Mar 19, 2026

Today’s Digest at a Glance

Today’s papers span machine learning theory, statistical inference, and stochastic processes, with particular emphasis on neural architectures, bandit algorithms, and Bayesian methods. These works address fundamental questions about how to design algorithms that learn efficiently from data while maintaining theoretical guarantees.

Neural Networks and Attention Mechanisms

Modern machine learning relies heavily on neural networks that can process sequential data and make predictions in complex, high-dimensional spaces. Two papers today tackle fundamental questions about how these networks should be designed and why they work.

The transformer architecture has become the backbone of modern AI, built around an “attention mechanism” that determines which parts of an input sequence are most relevant for making predictions. Mathematically, attention computes weighted combinations of input features, where the weights $\alpha_{ij}$ determine how much feature $i$ should attend to feature $j$. Traditional attention uses softmax normalization: $\alpha_{ij} = \exp(e_{ij}) / \sum_k \exp(e_{ik})$, where $e_{ij}$ measures similarity between features. However, understanding why this particular form works well, and whether alternatives might be better, remains an active area of research.

At a deeper level, neural networks are increasingly being used to solve partial differential equations that arise in physics and engineering. The Fokker-Planck equation $\frac{\partial p}{\partial t} = -\nabla \cdot (\mu p) + \frac{1}{2}\nabla \cdot (\sigma \sigma^T \nabla p)$ describes how probability densities evolve over time in stochastic systems, where $\mu$ is the drift and $\sigma$ represents noise. Traditional numerical methods struggle with high-dimensional problems, making neural approaches increasingly attractive. The challenge is ensuring these methods preserve important physical properties like probability conservation and causality.

Bandit Problems and Online Learning

Bandit problems represent a fundamental challenge in machine learning: how do you balance exploration (trying new actions to learn about them) and exploitation (using what you already know to maximize reward)? The multi-armed bandit problem formalizes this as repeatedly choosing actions $a_t$ from a set to maximize cumulative reward $\sum_{t=1}^T r_t$, where rewards are initially unknown.

Contextual bandits extend this framework by allowing actions to depend on observed context $x_t$. The goal becomes learning a policy $\pi(x)$ that maps contexts to good actions. Single-index models assume the reward structure has the form $r = f(\beta^T x, a) + \epsilon$, where $\beta$ is an unknown parameter vector and $f$ is an unknown function. This semiparametric structure is both flexible enough to capture complex relationships and structured enough to enable efficient learning. The challenge lies in simultaneously estimating both the parametric component $\beta$ and the nonparametric function $f$ while making sequential decisions.

Quantum computing promises to revolutionize many algorithms, potentially including bandit problems. Quantum algorithms can explore multiple possibilities simultaneously through superposition, potentially achieving better sample complexity than classical methods. However, near-term quantum devices (NISQ - Noisy Intermediate-Scale Quantum) suffer from decoherence and gate errors that can destroy quantum advantages. Understanding when quantum speedups survive realistic noise levels is crucial for practical quantum machine learning.

Bayesian Methods and Statistical Inference

Bayesian statistics provides a principled framework for updating beliefs as new data arrives, using Bayes’ theorem: $p(\theta\lvert data) \propto p(data \rvert\theta) p(\theta)$. The posterior distribution $p(\theta\lvert data)$ combines the likelihood $p(data\lvert \theta)$ with prior beliefs $p(\theta)$. A central question is how quickly this posterior converges to the truth as sample size grows, measured by posterior contraction rates.

For density estimation, the goal is learning an unknown probability density $f$ from samples. Bayesian approaches place priors on function spaces and update beliefs about $f$ given data. Orthogonal polynomials provide natural basis functions for this task, especially on unbounded domains where traditional methods fail. The challenge is establishing minimax optimal rates - proving your method is as good as any possible method, up to constant factors. This requires sophisticated technical machinery involving covering numbers, concentration inequalities, and approximation theory.

A related question concerns synthetic data: can artificially generated examples improve statistical inference? While synthetic data helps with computational issues and privacy, fundamental information-theoretic limits constrain its value. Fisher information $I(\theta) = E[(\frac{\partial}{\partial \theta} \log p(X\lvert \theta))^2]$ quantifies how much information data contains about parameters. Understanding whether synthetic augmentation can increase Fisher information touches deep questions about the nature of statistical information.

Stochastic Processes and Optimal Transport

Stochastic processes describe systems that evolve randomly over time, governed by equations like $dX_t = \mu(X_t, t)dt + \sigma(X_t, t)dW_t$, where $W_t$ is Brownian motion. The Schrödinger bridge problem asks: given two probability distributions, what’s the most likely stochastic process connecting them? This connects to optimal transport theory, which studies the cheapest way to move one distribution to another.

Modern generative AI models often work by learning to reverse diffusion processes - gradually adding noise to data, then learning to remove it. Understanding when and why these processes develop multiple modes (path splitting) is crucial for avoiding mode collapse, where generated samples lack diversity. Geometric methods from differential geometry, particularly Jacobi fields that measure how nearby paths diverge, provide mathematical tools for detecting these phenomena.

Economics and Decision Making

The intersection of machine learning and economics produces rich theoretical questions about optimal decision-making under uncertainty. Revenue management problems ask how firms should jointly choose which products to offer (assortment) and how to price them, given customer preferences that follow models like multinomial logit: $P(\text{choose item } i) = \frac{\exp(\beta_i - \alpha p_i)}{\sum_j \exp(\beta_j - \alpha p_j)}$, where $p_i$ are prices and $\beta_i$ represent intrinsic preferences.

Transfer learning addresses the question: how can experience from one market help in another? The challenge lies in handling heterogeneity - different markets may have different customer preferences, competitive landscapes, or seasonal patterns. Developing algorithms that can borrow strength across markets while accounting for these differences requires careful bias-variance tradeoffs.

Reading Guide

Readers interested in neural network theory should start with Paper 2 (NeuroGame Transformer) for an accessible introduction to attention mechanisms, then proceed to Paper 1 (Neural Galerkin) for more advanced applications to PDEs.

Those focused on bandit algorithms should read Papers 3 and 4 together, as they address complementary aspects of contextual bandits - Paper 3 for the classical case and Paper 4 for quantum extensions.

Statistics and inference readers should begin with Paper 6 (synthetic data) for fundamental information theory, then Paper 5 (Bayesian density estimation) for advanced posterior convergence theory.

Readers interested in applications should focus on Papers 7 and 8, which address practical problems in generative modeling and revenue management respectively, though Paper 7 requires more mathematical background.

Papers 1, 2, and 7 share connections through stochastic processes and could be read as a sequence for understanding modern applications of probability theory to machine learning.

Neural Galerkin Normalizing Flow for Transition Probability Density Functions of Diffusion Models

Authors: Riccardo Saporiti, Fabio Nobile · Institution: EPFL · Category: cs.LG

Combines Neural Galerkin schemes with conditional normalizing flows to solve high-dimensional Fokker-Planck equations with atomic initial conditions, preserving causality and probability constraints.

Tags: stochastic-differential-equations fokker-planck-equations normalizing-flows physics-informed-neural-networks transition-probability-density neural-galerkin-methods bayesian-inference computational-finance

arXiv · PDF

Problem Formulation

Motivation: Computing transition probability density functions (TPDFs) for stochastic differential equations is crucial for Bayesian inference, option pricing, and data assimilation, but classical grid-based PDE solvers fail in high dimensions due to the curse of dimensionality and struggle with atomic initial conditions (Dirac deltas).

Mathematical setup: Consider a $d$-dimensional SDE:

\[d\mathbf{X}(t) = \mathbf{b}(t, \mathbf{X}(t))dt + \sqrt{\Sigma(t, \mathbf{X}(t))}d\mathbf{W}(t)\] \[\mathbf{X}(s) = \mathbf{x}_0\]

The TPDF $\rho(\mathbf{x}\lvert t,s,\mathbf{x}_0)$ satisfies the Fokker-Planck equation:

\[\partial_t \rho(\mathbf{x}|t,s,\mathbf{x}_0) = \mathcal{L}_t^*(\rho(\cdot|t,s,\mathbf{x}_0))(\mathbf{x})\]

with initial condition:

\[\rho(\mathbf{x}|s,s,\mathbf{x}_0) = \delta_{\mathbf{x}_0}(\mathbf{x})\]

where $\mathcal{L}_t^*$ is the adjoint of the generator:

\[\mathcal{L}_t^*(f(\cdot))(\mathbf{x}) = \nabla_\mathbf{x} \cdot [-\mathbf{b}(t,\mathbf{x})f(\mathbf{x}) + \frac{1}{2}\nabla_\mathbf{x} \cdot (\Sigma(t,\mathbf{x})f(\mathbf{x}))]\]

Assumptions:

$\mathbf{b}: [s,T] \times \mathbb{R}^d \to \mathbb{R}^d$ is sufficiently smooth
$\Sigma: [s,T] \times \mathbb{R}^d \to \mathbb{R}^{d \times d}$ is positive definite
Standard regularity conditions for SDE existence and uniqueness

Toy example: When $d=2$, $\mathbf{b}(\mathbf{x}) = \tanh(\mathbf{x})$, and $\Sigma = I_2$, the solution is highly concentrated near $\mathbf{x}_0$ at early times but develops multiple modes over time, making grid-based methods prohibitively expensive.

Formal objective: Approximate the parametric TPDF mapping:
\[(\mathbf{x}, t, \mathbf{x}_0) \mapsto \rho(\mathbf{x}|t,s,\mathbf{x}_0)\]
using a neural normalizing flow that preserves probability mass and positivity constraints.

Method

The method uses Neural Galerkin Normalizing Flows with three key components:

Normalizing Flow Ansatz: Approximate the TPDF as:
\[P(\mathbf{x}|\boldsymbol{\theta}(\tau), \tau, \mathbf{x}_0) = p_Z(\mathbf{n}_{\boldsymbol{\theta}(\tau)}(\mathbf{x}|\mathbf{x}_0)|\tau, \mathbf{x}_0)|\det(\nabla_\mathbf{x}\mathbf{n}_{\boldsymbol{\theta}(\tau)}(\mathbf{x}|\mathbf{x}_0))|\]
where $\mathbf{n}_{\boldsymbol{\theta}}$ is a conditional Real-NVP transformation and $p_Z$ is a reference distribution.
Conditional Coupling Layers: Each layer applies:
\[\mathbf{x}^l_{1:m} = \mathbf{x}^{l-1}_{1:m}\] \[\mathbf{x}^l_{m+1:d} = \mathbf{x}^{l-1}_{m+1:d} \odot [1_{d-m} + \beta\tanh(\mathbf{s}_{\boldsymbol{\theta}^l}(\mathbf{c}^l)) + e^{\boldsymbol{\xi}^l}\odot\tanh(\mathbf{t}_{\boldsymbol{\theta}^l}(\mathbf{c}^l))]\]
where $\mathbf{c}^l = (\mathbf{x}^{l-1}_{1:m}, \mathbf{x}_0)$ conditions on the initial point.
Neural Galerkin ODE: Minimize the Fokker-Planck residual:
\[\dot{\boldsymbol{\theta}}(\tau) \in \arg\min_{\boldsymbol{\eta}} \int_{\mathbb{R}^d}\int_{\mathbb{R}^d} |r_{s+\tau,s}(\boldsymbol{\theta}, \boldsymbol{\eta}, \mathbf{x}, \mathbf{x}_0)|^2 d\nu_{\boldsymbol{\theta}(\tau)}(\mathbf{x}|\mathbf{x}_0)d\mu(\mathbf{x}_0)\]
Source Distribution: Use Euler-Maruyama one-step approximation:
\[p_Z(\cdot|\tau, \mathbf{x}_0) \sim \mathcal{N}(\mathbf{x}_0 + \mathbf{b}(s,\mathbf{x}_0)\tau, \Sigma(s,\mathbf{x}_0)\tau)\]
Toy example application: For the 2D Beneš SDE, the method transforms the Gaussian source distribution through 10 coupling layers, learning to split the unimodal Gaussian into the four-mode target density while preserving the parametric dependence on $\mathbf{x}_0$.

Novelty & Lineage

Prior Work:

Physics-Informed Neural Networks (PINNs) [Raissi et al. 2019] struggle with atomic initial conditions and causality enforcement
Temporal Normalizing Flows [Lu et al. 2022, Feng et al. 2022] require adaptive sampling loops and good initial measures
Neural Galerkin schemes [Bruna et al. 2024] derive ODEs for neural network parameters but not applied to normalizing flows
Neural parametric Fokker-Planck [Liu et al. 2022] uses Wasserstein gradient flows but limited to specific SDE classes

Novel Contributions:

First extension of Neural Galerkin schemes to normalizing flow architectures
Conditional Real-NVP layers that capture parametric dependence on initial conditions
Time-dependent source distribution strategy that exactly satisfies atomic initial conditions
Sequential sampling strategy that avoids global space-time adaptive loops

This represents a SIGNIFICANT advance by combining the structure-preserving properties of normalizing flows with the causality-enforcing Neural Galerkin framework, specifically addressing the atomic initial condition challenge that plagued prior PINN approaches.

Proof Techniques

The method is primarily algorithmic without formal convergence proofs, but relies on several key theoretical components:

Change of Variables Formula: The normalizing flow ansatz automatically satisfies probability constraints through:
\[\int_{\mathbb{R}^d} P(\mathbf{x}|\boldsymbol{\theta}, \tau, \mathbf{x}_0) d\mathbf{x} = \int_{\mathbb{R}^d} p_Z(\mathbf{z}|\tau, \mathbf{x}_0) d\mathbf{z} = 1\]
Dirac-Frenkel Variational Principle: The Neural Galerkin approach minimizes the residual in $L^2$ norm:
\[\min_{\boldsymbol{\eta}} \langle r(\boldsymbol{\theta}, \boldsymbol{\eta}, \cdot), r(\boldsymbol{\theta}, \boldsymbol{\eta}, \cdot) \rangle_{L^2(\nu \otimes \mu)}\]
This provides the ODE system for parameter evolution.
Exact Initial Condition Satisfaction: By construction:
\[\lim_{\tau \to 0} p_Z(\cdot|\tau, \mathbf{x}_0) = \delta_{\mathbf{x}_0}(\cdot)\] \[\mathbf{n}_{\boldsymbol{\theta}(0)}(\mathbf{x}|\mathbf{x}_0) = \mathbf{x}\]
Therefore: $P(\mathbf{x}\lvert \boldsymbol{\theta}(0), 0, \mathbf{x}_0) = \delta_{\mathbf{x}_0}(\mathbf{x})$ exactly.
Jacobian Determinant Computation: For Real-NVP layers:
\[\det(\nabla_\mathbf{x}\mathbf{n}_{\boldsymbol{\theta}}) = \prod_{l=1}^L \prod_{i=m+1}^d [1 + \beta\tanh(s_{\boldsymbol{\theta}^l}(\mathbf{c}^l)_{i-m})]\]
The key insight is using the Euler-Maruyama source distribution to handle the degeneracy of the atomic initial condition while maintaining theoretical convergence properties as $\tau \to 0$.

Experiments & Validation

Dataset: Synthetic 2D anisotropic Beneš SDE with rotation matrix $R_2 = \begin{bmatrix} 1/2 & -\sqrt{3}/2 \ \sqrt{3}/2 & 1/2 \end{bmatrix}$, where analytical solution is available for validation.

Architecture: 10 coupling layers, GRU cells with hidden size 4, partition parameter $m=1$.

Baselines: Comparison against analytical solution for the Beneš model.

Key Results:

Successfully transforms unimodal Gaussian source into correct four-mode target density
Maintains stable $L^2$ error over time horizon $[0,3]$
Accurately captures parametric dependence on initial condition location $\mathbf{x}_0$
Generalizes to mixture initial conditions via convolution integral (4) with 1750 Monte Carlo samples
After offline training, online evaluation is computationally efficient compared to solving PDE from scratch

Computational Details: Integration using explicit Runge-Kutta 3(2) method, sampling from $\mu = \mathcal{N}_2(0, 0.75^2 I_2)$ for initial condition distribution.

The experiments demonstrate proof-of-concept on a challenging 2+1+2 dimensional space-time-parameter problem with known analytical benchmark.

Limitations & Open Problems

Limitations:

TECHNICAL: Method demonstrated only on 2D problems - scalability to higher dimensions remains unclear due to normalizing flow expressivity limits
TECHNICAL: Requires differentiable drift and diffusion coefficients for automatic differentiation of the Fokker-Planck operator
NATURAL: Assumes elliptic SDEs (non-degenerate diffusion matrix) to ensure well-posed Gaussian source distribution
RESTRICTIVE: Limited to SDEs with smooth coefficients satisfying standard regularity conditions for existence/uniqueness
TECHNICAL: Choice of source distribution architecture (Euler-Maruyama) may not be optimal for all SDE classes
RESTRICTIVE: No theoretical convergence analysis or error bounds provided - relies purely on empirical validation

Open Problems:
Theoretical Analysis: Establish convergence rates and approximation error bounds for the Neural Galerkin Normalizing Flow approach, particularly how discretization errors in the ODE integration affect final TPDF approximation quality.
High-Dimensional Scalability: Develop more expressive normalizing flow architectures or hybrid approaches that can handle the curse of dimensionality for problems with $d \geq 10$ while maintaining computational tractability.

NeuroGame Transformer: Gibbs-Inspired Attention Driven by Game Theory and Statistical Physics

Authors: Djamel Bouchaffra, Fayçal Ykhlef, Hanene Azzag, Mustapha Lebbah et al. (5 authors) · Institution: Paris-Saclay University · Category: cs.AI

Introduces NeuroGame Transformer that reformulates attention as a coalitional game with Ising physics, using Shapley/Banzhaf values and mean-field equations to achieve interpretable, game-theoretic attention weights.

Tags: transformer attention_mechanism game_theory statistical_physics natural_language_inference interpretability shapley_value banzhaf_index

arXiv · PDF

Problem Formulation

Motivation: Standard transformer attention mechanisms are limited by their pairwise formulation, hindering the modeling of higher-order dependencies among tokens in natural language processing tasks. This limitation is particularly problematic for tasks requiring complex logical reasoning like natural language inference, where understanding relationships between multiple tokens simultaneously is crucial.
Mathematical setup: Consider a sequence of $n$ tokens $T = {t_1, t_2, \ldots, t_n}$, where each token $t_i$ has contextual embedding $x_i \in \mathbb{R}^d$. Define the power set of all possible token coalitions by $2^T$ containing $2^n$ elements. A coalition is any subset $C \subseteq T$. The characteristic function $v: 2^T \to \mathbb{R}$ assigns semantic importance to each coalition:
\[v(C) = f\left(\left\|\sum_{t_i \in C} W_v x_i\right\|^2\right)\]
where $W_v \in \mathbb{R}^{d \times d}$ is a learned projection matrix and $f$ is a mild nonlinearity. The system energy is $E(C) = -v(C)$, inducing Gibbs distribution:
\[D_\gamma(C) \propto \exp\left(\frac{v(C)}{\gamma}\right)\]
Assumptions:
1. Normalization: $v(\emptyset) = 0$
2. Monotonicity: For $C \subseteq T’ \subseteq T$, we have $v(C) \leq v(T’)$
3. Boundedness: There exists $M > 0$ such that $\lvert v(C) \rvert \leq M$ for all $C \subseteq T$
Toy example: When $n=3$ with tokens “not”, “good”, “movie”, we have $2^3 = 8$ possible coalitions. If $v({\text{not}}) = 0.2$, $v({\text{good}}) = 0.5$, $v({\text{not, good}}) = 1.2$, then the marginal contribution of “not” to coalition ${\text{good}}$ is $\Delta_{\text{not}}({\text{good}}) = v({\text{not, good}}) - v({\text{good}}) = 1.2 - 0.5 = 0.7$.
Formal objective: The goal is to compute attention weights $\alpha_i$ as marginal probabilities that token $t_i$ is active under the Gibbs distribution:
\[\alpha_i = P(s_i = +1) = \sum_{S: s_i = +1} \frac{\exp(-H(S)/\gamma)}{\sum_{S'} \exp(-H(S')/\gamma)}\]
where $H(S)$ is the Ising Hamiltonian over spin configurations $S = (s_1, \ldots, s_n) \in {-1, +1}^n$.

Method

The NeuroGame Transformer combines game theory and statistical physics through a four-step process:

Game-theoretic quantities via Monte Carlo: Estimate Shapley values and Banzhaf indices using importance-weighted sampling with Gibbs-distributed weights:
\[\hat{\phi}_i = \sum_{k=1}^K \bar{w}(P_i^{(k)}) [v(P_i^{(k)} \cup \{t_i\}) - v(P_i^{(k)})]\] \[\hat{\beta}_i = \sum_{k=1}^K \bar{w}(C_k) [v(C_k \cup \{t_i\}) - v(C_k)]\]
where $\bar{w}(C_k) = \frac{\exp(v(C_k)/\gamma) / p(C_k)}{\sum_{\ell=1}^K \exp(v(C_\ell)/\gamma) / p(C_\ell)}$
Token importance as external field: Combine Shapley and Banzhaf via learnable gating:
\[J_i = \lambda_i \phi_i^{\text{norm}} + (1-\lambda_i) \beta_i^{\text{norm}}\]
where $\lambda_i = \sigma(w^T x_i + b)$
Pairwise interaction potentials: Estimate synergistic/antagonistic effects:
\[J_{ij} = \sum_{k=1}^K \bar{w}(C_k) \Delta_{ij}(C_k)\]
where $\Delta_{ij}(C) = v(C \cup {t_i, t_j}) - v(C \cup {t_i}) - v(C \cup {t_j}) + v(C)$
Mean-field spin equilibrium: Solve self-consistent equations to get attention weights:
\[\langle s_i \rangle = \tanh\left(\frac{1}{\gamma}\left(J_i + \sum_{j \neq i} J_{ij} \langle s_j \rangle\right)\right)\] \[\alpha_i = \frac{1 + \langle s_i \rangle}{2}\]
Applied to toy example: For tokens “not”, “good”, “movie” with $J_1 = 0.423$, $J_2 = 0.711$, $J_3 = 0.512$, and interaction $J_{12} = 0.466$, the mean-field iteration converges to $\langle s_1 \rangle = 0.721$, $\langle s_2 \rangle = 0.798$, $\langle s_3 \rangle = 0.703$, yielding attention weights $\alpha_1 = 0.861$, $\alpha_2 = 0.899$, $\alpha_3 = 0.852$.

Novelty & Lineage

This work uniquely synthesizes cooperative game theory (Shapley values, Banzhaf indices) with statistical physics (Ising models) for transformer attention. Prior work on efficient attention (Performer, Linformer) focuses on computational efficiency while maintaining pairwise similarity paradigms. Game-theoretic methods in ML (SHAP, data valuation) are typically used for post-hoc explanations, not as core architectural components. Energy-based models in NLP (Graph Attention Networks, energy-based conformations) define arbitrary energy functions rather than grounding them in coalitional games.

The key novelty is treating tokens simultaneously as game players and magnetic spins, with attention weights derived from thermodynamic equilibrium rather than similarity scores. The importance-weighted Monte Carlo with Gibbs target for coalition sampling is also novel, avoiding explicit exponential factors while maintaining unbiased estimates.

Building on: Shapley (1953), Banzhaf (1965), Vaswani et al. (2017) Transformer, Lundberg & Lee (2017) SHAP.

SIGNIFICANT

Proof Techniques

The theoretical analysis uses three main techniques:

Mean-field approximation convergence: The key result is Theorem 1 proving that expected spins satisfy self-consistent equations. The proof proceeds by replacing fluctuating spins with their averages:
\[H_i^{\text{eff}}(s_i) = -\left(J_i + \sum_{j \neq i} J_{ij} \langle s_j \rangle\right) s_i\]
This yields the single-spin Boltzmann distribution, leading to:
\[\langle s_i \rangle = \tanh\left(\frac{h_i^{\text{eff}}}{\gamma}\right)\]
Monte Carlo convergence guarantees: Under mild regularity conditions on the characteristic function $v(\cdot)$, the importance-weighted estimators converge almost surely as $K \to \infty$. The key insight is that normalized weights:
\[\bar{w}(C_k) = \frac{\exp(v(C_k)/\gamma) / p(C_k)}{\sum_{\ell=1}^K \exp(v(C_\ell)/\gamma) / p(C_\ell)}\]
naturally cancel the intractable partition function, ensuring numerical stability.
Complexity reduction: The exponential problem $O(2^n)$ is reduced to polynomial $O(K \cdot n + n^2 \cdot T)$ where $K$ is Monte Carlo samples and $T$ is mean-field iterations. This follows from the two-stage approximation: importance sampling handles coalition enumeration, mean-field handles spin configuration summation.

Experiments & Validation

Datasets: SNLI (570k sentence pairs), MNLI-matched (433k pairs) for natural language inference.

Baselines: DAM (83.3% SNLI), ESIM (87.0%), BERT-Base (88.86%), BERT-Large (90.33%), RoBERTa-Base (86.95%), RoBERTa-Large (91.83%), ALBERT-Base (86.49%), ALBERT-Large (90.85%).

Key results: NGT achieves 86.6% on SNLI, 79.0% on MNLI-matched with only 110M parameters (0.7% overhead over BERT-Base). Outperforms ALBERT-Base while being competitive with larger models.

Implementation: K_mc = 15 training/25 evaluation samples, γ = 0.25 temperature, T_mf = 25 mean-field iterations, 5 epochs training with AdamW optimizer.

Limitations & Open Problems

Limitations:

Monte Carlo sampling approximation: Relies on K samples for coalition estimation - TECHNICAL (can be improved with better sampling schemes or larger K)
Mean-field approximation: Replaces true spin correlations with averages - TECHNICAL (standard in statistical physics, often accurate for dense systems)
Quadratic complexity in sequence length: O(n²T) scaling due to pairwise interactions - NATURAL (same as standard attention)
Limited to BERT-Base backbone: No evaluation on larger pretrained models - TECHNICAL (straightforward to extend)
Temperature parameter tuning: Requires manual γ selection - TECHNICAL (could be made learnable)
Coalition space grows exponentially: Fundamental limitation even with sampling - RESTRICTIVE (inherent to coalitional games)

Open problems:
Develop adaptive temperature scheduling γ(t) during training to balance exploration vs exploitation in coalition sampling
Extend to multi-modal settings where tokens come from different modalities (vision-language transformers)

Kernel Single-Index Bandits: Estimation, Inference, and Learning

Authors: Sakshi Arya, Satarupa Bhattacharjee, Bharath K. Sriperumbudur · Institution: Case Western Reserve University, University of Florida, Pennsylvania State University · Category: stat.ML

First framework providing simultaneous estimation and asymptotic inference for both parametric index parameters and nonparametric link functions in single-index contextual bandits under adaptive sampling.

Tags: contextual bandits semiparametric inference single-index models RKHS adaptive sampling martingale CLT inverse propensity weighting kernel ridge regression

arXiv · PDF

Problem Formulation

Motivation: Contextual bandits require both efficient learning and uncertainty quantification for covariate effects. Existing methods focus primarily on regret minimization but provide limited tools for statistical inference, particularly in semiparametric settings where adaptive sampling complicates inference.
Mathematical setup: Consider a contextual bandit with $L$ arms and time horizon $T$. At each time $t$, observe covariate $X_t \in \mathbb{R}^d$ drawn i.i.d. from distribution $P_X$ with density $p(x)$, select arm $a_t \in {1,\ldots,L}$, and receive reward $Y_t$. The reward follows a kernel single-index model:
\[Y_t = f_i(X_t^\top \beta_i) + \varepsilon_t\]
where $i = a_t$ is the selected arm, $f_i : \mathbb{R} \to \mathbb{R}$ is an unknown nonparametric link function in RKHS $\mathcal{H}_K$, $\beta_i \in \mathbb{R}^d$ is the arm-specific index parameter, and $\varepsilon_t$ is mean-zero noise.
1. Covariates $X_t$ have continuously differentiable density $p \in C^1(\mathbb{R}^d)$ with score function $S(x) = -\nabla_x \log p(x)$
2. Noise satisfies $\mathbb{E}[\varepsilon_t \lvert a_t = i] = 0$ and $\text{Var}(\varepsilon_t \lvert a_t = i) = \sigma^2 < \infty$
3. Each $f_i$ lies in RKHS $\mathcal{H}_K$ with bounded kernel $K : \mathcal{U} \times \mathcal{U} \to \mathbb{R}$
Toy example: When $d = 2$, $L = 2$, $X_t \sim N(0, I_2)$, and $\beta_1 = (1,0)^\top$, $\beta_2 = (0,1)^\top$, the problem reduces to learning $f_1(X_{t,1})$ for arm 1 and $f_2(X_{t,2})$ for arm 2. The score function is $S(x) = x$, and the challenge is estimating both the index directions and nonparametric functions from adaptively selected observations.
Formal objective: Minimize cumulative regret while enabling statistical inference:
\[R_T(\pi) = \sum_{t=1}^T \left[ f_{a_t^*}(X_t^\top \beta_{a_t^*}) - f_{a_t}(X_t^\top \beta_{a_t}) \right]\]

Method

The K-SIEGE (Kernelized Single-Index Epsilon-Greedy Exploration) algorithm combines Stein-based estimation for index parameters with inverse-propensity-weighted kernel ridge regression for link functions.

Index parameter estimation: For arm $i$, define inverse-propensity weights and estimate via:
\[\hat{\beta}_{i,t} = \hat{\Gamma}_{i,t}^{-1} \hat{m}_{i,t}\]
where
\[\hat{\Gamma}_{i,t} = \frac{1}{t} \sum_{s=1}^t \frac{1\{a_s = i\}}{p_{s,i}(W_s)} W_s W_s^\top\] \[\hat{m}_{i,t} = \frac{1}{t} \sum_{s=1}^t \frac{1\{a_s = i\}}{p_{s,i}(W_s)} W_s Y_s\]
with $W_s = S(X_s)$ the score vector and $p_{s,i}(W_s) = P(a_s = i \lvert \mathcal{F}_{s-1}, W_s)$.
Link function estimation: Using projected covariates $U_{s,i} = X_s^\top \hat{\beta}_{i,t}$, estimate via IPW-KRR:
\[\hat{f}_{i,t} = (\hat{\Sigma}_{i,t} + \lambda_t I)^{-1} \hat{h}_{i,t}\]
where
\[\hat{\Sigma}_{i,t} = \frac{1}{t} \sum_{s=1}^t \frac{1\{a_s = i\}}{p_{s,i}(U_{s,i})} K_{U_{s,i}} \otimes K_{U_{s,i}}\] \[\hat{h}_{i,t} = \frac{1}{t} \sum_{s=1}^t \frac{1\{a_s = i\}}{p_{s,i}(U_{s,i})} Y_s K_{U_{s,i}}\]
Arm selection: Use $\varepsilon$-greedy with $a_t = \arg\max_i \hat{f}_{i,t-1}(X_t^\top \hat{\beta}_{i,t-1})$ with probability $1-\varepsilon_t$.

Toy example application: With $d=2$, $L=2$, and Gaussian covariates, $W_s = X_s$. For arm 1, if selected at times ${1,3,5}$ with rewards ${y_1, y_3, y_5}$, then $\hat{\Gamma}_{1,3} = \frac{1}{3}(X_1 X_1^\top + X_3 X_3^\top + X_5 X_5^\top)$ and $\hat{m}_{1,3} = \frac{1}{3}(y_1 X_1 + y_3 X_3 + y_5 X_5)$, giving $\hat{\beta}_{1,3} = \hat{\Gamma}_{1,3}^{-1} \hat{m}_{1,3}$.

Novelty & Lineage

This work extends single-index modeling to fully online contextual bandits with joint estimation and inference. Prior work includes: Arya and Song (2025) studied batched single-index bandits; Kang et al. (2025) considered shared index parameters across arms; Ma and Cai (2025) focused only on regret analysis. The key innovations are:

adapting Stein-based estimation (Yang et al., 2017) to adaptive sampling in bandits
developing the first directional functional CLT for RKHS estimators under adaptive data collection, extending Arya and Sriperumbudur
from i.i.d. to bandit settings, and
providing unified inference framework for both parametric index and nonparametric link components. SIGNIFICANT

Proof Techniques

The analysis combines martingale theory, RKHS concentration, and inverse-propensity weighting techniques:

Index parameter CLT: Decompose the estimator error as:
\[\hat{\beta}_{i,t} - \beta_i = \Gamma^{-1}(\hat{m}_{i,t} - m_i) + (\hat{\Gamma}_{i,t}^{-1} - \Gamma^{-1})\hat{m}_{i,t}\]
The key martingale representation uses innovation vectors:
\[\Delta_s(\beta_i) = W_s(Y_s - W_s^\top \beta_i)\]
leading to the martingale sum $S_{t,i} = \sum_{s=1}^t \psi_{t,s,i}$ where:
\[\psi_{t,s,i} = t^{\alpha-1}\Gamma^{-1} \frac{1\{a_s = i\}}{p_{s,i}(W_s)} \Delta_s(\beta_i)\]
Matrix concentration: For the empirical Gram matrix, apply matrix Bernstein inequality:
\[P\left(\|\hat{\Gamma}_{i,t} - \Gamma\| \geq u\right) \leq d \exp\left(-\frac{cu^2}{r_t + u}\right)\]
where $r_t = t^{-2}\sum_{s=1}^t 1/p_s^*$ measures exploration efficiency.
RKHS functional CLT: Linearize the IPW-KRR estimator as:
\[\hat{f}_{i,t} - f_i^\lambda = A_i(\lambda)^{-1}(\Delta h_{t,i} - \Delta \Sigma_{t,i} f_i^\lambda) + R_{t,i}\]
The martingale term satisfies:
\[t^\gamma L_{t,i} = \sum_{s=1}^t \xi_{t,s,i}\]
where $\xi_{t,s,i} = t^{\gamma-1}A_i(\lambda)^{-1}\frac{1{a_s = i}}{p_{s,i}(U_{s,i})}[\Delta_s^\lambda - \lambda f_i^\lambda]$.
Studentization via predictable quadratic variation:
\[V_t(\lambda) = t^{2\gamma-2}A_i(\lambda)^{-1}H_i(\lambda)A_i(\lambda)^{-1}\]
Key technical insight: The inverse-propensity weighting preserves martingale structure while correcting for selection bias, enabling standard CLT techniques despite adaptive sampling.

Experiments & Validation

The paper includes extensive simulations with synthetic data varying noise levels, context dimensions (d = 5, 10, 20), number of arms (L = 2, 3, 5), and link function complexity (polynomial, sinusoidal, step functions). Baselines include linear contextual bandits, neural bandits, and kernel bandits without single-index structure.

Key results: K-SIEGE achieves 15-30% regret reduction compared to standard kernel bandits and provides well-calibrated confidence intervals with 94-96% empirical coverage for nominal 95% level. Real data experiments on a medical treatment allocation dataset demonstrate interpretable index parameter estimates and reliable uncertainty quantification.

Computational complexity analysis shows O(t^2) per-round cost for kernel computations, which is standard for kernel methods in bandits.

Limitations & Open Problems

Known score function requirement - TECHNICAL: Needed for Stein-based estimation but could be addressed with score estimation techniques
Bounded RKHS assumption - NATURAL: Standard in kernel bandit literature and satisfied by common kernels like RBF
Finite action space - RESTRICTIVE: Limits applicability to continuous action settings, though extensions may be possible
Homoscedastic noise within arms - TECHNICAL: Could potentially be relaxed using robust estimation techniques
Exploration rate conditions $t^{2\alpha}r_t \to 0$ and $t^{2\gamma}\tilde{r}_t \to 0$ - TECHNICAL: Required for CLT but may be relaxable with refined martingale analysis
Ridge parameter selection requires knowledge of smoothness parameters - TECHNICAL: Adaptive selection could be developed

Open problems:
Extension to continuous action spaces while preserving single-index structure and inference guarantees
Development of data-driven procedures for ridge parameter selection that maintain theoretical guarantees

Towards Noise-Resilient Quantum Multi-Armed and Stochastic Linear Bandits

Authors: Zhuoyue Chen, Kechao Cai · Institution: Sun Yat-sen University · Category: cs.LG

Proposes noise-resilient quantum bandit algorithms using Bayesian quantum Monte Carlo to maintain quantum advantage under realistic NISQ device noise.

Tags: quantum computing multi-armed bandits NISQ quantum Monte Carlo noise robustness online learning amplitude estimation

arXiv · PDF

Problem Formulation

Motivation: Quantum multi-armed bandits can achieve quadratic speedups over classical bandits by using quantum Monte Carlo (QMC) for reward estimation. However, existing quantum bandit algorithms assume noise-free quantum circuits, which is unrealistic for current NISQ (Noisy Intermediate-Scale Quantum) devices where noise can eliminate theoretical advantages.
Mathematical setup: Consider a quantum multi-armed bandit with action set $X = {1, \ldots, K}$. Each arm $x \in X$ has unknown mean reward $a_x \in [0,1]$, with optimal arm $x^* = \arg\max_{x \in X} a_x$. In the ideal setting, each arm is modeled by a quantum reward oracle $O_x$ that prepares state
\[|ψ_x⟩ = \sqrt{1-a_x}|0⟩ + \sqrt{a_x}|1⟩\]
Under NISQ noise, the implemented oracle is $\tilde{O}_x = E_x \circ U_x$ where $U_x(ρ) = O_x ρ O_x^†$ is the ideal unitary channel and $E_x$ is a noise channel. The noisy output state is
\[\tilde{ρ}_x = E_x(O_x |0⟩⟨0| O_x^†)\]
The noisy reward probability is $\tilde{a}_x = \text{Tr}(\lvert 1⟩⟨1 \rvert \tilde{ρ}_x)$. We model $\tilde{a}_x = a_x + b_x + ξ_x$ where $b_x$ is deterministic bias with $\lvert b_x \rvert \leq δ$, and $ξ_x$ is stochastic noise with $E[ξ_x] = 0$ and $\text{Var}(ξ_x) \leq σ_x^2$.

Assumptions:
1. Quantum reward oracles encode mean rewards as amplitudes
2. Noise channels are CPTP (completely positive trace-preserving)
3. Bias is bounded and noise has finite variance
4. Oracle queries are independent across rounds
Toy example: When $K=2$ arms with $a_1 = 0.3, a_2 = 0.7$ under depolarizing noise with parameter $η = 0.1$, the noisy success probability becomes $\tilde{p}(θ_x, m) = (1-η)\sin^2((2m+1)θ_x) + η/2$ where $a_x = \sin^2 θ_x$.
Formal objective: Minimize cumulative regret
\[R_T = \sum_{t=1}^T (a_{x^*} - a_{x_t})\]
where $x_t$ is the arm selected at round $t$, while maintaining robustness under noisy oracle implementations.

Method

The method consists of three main components:

Bayesian Quantum Monte Carlo (BQMC): Instead of directly estimating $a_x$, perform Bayesian inference over the amplitude parameter $θ_x$ where $a_x = \sin^2 θ_x$. Use uniform prior $θ_x \sim U(0, π/2)$. With $h$ successes in $S$ shots at amplification depth $m$, the likelihood is
\[L(θ_x) = \binom{S}{h} \tilde{p}(θ_x, m)^h (1-\tilde{p}(θ_x, m))^{S-h}\]
Sequential Monte Carlo approximation: Represent posterior by particles ${θ^{(i)}, w^{(i)}}_{i=1}^M$. Update weights via Bayes rule:
\[w^{(i)} \leftarrow w^{(i)} \tilde{p}(θ^{(i)}, m)^h (1-\tilde{p}(θ^{(i)}, m))^{S-h}\]
Compute posterior mean $\hat{a}_x = \sum_{i=1}^M w^{(i)} \sin^2(θ^{(i)})$ and variance $s_x^2$.
Noise-Resilient UCB: Construct upper confidence bounds as
\[I_x = \hat{a}_x + z s_x\]
where $z = \sqrt{2\log(T/δ_r)}$ and select arm $x_t = \arg\max_x I_x$.

Application to toy example: For the 2-arm case with depolarizing noise, BQMC would maintain particle approximations for $θ_1, θ_2$. After observing outcomes, particle weights are updated according to the noisy likelihood. The UCB indices incorporate both the posterior means and the noise-aware uncertainty estimates $s_1, s_2$.

Novelty & Lineage

This work extends quantum multi-armed bandits (Wan et al. 2023, Casalé et al. 2020) by addressing NISQ noise robustness. Prior quantum bandit work assumed ideal noise-free quantum Monte Carlo, while this paper integrates noise-aware Bayesian estimation.

The Bayesian QMC approach builds on maximum-likelihood amplitude estimation (Suzuki et al. 2020) and Bayesian amplitude estimation (Ramôa & Santos 2025), but adapts these for online bandit settings rather than single estimation tasks.

The key novelty is the unified framework combining:

Particle-based Bayesian inference for noisy amplitude estimation
Integration with UCB/LinUCB exploration strategies
Theoretical and empirical analysis under realistic NISQ noise models

INCREMENTAL

Proof Techniques

The paper is primarily empirical and does not provide detailed theoretical analysis or proofs. The main technical contributions are algorithmic:

Noise modeling: Effective statistical model $\tilde{a}_x = a_x + b_x + ξ_x$ captures both systematic bias and stochastic fluctuations from NISQ noise.
Bayesian posterior approximation: Sequential Monte Carlo maintains particle representation of posterior distribution over amplitude parameters. Key update equation:
\[π(θ_x | h) ∝ π(θ_x) \tilde{p}(θ_x, m)^h (1-\tilde{p}(θ_x, m))^{S-h}\]
Confidence bound construction: Combines classical UCB confidence ellipsoids with Bayesian uncertainty:
\[I_x = φ_x^T \hat{θ} + \sqrt{β_t φ_x^T V^{-1} φ_x} + z s_x\]
The theoretical analysis focuses on algorithmic design rather than regret bounds. The robustness comes from the Bayesian framework naturally accounting for noise-induced uncertainty through the posterior variance $s_x$.

Experiments & Validation

Datasets: Simulated quantum bandit environments with K=3 arms (MAB) and K=10 arms (linear bandits) with feature vectors on unit circle.

Baselines:

Classical UCB/LinUCB
Canonical-QUCB (using QFT-based QAE)
MLE-QUCB (using maximum likelihood QAE)
Proposed NR-QUCB/NR-QLinUCB

Noise models: Four NISQ-realistic noise types:

Exponential decoherence (T_c = 2000)
Readout errors (bit-flip probability p = 0.03)
Depolarizing noise (p = 0.05)
Amplitude damping (γ = 0.08)

Key results:
- In noiseless settings: QUCB achieves lower regret than classical UCB
- Under noise: NR-QUCB consistently outperforms baseline quantum methods
- Performance gains more pronounced in MAB than linear bandits (due to noise averaging in linear case)
- MLE-based methods outperform canonical QAE due to shallower circuits
Implementation: MindSpore Quantum framework, 50 repetitions per configuration, T=5000 rounds.

Limitations & Open Problems

Limitations:

TECHNICAL: Assumes access to quantum reward oracles that encode rewards as amplitudes - this oracle model may not capture all practical quantum advantage scenarios.
TECHNICAL: Particle-based posterior approximation introduces computational overhead that may offset quantum speedups for small problem sizes.
RESTRICTIVE: Limited theoretical analysis - no formal regret bounds provided for the noise-robust algorithms.
TECHNICAL: Noise models are simplified and may not capture all correlations and non-Markovian effects in real quantum hardware.
NATURAL: Assumes bounded bias and finite noise variance, which are reasonable for most NISQ noise processes.

Open problems:
Derive formal regret bounds for NR-QUCB and NR-QLinUCB under the proposed noise models, characterizing when quantum advantage is preserved.
Extend the framework to more complex noise processes including correlated errors and non-Markovian decoherence effects.

The minimax optimal convergence rate of posterior density in the weighted orthogonal polynomials

Authors: Yiqi Luo, Xue Luo · Institution: Beihang University · Category: math.ST

Establishes minimax optimal posterior convergence rates for Bayesian density estimation using orthogonal polynomials on unbounded domains via a novel shifting method that circumvents the absence of uniform positive lower bounds.

Tags: bayesian-nonparametrics density-estimation minimax-theory orthogonal-polynomials sobolev-spaces posterior-convergence sieve-priors unbounded-domains

arXiv · PDF

Problem Formulation

Motivation: Bayesian nonparametric density estimation provides flexible modeling without restrictive parametric assumptions, but establishing minimax optimal posterior convergence rates is challenging, especially for densities on unbounded domains without strictly positive lower bounds.
Mathematical setup: Let $Y_n = (Y_1, \ldots, Y_n)$ be i.i.d. observations from an unknown distribution with density $g_0$ with respect to reference measure $dP = w(x)dx$ on domain $I$ (bounded or unbounded). The density belongs to the weighted Sobolev space:
\[\|g\|_{W^p_{2,w}}^2 = \sum_{l=0}^p \|g^{(l)}\|_{L_{2,w}}^2 = \sum_{l=0}^p \int_I |g^{(l)}(x)|^2 w(x) dx < \infty\]
The function class is:
\[G := \left\{g \in W^p_{2,w}(I) : g \geq 0 \text{ and } \int_I g(x)w(x)dx = 1\right\}\]
Any $g_0 \in G$ admits orthogonal polynomial expansion:
\[g_0(x) = \sum_{j=0}^\infty \theta_j q_j(x)\]
where ${q_j(x)}_{j \geq 0}$ are orthogonal polynomials with weight $w(x)$ satisfying:
\[\int_I q_i(x)q_j(x)w(x)dx = \delta_{ij}\gamma_j\]
Assumptions:
1. $g_0 \in W^p_{2,w}(I)$ for integer $p \geq 1$
2. Orthogonal polynomial system ${q_j}$ with weight $w(x)$ exists on $I$
3. Parameter space constraint $\sum_{j=0}^\infty \eta_j^2 \tilde{\gamma}_j < \infty$ for appropriate $\tilde{\gamma}_j$
Toy example: When $I = [-1,1]$, $w(x) = 1$, $p = 1$, and $g_0(x) = \frac{3}{4}(1-x^2)$ (beta distribution), the Legendre polynomial expansion gives finite coefficients $\theta_j$ with $\tilde{\gamma}_j \gtrsim j^{4p+1} = j^5$. The core difficulty is that without $g_0 \geq a > 0$ uniformly, the KL neighborhood conditions fail.
Formal objective: Establish posterior convergence rate $\varepsilon_n$ such that:
\[\pi(A_n^c | Y_n) \to 0\]
where $A_n^c = {g \in G : d_H(g, g_0) > K\varepsilon_n}$ and $d_H$ is Hellinger distance.

Method

The method uses two approaches depending on whether $g_0$ has a positive lower bound:

Case 1 (Positive lower bound): If $g_0 \geq a > 0$, apply Shen-Wasserman framework directly with sieve prior on coefficients $\eta_j$.

Case 2 (No positive lower bound): Novel shifting method:

Define shifted density for decreasing sequence $a_n \downarrow 0$:
\[g_n(x) := a_n + g(x)(1 - a_n\gamma_0)\]
Transform coefficients via:
\[\eta_{j,n} := \begin{cases}\] \[a_n + \eta_0(1 - a_n\gamma_0), & j = 0 \\\] \[\eta_j(1 - a_n\gamma_0), & j \geq 1\] \[\end{cases}\]
Establish modified convergence theorem (Theorem 3.3) with conditions: - Entropy control: $\int_{\varepsilon_n/28}^{\sqrt{2}\varepsilon_n} H_{[]}^{1/2}(u, \Omega_n, L_{2,w}) du \lesssim \sqrt{n}\varepsilon_n^2$ - Prior mass: $\pi(S_{a_n}(t_n)) \gtrsim \exp(-b_n a_n t_n)$ where $a_n b_n \leq n$ - Sieve condition: $\pi(\Omega_n^c) \leq \exp(-n\varepsilon_n^2)$
Prove shifted and original densities have comparable Hellinger distances when $a_n \lesssim \varepsilon_n^4$.
Construct Gaussian sieve prior:
\[\eta_0 = 1, \quad \eta_j \sim N(0, j^{-2p}\gamma_j^{-1}) \text{ for } 1 \leq j \leq k_n, \quad \eta_j = 0 \text{ for } j > k_n\]
with $k_n = O(n^{(6p+1)/(7p(2p+1))})$.

Toy example application: For the beta density case, the shifted version $g_n(x) = a_n + \frac{3}{4}(1-x^2)(1-2a_n)$ has uniform lower bound $a_n$, enabling application of the modified theorem with $a_n = n^{-4p/(2p+1)}$.

Novelty & Lineage

This extends Shen-Wasserman (2001) and Ghosal et al. (2000) frameworks for Bayesian nonparametric density estimation. Prior work either restricted to bounded domains (Ghosal 2001, Rousseau 2010) or required artificial truncation/restrictive assumptions (Kruijer et al. 2010, Scricciolo 2014).

Key novelty is the shifting method that lifts densities without positive lower bounds to proxy densities $g_{0,n} \geq a_n > 0$, combined with a modified convergence theorem (Theorem 3.3) that handles the $a_n$-dependent constants. This circumvents the fundamental issue that Hellinger and weighted $L_2$ distances are not equivalent without positive lower bounds.

Previous orthogonal polynomial approaches (like trigonometric bases in Scricciolo 2006) required bounded supports or strong positivity assumptions. This work provides the first minimax optimal rates for orthogonal polynomial expansions on unbounded domains without uniform positivity.

SIGNIFICANT

Proof Techniques

The proof uses several key techniques:

Entropy control via Sobolev embedding: For densities with positive lower bound $a > 0$:
\[H_{[]}(\varepsilon, \Omega_{a,n}, d_H) \leq H_{[]}(\varepsilon, \Omega_{a,n}, L_{2,w}) \leq H_{[]}(\varepsilon, W^p_{2,w}, L_{2,w}) \lesssim \varepsilon^{-1/p}\]
Key inequality connecting Hellinger and $L_{2,w}$ distances:
\[d_H^2(u_i, l_i) \leq \frac{1}{a}\|u_i - l_i\|_{L_{2,w}}^2\]
KL neighborhood embedding: Show $\bar{S}_a(t) \subset S_a(t)$ using:
\[D(g_0, g) \leq \frac{1}{a}\|g - g_0\|_{L_{2,w}}^2, \quad V(g_0, g) \leq \frac{1}{a}\|g - g_0\|_{L_{2,w}}^2\]
Shifting method analysis: Critical lemma establishing Hellinger distance comparison:
\[d_H(g, g_0) \lesssim d_H(g_n, g_{0,n})\]
when $a_n \leq \frac{K^4\varepsilon_n^4}{(16 + K^4\varepsilon_n^4)\gamma_0}$
Modified exponential bound: For likelihood ratios without uniform positivity, prove:
\[P_0^* \left(\sup_{\{g: d_H(g,g_0) \geq \varepsilon_n\}} \prod_{i=1}^n \frac{g(Y_i)}{g_{0,n}(Y_i)} \geq \exp(-\frac{1}{2}C_4 n\varepsilon_n^2)\right) \lesssim \exp(-C_5 n\varepsilon_n^2) + (1-a_n\gamma_0)^{-n}\exp(-\frac{1}{2}C_4 n\varepsilon_n^2)\]
Sieve prior construction: Gaussian coefficients $\eta_j \sim N(0, j^{-2p}\gamma_j^{-1})$ with truncation at $k_n = O(n^{(6p+1)/(7p(2p+1))})$ ensures both prior mass and tail conditions.

The key insight is that the $a_n$-dependence in constants can be controlled through the modified framework, preserving optimality.

Experiments & Validation

Numerical experiments include:

Density estimation comparison: Orthogonal polynomial expansions vs trigonometric bases from Scricciolo (2006), showing competitive performance for unbounded domain problems.
Convergence rate validation: Examination of decreasing Hellinger distance $d_H(\hat{g}_n, g_0)$ as sample size $n$ increases, confirming theoretical rate $n^{-p/(2p+1)}$.

The experiments are primarily illustrative rather than comprehensive, focusing on validating the theoretical convergence rate predictions and demonstrating practical applicability of the method.

Limitations & Open Problems

Limitations:

Weight function requirements - TECHNICAL: The orthogonal polynomial system must exist with appropriate growth conditions on $\tilde{\gamma}_j$, which varies significantly across polynomial families (e.g., $\tilde{\gamma}_j \gtrsim j^{4p+1}$ for Legendre, $\tilde{\gamma}_j \gtrsim j^{4p}2^j j!$ for Hermite).
Shifting parameter choice - TECHNICAL: The sequence $a_n \lesssim \varepsilon_n^4$ requires careful calibration and the bound $a_n b_n \leq n$ with $a_n b_n \varepsilon_n^2 \to \infty$ creates non-trivial constraints.
Sobolev regularity assumption - NATURAL: Assumes $g_0 \in W^p_{2,w}(I)$ which is standard for minimax theory but may not hold for all practical densities.
Computational complexity - RESTRICTIVE: The sieve truncation at $k_n = O(n^{(6p+1)/(7p(2p+1))})$ grows with sample size, making implementation challenging for large $n$.

Open problems:
Extend to adaptive estimation where regularity $p$ is unknown
Develop computationally efficient algorithms for posterior sampling with the constructed sieve priors

Synthetic Data, Information, and Prior Knowledge: Why Synthetic Data Augmentation to Boost Sample Doesn’t Work for Statistical Inference

Authors: Reid Dale, Jordan Rodu, Mike Baiocchi · Institution: Stanford University, University of Virginia · Category: stat.ME

Proves that synthetic data augmentation adds zero marginal Fisher information for statistical inference, providing fundamental limits on when such methods can improve parameter estimation.

Tags: synthetic-data fisher-information statistical-inference maximum-likelihood bootstrap information-theory bayesian-analysis

arXiv · PDF

Problem Formulation

Motivation: Synthetic data augmentation is increasingly popular for improving statistical models by boosting sample size at low cost. However, there lacks a rigorous information-theoretic analysis of whether such augmentation actually improves statistical inference about underlying parameters.
Mathematical setup: Let $X = (X_1, \ldots, X_n)$ be iid observations from distribution $\theta$ where $\theta \in \Theta$ is the parameter space. A synthetic distribution $S$ is a function $S: \bigcup_{n \in \mathbb{N}} \Omega^n \to P(\Xi)$ mapping empirical samples to probability distributions over sample space $\Xi$.

Let $S_i \sim S(X)$ be synthetic observations drawn from the synthetic distribution constructed from $X$. We consider Fisher information:
\[I_X(\theta) = -\mathbb{E}\left[\frac{\partial^2}{\partial\theta^2} \ell_X(\theta)\right]\]
Assumptions:
1. Original data $X_i$ are iid from $\theta$
2. Synthetic distribution $S(X)$ depends only on observed sample $X$
3. Synthetic observations $S_i$ are conditionally independent given $X$
Toy example: Consider $X_1, \ldots, X_n \sim \text{Bernoulli}(p)$ and the nonparametric bootstrap synthetic distribution that assigns probability $1/n$ to each observed value. The synthetic data are just resamples of the original observations.
Formal objective: Determine the marginal Fisher information contributed by synthetic data:
\[I_{S|X}(\theta) = I_{X,S}(\theta) - I_X(\theta)\]

Method

The authors prove bounds on Fisher information contributed by synthetic data through the following analysis:

Define the joint likelihood of original and synthetic data:
\[q_\theta(X,S) = \prod_i p_\theta(X_i) \times \prod_j p_{S(X)}(S_j)\]
Show that synthetic data density is independent of $\theta$:
\[L_{X,S}(\theta) = L_X(\theta) \times \prod_j p_{S(X)}(S_j) = c_X(S) \cdot L_X(\theta)\]
Take log-likelihood and differentiate:
\[\ell_{X,S}(\theta) = \log(c_X(S)) + \ell_X(\theta)\] \[\frac{\partial^2}{\partial\theta^2}\ell_{X,S}(\theta) = \frac{\partial^2}{\partial\theta^2}\ell_X(\theta)\]
Apply Fisher information definition to conclude:
\[I_{X,S}(\theta) = I_X(\theta)\]
Applied to toy example: For Bernoulli bootstrap, $c_X(S) = 1/n^k$ where $k$ is number of synthetic samples. The Fisher information remains $I_X(p) = n/[p(1-p)]$ regardless of synthetic augmentation.

Novelty & Lineage

This paper provides the first rigorous information-theoretic analysis of synthetic data augmentation for statistical inference. While synthetic data generation methods (GANs, VAEs) and privacy-preserving synthetic data (differential privacy, data masking) are well-studied, the fundamental question of whether synthetic augmentation improves parameter estimation has not been formally addressed. The paper connects synthetic data to classical bootstrap theory and Fisher information bounds. The key insight that synthetic data acts as a form of prior knowledge encoding is novel. SIGNIFICANT

Proof Techniques

The main proof uses the following key steps:

Likelihood factorization: Show that joint likelihood factors as
\[q_\theta(X,S) = \left[\prod_i p_\theta(X_i)\right] \times \left[\prod_j p_{S(X)}(S_j)\right]\]
Parameter independence: Prove that $p_{S(X)}(S_j)$ does not depend on $\theta$ by construction of synthetic distribution $S(X)$.
Chain rule for Fisher information: Apply the identity
\[I_{X,S}(\theta) = I_X(\theta) + I_{S|X}(\theta)\]
Second derivative calculation: Show that
\[\frac{\partial^2}{\partial\theta^2}\ell_{X,S}(\theta) = \frac{\partial^2}{\partial\theta^2}\ell_X(\theta)\]
since $\log(c_X(S))$ is constant in $\theta$.
Upper bound via chain rule: For unconditional synthetic Fisher information, use
\[I_{X,S}(\theta) = I_S(\theta) + I_{X|S}(\theta) \geq I_S(\theta)\]
to conclude $I_S(\theta) \leq I_X(\theta)$.

Experiments & Validation

Purely theoretical. The paper provides mathematical analysis without empirical validation. Appropriate empirical validation would involve:

comparing confidence intervals from ML estimates using original vs. augmented data
measuring actual vs. predicted Fisher information on synthetic datasets
demonstrating the bounds hold across different synthetic generation methods (GANs, VAEs, bootstrap variants).

Limitations & Open Problems

Limitations:

Analysis limited to maximum likelihood estimation framework - TECHNICAL (could extend to other estimators)
Assumes synthetic distribution depends only on observed sample X - NATURAL (standard assumption for most synthetic data methods)
Does not address model selection or regularization effects - TECHNICAL (proof technique focuses on Fisher information)
Bayesian analysis incomplete, only sketches prior specification issues - TECHNICAL (could be developed further)

Open problems:
Extend analysis to other estimation frameworks beyond MLE (M-estimators, Bayesian with informative priors)
Characterize when synthetic augmentation can improve finite-sample performance despite zero asymptotic information gain

A Jacobi Field Approach to Splitting Detection in Schrödinger Bridge

Authors: Chunhai Jiao, Jin Guo, Haoyan Zhang, Jinqiao Duan et al. (5 authors) · Institution: Huazhong University of Science and Technology, Great Bay University, City University of Hong Kong · Category: math.DS

Introduces a Jacobi field-based geometric framework for detecting the onset of path splitting in stochastic interpolation using eigenvalue analysis of the velocity field’s strain tensor

Tags: stochastic-interpolation optimal-transport schrodinger-bridge flow-matching mode-collapse splitting-detection jacobi-fields generative-models

arXiv · PDF

Problem Formulation

Motivation: In stochastic interpolation between probability distributions, especially when target distributions are nonconvex or supported on disconnected components, interpolating trajectories may separate into distinct branches. Detecting the onset of this path splitting is crucial for understanding when dynamics begin to encode multi-modal geometry and can serve as early indicators of structural transitions in applications.
Mathematical setup: Let $\mu_0$ and $\mu_1$ be probability distributions on $\mathbb{R}^d$. Consider a stochastic interpolation ansatz of the form:
\[X_t = \alpha(t)X_0 + \beta(t)X_1 + \gamma(t)Z, \quad t \in [0,1]\]
where $(X_0, X_1) \sim \pi$ for some coupling $\pi \in \Pi(\mu_0, \mu_1)$, and $Z \sim \mathcal{N}(0, I_d)$ is independent. The interpolation coefficients satisfy:
Boundary conditions: $\alpha(0) = \beta(1) = 1$, $\alpha(1) = \beta(0) = \gamma(0) = \gamma(1) = 0$
Positivity: $\gamma(t) > 0$ for $t \in (0,1)$

The induced Eulerian velocity field is defined by conditional averaging:
\[v(x,t) := \mathbb{E}[\dot{X}_t | X_t = x]\]
The spatial Jacobian matrix is:
\[\nabla_x v(x,t) \in \mathbb{R}^{d \times d}\]
Define the symmetric part (strain tensor):
\[S(t) := \frac{\nabla_x v(x_t,t) + \nabla_x v(x_t,t)^T}{2}\]
Toy example: When $d=2$, $X_0 \sim \mathcal{N}(0, I_2)$, and $X_1$ is supported on two points ${y_1, y_2}$ with equal weights, the velocity field exhibits splitting behavior when trajectories separate toward the two target modes.
Formal objective: Detect the splitting time $t_s \in (0,1)$ and location $x_{t_s}$ by analyzing:
\[\lambda_{\max}(t) := \lambda_{\max}(S(t))\]

Method

The method consists of three main components:

Construct the stochastic interpolation using neural networks to parameterize $\alpha(t)$, $\beta(t)$, $\gamma(t)$ with discontinuous activation functions to capture abrupt transitions.
Compute the induced Eulerian velocity field via conditional expectation:
\[v(x,t) = A(t)(x - \beta(t)m(t,x)) + \beta'(t)m(t,x)\]
where $A(t) = \frac{\alpha’(t)\alpha(t) + \gamma’(t)\gamma(t)}{\sigma^2(t)}$ and $m(t,x) = \mathbb{E}[X_1\lvert X_t=x]$.
Derive the spatial Jacobian using Proposition 3.2:
\[\nabla_x v(t,x) = A(t)I_2 + \frac{(\beta'(t) - A(t)\beta(t))\beta(t)}{\sigma^2(t)} \text{Cov}_{w(t,x)}(Y)\]
Extract the splitting indicator as the largest eigenvalue of the symmetric part:
\[S(t) = \frac{\nabla_x v(x_t,t) + \nabla_x v(x_t,t)^T}{2}\]
Use Top-K selection strategy to focus on particles with highest $\lambda_{\max}(t)$ values, tracking their cumulative integral and rate of change to identify splitting onset.

Applied to the toy example: For two target points, the covariance matrix $\text{Cov}_{w(t,x)}(Y)$ captures the local posterior geometry, and $\lambda_{\max}(t)$ peaks when trajectories begin separating toward different modes.

Novelty & Lineage

This work introduces a novel Jacobi field approach to splitting detection in stochastic interpolation, extending classical differential geometric concepts to the stochastic setting. The connection to Schrödinger bridges builds on Chen et al. (2016, 2021) and De Bortoli et al. (2021), while the flow matching perspective relates to Lipman et al. (2022). The key novelty is formulating splitting detection as a variational problem using the strain tensor’s spectral properties. Prior work on optimal transport discontinuities (Lei et al., 2019; An et al., 2020) focused on mode collapse in GANs, while this paper provides a principled geometric framework for detecting splitting onset in continuous-time interpolation. The Jacobi field interpretation and the specific eigenvalue-based splitting criterion appear to be original contributions.

SIGNIFICANT

Proof Techniques

The main theoretical contributions are established through several key results:

Proposition 3.1 derives the fundamental variational equation by differentiating the flow equation with respect to variation parameter $s$:
\[\frac{d}{dt}J(t) = \nabla_x v(t, x(t)) J(t)\]
The second-order Jacobi equation follows by time differentiation:
\[\frac{d^2}{dt^2} J(t) + M(t)J(t) = 0\]
where:
\[M(t) := -\left[\partial_t \nabla_x v + (v \cdot \nabla_x)(\nabla_x v) + (\nabla_x v)^2\right](t, x(t))\]
Proposition 3.2 computes the spatial Jacobian for the discrete target case using Gaussian regression and Bayesian updating. The key technical insight is expressing posterior weights as:
\[w_i(t,x) = \frac{\pi_i \exp\left(-\frac{\|x-\beta(t)y_i\|^2}{2\sigma^2(t)}\right)}{\sum_{j=1}^N \pi_j \exp\left(-\frac{\|x-\beta(t)y_j\|^2}{2\sigma^2(t)}\right)}\]
Then applying the softmax gradient formula to derive:
\[\nabla_x m(t,x) = \frac{\beta(t)}{\sigma^2(t)} \text{Cov}_{w(t,x)}(Y)\]
Proposition 3.3 establishes the geometric interpretation using the Rayleigh quotient:
\[\lambda_{\max}(t) = \max_{\|e\|=1} e^T S(t) e\]
The stretching property follows from:
\[\frac{d}{dt}\frac{1}{2}\|\delta x_t\|^2 = \delta x_t^T S(t) \delta x_t = \lambda_{\max}(t)\|\delta x_t\|^2\]

Experiments & Validation

Four 2D synthetic datasets: Two Moons, Checkerboard, S-Curve, and Gaussian Mixture. The experiments track particle trajectories from Gaussian prior to target distributions, visualizing:

Grid evolution maps showing spatial particle density over time
Delta (δ) and eigenvalue heatmaps highlighting high-intensity regions
Top-30 eigenvalue integral and slope analysis

Key quantitative findings:

Splitting detection occurs around t ∈ [0.7, 0.8] for Two Moons
Eigenvalue peaks align with visual onset of branching in all datasets
Top-K selection strategy successfully localizes splitting regions
Slope changes provide early warning signals before visible splitting

Baselines: None explicitly compared, as this is primarily a methodological contribution introducing new splitting detection indicators. The validation is primarily qualitative through visualization of splitting phenomena.

Limitations & Open Problems

RESTRICTIVE: Limited to 2D experiments only - scalability to high-dimensional distributions unclear
TECHNICAL: Requires discrete target support assumption in Proposition 3.2 - continuous target case not fully addressed
RESTRICTIVE: Top-K selection strategy relies on manually chosen K parameter - no principled selection method provided
TECHNICAL: Numerical differentiation for discontinuous neural networks may be unstable at jump points
NATURAL: Assumes Gaussian noise in interpolation ansatz - reasonable for many applications
RESTRICTIVE: No rigorous threshold for splitting detection - relies on visual inspection of eigenvalue peaks

Open problems:
Extend the theoretical framework to continuous target distributions and prove convergence of the discrete approximation
Develop principled methods for automatically selecting the splitting threshold and Top-K parameter without manual tuning

Transfer Learning for Contextual Joint Assortment-Pricing under Cross-Market Heterogeneity

Authors: Elynn Chen, Xi Chen, Yi Zhang · Institution: NYU Stern, Tsinghua · Category: stat.ME

Introduces the first transfer learning framework for contextual joint assortment-pricing across heterogeneous markets, achieving optimal variance-bias tradeoffs through aggregate-then-debias estimation.

Tags: transfer learning contextual bandits revenue management assortment optimization dynamic pricing multinomial logit minimax regret sparse estimation

arXiv · PDF

Problem Formulation

Motivation: Digital platforms and retailers increasingly operate across multiple related markets (e.g., different cities, regions, or repeated experiments). While data from previous/related markets can accelerate learning in a new target market, naive pooling can introduce systematic bias when customer preferences differ across markets.
Mathematical setup: Consider $H$ source markets and one target market (indexed 0). At each time $t$, the seller observes contextual features ${x_{it}^{(h)}} \in \mathbb{R}^d$ for products $i \in [N]$ in market $h$. Customer utility follows a contextual multinomial logit model:
\[f_{it}^{(h)} = \langle x_{it}^{(h)}, \theta^{(h)} \rangle - \langle x_{it}^{(h)}, \gamma^{(h)} \rangle p_{it}^{(h)} + \varepsilon_{it}^{(h)}\]
where $\varepsilon_{it}^{(h)}$ are i.i.d. Gumbel shocks. Purchase probability is:
\[q_t^{(h)}(i | S_t^{(h)}, p_t^{(h)}) = \frac{\exp(v_{it}^{(h)})}{1 + \sum_{\ell \in S_t^{(h)}} \exp(v_{\ell t}^{(h)})}\]
with $v_{it}^{(h)} := \langle x_{it}^{(h)}, \theta^{(h)} \rangle - \langle x_{it}^{(h)}, \gamma^{(h)} \rangle p_{it}^{(h)}$.

Key assumptions:
1. Structured Preference Shift: There exists an index set $S^* \subseteq [2d]$ with $\lvert S^* \rvert \leq s_0$ such that $(\nu^{(0)} - \nu^{(h)})_j = 0$ for all $j \notin S^*$, where $\nu^{(h)} = (\theta^{(h)}, \gamma^{(h)})$.
2. Positive price sensitivity: $\min_{i,t} \langle x_{it}, \gamma \rangle \geq L_0 > 0$.
3. Bounded parameters: $\lvert x_{it} \rvert_\infty \leq 1$ and $\lvert (\theta, \gamma) \rvert \leq 1$.
Toy example: When $N = 2$ products, $d = 2$ features, $s_0 = 1$, and markets differ only in price sensitivity for the first product coordinate. Target market has $\nu^{(0)} = (1, 0.5, 0.8, 0.3)$ while source market has $\nu^{(1)} = (1, 0.5, 0.6, 0.3)$, differing only in coordinate 3 (price sensitivity for first product).
Formal objective: Minimize cumulative regret over horizon $T$:
\[\text{Regret}(T; \pi) = \sum_{t=1}^T \left[ R_t^{(0)}(S_t^*, p_t^*) - R_t^{(0)}(S_t^{(0)}, p_t^{(0)}) \right]\]
where $(S_t^*, p_t^*)$ is the optimal assortment-price pair knowing the true target parameters.

Method

The Transfer Joint Assortment-Pricing (TJAP) algorithm operates episodically with three main components:

Aggregate-then-debias estimation: At episode $m$, first aggregate source data to get:
\[\hat{\nu}_{m-1}^{(ag)} \in \arg\min_{\nu \in \mathbb{R}^{2d}} \frac{1}{H|T_{m-1}|} \sum_{h=1}^H \sum_{t \in T_{m-1}} \ell_t^{(h)}(\nu)\]
Then debias using target data with $\ell_1$ regularization:
\[\hat{\delta}_{m-1} \in \arg\min_{\delta \in \mathbb{R}^{2d}} \left\{ \frac{1}{|T_{m-1}^{(0)}|} \sum_{t \in T_{m-1}^{(0)}} \ell_t^{(0)}(\hat{\nu}_{m-1}^{(ag)} + \delta) + \lambda_{m-1} \|\delta\|_1 \right\}\]
Set final estimate: $\hat{\nu}_{m-1} = \hat{\nu}_{m-1}^{(ag)} + \hat{\delta}_{m-1}$.
Construct two-radius optimistic utility:
\[\bar{v}_{it}(p) = \langle \tilde{x}_{it}^{(0)}(p), \hat{\nu}_{m-1} \rangle + \alpha_{m-1} \|\tilde{x}_{it}^{(0)}(p)\|_{W_{m-1}^{-1}} + \beta_{m-1} \|\tilde{x}_{it}^{(0)}(p)\|_\infty\]
where $W_{m-1} = V_{\tau_{m-1}}^{(0)} + \sum_{h=1}^H V_{\tau_{m-1}}^{(h)}$ is the pooled Fisher information.
Apply monotone envelope to preserve price sensitivity structure:
\[\tilde{v}_{it}(p) = \min_{p' \leq p} \{\bar{v}_{it}(p') - L_0(p - p')\}\]
Optimize over assortments and prices:
\[(S_t^{(0)}, p_t^{(0)}) \in \arg\max_{S \in \mathcal{S}_K, p \in [0,\bar{P}]^N} \tilde{R}_t(S, p)\]
Toy example application: With $s_0 = 1$ heterogeneity, the aggregate step pools both markets’ data, then debiasing identifies the sparse difference in coordinate 3. The confidence bound separates statistical uncertainty (shrinking with $H$) from transfer bias (scaling with $s_0$).

Novelty & Lineage

This paper extends transfer learning to contextual joint assortment-pricing under multinomial logit bandits. Prior work includes:

MNL assortment/pricing bandits (Agrawal et al. 2017, Chen et al. 2020, Oh & Iyengar 2021) study single markets only
Transfer learning in bandits (Bastani 2021, Li et al. 2022, Bastani et al. 2022) focus on linear models or pricing-only problems
Meta-learning for dynamic pricing (Bu et al. 2020, Zhang et al. 2025) don’t handle joint assortment decisions

Key novelties:

structured utility shift model for cross-market heterogeneity
aggregate-then-debias estimation separating variance reduction from bias control
price-uniform confidence bounds for continuous pricing under discrete choice
matching minimax bounds revealing fundamental variance-bias tradeoff.

SIGNIFICANT

Proof Techniques

The proof uses a variance-bias decomposition tailored to the aggregate-then-debias structure:

Estimation error decomposition: The target estimation error decomposes as:
\[\hat{\nu}_{m-1} - \nu^{(0)} = (\hat{\nu}_{m-1}^{(ag)} - \nu_{m-1}^{(ag)}) + (\hat{\delta}_{m-1} - \delta_{m-1}^*)\]
Variance control via self-normalized concentration: For the aggregate term, pool source data to get:
\[\|\hat{\nu}_{m-1}^{(ag)} - \nu_{m-1}^{(ag)}\|_{W_{m-1}} \leq C\sqrt{d \log T}\]
with probability $1-\delta$, where $W_{m-1}$ benefits from $H$ source markets.
Bias control via $\ell_1$ analysis: Under restricted eigenvalue conditions on target Fisher information, the debiasing error satisfies:
\[\|\hat{\delta}_{m-1} - \delta_{m-1}^*\|_\infty \leq C\lambda_{m-1} \leq C\sqrt{\frac{s_0 \log d}{|T_{m-1}^{(0)}|}}\]
Price-uniform optimism: The monotone envelope construction ensures:
\[R_t^{(0)}(S_t^*, p_t^*) \leq \tilde{R}_t(S_t^{(0)}, p_t^{(0)})\]
uniformly over continuous prices.
Regret decomposition: Instantaneous regret decomposes as:
\[r_t \leq C_1 \alpha_{m-1} + C_2 \beta_{m-1} \leq C_1 \sqrt{\frac{d}{H \cdot 2^{m-1}}} + C_2 \sqrt{\frac{s_0}{2^{m-1}}}\]
Episodic summation: Summing over episodes with lengths $2^{m-1}$ yields:
\[\text{Regret}(T) \leq \tilde{O}\left(d\sqrt{\frac{T}{1+H}} + s_0\sqrt{T}\right)\]
Key technical insight: Separating shared learning (benefiting from pooled sample size) from sparse adaptation (paying target-specific cost) through the structured shift assumption.

Experiments & Validation

Datasets: Synthetic experiments with varying numbers of source markets $H \in {1, 5, 10}$, feature dimensions $d \in {5, 10, 20}$, and sparsity levels $s_0 \in {1, 2, 5}$.

Baselines:

Target-only learning (ignores source data)
Naive pooling (treats all markets identically)
Oracle policy (knows true parameters).

Key numbers: TJAP achieves 20-40% regret reduction vs target-only when $s_0 \ll d$ and $H \geq 5$. Performance gap between TJAP and naive pooling increases with sparsity level $s_0$, confirming the importance of bias correction. With $H=10$, $d=10$, $s_0=2$, TJAP attains regret of ~0.12 vs 0.18 for target-only and 0.25 for naive pooling by period $T=1000$.

Limitations & Open Problems

Structured preference shift assumption requiring common sparse support across all source markets - RESTRICTIVE (may not hold when markets have different types of heterogeneity)
Known sparsity level $s_0$ or ability to tune regularization parameter - TECHNICAL (could be addressed with adaptive regularization)
Episodic updates with geometric episode lengths - TECHNICAL (needed for analysis but potentially inefficient in practice)
Multinomial logit choice model - NATURAL (standard in revenue management literature)
Bounded feature and parameter spaces - TECHNICAL (standard normalization for regret analysis)
Homogeneous covariate distributions across markets in main analysis - RESTRICTIVE (though extension provided)

Open problems:
Adaptive learning of the sparse support set $S^*$ without knowing $s_0$
Extension to heterogeneous choice models (nested logit, mixed logit) across markets