Theory 3 papers

Theory Digest — Mar 25, 2026

Today’s Digest at a Glance

Today’s papers tackle three distinct challenges: projection-free optimization for contextual bandits, automated model selection in variational inference, and interpretability of diffusion transformers.

Contextual Bandits with Linear Payoffs

Contextual bandits extend multi-armed bandits by conditioning arm selection on observed context vectors, with applications ranging from recommendation systems to clinical trials. In the linear payoff setting, the expected reward for choosing arm $x$ given context is $\langle w^*, x \rangle$ for some unknown parameter vector $w^*$. The challenge is balancing exploration (gathering information about $w^*$) with exploitation (choosing seemingly optimal arms) while maintaining computational efficiency.

Traditional approaches like LinUCB require expensive matrix inversions and projections onto constraint sets (e.g., unit balls) at each timestep, scaling poorly with dimension $d$ and horizon $T$. Second-order methods can achieve better regret bounds by incorporating curvature information, but typically require even more expensive operations like computing Mahalanobis distances for projections.

The key insight in projection-free approaches is exploiting scale invariance: if the true parameter lies in some constraint set, then any scaled version of our estimate can be rescaled to satisfy the same constraints without changing the relative arm rankings. This eliminates the need for explicit projections while maintaining the benefits of second-order preconditioning.

Vine Copulas for Variational Inference

Vine copulas provide a flexible framework for modeling complex multivariate dependencies by decomposing joint distributions into sequences of bivariate copulas arranged in tree structures. Unlike traditional copula models that assume specific parametric forms, vine copulas build dependencies incrementally: Tree 0 captures marginal distributions, Tree 1 models pairwise dependencies, Tree 2 captures three-way dependencies conditioned on pairs, and so forth.

In variational inference, the challenge is selecting an appropriately complex variational family—too simple and the approximation is poor, too complex and optimization becomes intractable. Standard approaches require pre-specifying the number of trees (truncation level) in the vine copula, which is often unknown a priori and requires expensive cross-validation.

Stepwise vine copula construction addresses this by growing the model incrementally: starting with independent marginals, it adds one tree at a time until a stopping criterion is met. At each step, parameters from previous trees are frozen while only the current tree’s parameters are optimized. This automatic truncation selection eliminates the need to pre-specify model complexity while maintaining computational tractability through the sequential optimization structure.

Reading guide: The first paper develops efficient algorithms for contextual recommendation by avoiding projections entirely, while the second automates model selection in copula-based variational inference. The third paper provides theoretical insights into how diffusion transformers implement hierarchical generation through attention mechanisms, connecting continuous diffusion theory to discrete architectural choices.


Simple Projection-Free Algorithm for Contextual Recommendation with Logarithmic Regret and Robustness

Authors: Shinsaku Sakaue · Institution: CyberAgent, National Institute of Informatics, RIKEN · Category: cs.LG

A projection-free second-order algorithm for contextual recommendation achieving $O(d \log T)$ regret by exploiting scale invariance, eliminating expensive Mahalanobis projections required by prior methods.

Tags: online learning contextual bandits inverse optimization projection-free algorithms second-order methods logarithmic regret contextual recommendation

arXiv · PDF

Problem Formulation
  1. Motivation: Contextual recommendation is a variant of contextual linear bandits where the learner observes actions rather than reward scalars. This setting arises in inverse optimization, inverse reinforcement learning, and learning from revealed preferences, where systems record what users choose without directly observing utilities.

  2. Mathematical setup: Let $V$ be a real Hilbert space with inner product $\langle \cdot, \cdot \rangle$ and induced norm $|\cdot|$. At round $t$, the learner observes a nonempty, weakly compact feasible set $X_t \subseteq V$ and recommends an action

    \[\hat{x}_t \in \arg\max_{x \in X_t} \langle \hat{w}_t, x \rangle\]

    for some prediction $\hat{w}_t \in V$. The user then takes an action $x_t \in X_t$ that is optimal with respect to unknown preference $u \in V$:

    \[x_t \in \arg\max_{x \in X_t} \langle u, x \rangle\]

    The regret is defined as:

    \[R_T(u) := \sum_{t=1}^T \langle u, x_t - \hat{x}_t \rangle\]
  3. Assumption 1 (Optimal-action feedback): $x_t \in \arg\max_{x \in X_t} \langle u, x \rangle$ for every $t$.
  4. Assumption 2 (Boundedness): There exists $B > 0$ such that for all $t$ and all $x, x’ \in X_t$, $\langle u, x - x’ \rangle \leq B$.

  5. Toy example: When $V = \mathbb{R}^2$ and $X_t = {x \in \mathbb{R}^2 : |x|_2 \leq 1}$, if $u = (1,0)$ and the learner predicts $\hat{w}_t = (0.5, 0.5)$, then $\hat{x}_t = (1/\sqrt{2}, 1/\sqrt{2})$ while the user chooses $x_t = (1,0)$. The residual is $g_t = \hat{x}_t - x_t = (1/\sqrt{2} - 1, 1/\sqrt{2})$.

  6. Formal objective: Minimize the regret:

    \[R_T(u) = \sum_{t=1}^T \langle u, x_t - \hat{x}_t \rangle\]
Method

CoRectron (Algorithm 1) maintains:

  • Cumulative residual $\zeta_t := \sum_{s=1}^t g_s$ where $g_t := \hat{x}_t - x_t$
  • Second-order preconditioner $A_t = A_{t-1} + g_t \otimes g_t$ with $A_0 = \lambda I$

The update rule is:

\[\hat{w}_t = -A_{t-1}^{-1} \zeta_{t-1}\]

Complete algorithm steps:

  1. Compute $\hat{w}_t \leftarrow -A_{t-1}^{-1} \zeta_{t-1}$
  2. Observe context $X_t$
  3. Recommend $\hat{x}_t \in \arg\max_{x \in X_t} \langle \hat{w}_t, x \rangle$
  4. Observe $x_t$ and set $g_t \leftarrow \hat{x}_t - x_t$
  5. Update $A_t \leftarrow A_{t-1} + g_t \otimes g_t$ and $\zeta_t \leftarrow \zeta_{t-1} + g_t$

    Applied to the toy example: With $g_1 = (1/\sqrt{2} - 1, 1/\sqrt{2})$ and $\lambda = 1$, we get:

    \[A_1 = I + g_1 \otimes g_1\] \[\zeta_1 = g_1\] \[\hat{w}_2 = -A_1^{-1} g_1\]
Novelty & Lineage

Step 1 — Prior work:

  • Sakaue et al. (2025) achieved $O(d \log T)$ regret using ONS but required expensive Mahalanobis projections at each round
  • Besbes et al. (2021) and Gollapudi et al. (2021) used cutting-plane methods achieving $O(d \log T)$ bounds but with per-round time polynomial in $T$
  • Bärmann et al. (2017) introduced the setting with $O(\sqrt{T})$ regret using online gradient descent

Step 2 — Delta: This paper eliminates Mahalanobis projections while maintaining $O(d \log T)$ regret. Key insight: exploit “improperness” from scale invariance in contextual recommendation—since recommendations are invariant under positive rescaling of $\hat{w}_t$, the learner can use unbounded utility vectors while the user’s preference $u$ has fixed scale.

Step 3 — Theory-specific assessment:

  • The main theorem achieving $O(d \log T)$ regret without projections is somewhat surprising given that prior efficient methods required projections
  • The proof technique using cumulative residuals $\zeta_t$ instead of per-round quadratic forms is genuinely new and enables the projection-free analysis
  • The bounds match the state-of-the-art $O(d \log T)$ rates from Sakaue et al. (2025)

Verdict: SIGNIFICANT — clear algorithmic improvement with matching theoretical guarantees and novel proof technique exploiting problem structure.

Proof Techniques

The main proof strategy uses three key components:

  1. Sign condition (Lemma 1): For each iteration $t$, we have $\langle g_t, A_{t-1}^{-1} \zeta_{t-1} \rangle \leq 0$. This follows from optimality of $\hat{x}_t$ for $\hat{w}_t = -A_{t-1}^{-1} \zeta_{t-1}$ over $X_t$.

  2. Cumulative-potential–elliptical-potential inequality (Lemma 2): The sign condition enables bounding the cumulative potential:

    \[\Phi_T = \langle \zeta_T, A_T^{-1} \zeta_T \rangle \leq \sum_{t=1}^T \langle g_t, A_t^{-1} g_t \rangle\]

    Key inequality using Sherman-Morrison formula:

    \[A_t^{-1} = A_{t-1}^{-1} - \frac{A_{t-1}^{-1} g_t \otimes A_{t-1}^{-1} g_t}{1 + \langle g_t, A_{t-1}^{-1} g_t \rangle}\]
  3. Elliptical potential lemma (Lemma 3): Standard technique showing:

    \[\sum_{t=1}^T \langle g_t, A_t^{-1} g_t \rangle \leq \log \det(I_T + \lambda^{-1} K_T)\]

    where $K_T = (\langle g_i, g_j \rangle)_{i,j=1}^T$ is the Gram matrix of residuals.

    Final bound combines these via Cauchy-Schwarz:

    \[R_T(u) = -\langle u, \zeta_T \rangle \leq \|u\|_{A_T} \sqrt{\Phi_T} \leq \|u\|_{A_T} \sqrt{\log \det(I_T + \lambda^{-1} K_T)}\]
Experiments & Validation

Experiments on contextual $m$-out-of-$n$ problems with $n=10$, $m=5$, $p=10$ comparing CoRectron against ONS, OGD, and kernelized variants. Key results:

  • Linear setting ($T=10,000$): CoRectron achieves best final regret, roughly half that of ONS and OGD
  • Kernel setting ($T=1,000$, RBF kernel with bandwidth $\theta \in {0.5, 1.0, 2.0}$): CoRectron consistently faster than ONS-based methods
  • Runtime improvements: CoRectron shows 2-10x speedup over ONS while maintaining better or comparable regret
  • Stability: CoRectron more robust to hyperparameter choices than projection-based methods

However, this is primarily theoretical work with limited experimental validation on synthetic problems only.

Limitations & Open Problems

Limitations:

  1. Requires exact linear optimization oracle over each feasible set $X_t$ - TECHNICAL (standard assumption in the field but may be computationally hard for complex constraint sets)
  2. In kernelized implementation, per-round cost is still $O(t^2)$ - RESTRICTIVE (limits scalability to very long horizons)
  3. Analysis assumes weakly compact action sets - NATURAL (standard in optimization literature)
  4. Boundedness assumption on utility differences - NATURAL (prevents trivial unbounded regret)

    Open problems:

  5. Develop approximation schemes for the linear optimization oracle while maintaining logarithmic regret guarantees
  6. Design sketching or compression techniques for kernelized implementation to achieve sublinear per-round cost in $t$

Stepwise Variational Inference with Vine Copulas

Authors: Elisabeth Griesbauer, Leiv Rønneberg, Arnoldo Frigessi, Claudia Czado et al. (5 authors) · Institution: University of Oslo · Category: stat.ML

Proposes stepwise vine copula variational inference with automatic truncation selection, eliminating the need to pre-specify variational family complexity.

Tags: variational_inference copula_models vine_copulas automatic_model_selection gaussian_processes renyi_divergence stepwise_optimization

arXiv · PDF

Problem Formulation
  1. Motivation: Variational inference (VI) approximates intractable posteriors using tractable parametric families. Standard approaches require pre-specifying complexity hyperparameters (truncation levels, covariance structure, etc.), which is difficult without domain knowledge. The choice determines whether the model captures key posterior dependencies or becomes over-parametrized.

  2. Mathematical setup: Let $Z \in \mathbb{R}^d$ be latent variables, $x$ be observed data, $\pi(z)$ be the prior, and $p(x z)$ be the likelihood. The true posterior is:
    \[p(z|x) = \frac{p(x|z)\pi(z)}{\int p(x|z')\pi(z')dz'}\]

    The variational distribution is a D-vine copula:

    \[q(z; \lambda, \eta) := \prod_{j=1}^d q_j(z_j; \lambda_j) \cdot \prod_{t=1}^{\tau} c_t(u(z; \lambda, \eta_1, \ldots, \eta_{t-1}); \eta_t)\]

    where $\lambda$ are marginal parameters, $\eta_t$ are tree-$t$ copula parameters, and $\tau$ is the (unknown) truncation level.

    Assumptions:

    1. D-vine structure and pair copula families are pre-specified
    2. Simplifying assumption: conditional copulas don’t depend on conditioning variables
    3. Reparameterization is available for all marginals
  3. Toy example: For $d=3$ with Gaussian marginals and Gaussian copulas, tree 1 models pairs $(Z_1, Z_2)$ and $(Z_2, Z_3)$ with correlations $\rho_{12}, \rho_{23}$. Tree 2 models $(Z_1, Z_3 Z_2)$ with partial correlation $\rho_{13 2}$. When all correlations are small ($ \rho < 0.1$), the method should stop at the mean-field level.
  4. Formal objective: Minimize Rényi $\alpha$-divergence via the VR-IWAE bound:

    \[\max_{\lambda,\eta} l_N^{(\alpha)}(\lambda,\eta; x) = \frac{1}{1-\alpha} \int \prod_{i=1}^N q(z_i; \lambda,\eta) \log\left(\frac{1}{N}\sum_{k=1}^N \left[\frac{p(x,z_k)}{q(z_k; \lambda,\eta)}\right]^{1-\alpha}\right) dz_{1:N}\]
Method

The method optimizes vine copula parameters tree-by-tree using a stepwise procedure:

Algorithm:

  1. Tree 0 (Mean-field): Optimize marginal parameters $\lambda$ until convergence (measured by $\hat{R} < R$)

  2. Tree $t$ ($t = 1, 2, \ldots$): Fix previous parameters and optimize current tree parameters $\eta_t$ via:

    \[\hat{\eta}_t \leftarrow \hat{\eta}_t + \gamma \widehat{\nabla}_{\eta_t} l_N^{(\alpha)}(\hat{\eta}_t; x)\]
  3. Global stopping: If all pair copulas in tree $t$ have $ \rho < 0.1$, stop and use $\tau = t-1$ truncated vine
  4. Local stopping: Use $\hat{R}$ statistic from MCMC diagnostics to detect parameter convergence

    Key equations: The reparameterized gradient estimator is:

    \[\widehat{\nabla}_\phi l_N^{(\alpha)}(\phi; x) = \sum_{j=1}^N \frac{w_\phi(g(\epsilon_j, \phi))^{1-\alpha}}{\sum_{k=1}^N w_\phi(g(\epsilon_k, \phi))^{1-\alpha}} \nabla_\phi \log w_\phi(g(\epsilon_j, \phi))\]

    where $w_\phi(z) = p(x,z)/q(z;\phi)$.

    Toy example application: For the 3D Gaussian case, the method first fits marginal means/variances, then optimizes $\rho_{12}, \rho_{23}$ in tree 1. If these correlations are small, it stops; otherwise it proceeds to optimize $\rho_{13 2}$ in tree 2.
Novelty & Lineage

Prior work:

  1. Tran et al. (2015): “Copula Variational Inference” - first to use vine copulas for VI, but optimizes all parameters simultaneously and requires pre-specified truncation
  2. Chi et al. (2022): “Fast Copula Variational Inference” - alternates between mean-field and vine optimization, still requires fixed truncation level

    Delta: This paper adds three key innovations:

  3. Stepwise estimation: Tree-by-tree optimization following vine structure (rather than simultaneous)
  4. Automatic truncation: Global stopping criterion eliminates need to pre-specify complexity
  5. Rényi divergence: Shows backward KL cannot recover true parameters; uses VR-IWAE bound instead

    Theory-specific assessment:

    • Main theorems (3.1-3.2) are predictable: It’s well-known that backward KL has mode-seeking behavior and struggles with correlation structure
    • Proof technique is routine: Standard matrix calculus for Gaussian case
    • No tightness analysis: No lower bounds provided for the approximation quality
    • The global stopping criterion ($ \rho < 0.1$) is heuristic without theoretical justification

    The stepwise procedure is the main novelty, but it’s a natural adaptation of existing vine fitting methods to the VI setting. The automatic complexity selection is useful but incremental.

    Verdict: INCREMENTAL — Solid engineering contribution that combines existing techniques (vines + Rényi VI) with natural stepwise fitting, but lacks fundamental theoretical insights.

Proof Techniques

The main theoretical results use standard Gaussian distribution theory:

Theorem 3.1 (Forward KL recovers truth):

  1. Set up KL minimization for multivariate Gaussians:

    \[\min_{\nu,\Psi} \text{KL}(N(\mu,\Sigma)||N(\nu,\Psi))\]
  2. Apply matrix calculus to the Gaussian KL formula:

    \[\text{KL}(N(\mu,\Sigma)||N(\nu,\Psi)) = \frac{1}{2}[(\mu-\nu)^T\Psi^{-1}(\mu-\nu) + \text{tr}(\Psi^{-1}\Sigma) - \log|\Psi^{-1}\Sigma| - d]\]
  3. Set derivatives to zero, yielding $\nu = \mu$ and $\Psi = \Sigma$

    Theorem 3.2 (Backward KL fails):

  4. For backward KL $\text{KL}(N(\nu,\Psi)   N(\mu,\Sigma))$, the key inequality is:
    \[\text{KL}(q||p) = \frac{1}{2}[(\nu-\mu)^T\Sigma^{-1}(\nu-\mu) + \text{tr}(\Sigma^{-1}\Psi) - \log|\Sigma^{-1}\Psi| - d]\]
  5. Setting gradient w.r.t. $\nu$ to zero gives $\nu = \mu$ (mean is recovered)

  6. For covariance, the stepwise procedure constrains the optimization. The key insight is that when optimizing tree-by-tree, the backward KL objective becomes:

    \[\min_{\eta_t} \mathbb{E}_{q(\text{previous trees})}[\text{KL}(\text{current tree}||\text{true conditional})]\]
  7. This conditional KL structure prevents recovery of true correlations unless they are zero

    The proofs are routine applications of matrix differential calculus (Magnus & Neudecker, 2019) and standard Gaussian identities. No novel proof techniques are introduced.

Experiments & Validation

Simulated Examples:

  1. MF Recovery: 4D regression where true posterior has independent components. Method correctly stops at tree 0.
  2. Needle Example: 4D regression with high correlation matrix. Shows vine captures dependencies that mean-field misses.

    Real Application:

    • Sparse Gaussian Process: Pumadyn32nm dataset (7168 train, 1024 test, 32 features)
    • Metrics: RMSE and negative log-predictive density (NLPD)
    • Key result: Method interpolates between mean-field and full-rank SGPR in NLPD performance

    Baselines: Mean-field VI (MFVI), Gaussian Copula VI (GC-VI), Masked Autoregressive Flows (MAF), NUTS as ground truth

    Limitations:

    • Only 2 synthetic examples, 1 real application
    • GP experiment shows “only small improvements past tree one”
    • Global stopping criterion failed in GP case (didn’t trigger until $t=46$)
    • No computational cost comparison
    • Limited to D-vine structure only

    Missing validations: No comparison of approximation quality vs. computational cost trade-offs, no analysis of when the method helps most.

Limitations & Open Problems

Limitations:

  1. D-vine structure assumption - RESTRICTIVE: Only considers path-like dependence graphs, may miss important conditional independence structure
  2. Pre-specified pair copula families - TECHNICAL: Could be learned but authors don’t attempt this
  3. **Global stopping threshold $ \rho < 0.1$** - TECHNICAL: Heuristic choice without theoretical justification
  4. Gaussian copula focus - NATURAL: Most experiments use Gaussian copulas, limiting flexibility gains
  5. Simplifying assumption for conditional copulas - TECHNICAL: Assumes copulas don’t depend on conditioning values
  6. $\alpha = 0.1$ fixed choice - TECHNICAL: Limited exploration of $\alpha$ parameter sensitivity

    Open Problems:

  7. Theoretical stopping criterion: Derive principled threshold for global stopping based on approximation error rather than correlation magnitude
  8. Vine structure learning: Develop methods to automatically learn the tree structure rather than pre-specifying D-vine order

Interpreting the Synchronization Gap: The Hidden Mechanism Inside Diffusion Transformers

Authors: Emil Albrychiewicz, Andrés Franco Valiente, Li-Ching Chen, Viola Zixin Zhao · Institution: UC Berkeley · Category: cs.LG

Shows how the synchronization gap from continuous diffusion theory manifests in Diffusion Transformers through spatially-selective attention routing that resolves global structure before local details.

Tags: diffusion-models transformer-interpretability synchronization-phenomena mean-field-theory attention-mechanisms generative-modeling phase-transitions

arXiv · PDF

Problem Formulation
  1. Motivation: Diffusion Transformers (DiTs) have achieved SOTA results in generative modeling, but how they resolve generative ambiguity during the reverse process remains poorly understood. Recent theoretical work on coupled Ornstein-Uhlenbeck processes predicts a “synchronization gap” between modes that commit at different stages, but it’s unclear how this manifests in discrete DiT architectures.

  2. Mathematical setup: Consider two replica trajectories $z^A_t, z^B_t \in \mathbb{R}^{d_z}$ following coupled reverse diffusion:

    \[dz^A_t = [f(z^A_t, t) + g(z^A_t - z^B_t)]dt + \sqrt{\beta_t}d\bar{W}^A_t\] \[dz^B_t = [f(z^B_t, t) + g(z^B_t - z^A_t)]dt + \sqrt{\beta_t}d\bar{W}^B_t\]

    Define common and difference modes:

    \[u_t = \frac{z^A_t + z^B_t}{\sqrt{2}}, \quad v_t = \frac{z^A_t - z^B_t}{\sqrt{2}}\]

    For DiT architecture, embed two replicas into joint token sequence:

    \[X_t = [X^A_t; X^B_t] \in \mathbb{R}^{2N \times d_{model}}\]

    Assumptions:

    1. Replicas are initialized with antisymmetric perturbation preserving marginal variance
    2. Local difference distribution follows symmetric two-component Gaussian mixture
    3. Spatial routing dominates pattern modulation for low-frequency modes
    4. Branch separation amplitude propagates multiplicatively
  3. Toy example: When $d_z = 2$ with two modes having different routing gains $\chi_{k_{hi}} > \chi_{k_{lo}}$, the synchronization gap arises because global modes (high routing gain) speciate before local modes (low routing gain).

  4. Formal objective: Measure the synchronization gap between leading and trailing modes:

    \[\Delta s_v(\ell; g) := s^{(k_{lo})}_{spec}(\ell; g) - s^{(k_{hi})}_{spec}(\ell; g)\]
Method

The method constructs an explicit architectural realization of replica coupling through symmetric cross-attention gates.

Key steps:

  1. Embed two generation trajectories into joint token sequence with block-wise attention structure
  2. Implement symmetric coupling via normalized mixture:

    \[\text{Attn}_g(X) = \frac{1}{1+g}\left[\begin{bmatrix}A_{AA}V^A \\ A_{BB}V^B\end{bmatrix} + g\begin{bmatrix}A_{AB}V^B \\ A_{BA}V^A\end{bmatrix}\right]\]
  3. Linearize attention difference around symmetric state:

    \[\text{Attn}^A_g - \text{Attn}^B_g = \frac{1-g}{1+g} \cdot 2A_0\delta h + \frac{1}{1+g} \cdot 2[\delta A^{(+)} + g\delta A^{(-)}]V_0\]
  4. Model local distribution as symmetric two-component Gaussian mixture and derive fixed point condition:

    \[u_k = \kappa_{v,k}(s,\ell;g)\tanh(u_k)\]
  5. Extract modewise signal-to-noise ratio:

    \[\text{SNR}_{v,k}(s,\ell;g) = \frac{m_k^2\mu_k^2}{\gamma_{s,\ell}\mu_k - \lambda^{MLP}_k - \rho(g)\chi_k - \xi(g)\pi_k}\]

    For the toy example: With two modes having routing gains $\chi_{hi} > \chi_{lo}$, the high-frequency mode has lower SNR, leading to later speciation time and creating the synchronization gap.

Novelty & Lineage

Step 1 — Prior work:

  • Franco et al. (2024): “Coupled diffusion processes” - showed synchronization gap in continuous OU processes with symmetric coupling
  • Saremi & Jastrzebski (2023): “Diffusion speciation” - identified phase transitions in reverse diffusion using replica analysis
  • Luo et al. (2023): “Free entropy criteria” - extended speciation to general class structures

Step 2 — Delta: This work bridges continuous theoretical predictions to discrete DiT architectures by:

  1. constructing explicit attention-gated coupling mechanism
  2. deriving linearized attention difference decomposition
  3. validating gap existence and collapse empirically in pretrained models.

    Step 3 — Theory-specific assessment:

    • Main result is predictable extension of continuous theory to discrete architecture
    • Proof technique assembles known tools: linearization around symmetric state, mean-field bifurcation analysis, empirical mode decomposition
    • No lower bounds provided; gap scaling $O(\frac{1-g}{1+g})$ follows from routing term analysis
    • Empirical validation confirms theoretical predictions but doesn’t reveal surprising phenomena

    The theoretical framework is routine application of mean-field methods to attention mechanisms. The architectural mapping from OU coupling to symmetric cross-attention is clever but straightforward.

    Verdict: INCREMENTAL — solid theoretical bridge from continuous to discrete setting with thorough empirical validation, but no fundamental surprises.

Proof Techniques

Main proof strategy uses linearized mean-field analysis with empirical mode decomposition:

  1. Symmetric state linearization: Expand attention difference around replica-symmetric point $H^A = H^B = H_0$:

    \[\text{Attn}^A_g - \text{Attn}^B_g = \frac{1-g}{1+g}R_\ell h + \frac{1}{1+g}P_\ell(g)h + O(\|h\|^2)\]
  2. Key decomposition: Split linear response into spatial routing ($R_\ell$) and pattern modulation ($P_\ell$) terms with different coupling dependence.

  3. Propagator construction: Combine attention and MLP contributions:

    \[K_g = I + J^{MLP}_0 + \rho(g)R + \xi(g)P_g\]

    where $\rho(g) = \frac{1-g}{1+g}$ and $\xi(g) = \frac{1}{1+g}$.

  4. Mean-field bifurcation analysis: Model local distribution as symmetric mixture, derive scalar self-consistency:

    \[u_k = \kappa_{v,k}\tanh(u_k)\]

    with speciation parameter:

    \[\kappa_{v,k} = \frac{\gamma m_k^2}{c_k[(1-\eta_k)c_k + \gamma]}\]
  5. Modal SNR formula: Project onto empirical eigenmodes ${r_k}$ and establish routing dominance bound (Appendix B) showing $ \pi_k \ll \chi_k $ for low-frequency modes.
  6. Gap collapse derivation: Under routing dominance, SNR difference scales as $O(\rho(g)) = O(\frac{1-g}{1+g})$, vanishing as $g \to 1$.
Experiments & Validation

Experiments on pretrained DiT-XL/2 using two complementary protocols:

Protocol I (Behavioral Commitment): Initialize replica pairs with antisymmetric perturbation, couple for $t_{int}$ steps, then evolve independently. Measure final output agreement via ResNet-50 feature similarity and scale-dependent pixel discrepancies. Extract speciation time $\tau_{spec}(g)$ from sigmoid fits.

Protocol II (Internal Mode Tracking): Track hidden state difference energies across all 28 Transformer layers at speciation time. Measure normalized energies of leading vs trailing empirical modes using fixed basis decomposition.

Key findings:

  • Gap exists at $g=0$ (intrinsic property)
  • Gap collapses as $g \to 1$ (theoretical prediction confirmed)
  • Gap localized to final ~5 layers
  • Global structures commit before local details across all coupling strengths

Datasets: ImageNet-based experiments with variance-preserving initialization. Aggregated over multiple paired seeds for statistical robustness.

Baselines: Compared coupled vs decoupled trajectories, swept coupling strength $g \in [0,1]$, analyzed scale decomposition via adaptive pooling.

Limitations & Open Problems

Limitations:

  1. TECHNICAL: Linearization around symmetric state assumes small perturbations - may break down in strongly nonlinear regimes or with large initial differences.

  2. TECHNICAL: Mean-field modal decoupling assumes empirical modes remain approximately orthogonal eigenvectors of difference covariance - violated when strong cross-mode couplings develop.

  3. NATURAL: Requires pretrained DiT architecture - results may not generalize to other diffusion model architectures (U-Net, other transformer variants).

  4. RESTRICTIVE: Symmetric two-component Gaussian mixture assumption for local distribution - real data distributions likely more complex with multiple modes.

  5. TECHNICAL: Score gain parameter $\gamma_{s,\ell}$ treated phenomenologically - not derived from first principles or measured directly.

  6. NATURAL: Spatial routing dominance assumption $ \pi_k \ll \chi_k $ only proven for low-frequency modes - may not hold across full spectrum.

    Open problems:

  7. Extend analysis to asymmetric coupling and multimodal generation scenarios
  8. Derive principled estimate of score gain $\gamma_{s,\ell}$ from network activations rather than treating it as effective parameter