Apr 8, 2026 Theory 3 papers

Theory Digest — Apr 8, 2026

Today’s Digest at a Glance

Today’s digest explores stochastic optimization dynamics in neural networks, optimal control with constraints in insurance mathematics, and the connection between discrete and continuous optimal transport formulations.

Deep Linear Networks and Saddle-to-Saddle Dynamics

Deep linear networks (DLNs) are neural networks without nonlinear activation functions, where each layer performs a simple linear transformation. Despite their apparent simplicity, DLNs exhibit rich training dynamics that serve as a theoretical testbed for understanding feature learning in deeper architectures. The key challenge is that the loss landscape contains numerous saddle points corresponding to different factorizations of the same linear map, leading to complex “saddle-to-saddle” trajectories during optimization.

The saddle-to-saddle regime refers to the phenomenon where gradient-based training in DLNs follows extended trajectories that pass near multiple saddle points before converging. Each saddle corresponds to a different low-rank factorization, and the network must navigate between these saddles to learn features in a specific order determined by the data’s singular value decomposition. This creates a hierarchical learning process where features are acquired sequentially rather than simultaneously.

Analyzing stochastic gradient descent (SGD) in this setting requires decomposing the anisotropic noise structure that arises from finite batch sampling. The gradient noise covariance matrix captures how randomness affects different modes of the parameter space differently, with stronger noise along directions corresponding to smaller singular values. Intuitively, SGD acts like a noisy version of gradient flow where the noise can occasionally “kick” the optimization trajectory out of attraction basins, potentially accelerating or hindering the saddle-to-saddle transitions.

Hamilton-Jacobi-Bellman Variational Inequalities for Jump-Diffusion Control

Optimal control problems with state constraints often lead to Hamilton-Jacobi-Bellman (HJB) equations that take the form of variational inequalities rather than standard PDEs. This occurs when the control problem involves constraints like non-negativity of dividends, capital injection decisions, or other boundary conditions that create regions where different optimality conditions apply.

In insurance mathematics, the surplus process (assets minus liabilities) follows a compound Poisson process with drift, combining continuous premium collection with random claim jumps. The challenge is to simultaneously choose dividend payment rates and capital injection timing to maximize expected discounted dividends minus injection costs. This creates a free boundary problem where the optimal policy switches between different regimes (pay dividends, inject capital, or do nothing) depending on the current surplus level.

The HJB variational inequality encodes these regime-switching decisions through complementarity conditions: $\min{\mathcal{L}^c v - \mathcal{T}v + h - c, \ell - v_x, -v_c} = 0$, where each term corresponds to a different constraint (optimality, dividend bound, capital injection bound). The jump integral operator $\mathcal{T}$ accounts for the discontinuous claim arrivals that can instantly change the surplus level. Constructing strong solutions requires sophisticated approximation techniques that preserve both the variational structure and the jump discontinuities.

Reading Guide

The first paper extends continuous-time gradient flow theory to the stochastic setting, showing how noise affects but doesn’t fundamentally change the sequential feature learning in deep linear networks. The second paper tackles a different type of stochastic control problem, constructing solutions for optimal dividend and capital policies under jump-driven surplus dynamics. The third paper connects these themes by providing multiple equivalent characterizations of martingale Schrödinger bridges, linking discrete-time optimal transport with continuous stochastic processes through entropic regularization principles.

Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks

Authors: Guillaume Corlouer, Avi Semler, Alexander Strang, Alexander Gietelink Oldenziel · Institution: Moirai, University of Oxford, University of California Berkeley, Iliad · Category: cs.LG

Extends gradient flow analysis to SGD in deep linear networks, showing that noise-induced diffusion peaks predict when features will be learned but doesn’t fundamentally alter saddle-to-saddle dynamics.

Tags: deep linear networks stochastic gradient descent feature learning saddle-to-saddle dynamics mode decomposition anisotropic noise SDE analysis

arXiv · PDF

Problem Formulation

Motivation (2–3 sentences): Deep linear networks (DLNs) serve as analytically tractable models for understanding the training dynamics of deep neural networks. While gradient descent in DLNs exhibits well-understood saddle-to-saddle dynamics where modes are learned sequentially, the impact of SGD noise on this regime remains poorly understood, limiting our theoretical understanding of feature learning.
Mathematical setup: Consider a depth-$L$ deep linear network $f(x) = W_L W_{L-1} \cdots W_1 x$ with weight matrices $W_l \in \mathbb{R}^{d_l \times d_{l-1}}$. Data is generated by a teacher matrix $M \in \mathbb{R}^{d_L \times d_0}$ via $Y = MX + \xi_q$ where $X \sim \mathcal{N}(0, I_{d_0})$ and $\xi_q \sim \mathcal{N}(0, \sigma_q I)$ is label noise. The continuous-time SGD dynamics are modeled as the SDE:
\[d\theta_t = -g(\theta_t) dt + \sqrt{\eta} \Sigma_b(\theta_t) dW_t\]
Define the teacher-student gap $\Delta := M - W$ and mode amplitude $w_\alpha := u_\alpha^T W v_\alpha$ where $(u_\alpha, v_\alpha)$ are left/right singular vectors of $M$. The assumptions are:
1. Whitened Gaussian inputs: $X \sim \mathcal{N}(0, I_{d_0})$
2. Online learning: batches sampled from data distribution
3. Balanced weights: $W_{l+1}^T W_{l+1} = W_l W_l^T$ for all $l, t$
4. Aligned modes: cross-mode amplitudes $w_{\alpha\beta} = 0$ for $\alpha \neq \beta$
Toy example: When $d_0 = d_L = 2$ and $M = \text{diag}(2, 1)$, this reduces to learning two modes with singular values $s_1 = 2, s_2 = 1$. The larger mode should be learned first with faster dynamics, followed by the smaller mode.
Formal objective: The main quantity to analyze is the mode-wise diffusion coefficient:
\[D_\alpha(w_\alpha) = \beta \frac{L^2}{2} \left[(s_\alpha - w_\alpha)^2 + \sigma_q^2\right] w_\alpha^{\frac{4(L-1)}{L}}\]

Method

The method decomposes the anisotropic Langevin dynamics of SGD into modewise SDEs.

Key steps:

Derive the gradient noise covariance matrix for DLNs:
\[\Sigma_{lm}(\theta) = J_l^T (I \otimes \Delta)(I_{d_0^2} + C)(I \otimes \Delta)^T J_m + (W_{<l}W_{<m}^T) \otimes (W_{>l}^T \Sigma_q W_{>m})\]
Under balance and alignment assumptions, show each mode evolves independently:
\[dw_\alpha = [\mu_\alpha^{\text{grad}}(w_\alpha) + \mu_\alpha^{\text{Ito}}(w_\alpha)] dt + \sqrt{D_\alpha(w_\alpha)} dB_{\alpha,t}\]
The drift components are:
\[\mu_\alpha^{\text{grad}}(w_\alpha) = (s_\alpha - w_\alpha) L w_\alpha^{\frac{2(L-1)}{L}}\] \[\mu_\alpha^{\text{Ito}}(w_\alpha) = \beta (s_\alpha - w_\alpha) L(L-1) w_\alpha^{\frac{2(L-2)}{L}} + \beta L(L-1) w_\alpha^{\frac{2(L-2)}{L}} \sigma_q\]
The diffusion coefficient:
\[D_\alpha(w_\alpha) = \beta \frac{L^2}{2} \left[(s_\alpha - w_\alpha)^2 + \sigma_q^2\right] w_\alpha^{\frac{4(L-1)}{L}}\]
Applied to toy example: For $M = \text{diag}(2,1)$ and $L=2$, the diffusion for mode 1 peaks at $w_1^* = \frac{s_1}{2} = 1$ before the mode is fully learned at $w_1 = s_1 = 2$.

Novelty & Lineage

Step 1 — Prior work: The closest papers are:

Saxe et al. (2013): “Exact solutions to the nonlinear dynamics of learning in deep linear networks” - solved gradient flow dynamics exactly, showing stagewise learning
Pesme et al. (2021): “Implicit bias of SGD for diagonal linear networks” - analyzed SGD in diagonal networks, showed sparsity bias
Jacot et al. (2021): “Saddle-to-saddle dynamics in deep linear networks” - characterized the saddle-to-saddle regime for gradient flow

Step 2 — Delta: This paper extends the gradient flow analysis to stochastic gradient flow with anisotropic, state-dependent noise. Key additions:

exact closed-form SGD noise covariance for DLNs
decomposition into modewise SDEs under balance/alignment
prediction that diffusion peaks precede feature learning
characterization of stationary distributions.

Step 3 — Theory-specific assessment:
- The main theorem showing modewise decomposition is a natural extension of known gradient flow results to the stochastic case
- The proof technique is routine - it assembles known results from SDE theory and DLN analysis under standard assumptions
- The bounds are not particularly tight and no lower bounds are established
- The key insight about diffusion predicting feature learning is interesting but follows naturally from the derived expressions
The results are solid but represent expected extensions rather than surprising breakthroughs. The state-dependent noise characterization is valuable but the overall dynamics remain qualitatively similar to gradient flow.

Verdict: INCREMENTAL — solid extension of gradient flow analysis to SGD with useful technical contributions but no fundamental surprises.

Proof Techniques

The main proof strategy involves several standard techniques:

Gradient noise covariance derivation: Uses vectorization and Kronecker product identities. The key inequality is the decomposition:
\[\Sigma_{lm} = J_l^T (I \otimes \Delta)(I_{d_0^2} + C)(I \otimes \Delta)^T J_m + \text{label noise term}\]
Modewise SDE decomposition: Applies Itô’s lemma to mode amplitudes $w_\alpha = u_\alpha^T W v_\alpha$. The critical step uses the balanced weight condition $W_l^T W_l = W_{l-1} W_{l-1}^T$ to simplify:
\[dw_\alpha = \nabla w_\alpha^T dW + \frac{1}{2} \text{tr}(\Sigma \nabla^2 w_\alpha) dt\]
Balance and alignment assumptions: Under these conditions, the cross-mode diffusion terms vanish:
\[D_{\alpha\beta} = \eta a_\alpha^T \Sigma a_\beta = 0 \text{ for } \alpha \neq \beta\]
Stationary distribution analysis: Uses detailed balance condition for the Fokker-Planck equation. The key insight is that without label noise:
\[D_\alpha(w) \sim (s_\alpha - w)^2 \text{ and } \mu_\alpha(w) \sim (s_\alpha - w)\]
leading to non-normalizable density except at $w = s_\alpha$.
Diffusion maximum calculation: Solves the optimization problem:
\[\frac{dD_\alpha}{dw_\alpha} = 0 \Rightarrow w_\alpha^* = \frac{(a+1)s_\alpha - \sqrt{s_\alpha^2 - a(a+2)\sigma_q^2/2}}{a+2}\]
The proofs are technically sound but use standard SDE and DLN techniques without novel mathematical insights.

Experiments & Validation

Datasets: Synthetic data with Gaussian inputs $X \sim \mathcal{N}(0, I)$ and teacher matrices with specified singular value structure.

Baselines: Compares SGD with gradient descent, anisotropic Langevin dynamics (continuous limit), and isotropic Gaussian noise.

Key numbers:

4-6 layer linear networks with various widths
Learning rates $\eta \in [0.001, 0.1]$
Batch sizes $b \in [1, 64]$
Teacher singular values typically $[2, 1.5, 1, 0.5, 0.2]$

Main findings:

Diffusion peaks precede mode learning by ~20% of learning time
state-dependent noise model more accurate than isotropic
end-of-training distributions concentrate at teacher singular values without label noise
qualitative predictions hold even when balance/alignment assumptions violated.

Empirical validation confirms theoretical predictions about diffusion timing and stationary distributions, though discrete SGD shows slower timescales than continuous limits.

Limitations & Open Problems

Limitations:

Continuous-time approximation: TECHNICAL - ignores finite learning rate effects, though likely removable with effective potential methods
Balance and alignment assumptions: RESTRICTIVE - significantly limits applicability as these don’t hold exactly in practice (though experiments show qualitative results persist)
Whitened input assumption: NATURAL - standard in theoretical analysis and achievable via preprocessing
Online learning assumption: NATURAL - large-sample limit of realistic finite-dataset case
Gaussian noise assumption: TECHNICAL - excludes heavy-tailed SGD noise which may be important in practice
Linear architecture: RESTRICTIVE - unclear how results extend to nonlinear networks where most practical interest lies

Open problems:
Extension to nonlinear networks: Can modewise diffusion analysis be extended to two-layer ReLU networks or other nonlinear architectures to track stagewise feature learning?
Golden Path hypothesis verification: Under what conditions does SGD noise not affect generalization, making gradient flow analysis sufficient for understanding feature learning dynamics?

Dividend ratcheting and capital injection under the Cramér-Lundberg model: Strong solution and optimal strategy

Authors: Chonghu Guan, Zuo Quan Xu · Institution: Hong Kong Polytechnic University, Jiaying University · Category: math.OC

Constructs first strong solution to HJB equation for dividend ratcheting with capital injection under jump-diffusion surplus via regime-switching approximation.

Tags: optimal_control insurance_mathematics jump_diffusion variational_inequalities free_boundary_problems HJB_equations dividend_optimization

arXiv · PDF

Problem Formulation

Motivation: Insurance companies need optimal dividend policies while avoiding ruin through capital injection. Adding a “ratcheting constraint” (dividend rates cannot decrease) creates realistic contractual constraints but makes the mathematical problem significantly harder.

Mathematical setup: On probability space $(\Omega, \mathcal{F}, P, {\mathcal{F}_t})$, surplus follows Cramér-Lundberg model:

\[X_t = x + \int_0^t (\mu - C_s) ds - \sum_{i=1}^{N_t} Z_i + D_t\]

where ${N_t}$ is Poisson process with intensity $\lambda > 0$, ${Z_i}$ are i.i.d. claim sizes with distribution $F$, $C_t$ is dividend rate, $D_t$ is cumulative capital injection.

Assumptions:

Dividend rate satisfies $c \leq C_t \leq \bar{c}$ and is non-decreasing (ratcheting)
Capital injection cost parameter $\ell > 1$
Claim distribution has density $p(x) > 0$ that is bounded and non-increasing on $(0,\infty)$
$\mathbb{E}[Z_1] = \gamma < \infty$ and $0 < c < \bar{c} < \mu$

Toy example: When $\mu = 2$, $\bar{c} = 1$, $\lambda = 1$, $\ell = 2$, and claims are exponential with rate 1, the company starts at $x = 0$ with initial dividend rate $c = 0.5$. If surplus hits a new maximum, the dividend rate can increase but never decrease.

Formal objective:
\[V(x,c) = \sup_{(C_t,D_t) \in \Pi_{x,c}} \mathbb{E}\left[ \int_0^\infty e^{-rt} C_t dt - \ell \int_0^\infty e^{-rt} dD_t \right]\]

Method

The method constructs a strong solution to the Hamilton-Jacobi-Bellman variational inequality:

\[\min\{\mathcal{L}^c v - \mathcal{T}v + h - c, \ell - v_x, -v_c\} = 0\]

where:

\[\mathcal{L}^c v := -(\mu - c)v_x + (r + \lambda)v\] \[\mathcal{T}v := \lambda \mathbb{E}[v((x - Z_1)^+, c)]\] \[h(x) := \lambda \ell \mathbb{E}[(Z_1 - x)^+]\]

Algorithm steps:

Solve boundary case $c = \bar{c}$ via ordinary integro-differential equation
Discretize dividend rates: $c_i = \bar{c} - (i-1)\Delta c$ for $i = 1,\ldots,n$
Solve regime-switching system of ODEs for each discretization level
Establish uniform bounds: $0 \leq v_x \leq \ell$, $-\lambda\ell/(\mu-\underline{c}) \leq v_{xx} \leq 0$
Pass to limit as $n \to \infty$ using compactness arguments

Application to toy example: With exponential claims, the optimal policy uses free boundary $X(c)$ such that dividend rate increases to $M(\max_{s \leq t} X_s, c)$ when surplus hits new maximum, and capital injection occurs minimally to prevent ruin.

Novelty & Lineage

Prior work:

“Optimal dividend strategies under a generalized Cramér-Lundberg model with diffusion” (Albrecher-Azcue-Muler, ~2017) - used viscosity solutions for ratcheting under Brownian motion
“Dividend maximization under a drawdown constraint” (Guan-Xu, 2020) - PDE approach for strong solutions under diffusion
“Optimal dividend and capital injection under a drawdown constraint” (Sethi-Taksar, ~1983) - barrier strategies without ratcheting

Delta: This paper extends the PDE framework from Brownian motion to jump-diffusion (Cramér-Lundberg), adds costly capital injection, and constructs the first strong solution (not just viscosity) for this combination.

Theory-specific assessment:
- Main theorem is somewhat predictable given prior PDE methods, but technical execution is non-trivial
- Proof technique combines regime-switching approximation with careful uniform estimates - this is a reasonable extension of known methods rather than fundamentally new
- No lower bounds are established; the authors don’t compare to known optimality gaps
Verdict: INCREMENTAL — solid technical extension of known PDE methods to a more general setting, but the core approach follows established patterns from authors’ previous work.

Proof Techniques

Main strategy: Discretization and limiting argument with uniform bounds.

Key inequality 1 - Comparison principle for integro-differential operators:

\[\mathcal{L}^c \psi_1 - J\psi_1 + H(x,\psi_1) \leq \mathcal{L}^c \psi_2 - J\psi_2 + H(x,\psi_2)\]

implies $\psi_1 \leq \psi_2$ when $J$ satisfies growth condition.

Key inequality 2 - Uniform derivative bounds:

\[0 \leq v_x \leq \ell, \quad v_{xx} \geq -\frac{\lambda\ell}{\mu - \underline{c}}\]

Key equation - Discrete regime-switching system:

\[\min\{\mathcal{L}^{c_i} v_i - \mathcal{T}v_i + h - c_i, v_i - v_{i-1}\} = 0\]

Proof stages:

Solve boundary problem via contraction mapping for OIDE
Establish existence/uniqueness for each discrete approximation using penalty methods
Derive uniform estimates independent of discretization parameter $n$
Use Arzelà-Ascoli compactness to extract convergent subsequence
Verify limit satisfies original variational inequality in strong sense

Technical insight: The regime-switching approximation preserves the gradient constraint $v_x \leq \ell$ automatically, avoiding need for explicit penalty terms in the capital injection constraint.

Experiments & Validation

Purely theoretical. Empirical validation would require:

calibrating Cramér-Lundberg parameters to real insurance data
comparing optimal ratcheting strategy performance against barrier strategies and constant dividend policies
sensitivity analysis for cost parameter $\ell$ and claim distributions.

Limitations & Open Problems

Limitations:

Boundedness assumption on claim density $p(x)$ - TECHNICAL (needed for uniform estimates but likely removable with more care)
Restriction $\ell > 1$ for capital injection cost - NATURAL (economically reasonable that external capital is expensive)
Upper bound $\bar{c} < \mu$ on dividend rate - NATURAL (prevents degenerate cases where income cannot offset claims)
Non-increasing assumption on $p(x)$ - TECHNICAL (simplifies proofs but excludes some realistic claim distributions)

Open problems:
Extension to spectrally negative Lévy processes beyond compound Poisson
Optimal dividend policies with both ratcheting and maximum drawdown constraints simultaneously

Bridging classical and martingale Schrödinger bridges

Authors: Julio Backhoff, Mathias Beiglböck, Giorgia Bifronte, Armand Ley · Institution: University of Vienna · Category: math.PR

Extends the martingale Schrödinger bridge to arbitrary dimension with five equivalent characterizations connecting discrete and continuous-time formulations via a base measure variational principle.

Tags: optimal transport martingale theory Schrödinger bridges entropy regularization stochastic control Föllmer processes convex duality mathematical finance

arXiv · PDF

Problem Formulation

Motivation: The martingale Schrödinger bridge problem seeks a canonical martingale coupling between probability measures $\mu, \nu \in \mathcal{P}_2(\mathbb{R}^d)$ in convex order. This addresses the non-constructive nature of Strassen’s theorem while providing a principled approach for applications in mathematical finance and generative modeling.

Mathematical setup: Let $\mu \preceq_c \nu$ denote convex order, meaning $\int \phi d\mu \leq \int \phi d\nu$ for all convex functions $\phi$. The space of martingale couplings is

\[\text{MT}(\mu,\nu) = \{\pi \in \mathcal{P}(\mathbb{R}^d \times \mathbb{R}^d) : p_1(\pi) = \mu, p_2(\pi) = \nu, \int y \, \pi_x(dy) = x \text{ for } \mu\text{-a.e. } x\}\]

Assumptions:

$\mu, \nu$ have finite second moments
$\nu$ has finite exponential moments: $\int e^{q y } \nu(dy) < \infty$ for all $q > 0$

$\nu$ is not concentrated on any hyperplane

Toy example: When $d=1$, $\mu = \delta_0$, and $\nu = \frac{1}{2}(\delta_{-1} + \delta_1)$, the martingale constraint forces any coupling to satisfy $\mathbb{E}[Y

X=0] = 0$, making this a constrained entropy minimization over the two-point distribution.

Formal objective: Find the unique minimizer of the entropic martingale transport problem:

\[\inf_{m \in \text{MT}(\mu,\nu)} H(m|\mu \otimes \nu)\]

Method

The method provides five equivalent characterizations of the martingale Schrödinger bridge $m^{\text{SB}}$:

Entropic characterization:
\[m^{\text{SB}} = \arg\min_{m \in \text{MT}(\mu,\nu)} H(m|\mu \otimes \nu)\]
Gibbs density form: The optimizer has density
\[\frac{dm}{d\mu \otimes \nu}(x,y) = \exp(\varphi(x) + \psi(y) - h(x) \cdot (y-x))\]
where $(\varphi,\psi)$ are dual optimizers and $h(x) \cdot (y-x)$ enforces the martingale constraint.
Continuous-time minimization:
\[\inf_{M_0 \sim \mu, M_1 \sim \nu, M_t = M_0 + \int_0^t \sigma_s dB_s} \frac{1}{2}\mathbb{E}\left[\int_0^1 \frac{|\sigma_t - I|^2}{1-t} dt\right]\]

Föllmer martingale construction: For each $x$, construct the Föllmer process solving the classical Schrödinger problem, then take its Doob martingale $M_t^x = \mathbb{E}[X_1^{\text{SB}}

\mathcal{F}_t]$.

Base measure variational problem:
\[\sup_{\bar{\mu} \in \mathcal{P}_2(\mathbb{R}^d)} \{\text{MCov}(\mu, \bar{\mu}) - \mathcal{E}_\nu(\bar{\mu})\}\]
where $\text{MCov}(\mu,\bar{\mu}) = \sup_{\pi \in \text{Cpl}(\mu,\bar{\mu})} \int \bar{x} \cdot x \, d\pi$ and $\mathcal{E}_\nu(\bar{\mu})$ is the entropic cost.

Application to toy example: In the discrete case with $\mu = \delta_0$, $\nu = \frac{1}{2}(\delta_{-1} + \delta_1)$, the method yields $m^{\text{SB}}(0, {-1,1}) = (\frac{1}{2}, \frac{1}{2})$, recovering the uniform distribution over the feasible outcomes.

Novelty & Lineage

Prior work:

Nutz-Wiesel (2021): Introduced discrete-time martingale Schrödinger bridge in dimension $d=1$
Classical Schrödinger bridge theory: Föllmer (1985), Léonard (2014) for entropy-regularized optimal transport
Henry-Labordère (2019): Path-space martingale Schrödinger problem with fixed prior

Delta: This paper extends Nutz-Wiesel to arbitrary dimension $d$ and establishes the first rigorous connection between discrete and continuous-time formulations. The key advances are:
Five equivalent characterizations unifying static and dynamic viewpoints
Base measure variational principle connecting martingale and classical Schrödinger problems
Filtering representation showing scale-invariance of the Föllmer martingale
Complete duality theory for weak martingale transport

Theory-specific assessment: The main theorems are somewhat predictable extensions of classical Schrödinger bridge theory to the martingale constraint. The proof techniques largely adapt existing convex duality and variational methods. However, the base measure characterization via MCov-conjugation provides genuine structural insight. The connection to filtering theory is elegant but follows from known martingale representation results.

The bounds are optimal in the sense of solving the stated variational problems, but no non-trivial lower bounds are established for the computational complexity.

Verdict: INCREMENTAL — Solid multidimensional extension of Nutz-Wiesel with valuable unifying perspective, but the core techniques and results follow predictably from existing theory.

Proof Techniques

The proofs employ standard techniques from convex analysis and optimal transport theory:

Duality for weak transport: Establishes $\text{value}(P) = \text{value}(D)$ using Fenchel-Rockafellar duality. The key inequality is
\[H(p|\nu) \geq \int \psi dp - \log \int e^{\psi} d\nu\]
with equality characterizing the Gibbs form.
Variational calculus for base measure: The first-order condition for optimality of $\bar{\mu}$ in problem (VP) requires
\[\frac{\delta}{\delta\bar{\mu}}\text{MCov}(\mu,\bar{\mu}) = \frac{\delta}{\delta\bar{\mu}}\mathcal{E}_\nu(\bar{\mu})\]
This is verified by showing the transport map $T(\bar{x}) = \text{bary}(\pi^{\text{SB}}_{\bar{x}})$ satisfies $T_# \bar{\mu} = \mu$.
Strict convexity argument: The uniqueness follows from showing the Hessian satisfies
\[\nabla^2 \varphi(\bar{x}) = -\text{Cov}_{\pi_{\bar{x}}}(Y)\]
which is positive definite since $\nu$ charges no hyperplane.
Stochastic control connection: The continuous-time characterization uses the identity
\[H(\mathbb{P}|\mathbb{Q}) = \mathbb{E}_\mathbb{P}\left[\frac{1}{2}\int_0^1 |\beta_t|^2 dt\right]\]
where $\beta$ is the Föllmer drift relating measures $\mathbb{P}$ and $\mathbb{Q}$.
Martingale representation: The bijection between drifted processes and martingales relies on the stochastic Fubini theorem:
\[u_s = \int_0^s \frac{\sigma_r - I}{1-r} dB_r\]

Experiments & Validation

Purely theoretical. The paper develops two explicit examples:

Gaussian case: When $\mu = \mathcal{N}(0,\Sigma_1)$ and $\nu = \mathcal{N}(0,\Sigma_2)$ with $\Sigma_1 \preceq \Sigma_2$, closed-form expressions are derived for all five characterizations.
Discrete finite support: Comparison with Bass martingales in simple discrete settings, showing the entropic regularization effect.

Empirical validation would require:
- Numerical implementation of the Sinkhorn-type algorithm (mentioned as ongoing work)
- Convergence analysis for discretized continuous-time processes
- Comparison with other canonical martingale constructions in mathematical finance applications

Limitations & Open Problems

Limitations:

TECHNICAL: Existence of dual optimizers requires regularity assumptions on $\nu$ (finite exponential moments, non-concentration on hyperplanes) that are needed for proof techniques but likely removable.
TECHNICAL: Uniqueness of solutions to the inner optimization problems $(I_x)$ is assumed rather than proven, though this is expected to hold generically.
RESTRICTIVE: The continuous-time formulation requires specific Brownian reference measure, limiting applicability to other noise structures.
NATURAL: Finite second moment assumptions are standard in optimal transport theory.

Open problems:
Computational complexity: Develop efficient algorithms for high-dimensional problems beyond the mentioned Sinkhorn approach.
Robustness: Analyze stability of the martingale Schrödinger bridge under perturbations of the marginal constraints $\mu, \nu$.