Theory 3 papers

Theory Digest — Apr 12, 2026

Time Series Gaussian Chain Graph Models

Authors: Qin Fang, Xinghao Qiao, Zihan Wang · Institution: University of Sydney, University of Hong Kong, Tsinghua University · Category: stat.ME

Introduces time series chain graphs that capture blockwise causal and conditional dependence structures via group sparse plus group low-rank decomposition of inverse spectral density matrices.

Tags: time series graphical models chain graphs causal inference spectral density group sparsity low-rank approximation frequency domain

arXiv · PDF

Problem Formulation

Motivation: Many multivariate time series exhibit variable-partitioned blockwise dependence, where variables group into meaningful clusters (e.g., monetary policy variables vs. asset returns, brain networks) with distinct within-block and cross-block dependence patterns. Standard time series graphical models fail to capture this structure.

Mathematical setup: Consider a $p$-dimensional stationary time series ${x_t}_{t \in \mathbb{Z}}$ following:

\[x_t = Ax_t + Bx_{t-1} + e_t\]

where $A, B \in \mathbb{R}^{p \times p}$ are coefficient matrices and ${e_t}$ is a stationary Gaussian noise process. The nodes ${1, \ldots, p}$ partition into $G$ disjoint chain components $N = \bigcup_{g=1}^G \tau_g$. Key assumptions:

  1. Undirected edges exist only within chain components
  2. Directed edges exist only between chain components
  3. There exists a causal ordering $\pi = (\pi_1, \ldots, \pi_G)$ such that directed edges point only from higher- to lower-ordered components
  4. The triplet $(\Omega, A, B)$ is TSCG-feasible (admits common block structure under permutation)

    Let $\Omega(\omega) = f_e^{-1}(\omega)$ denote the inverse spectral density matrix of ${e_t}$ at frequency $\omega$.

    Toy example: When $p = 4$ with chain components ${1,2}$ and ${3,4}$, the inverse spectral density of ${x_t}$ becomes:

    \[\Theta(\omega) = (I_4 - A - B e^{-i\omega})^H \Omega(\omega) (I_4 - A - B e^{-i\omega})\]

    where $\Omega(\omega)$ is block-diagonal and $A, B$ are block lower-triangular.

    Formal objective: Estimate the time series chain graph structure by recovering:

    \[\min_{(\Omega,A,B)} -\ell_M(\Omega + L) + \lambda_{1T}\sqrt{M}\sum_{k \neq \ell}\sqrt{\sum_{j=1}^M |\Omega_{k\ell}(\omega_j)|^2} + \lambda_{2T}\sqrt{M}\frac{\|\mathcal{L}^{(1)}\|_* + \|\mathcal{L}^{(2)}\|_*}{2}\]
Method

Algorithm Overview: Three-stage procedure combining group sparse plus group low-rank decomposition.

Stage 1 - Undirected Edge Recovery: Minimize regularized Whittle likelihood:

\[(\hat{\Omega}, \hat{L}) = \arg\min_{\tilde{\Omega}, \tilde{L}} -\ell_M(\tilde{\Omega} + \tilde{L}) + P_1(\tilde{\Omega}, \lambda_{1T}) + P_2(\tilde{L}, \lambda_{2T})\]

where:

  • $P_1(\tilde{\Omega}, \lambda_{1T}) = \lambda_{1T}\sqrt{M}\sum_{k \neq \ell}\sqrt{\sum_{j=1}^M \Omega_{k\ell}(\tilde{\omega}_j) ^2}$ (group lasso)
  • $P_2(\tilde{L}, \lambda_{2T}) = \lambda_{2T}\sqrt{M}(|\tilde{L}^{(1)}|_* + |\tilde{L}^{(2)}|_*)/2$ (tensor-unfolding nuclear norm)

The key decomposition is:

\[\Theta(\omega) = \Omega(\omega) + L(\omega)\]

where $L(\omega) = (A + Be^{-i\omega})^H\Omega(\omega)(A + Be^{-i\omega}) - (A + Be^{-i\omega})^H\Omega(\omega) - \Omega(\omega)(A + Be^{-i\omega})$.

Stage 2 - Chain Component Ordering: Use conditional variance discrepancy:

\[\hat{D}(\hat{\tau}_g, M) = \max_{k \in \hat{\tau}_g} \max_{j \in [M]} |\hat{f}_{x,kk}(\tilde{\omega}_j) - \hat{f}_{x,kM}(\tilde{\omega}_j)\hat{f}_{x,MM}^{-1}(\tilde{\omega}_j)\hat{f}_{x,Mk}(\tilde{\omega}_j) - \hat{\Omega}_{kk}^{-1}(\tilde{\omega}_j)|\]

Iteratively select chain components by minimizing this discrepancy.

Stage 3 - Directed Edge Recovery: Multivariate regression followed by singular value and elementwise hard thresholding.

Toy Example Application: For 2×2 blocks, Stage 1 recovers $\hat{\Omega}(\omega)$ as block-diagonal across frequencies. Stage 2 identifies which 2×2 block comes first in causal ordering. Stage 3 estimates non-zero entries in the lower-triangular blocks of $A$ and $B$.

Novelty & Lineage

Prior Work:

  1. Zhao et al. (2024): Gaussian chain graphs for i.i.d. data - established identifiability and estimation for classical chain graphs but ignores temporal dynamics
  2. Dahlhaus & Eichler (2003): Time-indexed chain graphs - each time point forms a chain component, missing variable-level groupings
  3. Jung et al. (2015), Tugnait (2022): Time series conditional independence graphs - undirected edges only, no causal structure

    Delta: This paper introduces time series chain graphs where chain components are variables (not time indices), capturing both within-component conditional dependencies and cross-component causal relations with full temporal dynamics.

    Theory-specific Assessment:

    • Main theorem: Identifiability result (Theorem 1) is conceptually predictable - extends matrix sparse+low-rank decomposition to the group setting with frequency domain
    • Proof technique: Novel transversality condition for continuous frequency domain is genuinely new; the primal-dual witness technique in tensor formulation provides new technical tools
    • Bound tightness: No lower bounds are established or cited for the group sparse plus group low-rank recovery problem

    The identifiability framework introduces the first transversality condition for group sparse plus group low-rank decomposition in continuous frequency domain. However, the overall approach follows expected extensions of existing sparse+low-rank theory.

    Verdict: INCREMENTAL — Solid extension combining known techniques (chain graphs + frequency domain graphical models) with a novel but expected tensor nuclear norm penalty.

Proof Techniques

Main Strategy: Establish identifiability via transversality condition, then prove consistency using primal-dual witness technique.

Key Technical Components:

  1. Transversality Condition (Assumption 1):

    \[\mathcal{S}(\Omega_0) \cap \mathcal{T}(L_0) = \{0(\cdot)\}\]

    where $\mathcal{S}(\Omega_0) = {\Omega’ \in C((0,2\pi]; H_p) : \text{gsupp}(\Omega’) \subset \text{gsupp}(\Omega_0)}$ and $\mathcal{T}(L_0)$ is the tangent space for group low-rank matrices.

  2. Irrepresentable Condition: For the discrete frequency setting, the condition:

    \[\|\mathcal{P}_{\mathcal{T}(\hat{L})}(\nabla \mathcal{S}(\hat{\Omega}))\|_{\infty} < 1\]

    is satisfied asymptotically, where $\mathcal{P}_{\mathcal{T}(\hat{L})}$ denotes projection onto the tangent space.

  3. Primal-Dual Witness Construction: The key inequality controlling estimation error:

    \[\|(\hat{\Omega}, \hat{L}) - (\Omega_0, L_0)\|_F \leq C \sqrt{\frac{(S + pR) \log p}{MT}}\]

    where the witness matrices satisfy KKT conditions:

    \[\nabla \ell_M(\Omega_0 + L_0) + \lambda_{1T}\tilde{Z}_{\Omega} + \lambda_{2T}\tilde{Z}_L = 0\]
  4. Frequency Domain Control: Bound discrepancies between continuous and discrete frequencies using:

    \[\sup_{\omega \in (0,2\pi]} \|\Theta(\omega) - \Theta(\tilde{\omega}_j)\| = O(M^{-1})\]

    for appropriately chosen block size $m$ and number of blocks $M$.

Experiments & Validation

Simulations: Extensive simulation study with varying:

  1. Dimensions $p \in {20, 50, 100}$
  2. Sample sizes $T \in {200, 500, 1000}$
  3. Signal-to-noise ratios
  4. Sparsity levels and rank structures

    Performance metrics: F1-score, precision, recall for both undirected and directed edge recovery.

    Real Data: U.S. macroeconomic data application with 20 quarterly time series (1960-2019) including:

    • Monetary policy variables (federal funds rate, money supply)
    • Economic indicators (GDP, unemployment, inflation)
    • Financial variables (stock returns, bond yields)

    Shows interpretable monetary policy transmission mechanisms with policy variables affecting real economic variables, which then impact financial markets.

    Baselines: Compared against:

  5. Standard VAR with LASSO (no chain structure)
  6. Separate estimation within/across components
  7. Time-indexed chain graphs
  8. Classical graphical LASSO on inverse spectral density
Limitations & Open Problems

Limitations:

  1. RESTRICTIVE: Assumes Gaussian distribution - limits applicability to heavy-tailed financial/economic time series where non-Gaussian models are often more appropriate

  2. TECHNICAL: Requires pre-specification of number of chain components $G$ - not estimated from data, though sensitivity analysis could address this

  3. RESTRICTIVE: Block-diagonal structure of $\Omega(\omega)$ assumes conditional independence across chain components given past values - may be too strong for some applications

  4. TECHNICAL: Choice of frequency blocking parameters $(m, M)$ requires careful tuning - no adaptive selection procedure provided

  5. NATURAL: Stationarity assumption is standard but limits applicability to non-stationary macroeconomic/financial series

  6. TECHNICAL: SVD thresholding in Stage 3 uses hard thresholding - could benefit from adaptive or soft thresholding approaches

    Open Problems:

  7. Model Selection: Develop principled methods for selecting the number of chain components $G$ and optimal frequency blocking structure simultaneously

  8. Non-Gaussian Extensions: Extend the framework to handle heavy-tailed innovations using robust M-estimators or other non-Gaussian approaches suitable for financial time series


Value Mirror Descent for Reinforcement Learning

Authors: Zhichao Jia, Guanghui Lan · Institution: Georgia Institute of Technology · Category: math.OC

Integrates mirror descent into value iteration to achieve policy stability guarantees while matching optimal sample complexity for regularized MDPs.

Tags: reinforcement learning value iteration mirror descent sample complexity regularized MDPs policy optimization Bregman divergence concentration inequalities

arXiv · PDF

Problem Formulation

Motivation: Classical value iteration methods are fundamental for reinforcement learning but lack the flexibility to handle regularized MDPs effectively. While policy optimization methods handle regularization well, value iteration-type methods achieve superior sample complexity under generative models, particularly in their dependence on the discount factor $(1-\gamma)$.

Mathematical setup: Consider a discounted Markov Decision Process (DMDP) with tuple $(S,A,P,c,\gamma)$ where:

  • $S$ is finite state space, $A$ is finite action space
  • $P \in \mathbb{R}^{ S \times A \times S }$ is transition kernel with $P(s’ s,a) = P_{s,a}(s’)$
  • $c(s,a) \in [0,1]$ is cost function
  • $\gamma \in (0,1)$ is discount factor
  • Regularizer $h: \Delta_A \to [0,\bar{h}]$ is continuous and convex

The state-value function for policy $\pi$ is:

\[V^\pi(s) := \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t[c(s_t,a_t) + h(\pi(\cdot|s_t))] \mid s_0 = s\right]\]

Assumptions:

  1. Finite state and action spaces
  2. Costs bounded: $c(s,a) \in [0,1]$
  3. Regularizer $h$ is convex with $\mu$-strong convexity parameter $\mu \geq 0$
  4. Access to generative model for sampling transitions

    Toy example: When $ S =2$, $ A =2$, $\gamma=0.5$, and $h(\pi) = 0$ (no regularization), this reduces to standard value iteration finding optimal policy for 2-state 2-action MDP.

    Formal objective: Find optimal policy $\pi^*$ minimizing:

    \[\pi^* \in \arg\min_{\pi \in \Pi} V^\pi(s) \quad \forall s \in S\]
Method

Value Mirror Descent (VMD): The method alternates between two steps:

  1. Prox-mapping step: For each state $s \in S$:

    \[\pi_{t+1}(\cdot|s) = \arg\min_{\pi(\cdot|s) \in \Delta_A} \left\{\eta_t[\langle c(s,\cdot) + \gamma P_s V_t, \pi(\cdot|s)\rangle + h(\pi(\cdot|s))] + D_{\pi}^{\pi_t}(s)\right\}\]
  2. Value update step:

    \[V_{t+1}(s) = \langle c(s,\cdot) + \gamma P_s V_t, \pi_{t+1}(\cdot|s)\rangle + h(\pi_{t+1}(\cdot|s))\]

    Stochastic VMD (SVMD): Uses estimated transition kernel $\hat{P}^{(m)}$ with variance reduction:

    • Estimate $\tilde{P}_0 V_0$ in first iteration using $m_{k,1}$ samples
    • For $t > 0$: compute $\tilde{P}_t V_t = \tilde{P}_0 V_0 + \hat{P}^{(m_{k,2})}(V_t - V_0)$
    • Component-wise minimum: $V_{t+1}(s) = \min{\tilde{V}_{t+1}(s), V_t(s)}$

    Application to toy example: For 2-state 2-action MDP, the prox-mapping reduces to solving simple constrained optimization over 2-simplex for each state, with value updates being weighted averages of immediate costs and discounted future values.

Novelty & Lineage

Prior work:

  1. Variance-reduced Q-value iteration (Wainwright & Xu 2019): achieved $\tilde{O}( S   A (1-\gamma)^{-3}\epsilon^{-2})$ sample complexity
  2. Stochastic policy mirror descent (Shani et al. 2020): obtained policy convergence guarantees but with worse sample complexity dependence on $(1-\gamma)$
  3. Classical value iteration: $O((1-\gamma)^{-1})$ iteration complexity but no regularization handling

    Delta: This paper adds:

    • Integration of mirror descent into value iteration framework
    • Bounded Bregman divergence guarantees: $|D_{\pi^*}^{\hat{\pi}_K}|_\infty \leq O((1-\gamma)^{-2})$
    • First value-based method achieving $\tilde{O}(\epsilon^{-1})$ sample complexity for strongly convex regularizers
    • Policy stability properties absent in existing value iteration methods

    Theory-specific assessment:

    • Main theorem is incremental: sample complexity matches existing lower bounds, no breakthrough in fundamental limits
    • Proof technique combines standard mirror descent analysis with value iteration monotonicity - largely routine application of known techniques
    • Bounds are tight for general convex case, but strongly convex improvement is modest (helps only when $\epsilon < O((1-\gamma)^2)$)
    • No new lower bounds established

    Verdict: INCREMENTAL — solid technical contribution combining mirror descent with value iteration, but improvements are predictable extensions of existing methods without fundamental breakthroughs.

Proof Techniques

The proof strategy proceeds in three main stages:

  1. Single-epoch convergence via mirror descent analysis: Uses standard mirror descent inequality:

    \[\eta_t[\langle c + \gamma P_s V_t, \pi_{t+1} - \pi\rangle + h(\pi_{t+1}) - h(\pi)] + D_{\pi_{t+1}}^{\pi_t} \leq D_{\pi}^{\pi_t} - (1+\eta_t\mu)D_{\pi}^{\pi_{t+1}}\]
  2. Monotonicity preservation: Key technical insight showing approximate monotonicity:

    \[V_t \geq \Gamma_{\pi_t} V_t - \beta_t\]

    where $\beta_t$ captures estimation errors. This leads to:

    \[V_t \geq V^{\pi_t} - (I - \gamma P_{\pi_t})^{-1}\beta_t\]
  3. Error accumulation control: For stochastic version, bounds cumulative errors using concentration inequalities:
    • Hoeffding bound: $ \hat{P}_{s,a}^{(m)} V - P_{s,a} V \leq |V|_\infty\sqrt{2m^{-1}\ln(2\delta^{-1})}$
    • Bernstein bound: Exploits variance structure with key lemma:
    \[\|(I-\gamma P_\pi)^{-1}\sqrt{\sigma_V^\pi}\|_\infty^2 \leq \frac{(1+\bar{h})^2(1+\gamma)}{\gamma^2(1-\gamma)^3}\]
  4. Strongly convex analysis: Introduces virtual value sequences and exploits strong convexity via:

    \[h(\pi) - h(\pi') - \langle h'(\pi'), \pi - \pi'\rangle \geq \mu D_\pi^{\pi'}\]

    This enables $O(\epsilon^{-1})$ rate by faster policy convergence through Bregman divergence contraction.

Experiments & Validation

Purely theoretical.

Empirical validation would require:

  1. Comparison with variance-reduced Q-value iteration on standard RL benchmarks
  2. Evaluation of policy stability via Bregman divergence tracking
  3. Testing continual learning performance after offline pre-training
  4. Verification of strongly convex regularizer improvements in high-accuracy regime
Limitations & Open Problems

Limitations:

  1. Generative model assumption - TECHNICAL (standard in value iteration literature but limits applicability to online settings)

  2. Finite state/action spaces - RESTRICTIVE (significantly narrows applicability to function approximation settings)

  3. Strongly convex improvement only helps when $\epsilon < O((1-\gamma)^2)$ - TECHNICAL (benefit limited to high-accuracy regime)

  4. Policy stability bound $O((1-\gamma)^{-2})$ - NATURAL (matches policy optimization methods)

  5. Sample complexity still has $(1-\gamma)^{-3}$ dependence - NATURAL (matches known lower bounds)

    Open problems:

  6. Extend to function approximation settings while preserving sample complexity guarantees
  7. Develop online variants that maintain policy stability properties without generative model assumption

An Actor-Critic Framework for Continuous-Time Jump-Diffusion Controls with Normalizing Flows

Authors: Liya Guo, Ruimeng Hu, Xu Yang, Yi Zhu · Institution: Tsinghua University, UCSB · Category: math.OC

Extends actor-critic reinforcement learning to time-inhomogeneous infinite-horizon jump-diffusion control using little q-functions and normalizing flows for non-Gaussian stochastic policies.

Tags: continuous-time control jump-diffusion processes policy gradient methods actor-critic stochastic control reinforcement learning normalizing flows entropy regularization

arXiv · PDF

Problem Formulation

Motivation: Continuous-time stochastic control with jump-diffusion dynamics is central in finance and economics, particularly for portfolio optimization and dynamic decision-making under discontinuous shocks and time-dependent parameters. Classical dynamic programming methods become intractable in high dimensions with unknown dynamics, motivating reinforcement learning approaches.

Mathematical setup: Let $(Ω, \mathcal{F}, \mathbb{F}, \mathbb{P})$ be a filtered probability space. The controlled state process $(X_t^π)_{t≥0} ∈ \mathbb{R}^d$ follows the Itô-Lévy process:

\[dX_t^π = b(t, X_{t-}^π, u_t) dt + σ(t, X_{t-}^π, u_t) dW_t + ∫_{\mathbb{R}^d} α(t, X_{t-}^π, u_t, z) \tilde{N}(dt, dz)\]
where $W = (W_t)_{t≥0}$ is $d$-dimensional Brownian motion, $\tilde{N}(dt, dz) = N(dt, dz) - ν(dz) dt$ is compensated Poisson random measure with Lévy measure $ν$. The policy $π(· t,x) ∈ \mathcal{P}(A)$ is a time-dependent randomized feedback law.

The entropy-regularized reward functional is:

\[J(t, x; π) := \mathbb{E}\left[\int_t^∞ e^{-β(s-t)} \tilde{f}(s, X_s^π; π) ds | X_t^π = x\right]\]

where

\[\tilde{f}(s, y; π) := ∫_A [f(s, y, u) - γ \log π(u|s, y)] π(u|s, y) du\]

Assumptions:

  1. Standard Lipschitz and linear growth conditions on $(b, σ, α)$ for unique strong solution existence
  2. Lévy measure satisfies $∫_{\mathbb{R}^d} \min{ z ^2, 1} ν(dz) < ∞$
  3. Discount factor $β > 0$ and regularization parameter $γ ≥ 0$
  4. Policy parameterization via conditional normalizing flows for non-Gaussian distributions

    Toy example: When $d=1$, $A = \mathbb{R}$, with linear dynamics $dX_t = u_t dt + σ dW_t + α dM_t$ and quadratic running cost $f(t, x, u) = -\frac{1}{2}(Qu x^2 + Ru u^2)$, the optimal policy becomes Gaussian with mean $-\frac{H(t)}{R} x$ where $H(t)$ solves a Riccati equation modified by jump terms.

    Formal objective: Find the optimal policy

    \[π^*(·|t,x) = \arg\max_π J(t, x; π)\]
Method

Core Method: The approach introduces a time-inhomogeneous “little” $q$-function and develops an actor-critic framework with policy gradient updates using conditional normalizing flows.

Key Steps:

  1. Define discounted occupation measure:

    \[μ^{π,t,x}(A) := \mathbb{E}\left[\int_t^∞ e^{-β(s-t)} \mathbf{1}_{\{(s,X_s^π) ∈ A\}} ds\right]\]
  2. Introduce little $q$-function:

    \[q(t, x, u; π) := ∂_t J(t, x; π) + \mathcal{H}(t, x, u, ∇_x J(t, x; π), ∇_x^2 J(t, x; π)) - β J(t, x; π)\]

    where $\mathcal{H}$ is the Hamiltonian including jump terms.

  3. Policy gradient formula (Theorem 3.1):

    \[∇_θ J(t, x; π_θ)|_{θ=θ_0} = \frac{1}{β} \mathbb{E}_{(s,y) \sim βμ_{θ_0,t,x}, u \sim π_{θ_0}(·|s,y)} [\nabla_θ \log π_θ(u|s,y)|_{θ=θ_0} A_{ent}(s, y, u; θ_0)]\]

    where

    \[A_{ent}(s, y, u; θ_0) = q(s, y, u; π_{θ_0}) - γ \log π_{θ_0}(u|s,y)\]
  4. Tractable $q$-approximation (Lemma 3.3):

    \[\tilde{q}_{δt}(t, x, u; π) = \frac{1}{δt}[f(t, x, u) δt + e^{-βδt} J(t+δt, X_{t+δt}^u; π) - J(t, x; π)]\]
  5. Conditional normalizing flow parameterization:

    \[π_θ(u|t,x) = \mathcal{N}(\bar{μ}_θ(t,x), \text{Std}_θ^2(t,x)) \circ F_θ^{-1}(·; t,x) \circ S^{-1}\]

    where $F_θ$ is learnable invertible flow and $S$ is optional squashing map.

    Application to toy example: For the linear-quadratic case with quadratic Hamiltonian, the optimal policy is Gaussian, so the flow reduces to identity and the method recovers the analytical solution $π^*(u t,x) = \mathcal{N}(R^{-1} B^T H(t) x, \frac{γ}{2} R^{-1})$ where $H(t)$ solves the modified Riccati equation with jump terms.
Novelty & Lineage

Prior work:

  • Hu et al. (2021): Continuous-time policy gradients for finite-horizon diffusion control with Gaussian policies
  • Wang et al. (2020): “Little” $q$-function for finite-horizon jump-diffusion with entropy regularization
  • Jia et al. (2023): Time-homogeneous infinite-horizon diffusion control with generalized advantage estimation

Delta: This paper extends to time-inhomogeneous infinite-horizon jump-diffusion control with three key additions:

  1. Time-inhomogeneous little $q$-function including explicit $∂_t J$ term for infinite-horizon discounting
  2. Conditional normalizing flows for non-Gaussian stochastic policies (prior work limited to Gaussian)
  3. Rigorous treatment of time-dependent occupation measures under discounting

    Theory-specific assessment:

    • Main theorem (policy gradient) is a predictable extension of [Wang et al. 2020] and [Jia et al. 2023] to the time-inhomogeneous setting
    • Proof technique assembles known martingale methods and occupation measure identities - no fundamentally new mathematical insights
    • The $\tilde{q}_{δt}$ approximation (Lemma 3.3) is standard first-order Taylor expansion, similar to GAE literature
    • No lower bounds are established; tightness of approximations unknown

    Technical novelty: The time-inhomogeneous extension requires careful handling of discounted occupation measures on $[t,∞) \times \mathbb{R}^d$, but this is achieved through routine modifications of existing techniques. The normalizing flow parameterization is borrowed from the ML literature.

    Verdict: INCREMENTAL — Solid but expected extension combining known techniques (little $q$-functions, normalizing flows, actor-critic) to a broader setting without fundamental theoretical breakthroughs.

Proof Techniques

Main proof strategy for policy gradient theorem:

  1. Occupation measure representation (Lemma 3.1):

    \[\mathbb{E}\left[\int_t^∞ e^{-β(s-t)} φ(s, X_s^π) ds\right] = \int_{[t,∞) × \mathbb{R}^d} φ(s,y) μ^{π,t,x}(ds, dy)\]
  2. Generator identity (Lemma 3.2): For $φ ∈ C^{1,2}([0,∞) × \mathbb{R}^d)$,

    \[\mathbb{E}\left[\int_t^∞ e^{-β(s-t)} [-∂_s φ - \mathcal{L}^π φ + β φ](s, X_s^π) ds\right] = φ(t,x)\]
  3. Policy difference decomposition: Apply Lemma 3.2 with $φ = J(·,·; \hat{π})$ under measure induced by baseline policy $π$:

    \[J(t,x; \hat{π}) - J(t,x; π) = \frac{1}{β} \mathbb{E}_{μ^{\hat{π},t,x}, u \sim \hat{π}} [q(s, X_s^{\hat{π}}, u; π) - γ \log \hat{π}(u|s, X_s^{\hat{π}})]\]
  4. Differentiation: For parameterized policies $π_θ$, differentiate the identity at $θ = θ_0$ using:

    \[\frac{d}{dθ} \mathbb{E}_{u \sim π_θ} [g(u)] = \mathbb{E}_{u \sim π_θ} [\nabla_θ \log π_θ(u) g(u)]\]

    Key inequalities: The proof relies on standard martingale properties and does not require novel concentration bounds. The main technical work is in handling the time-derivative term $∂_t J$ which appears due to time-inhomogeneity.

    GAE approximation proof (Lemma 3.3): Uses first-order Itô expansion:

    \[\mathbb{E}[\tilde{q}_{δt}(t,x,u;π)] = q(t,x,u;π) + o(1)\]

    as $δt \to 0$ through Taylor expansion of the value function increments.

    Technical insight: The main challenge is properly defining the exploratory SDE (2.10) to avoid measurability issues when continuously sampling from $π(· t,x)$. This is handled by introducing auxiliary randomness through Poisson measure on extended space $\mathbb{R}^d × [0,1]^m$.
Experiments & Validation

Datasets and benchmarks:

  1. Linear-quadratic control: Time-homogeneous and time-inhomogeneous variants with analytical solutions from modified Riccati equations
  2. Merton portfolio optimization: Classical and entropy-regularized versions, with PINN solver providing high-accuracy benchmark for non-Gaussian optimal policies
  3. Multi-agent portfolio game: 25-agent strategic interaction with analytical Nash equilibrium characterized by coupled first-order conditions

    Baselines: Exact analytical solutions for LQ problems and Merton (deterministic case), PINN-computed benchmarks for entropy-regularized Merton, game-theoretic equilibrium analysis for multi-agent setting.

    Key metrics: Time-averaged relative RMSEs for state trajectories, value functions, and controls. For stochastic policies ($γ > 0$), KL divergence between learned and optimal policy distributions.

    Representative results:

    • LQ control: Value error $E_V \approx 0.004$ across dimensions $d ∈ {1,5,20,50}$
    • Merton problem: Close agreement between learned flow-based policies and PINN benchmarks for non-Gaussian distributions
    • Multi-agent game: Runtime scales linearly with number of agents; stable performance up to $n=25$ agents

    Computational details: PyTorch implementation on NVIDIA RTX 4090, step sizes $δt ∈ {0.01, 0.05}$, training iterations $N_{itr} ∈ {1000, 3000}$ depending on problem complexity.

Limitations & Open Problems

Limitations:

  1. TECHNICAL: Approximation quality of $\tilde{q}_{δt}$ depends on step size $δt$ with no explicit convergence rates provided - standard first-order error but could be restrictive in practice

  2. TECHNICAL: Normalizing flow parameterization adds computational overhead compared to Gaussian policies, though enables broader policy classes - trade-off between expressiveness and efficiency

  3. NATURAL: Requires knowledge of Lévy measure $ν$ for jump term evaluation in critic updates - standard assumption in jump-diffusion literature but limits model-free applicability

  4. RESTRICTIVE: Method assumes access to gradient information $∇_x V_ψ$ for martingale correction terms, which may be numerically unstable for complex value network architectures

  5. TECHNICAL: Evaluation of non-local compensator term $∫ [V(t,x+α(t,x,u,z)) - V(t,x)] ν(dz)$ is computationally expensive and approximated in practice

  6. NATURAL: Infinite-horizon setting requires careful discount factor choice $β > 0$ to ensure convergence - standard in discounted control but adds hyperparameter sensitivity

    Open problems:

  7. Convergence analysis: Establish finite-sample convergence rates for the actor-critic algorithm under jump-diffusion dynamics with explicit dependence on approximation parameters $δt$, network capacity, and regularization strength $γ$

  8. Model-free extension: Develop methods to learn or approximate the Lévy measure $ν$ and SDE coefficients $(b,σ,α)$ from data, potentially through kernel density estimation or neural approaches for the jump measure