Apr 12, 2026 Theory 3 papers

Theory Digest — Apr 12, 2026

Time Series Gaussian Chain Graph Models

Authors: Qin Fang, Xinghao Qiao, Zihan Wang · Institution: University of Sydney, University of Hong Kong, Tsinghua University · Category: stat.ME

Introduces time series chain graphs that capture blockwise causal and conditional dependence structures via group sparse plus group low-rank decomposition of inverse spectral density matrices.

Tags: time series graphical models chain graphs causal inference spectral density group sparsity low-rank approximation frequency domain

arXiv · PDF

Problem Formulation

Motivation: Many multivariate time series exhibit variable-partitioned blockwise dependence, where variables group into meaningful clusters (e.g., monetary policy variables vs. asset returns, brain networks) with distinct within-block and cross-block dependence patterns. Standard time series graphical models fail to capture this structure.

Mathematical setup: Consider a $p$-dimensional stationary time series ${x_t}_{t \in \mathbb{Z}}$ following:

\[x_t = Ax_t + Bx_{t-1} + e_t\]

where $A, B \in \mathbb{R}^{p \times p}$ are coefficient matrices and ${e_t}$ is a stationary Gaussian noise process. The nodes ${1, \ldots, p}$ partition into $G$ disjoint chain components $N = \bigcup_{g=1}^G \tau_g$. Key assumptions:

Undirected edges exist only within chain components
Directed edges exist only between chain components
There exists a causal ordering $\pi = (\pi_1, \ldots, \pi_G)$ such that directed edges point only from higher- to lower-ordered components
The triplet $(\Omega, A, B)$ is TSCG-feasible (admits common block structure under permutation)

Let $\Omega(\omega) = f_e^{-1}(\omega)$ denote the inverse spectral density matrix of ${e_t}$ at frequency $\omega$.

Toy example: When $p = 4$ with chain components ${1,2}$ and ${3,4}$, the inverse spectral density of ${x_t}$ becomes:
\[\Theta(\omega) = (I_4 - A - B e^{-i\omega})^H \Omega(\omega) (I_4 - A - B e^{-i\omega})\]
where $\Omega(\omega)$ is block-diagonal and $A, B$ are block lower-triangular.

Formal objective: Estimate the time series chain graph structure by recovering:
\[\min_{(\Omega,A,B)} -\ell_M(\Omega + L) + \lambda_{1T}\sqrt{M}\sum_{k \neq \ell}\sqrt{\sum_{j=1}^M |\Omega_{k\ell}(\omega_j)|^2} + \lambda_{2T}\sqrt{M}\frac{\|\mathcal{L}^{(1)}\|_* + \|\mathcal{L}^{(2)}\|_*}{2}\]

Method

Algorithm Overview: Three-stage procedure combining group sparse plus group low-rank decomposition.

Stage 1 - Undirected Edge Recovery: Minimize regularized Whittle likelihood:

\[(\hat{\Omega}, \hat{L}) = \arg\min_{\tilde{\Omega}, \tilde{L}} -\ell_M(\tilde{\Omega} + \tilde{L}) + P_1(\tilde{\Omega}, \lambda_{1T}) + P_2(\tilde{L}, \lambda_{2T})\]

where:

$P_1(\tilde{\Omega}, \lambda_{1T}) = \lambda_{1T}\sqrt{M}\sum_{k \neq \ell}\sqrt{\sum_{j=1}^M

\Omega_{k\ell}(\tilde{\omega}_j)

^2}$ (group lasso)

$P_2(\tilde{L}, \lambda_{2T}) = \lambda_{2T}\sqrt{M}(|\tilde{L}^{(1)}|_* + |\tilde{L}^{(2)}|_*)/2$ (tensor-unfolding nuclear norm)

The key decomposition is:

\[\Theta(\omega) = \Omega(\omega) + L(\omega)\]

where $L(\omega) = (A + Be^{-i\omega})^H\Omega(\omega)(A + Be^{-i\omega}) - (A + Be^{-i\omega})^H\Omega(\omega) - \Omega(\omega)(A + Be^{-i\omega})$.

Stage 2 - Chain Component Ordering: Use conditional variance discrepancy:

\[\hat{D}(\hat{\tau}_g, M) = \max_{k \in \hat{\tau}_g} \max_{j \in [M]} |\hat{f}_{x,kk}(\tilde{\omega}_j) - \hat{f}_{x,kM}(\tilde{\omega}_j)\hat{f}_{x,MM}^{-1}(\tilde{\omega}_j)\hat{f}_{x,Mk}(\tilde{\omega}_j) - \hat{\Omega}_{kk}^{-1}(\tilde{\omega}_j)|\]

Iteratively select chain components by minimizing this discrepancy.

Stage 3 - Directed Edge Recovery: Multivariate regression followed by singular value and elementwise hard thresholding.

Toy Example Application: For 2×2 blocks, Stage 1 recovers $\hat{\Omega}(\omega)$ as block-diagonal across frequencies. Stage 2 identifies which 2×2 block comes first in causal ordering. Stage 3 estimates non-zero entries in the lower-triangular blocks of $A$ and $B$.

Novelty & Lineage

Prior Work:

Zhao et al. (2024): Gaussian chain graphs for i.i.d. data - established identifiability and estimation for classical chain graphs but ignores temporal dynamics
Dahlhaus & Eichler (2003): Time-indexed chain graphs - each time point forms a chain component, missing variable-level groupings
Jung et al. (2015), Tugnait (2022): Time series conditional independence graphs - undirected edges only, no causal structure

Delta: This paper introduces time series chain graphs where chain components are variables (not time indices), capturing both within-component conditional dependencies and cross-component causal relations with full temporal dynamics.

Theory-specific Assessment:
- Main theorem: Identifiability result (Theorem 1) is conceptually predictable - extends matrix sparse+low-rank decomposition to the group setting with frequency domain
- Proof technique: Novel transversality condition for continuous frequency domain is genuinely new; the primal-dual witness technique in tensor formulation provides new technical tools
- Bound tightness: No lower bounds are established or cited for the group sparse plus group low-rank recovery problem
The identifiability framework introduces the first transversality condition for group sparse plus group low-rank decomposition in continuous frequency domain. However, the overall approach follows expected extensions of existing sparse+low-rank theory.

Verdict: INCREMENTAL — Solid extension combining known techniques (chain graphs + frequency domain graphical models) with a novel but expected tensor nuclear norm penalty.

Proof Techniques

Main Strategy: Establish identifiability via transversality condition, then prove consistency using primal-dual witness technique.

Key Technical Components:

Transversality Condition (Assumption 1):
\[\mathcal{S}(\Omega_0) \cap \mathcal{T}(L_0) = \{0(\cdot)\}\]
where $\mathcal{S}(\Omega_0) = {\Omega’ \in C((0,2\pi]; H_p) : \text{gsupp}(\Omega’) \subset \text{gsupp}(\Omega_0)}$ and $\mathcal{T}(L_0)$ is the tangent space for group low-rank matrices.
Irrepresentable Condition: For the discrete frequency setting, the condition:
\[\|\mathcal{P}_{\mathcal{T}(\hat{L})}(\nabla \mathcal{S}(\hat{\Omega}))\|_{\infty} < 1\]
is satisfied asymptotically, where $\mathcal{P}_{\mathcal{T}(\hat{L})}$ denotes projection onto the tangent space.
Primal-Dual Witness Construction: The key inequality controlling estimation error:
\[\|(\hat{\Omega}, \hat{L}) - (\Omega_0, L_0)\|_F \leq C \sqrt{\frac{(S + pR) \log p}{MT}}\]
where the witness matrices satisfy KKT conditions:
\[\nabla \ell_M(\Omega_0 + L_0) + \lambda_{1T}\tilde{Z}_{\Omega} + \lambda_{2T}\tilde{Z}_L = 0\]
Frequency Domain Control: Bound discrepancies between continuous and discrete frequencies using:
\[\sup_{\omega \in (0,2\pi]} \|\Theta(\omega) - \Theta(\tilde{\omega}_j)\| = O(M^{-1})\]
for appropriately chosen block size $m$ and number of blocks $M$.

Experiments & Validation

Simulations: Extensive simulation study with varying:

Dimensions $p \in {20, 50, 100}$
Sample sizes $T \in {200, 500, 1000}$
Signal-to-noise ratios
Sparsity levels and rank structures

Performance metrics: F1-score, precision, recall for both undirected and directed edge recovery.

Real Data: U.S. macroeconomic data application with 20 quarterly time series (1960-2019) including:
- Monetary policy variables (federal funds rate, money supply)
- Economic indicators (GDP, unemployment, inflation)
- Financial variables (stock returns, bond yields)
Shows interpretable monetary policy transmission mechanisms with policy variables affecting real economic variables, which then impact financial markets.

Baselines: Compared against:
Standard VAR with LASSO (no chain structure)
Separate estimation within/across components
Time-indexed chain graphs
Classical graphical LASSO on inverse spectral density

Limitations & Open Problems

Limitations:

RESTRICTIVE: Assumes Gaussian distribution - limits applicability to heavy-tailed financial/economic time series where non-Gaussian models are often more appropriate
TECHNICAL: Requires pre-specification of number of chain components $G$ - not estimated from data, though sensitivity analysis could address this
RESTRICTIVE: Block-diagonal structure of $\Omega(\omega)$ assumes conditional independence across chain components given past values - may be too strong for some applications
TECHNICAL: Choice of frequency blocking parameters $(m, M)$ requires careful tuning - no adaptive selection procedure provided
NATURAL: Stationarity assumption is standard but limits applicability to non-stationary macroeconomic/financial series
TECHNICAL: SVD thresholding in Stage 3 uses hard thresholding - could benefit from adaptive or soft thresholding approaches

Open Problems:
Model Selection: Develop principled methods for selecting the number of chain components $G$ and optimal frequency blocking structure simultaneously
Non-Gaussian Extensions: Extend the framework to handle heavy-tailed innovations using robust M-estimators or other non-Gaussian approaches suitable for financial time series

Value Mirror Descent for Reinforcement Learning

Authors: Zhichao Jia, Guanghui Lan · Institution: Georgia Institute of Technology · Category: math.OC

Integrates mirror descent into value iteration to achieve policy stability guarantees while matching optimal sample complexity for regularized MDPs.

Tags: reinforcement learning value iteration mirror descent sample complexity regularized MDPs policy optimization Bregman divergence concentration inequalities

arXiv · PDF

Problem Formulation

Motivation: Classical value iteration methods are fundamental for reinforcement learning but lack the flexibility to handle regularized MDPs effectively. While policy optimization methods handle regularization well, value iteration-type methods achieve superior sample complexity under generative models, particularly in their dependence on the discount factor $(1-\gamma)$.

Mathematical setup: Consider a discounted Markov Decision Process (DMDP) with tuple $(S,A,P,c,\gamma)$ where:

$S$ is finite state space, $A$ is finite action space

$P \in \mathbb{R}^{

\times

}$ is transition kernel with $P(s’

s,a) = P_{s,a}(s’)$

$c(s,a) \in [0,1]$ is cost function
$\gamma \in (0,1)$ is discount factor
Regularizer $h: \Delta_A \to [0,\bar{h}]$ is continuous and convex

The state-value function for policy $\pi$ is:

\[V^\pi(s) := \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t[c(s_t,a_t) + h(\pi(\cdot|s_t))] \mid s_0 = s\right]\]

Assumptions:

Finite state and action spaces
Costs bounded: $c(s,a) \in [0,1]$
Regularizer $h$ is convex with $\mu$-strong convexity parameter $\mu \geq 0$

Access to generative model for sampling transitions

Toy example: When $

=2$, $

=2$, $\gamma=0.5$, and $h(\pi) = 0$ (no regularization), this reduces to standard value iteration finding optimal policy for 2-state 2-action MDP.

Formal objective: Find optimal policy $\pi^*$ minimizing:

\[\pi^* \in \arg\min_{\pi \in \Pi} V^\pi(s) \quad \forall s \in S\]

Method

Value Mirror Descent (VMD): The method alternates between two steps:

Prox-mapping step: For each state $s \in S$:
\[\pi_{t+1}(\cdot|s) = \arg\min_{\pi(\cdot|s) \in \Delta_A} \left\{\eta_t[\langle c(s,\cdot) + \gamma P_s V_t, \pi(\cdot|s)\rangle + h(\pi(\cdot|s))] + D_{\pi}^{\pi_t}(s)\right\}\]
Value update step:
\[V_{t+1}(s) = \langle c(s,\cdot) + \gamma P_s V_t, \pi_{t+1}(\cdot|s)\rangle + h(\pi_{t+1}(\cdot|s))\]
Stochastic VMD (SVMD): Uses estimated transition kernel $\hat{P}^{(m)}$ with variance reduction:
- Estimate $\tilde{P}_0 V_0$ in first iteration using $m_{k,1}$ samples
- For $t > 0$: compute $\tilde{P}_t V_t = \tilde{P}_0 V_0 + \hat{P}^{(m_{k,2})}(V_t - V_0)$
- Component-wise minimum: $V_{t+1}(s) = \min{\tilde{V}_{t+1}(s), V_t(s)}$
Application to toy example: For 2-state 2-action MDP, the prox-mapping reduces to solving simple constrained optimization over 2-simplex for each state, with value updates being weighted averages of immediate costs and discounted future values.

Novelty & Lineage

Prior work:

Variance-reduced Q-value iteration (Wainwright & Xu 2019): achieved $\tilde{O}(

(1-\gamma)^{-3}\epsilon^{-2})$ sample complexity

Stochastic policy mirror descent (Shani et al. 2020): obtained policy convergence guarantees but with worse sample complexity dependence on $(1-\gamma)$
Classical value iteration: $O((1-\gamma)^{-1})$ iteration complexity but no regularization handling

Delta: This paper adds:
- Integration of mirror descent into value iteration framework
- Bounded Bregman divergence guarantees: $|D_{\pi^*}^{\hat{\pi}_K}|_\infty \leq O((1-\gamma)^{-2})$
- First value-based method achieving $\tilde{O}(\epsilon^{-1})$ sample complexity for strongly convex regularizers
- Policy stability properties absent in existing value iteration methods
Theory-specific assessment:
- Main theorem is incremental: sample complexity matches existing lower bounds, no breakthrough in fundamental limits
- Proof technique combines standard mirror descent analysis with value iteration monotonicity - largely routine application of known techniques
- Bounds are tight for general convex case, but strongly convex improvement is modest (helps only when $\epsilon < O((1-\gamma)^2)$)
- No new lower bounds established
Verdict: INCREMENTAL — solid technical contribution combining mirror descent with value iteration, but improvements are predictable extensions of existing methods without fundamental breakthroughs.

Proof Techniques

The proof strategy proceeds in three main stages:

Single-epoch convergence via mirror descent analysis: Uses standard mirror descent inequality:
\[\eta_t[\langle c + \gamma P_s V_t, \pi_{t+1} - \pi\rangle + h(\pi_{t+1}) - h(\pi)] + D_{\pi_{t+1}}^{\pi_t} \leq D_{\pi}^{\pi_t} - (1+\eta_t\mu)D_{\pi}^{\pi_{t+1}}\]
Monotonicity preservation: Key technical insight showing approximate monotonicity:
\[V_t \geq \Gamma_{\pi_t} V_t - \beta_t\]
where $\beta_t$ captures estimation errors. This leads to:
\[V_t \geq V^{\pi_t} - (I - \gamma P_{\pi_t})^{-1}\beta_t\]

Error accumulation control: For stochastic version, bounds cumulative errors using concentration inequalities:

Hoeffding bound: $

\hat{P}_{s,a}^{(m)} V - P_{s,a} V

\leq |V|_\infty\sqrt{2m^{-1}\ln(2\delta^{-1})}$

Bernstein bound: Exploits variance structure with key lemma:

\[\|(I-\gamma P_\pi)^{-1}\sqrt{\sigma_V^\pi}\|_\infty^2 \leq \frac{(1+\bar{h})^2(1+\gamma)}{\gamma^2(1-\gamma)^3}\]

Strongly convex analysis: Introduces virtual value sequences and exploits strong convexity via:
\[h(\pi) - h(\pi') - \langle h'(\pi'), \pi - \pi'\rangle \geq \mu D_\pi^{\pi'}\]
This enables $O(\epsilon^{-1})$ rate by faster policy convergence through Bregman divergence contraction.

Experiments & Validation

Purely theoretical.

Empirical validation would require:

Comparison with variance-reduced Q-value iteration on standard RL benchmarks
Evaluation of policy stability via Bregman divergence tracking
Testing continual learning performance after offline pre-training
Verification of strongly convex regularizer improvements in high-accuracy regime

Limitations & Open Problems

Limitations:

Generative model assumption - TECHNICAL (standard in value iteration literature but limits applicability to online settings)
Finite state/action spaces - RESTRICTIVE (significantly narrows applicability to function approximation settings)
Strongly convex improvement only helps when $\epsilon < O((1-\gamma)^2)$ - TECHNICAL (benefit limited to high-accuracy regime)
Policy stability bound $O((1-\gamma)^{-2})$ - NATURAL (matches policy optimization methods)
Sample complexity still has $(1-\gamma)^{-3}$ dependence - NATURAL (matches known lower bounds)

Open problems:
Extend to function approximation settings while preserving sample complexity guarantees
Develop online variants that maintain policy stability properties without generative model assumption

An Actor-Critic Framework for Continuous-Time Jump-Diffusion Controls with Normalizing Flows

Authors: Liya Guo, Ruimeng Hu, Xu Yang, Yi Zhu · Institution: Tsinghua University, UCSB · Category: math.OC

Extends actor-critic reinforcement learning to time-inhomogeneous infinite-horizon jump-diffusion control using little q-functions and normalizing flows for non-Gaussian stochastic policies.

Tags: continuous-time control jump-diffusion processes policy gradient methods actor-critic stochastic control reinforcement learning normalizing flows entropy regularization

arXiv · PDF

Problem Formulation

Motivation: Continuous-time stochastic control with jump-diffusion dynamics is central in finance and economics, particularly for portfolio optimization and dynamic decision-making under discontinuous shocks and time-dependent parameters. Classical dynamic programming methods become intractable in high dimensions with unknown dynamics, motivating reinforcement learning approaches.

Mathematical setup: Let $(Ω, \mathcal{F}, \mathbb{F}, \mathbb{P})$ be a filtered probability space. The controlled state process $(X_t^π)_{t≥0} ∈ \mathbb{R}^d$ follows the Itô-Lévy process:

\[dX_t^π = b(t, X_{t-}^π, u_t) dt + σ(t, X_{t-}^π, u_t) dW_t + ∫_{\mathbb{R}^d} α(t, X_{t-}^π, u_t, z) \tilde{N}(dt, dz)\]

where $W = (W_t)_{t≥0}$ is $d$-dimensional Brownian motion, $\tilde{N}(dt, dz) = N(dt, dz) - ν(dz) dt$ is compensated Poisson random measure with Lévy measure $ν$. The policy $π(·

t,x) ∈ \mathcal{P}(A)$ is a time-dependent randomized feedback law.

The entropy-regularized reward functional is:

\[J(t, x; π) := \mathbb{E}\left[\int_t^∞ e^{-β(s-t)} \tilde{f}(s, X_s^π; π) ds | X_t^π = x\right]\]

where

\[\tilde{f}(s, y; π) := ∫_A [f(s, y, u) - γ \log π(u|s, y)] π(u|s, y) du\]

Assumptions:

Standard Lipschitz and linear growth conditions on $(b, σ, α)$ for unique strong solution existence
Lévy measure satisfies $∫_{\mathbb{R}^d} \min{ z ^2, 1} ν(dz) < ∞$
Discount factor $β > 0$ and regularization parameter $γ ≥ 0$
Policy parameterization via conditional normalizing flows for non-Gaussian distributions

Toy example: When $d=1$, $A = \mathbb{R}$, with linear dynamics $dX_t = u_t dt + σ dW_t + α dM_t$ and quadratic running cost $f(t, x, u) = -\frac{1}{2}(Qu x^2 + Ru u^2)$, the optimal policy becomes Gaussian with mean $-\frac{H(t)}{R} x$ where $H(t)$ solves a Riccati equation modified by jump terms.

Formal objective: Find the optimal policy
\[π^*(·|t,x) = \arg\max_π J(t, x; π)\]

Method

Core Method: The approach introduces a time-inhomogeneous “little” $q$-function and develops an actor-critic framework with policy gradient updates using conditional normalizing flows.

Key Steps:

Define discounted occupation measure:
\[μ^{π,t,x}(A) := \mathbb{E}\left[\int_t^∞ e^{-β(s-t)} \mathbf{1}_{\{(s,X_s^π) ∈ A\}} ds\right]\]
Introduce little $q$-function:
\[q(t, x, u; π) := ∂_t J(t, x; π) + \mathcal{H}(t, x, u, ∇_x J(t, x; π), ∇_x^2 J(t, x; π)) - β J(t, x; π)\]
where $\mathcal{H}$ is the Hamiltonian including jump terms.
Policy gradient formula (Theorem 3.1):
\[∇_θ J(t, x; π_θ)|_{θ=θ_0} = \frac{1}{β} \mathbb{E}_{(s,y) \sim βμ_{θ_0,t,x}, u \sim π_{θ_0}(·|s,y)} [\nabla_θ \log π_θ(u|s,y)|_{θ=θ_0} A_{ent}(s, y, u; θ_0)]\]
where
\[A_{ent}(s, y, u; θ_0) = q(s, y, u; π_{θ_0}) - γ \log π_{θ_0}(u|s,y)\]
Tractable $q$-approximation (Lemma 3.3):
\[\tilde{q}_{δt}(t, x, u; π) = \frac{1}{δt}[f(t, x, u) δt + e^{-βδt} J(t+δt, X_{t+δt}^u; π) - J(t, x; π)]\]

Conditional normalizing flow parameterization:

\[π_θ(u|t,x) = \mathcal{N}(\bar{μ}_θ(t,x), \text{Std}_θ^2(t,x)) \circ F_θ^{-1}(·; t,x) \circ S^{-1}\]

where $F_θ$ is learnable invertible flow and $S$ is optional squashing map.

Application to toy example: For the linear-quadratic case with quadratic Hamiltonian, the optimal policy is Gaussian, so the flow reduces to identity and the method recovers the analytical solution $π^*(u

t,x) = \mathcal{N}(R^{-1} B^T H(t) x, \frac{γ}{2} R^{-1})$ where $H(t)$ solves the modified Riccati equation with jump terms.

Novelty & Lineage

Prior work:

Hu et al. (2021): Continuous-time policy gradients for finite-horizon diffusion control with Gaussian policies
Wang et al. (2020): “Little” $q$-function for finite-horizon jump-diffusion with entropy regularization
Jia et al. (2023): Time-homogeneous infinite-horizon diffusion control with generalized advantage estimation

Delta: This paper extends to time-inhomogeneous infinite-horizon jump-diffusion control with three key additions:

Time-inhomogeneous little $q$-function including explicit $∂_t J$ term for infinite-horizon discounting
Conditional normalizing flows for non-Gaussian stochastic policies (prior work limited to Gaussian)
Rigorous treatment of time-dependent occupation measures under discounting

Theory-specific assessment:
- Main theorem (policy gradient) is a predictable extension of [Wang et al. 2020] and [Jia et al. 2023] to the time-inhomogeneous setting
- Proof technique assembles known martingale methods and occupation measure identities - no fundamentally new mathematical insights
- The $\tilde{q}_{δt}$ approximation (Lemma 3.3) is standard first-order Taylor expansion, similar to GAE literature
- No lower bounds are established; tightness of approximations unknown
Technical novelty: The time-inhomogeneous extension requires careful handling of discounted occupation measures on $[t,∞) \times \mathbb{R}^d$, but this is achieved through routine modifications of existing techniques. The normalizing flow parameterization is borrowed from the ML literature.

Verdict: INCREMENTAL — Solid but expected extension combining known techniques (little $q$-functions, normalizing flows, actor-critic) to a broader setting without fundamental theoretical breakthroughs.

Proof Techniques

Main proof strategy for policy gradient theorem:

Occupation measure representation (Lemma 3.1):
\[\mathbb{E}\left[\int_t^∞ e^{-β(s-t)} φ(s, X_s^π) ds\right] = \int_{[t,∞) × \mathbb{R}^d} φ(s,y) μ^{π,t,x}(ds, dy)\]
Generator identity (Lemma 3.2): For $φ ∈ C^{1,2}([0,∞) × \mathbb{R}^d)$,
\[\mathbb{E}\left[\int_t^∞ e^{-β(s-t)} [-∂_s φ - \mathcal{L}^π φ + β φ](s, X_s^π) ds\right] = φ(t,x)\]
Policy difference decomposition: Apply Lemma 3.2 with $φ = J(·,·; \hat{π})$ under measure induced by baseline policy $π$:
\[J(t,x; \hat{π}) - J(t,x; π) = \frac{1}{β} \mathbb{E}_{μ^{\hat{π},t,x}, u \sim \hat{π}} [q(s, X_s^{\hat{π}}, u; π) - γ \log \hat{π}(u|s, X_s^{\hat{π}})]\]

Differentiation: For parameterized policies $π_θ$, differentiate the identity at $θ = θ_0$ using:

\[\frac{d}{dθ} \mathbb{E}_{u \sim π_θ} [g(u)] = \mathbb{E}_{u \sim π_θ} [\nabla_θ \log π_θ(u) g(u)]\]

Key inequalities: The proof relies on standard martingale properties and does not require novel concentration bounds. The main technical work is in handling the time-derivative term $∂_t J$ which appears due to time-inhomogeneity.

GAE approximation proof (Lemma 3.3): Uses first-order Itô expansion:

\[\mathbb{E}[\tilde{q}_{δt}(t,x,u;π)] = q(t,x,u;π) + o(1)\]

as $δt \to 0$ through Taylor expansion of the value function increments.

Technical insight: The main challenge is properly defining the exploratory SDE (2.10) to avoid measurability issues when continuously sampling from $π(·

t,x)$. This is handled by introducing auxiliary randomness through Poisson measure on extended space $\mathbb{R}^d × [0,1]^m$.

Experiments & Validation

Datasets and benchmarks:

Linear-quadratic control: Time-homogeneous and time-inhomogeneous variants with analytical solutions from modified Riccati equations
Merton portfolio optimization: Classical and entropy-regularized versions, with PINN solver providing high-accuracy benchmark for non-Gaussian optimal policies
Multi-agent portfolio game: 25-agent strategic interaction with analytical Nash equilibrium characterized by coupled first-order conditions

Baselines: Exact analytical solutions for LQ problems and Merton (deterministic case), PINN-computed benchmarks for entropy-regularized Merton, game-theoretic equilibrium analysis for multi-agent setting.

Key metrics: Time-averaged relative RMSEs for state trajectories, value functions, and controls. For stochastic policies ($γ > 0$), KL divergence between learned and optimal policy distributions.

Representative results:
- LQ control: Value error $E_V \approx 0.004$ across dimensions $d ∈ {1,5,20,50}$
- Merton problem: Close agreement between learned flow-based policies and PINN benchmarks for non-Gaussian distributions
- Multi-agent game: Runtime scales linearly with number of agents; stable performance up to $n=25$ agents
Computational details: PyTorch implementation on NVIDIA RTX 4090, step sizes $δt ∈ {0.01, 0.05}$, training iterations $N_{itr} ∈ {1000, 3000}$ depending on problem complexity.

Limitations & Open Problems

Limitations:

TECHNICAL: Approximation quality of $\tilde{q}_{δt}$ depends on step size $δt$ with no explicit convergence rates provided - standard first-order error but could be restrictive in practice
TECHNICAL: Normalizing flow parameterization adds computational overhead compared to Gaussian policies, though enables broader policy classes - trade-off between expressiveness and efficiency
NATURAL: Requires knowledge of Lévy measure $ν$ for jump term evaluation in critic updates - standard assumption in jump-diffusion literature but limits model-free applicability
RESTRICTIVE: Method assumes access to gradient information $∇_x V_ψ$ for martingale correction terms, which may be numerically unstable for complex value network architectures
TECHNICAL: Evaluation of non-local compensator term $∫ [V(t,x+α(t,x,u,z)) - V(t,x)] ν(dz)$ is computationally expensive and approximated in practice
NATURAL: Infinite-horizon setting requires careful discount factor choice $β > 0$ to ensure convergence - standard in discounted control but adds hyperparameter sensitivity

Open problems:
Convergence analysis: Establish finite-sample convergence rates for the actor-critic algorithm under jump-diffusion dynamics with explicit dependence on approximation parameters $δt$, network capacity, and regularization strength $γ$
Model-free extension: Develop methods to learn or approximate the Lévy measure $ν$ and SDE coefficients $(b,σ,α)$ from data, potentially through kernel density estimation or neural approaches for the jump measure