Apr 10, 2026 Theory 3 papers

Theory Digest — Apr 10, 2026

Today’s Digest at a Glance

Preliminary

Today’s digest examines three methodological advances: extending contextual bandits to hidden Markov models, developing sensitivity analysis for instrumental variables without monotonicity assumptions, and creating neural networks that incorporate structural economic theory.

Hidden Markov Model Contextual Bandits

Traditional contextual bandits assume that contexts are directly observable and rewards depend deterministically on context-action pairs. However, many real-world scenarios involve latent states that evolve over time according to hidden Markov dynamics, where the observed context provides only partial information about the underlying state. The naive approach of treating each context as independent fails because it ignores the temporal correlation structure encoded in the latent state transitions.

The core challenge is that rewards depend on the true hidden state $h_t$, but we only observe contexts $x_t$ that are generated from the hidden states. The key insight is to maintain belief states $b_t(h) = P(h_t = h

x_{1:t})$ that represent the posterior probability distribution over hidden states given the history of observed contexts. The algorithm then applies contextual bandit methods (like LinUCB) to these estimated beliefs rather than the raw contexts.

The technical innovation involves a staged approach where parameters are updated only at specific intervals of length $\ell$, combined with a forgetting mechanism that leverages the mixing properties of the underlying HMM. This allows the algorithm to adapt to the latent dynamics while maintaining computational tractability. Intuitively, the method “smooths out” the hidden state uncertainty by working with probability distributions rather than point estimates.

Instrumental Variable Sensitivity Analysis

Instrumental variable (IV) methods estimate causal effects when treatment assignment is confounded, using an instrument that affects treatment but not the outcome directly (exclusion restriction) and is uncorrelated with unobserved confounders (exogeneity). Traditional IV analysis assumes these conditions hold exactly, but in practice they are often violated to unknown degrees.

Sensitivity analysis for IV estimators typically requires first-stage monotonicity (the instrument affects everyone in the same direction) to ensure well-defined complier subpopulations. However, this assumption is restrictive and often implausible. The challenge is to assess robustness to IV violations while allowing for heterogeneous treatment effects without monotonicity constraints.

The solution uses linear programming to characterize identified sets under sensitivity constraints. For binary outcomes, the method constructs bounds by intersecting Manski-style no-assumption bounds $H_x = \prod_{z=0,1} [P(Y=1,X=x

Z=z), P(Y=1,X=x

Z=z) + \pi(1-x

z)]$ with user-specified sensitivity parameters that bound the degree of exclusion and exogeneity violations. The key insight is that even without point identification, we can still obtain informative bounds on treatment effects by explicitly modeling the extent of assumption violations.

Structured-Knowledge-Informed Neural Networks (SKINNs)

Traditional neural networks are purely data-driven and lack mechanisms to incorporate domain-specific theoretical knowledge, while classical econometric methods can embed theory but have limited flexibility for complex nonlinear relationships. This creates a gap when we want both the flexibility of neural networks and the interpretability of structural economic models.

SKINNs address this by jointly optimizing neural network parameters $\theta$ and interpretable structural parameters $\phi$ through a composite loss function. The key innovation is the theory component that enforces consistency with economic theory via collocation methods. This involves evaluating theoretical relationships (like partial differential equations from option pricing theory) at randomly sampled points and penalizing deviations: $L_{theory} = \mathbb{E}[|F(x; NN(x; \theta), \phi)|^2]$ where $F$ represents the theoretical constraint.

The method maintains the universal approximation properties of neural networks while ensuring that the learned function respects economic principles. Intuitively, it “teaches” the neural network to follow economic theory while still allowing data-driven flexibility in regions where theory is silent or approximate.

Reading Guide

The first paper extends contextual bandits (covered previously) to handle latent state dynamics through belief estimation and staged parameter updates. The second develops linear programming methods for IV sensitivity analysis that unify existing approaches while removing monotonicity requirements. The third bridges machine learning and econometric theory by embedding structural constraints directly into neural network training via composite loss functions.

A Direct Approach for Handling Contextual Bandits with Latent State Dynamics

Authors: Zhen Li, Gilles Stoltz · Institution: BNP Paribas, Université Paris-Saclay, HEC Paris · Category: cs.LG

Extends contextual bandits with HMM dynamics to handle direct state-dependent rewards (not just belief-dependent), achieving T^{7/8} regret via staged LinUCB with HMM forgetting properties.

Tags: contextual bandits hidden Markov models regret bounds belief estimation sequential decision making latent state dynamics

arXiv · PDF

Problem Formulation

Motivation (2–3 sentences): This paper studies linear contextual bandits where contexts and rewards are governed by a hidden Markov model. Such settings arise naturally when the environment has latent state dynamics that evolve over time, making standard contextual bandit approaches inadequate.
Mathematical setup: Consider a finite action set A and context space X ⊆ ℝᵐ. At each round t ≥ 1, there exists a latent state hₜ ∈ [H] following a hidden Markov model: the initial state h₁ ~ π, contexts xₜ are drawn from emission distributions νₕₜ, and transitions follow Markov matrix M. The reward model is:
\[r_t(a) = \phi(a, x_t)^T \theta^*_{h_t} + \eta_t(a)\]
where φ: A × X → ℝᵈ is a known transfer function, θ^*_h ∈ ℝᵈ are unknown state-dependent parameters, and η_t(a) is noise.

Assumptions:

Boundedness:

φ(a,x)^T θ^*_h

≤ 1,

φ(a,x)

₂ ≤ 1,

θ^*_h

₂ ≤ C_θ*

Noise: E[η_t(a) ℱ^all_t] = 0 and E[η_t(a)² ℱ^all_t] ≤ C_η
Toy example: When d = 2, H = 2, and φ(a,x) = (a,x) with binary actions, the reward depends on both the observed context and the hidden state, creating complex dependencies that cannot be reduced to standard linear bandits.
Formal objective: The pseudo-regret is defined as:
\[R_T = \sum_{t=1}^T \max_{a∈A} \sum_{h∈[H]} b_t(h) \phi(a,x_t)^T \theta^*_h - \sum_{t=1}^T \sum_{h∈[H]} b_t(h) \phi(a_t,x_t)^T \theta^*_h\]

where b_t(h) = P(h_t = h x_{1:t}) are beliefs based only on contexts.

Method

The method is a staged LinUCB algorithm on estimated beliefs that proceeds in stages of length ℓ.

Key components:

Belief estimation subroutine B that estimates b̂_t(h) = P(h_t = h x_{1:t}) based only on contexts
Staged parameter updates: estimates θ̂_t are updated only at rounds t multiple of ℓ
Within each stage, rewards are estimated as:
\[\hat{r}_t(a) = \sum_{h∈[H]} \hat{b}_t(h) \phi(a,x_t)^T \hat{\theta}_{(s-1)ℓ,h} + \epsilon_{t,a}\]
where ε_{t,a} are confidence bonuses and s is the current stage.
The LinUCB-style parameter estimates are:
\[\hat{\theta}_t = G_t^{-1} \sum_{\tau=1}^t (\hat{b}_τ ⊗ \phi(a_τ, x_τ)) r_τ(a_τ)\]
where G_t = ∑τ (\hat{b}_τ ⊗ φ(a_τ, x_τ))(\hat{b}_τ ⊗ φ(a_τ, x_τ))^T + λI{dH}.

Applied to toy example: With d = 2, H = 2, binary actions, the algorithm maintains beliefs over 2 states, updates parameter estimates θ̂₁, θ̂₂ ∈ ℝ² every ℓ rounds, and picks actions optimistically based on estimated rewards plus confidence bounds.

Novelty & Lineage

Step 1 — Prior work:

Nelson et al. (2022): “Linearizing contextual bandits with latent state dynamics” - studied a simplified model where rewards depend on beliefs rather than hidden states directly
Zhou et al. (2021): “Regime switching bandits” - considered binary rewards and spectral methods with T^{2/3} regret
Azizzadenesheli et al. (2016): context-free rewards depending only on states and actions, not contexts

Step 2 — Delta: This paper studies the more natural model where rewards depend directly on hidden states h_t rather than on posterior beliefs. It provides high-probability regret bounds (not just expected bounds) and handles the full complexity of belief-reward dependencies.

Step 3 — Theory-specific assessment:
- Main theorem: Somewhat predictable extension - the T^{7/8} rate for direct state dependence vs T^{3/4} for belief dependence follows expected complexity scaling
- Proof technique: Standard LinUCB analysis adapted to stages, combined with HMM forgetting properties. Uses L2-Markov inequalities instead of exponential concentration
- Bound tightness: No known lower bounds provided. The T^{7/8} vs T^{3/4} gap appears due to staging necessity rather fundamental limits
Verdict: INCREMENTAL — solid extension of Nelson et al. (2022) to a more natural but predictably harder model, with expected rate degradation.

Proof Techniques

The proof uses a staged LinUCB analysis combined with HMM forgetting properties:

Staged filtration construction: Define augmented filtration U_t = σ(x_{1:t}, {θ̂{sℓ}}{s≤s_t-1}) such that actions a_t are U_t-measurable while beliefs b_t(h) and P(h_t = h

U_t) remain close.

Key concentration inequality: Show that confidence bonuses ε_{t,a} satisfy:
\[\left|\sum_{h∈[H]} b_t(h)φ(a,x_t)^T θ^*_h - \sum_{h∈[H]} \hat{b}_t(h)φ(a,x_t)^T \hat{\theta}_{(s_t-1)ℓ,h}\right| ≤ ε_{t,a} + 2T_0/λ\]
Three-term decomposition: The estimation error splits into:
- S_Δ = ∑τ φ(a_τ,x_τ)^T(θ^*{h_τ} - ∑_h b̄_τ(h)θ^*_h): direct state dependence term
- S_b: belief estimation error term
- S_η: noise term
L2-Markov bound for S_Δ: Since E[z_τ U_τ] = 0 but z_τ not U_τ-measurable, use:

\[E[S_Δ^2] = O(sℓ)\]

by HMM forgetting (Assumption 5.1), giving S_Δ ≤ √(sℓ/δ_s) with probability 1-δ_s.

Elliptic potential bound: Standard LinUCB bound ∑_t

G_t^{-1}(\hat{b}_t ⊗ φ(a_t,x_t))

_2 ≤ √(2dHT ln(1 + T/(dHλ))).

The technical insight is constructing the staged filtration to balance measurability constraints with belief estimation accuracy.

Experiments & Validation

Purely theoretical. The paper provides no empirical validation.

Natural empirical validation would include:

Synthetic HMM environments with known parameters to verify regret rates
Comparison with standard contextual bandits that ignore state dynamics
Real applications like recommendation systems with user state evolution
Analysis of how belief estimation subroutines perform in practice

Limitations & Open Problems

Limitations:

HMM parameters must be estimable (TECHNICAL - standard assumption, could potentially be relaxed)
Finite action and state spaces (RESTRICTIVE - limits applicability to many continuous control problems)
Known horizon T (TECHNICAL - standard in regret analysis, removable with doubling tricks)
Exponential forgetting condition (NATURAL - widely satisfied by ergodic HMMs)
Bounded transfer functions and parameters (NATURAL - standard boundedness assumptions)

Open problems:
Close the gap between T^{7/8} upper bound and unknown lower bounds - is T^{3/4} achievable for direct state dependence?
Extend to infinite or continuous action/state spaces using function approximation techniques.

Assessing Sensitivity to IV Exclusion and Exogeneity without First Stage Monotonicity

Authors: Paul Diegert, Matthew A. Masten, Alexandre Poirier · Institution: Duke University · Category: econ.EM

Develops linear programming framework for IV sensitivity analysis under heterogeneous treatment effects without monotonicity, unifying MSM, c-dependence, and KS approaches for both discrete and continuous outcomes.

Tags: instrumental_variables sensitivity_analysis partial_identification linear_programming treatment_effects causal_inference robustness nonparametric_bounds

arXiv · PDF

Problem Formulation

Motivation: Instrumental variable (IV) analysis relies on exclusion and exogeneity assumptions that are often debated empirically. Violations can invalidate causal inferences, creating need for robust sensitivity analysis without restrictive monotonicity assumptions.

Mathematical setup: Let $X \in {0,1}$ denote binary treatment and $Z \in {0,1}$ binary instrument. Define potential outcomes ${Y(x,z)}_{x,z \in {0,1}}$ with observed outcome

\[Y = Y(X,Z)\]

Let $p_Y(x,z) := P(Y(x,z) = 1

Z = z)$ denote conditional probabilities and $p_Y := (p_Y(0,0), p_Y(0,1), p_Y(1,0), p_Y(1,1)) \in [0,1]^4$. Key assumptions:

Regularity: $p_Z \in (0,1)$ and $\pi(x z) \in (0,1)$ for all $x,z \in {0,1}$
Instrument validity: $Z$ is exogenous OR weakly excluded (but not necessarily both)

Sensitivity model: $p_Y \in A_0(\theta) \times A_1(\theta)$ where $\theta \in [0,1]$ indexes violations

Toy example: When $p_Z = 0.5$ and observed data shows $P(Y=1,X=0

Z=0) = 0.3$, $P(Y=1,X=0

Z=1) = 0.5$, the no-assumption bounds give $p_Y(0,0) \in [0.3, 0.8]$ and $p_Y(0,1) \in [0.5, 1]$. Under perfect validity ($\theta=0$), we require $p_Y(0,0) = p_Y(0,1)$, giving intersection $[0.5, 0.8]$.

Formal objective: Identify bounds for average treatment effect:

\[\text{ATE} = E[Y(1,Z) - Y(0,Z)] = \Gamma_{\text{ATE}}(p_Y)\]

Method

The method constructs identified sets via intersection of data-consistent bounds with sensitivity constraints. For binary outcomes:

No-assumption bounds: Compute Manski-style bounds
\[H_x = \prod_{z=0,1} [P(Y=1,X=x|Z=z), P(Y=1,X=x|Z=z) + \pi(1-x|z)]\]
Sensitivity constraints: Define convex polytopes $A_x(\theta)$ encoding violations. Three examples: - MSM: $A_{MSM}(\lambda) = {p \in [0,1]^2 : A_{MSM}(\lambda)p \leq a_{MSM}(\lambda)}$ where $\lambda = 1-\Lambda^{-1}$ - c-dependence: $A_{c-dep}(c) = {p \in [0,1]^2 : A_{c-dep}(c)p \leq a_{c-dep}(c)}$
- KS distance: $A_{KS}(K) = {p \in [0,1]^2 : |p_1 - p_0| \leq K}$
Linear programming: Identified set is $\Pi(\theta) = \Pi_0(\theta) \times \Pi_1(\theta)$ where $\Pi_x(\theta) = H_x \cap A_x(\theta)$. Bounds for functionals like ATE solved via:
\[\text{ATE}(\theta) = \max_{p_Y \in \Pi(\theta)} \Gamma_{\text{ATE}}(p_Y)\]
Continuous outcomes: Use sieve approximation with basis functions ${\phi_j}_{j=1}^J$ to approximate infinite-dimensional linear program.

Toy example application: With $c=0.2$ in c-dependence model, if baseline gives ATE $\in [0.1, 0.4]$, allowing moderate violations might expand bounds to $[-0.1, 0.6]$, showing sensitivity to assumption.

Novelty & Lineage

Prior work:

Manski (1990): “Nonparametric Bounds on Treatment Effects” - derived sharp bounds under no assumptions vs. full IV validity
Conley, Hansen, Rossi (2012): “Plausibly Exogenous” - sensitivity analysis for linear IV models assuming treatment effect homogeneity
Kitagawa (2021): “A Test for Instrument Validity” - extended testable implications to continuous outcomes with binary treatment/instrument

Delta: This paper allows arbitrary treatment effect heterogeneity while providing continuous parameterization spanning no-assumptions to full validity. Extends beyond binary outcomes without monotonicity requirements. Introduces unified framework encompassing MSM, c-dependence, and KS approaches.

Theory-specific assessment:
- Main theorem (sharp characterization via linear programming) follows predictably from convex polytope structure, building incrementally on Manski bounds
- Proof technique assembles known results: linear programming duality, correspondence continuity lemmas, sieve approximation theory
- Bounds match known extremes (Manski no-assumptions bounds when $\theta=1$, standard IV bounds when $\theta=0$) but tightness vs. unknown lower bounds not established
The continuous outcome extension requires more technical machinery but uses standard sieve approximation methods. The unified sensitivity framework is organizationally useful but mathematically straightforward given individual models’ linearity.

Verdict: INCREMENTAL — Solid extension combining existing sensitivity approaches with useful computational framework, but core insights follow expectedly from prior Manski/Kitagawa work.

Proof Techniques

Main proof strategy proceeds through convex analysis and linear programming theory:

Polytope characterization: Show identified sets are intersections of convex polytopes. Key insight uses data bounds
\[H_x = [P(Y=1,X=x|Z=0), P(Y=1,X=x|Z=0) + \pi(1|0)] \times [P(Y=1,X=x|Z=1), P(Y=1,X=x|Z=1) + \pi(0|1)]\]
intersected with linear constraint sets $A_x(\theta)$ defined by matrix inequalities $A(\theta)p \leq a(\theta)$.
Falsification point characterization: Uses intersection emptiness. Define
\[\bar{\theta} = \inf\{\theta \in [0,1] : H_x \cap A_x(\theta) \neq \emptyset \text{ for all } x\}\]
Proved via compactness of constraint sets and continuity of correspondence.
Continuity properties: Apply Berge maximum theorem. For continuous functional $\Gamma$, bounds
\[\bar{\Gamma}(\theta) = \sup_{p_Y \in \Pi(\theta)} \Gamma(p_Y)\]
are continuous in $\theta$ when $\Pi(\theta)$ is continuous as set-valued correspondence.
Infinite-dimensional extension: For continuous outcomes, use sieve approximation. Key inequality shows
\[\left|\bar{\Gamma}_J(\theta) - \bar{\Gamma}(\theta)\right| \leq C \cdot J^{-r}\]
where $J$ is number of basis functions, $r$ depends on smoothness assumptions, using standard approximation theory for compact function spaces.
Linear programming duality: Exploits strong duality to show
\[\max_{p \in \Pi(\theta)} c^T p = \min_{\lambda \geq 0} b(\theta)^T \lambda\]
subject to $A(\theta)^T \lambda = c$, enabling efficient computation.

Experiments & Validation

Empirical Application: Gilchrist and Sands (2016) peer effects in movie viewership using weather instruments. Key findings:

Baseline IV estimate shows positive peer effect
Under c-dependence with $c=0.1$: bounds remain positive
Under c-dependence with $c=0.2$: bounds include zero
MSM sensitivity analysis shows robustness to moderate exclusion violations
KS bounds demonstrate similar pattern with threshold around $K=0.15$

Computational validation:

Linear programming solver (CPLEX) handles discrete cases efficiently
Sieve approximation with $J=50$ basis functions for continuous outcomes
Monte Carlo experiments verify theoretical coverage properties

Datasets: Weather data from NOAA, box office data from industry sources, sample size approximately 2,000 movie releases.

Baselines: Standard 2SLS estimates, Manski no-assumption bounds, confidence intervals from wild bootstrap.

Limitations & Open Problems

Limitations:

TECHNICAL: Compactness requirement (Assumption 7) for continuous outcomes needed for theoretical results but may exclude some practical density classes
TECHNICAL: Sieve approximation introduces additional tuning parameter $J$ requiring rate conditions on smoothness
RESTRICTIVE: Binary treatment/instrument restriction in main theorems, though generalization sketched in Section 2.4
NATURAL: No monotonicity assumptions maintained throughout, which is both strength and limitation for contexts where monotonicity plausible
TECHNICAL: Continuity assumptions on sensitivity parameter correspondences (Assumption 3.4) needed for smooth sensitivity plots

Open Problems:
Optimal sensitivity parameter selection: Develop principled methods for choosing $\theta$ values beyond ad-hoc robustness checks, possibly using external validation or cross-validation approaches
Multiple testing corrections: When conducting sensitivity analysis across multiple parameters simultaneously, appropriate multiple comparison adjustments remain underdeveloped

Bridging Structured Knowledge and Data: A Unified Framework with Finance Applications

Authors: Yi Cao, Zexun Chen, Lin William Cong, Heqing Shi · Institution: Xi’an Jiaotong-Liverpool University, University of Edinburgh, Nanyang Technological University · Category: stat.ML

Develops Structured-Knowledge-Informed Neural Networks (SKINNs) that jointly estimate neural network and interpretable structural parameters through a composite loss enforcing theory consistency via collocation, with M-estimator statistical properties.

Tags: machine-learning-theory econometrics m-estimation neural-networks finance option-pricing physics-informed-neural-networks transfer-learning

arXiv · PDF

Problem Formulation

Motivation: Traditional neural networks achieve strong predictive performance but lack interpretability and struggle under distributional shifts, while theory-based models are interpretable but rely on oversimplified assumptions. This creates a critical gap in domains like finance where both flexibility and interpretability are essential.

Mathematical setup: Let $f(X; \theta): \mathbb{R}^d \to \mathbb{R}$ be a neural network mapping input features $X \in \mathbb{R}^d$ to targets $y \in \mathbb{R}$, with trainable parameters $\theta$. Let $g(X_{SK}; \phi): \mathbb{R}^{d_{SK}} \to \mathbb{R}$ be a structured-knowledge representation with observable features $X_{SK} \subseteq X$ and latent structural parameters $\phi \in \mathbb{R}^{d_\phi}$.

Assumptions:

The neural network class ${f_\theta}$ has sufficient approximation capacity
The structured-knowledge function $g_\phi$ is differentiable in both inputs and parameters
Collocation points $X_{Colloc}$ can be sampled from the valid input domain
Standard regularity conditions for M-estimation hold (compactness, continuity, uniform convergence)

Toy example: In option pricing with $d = 3$ features (spot price $S$, strike $K$, maturity $T$), the Black-Scholes model gives $g_\phi(S,K,T) = \text{BSM}(S,K,T;\sigma)$ where $\phi = {\sigma}$ is the volatility parameter. The neural network $f_\theta(S,K,T)$ learns flexible price patterns while being regularized by the Black-Scholes structure.

Formal objective: The composite loss function to minimize is:
\[L_{Total}(\theta, \phi) = \frac{1}{N_{Obs}} \sum_{i=1}^{N_{Obs}} \left(f_\theta(X_i^{Obs}) - y_i^{Obs}\right)^2 + \lambda \frac{1}{N_{Colloc}} \sum_{j=1}^{N_{Colloc}} \left(f_\theta(X_j^{Colloc}) - g_\phi(X_{j,SK}^{Colloc})\right)^2\]

Method

SKINNs jointly optimize neural network parameters $\theta$ and structural parameters $\phi$ via gradient descent on the composite loss. The method proceeds as follows:

Data component: Standard empirical risk minimization on observed data $(X_i^{Obs}, y_i^{Obs})$
Theory component: Enforce consistency between neural network and structured knowledge on collocation points
Joint optimization: Update both parameter sets simultaneously via automatic differentiation

The structured knowledge can take three forms:

Parametric: Closed-form solutions like Black-Scholes:
\[g_\phi(X_{SK}) = \text{BSM}(S, K, T; \sigma)\]
Semi-parametric: Deep surrogate for expensive models:
\[g_\phi(X_{SK}) = f_{SR}(X_{SK}, \phi)\]
where $f_{SR}$ is pre-trained on simulated data.

Non-parametric: High-dimensional representations like risk-neutral probabilities:
\[g_\phi(X_{SK}) = e^{-rT} \sum_{i=1}^{N_z} u(z_i, X_{SK}) Q(z_i)\]
where $\phi = {Q(z_1), \ldots, Q(z_{N_z})}$ are probability parameters.

Toy example application: For Black-Scholes option pricing, the method learns neural network weights $\theta$ for flexible price mapping and volatility parameter $\sigma$ simultaneously. The collocation points enforce that $f_\theta(S,K,T) \approx \text{BSM}(S,K,T;\sigma)$ over the entire input domain, not just observed data points.

Novelty & Lineage

Prior work:

Physics-Informed Neural Networks (PINNs) (Raissi et al., 2019): Embed differential equations via automatic differentiation but suffer gradient pathologies and require observable state variables
Transfer Learning for Finance (Chen et al., 2023): Two-stage approach pre-training on theory then fine-tuning on data, but lacks joint parameter discovery
Functional GMM (Hansen, 1982): Combines empirical moments with theory restrictions but limited to parametric settings

Delta: SKINNs introduce three key advances:
Joint optimization of both neural and structural parameters in a single objective, unlike sequential approaches
Collocation-based enforcement of theory consistency over the entire input domain, not just data points
Flexible knowledge representations from parametric to machine-learned, accommodating high-dimensional latent structures.

Theory-specific assessment:
- The main result (M-estimator consistency with $\sqrt{N}$ convergence) is predictable from classical theory but the application to jointly learned neural-structural systems is novel
- The proof technique is largely routine, assembling standard M-estimation results with neural network approximation theory
- No lower bounds are established; optimality claims for the weighting parameter $\lambda$ are restricted to orthogonal moment conditions
- Identification results under joint flexibility (Proposition 1) provide useful sufficient conditions but are not surprising
Verdict: INCREMENTAL — Solid theoretical foundation for an important practical problem, but the statistical theory largely follows from existing M-estimation results and the novelty lies primarily in the architectural design rather than fundamental theoretical breakthroughs.

Proof Techniques

The main theoretical results rely on standard M-estimation theory with neural network approximation:

Consistency (Theorem 1): Uses uniform law of large numbers for the composite objective:
\[\sup_{(\theta,\phi) \in \Theta \times \Phi} |L_N(\theta,\phi) - L(\theta,\phi)| \to 0\]
Asymptotic normality (Theorem 2): Central limit theorem for the score function with sandwich covariance:
\[\sqrt{N}(\hat{\theta}_N - \theta^*, \hat{\phi}_N - \phi^*) \xrightarrow{d} N(0, H^{-1}\Xi H^{-1})\]
where $H$ is the Hessian and $\Xi$ is the score covariance.
Identification (Proposition 1): For squared loss, the optimal neural function is:
\[f^*_\theta(X) = \frac{E[y|X] + \lambda g_\phi(X_{SK})}{1 + \lambda}\]
The profiled criterion becomes:
\[Q(\phi) = \frac{\lambda}{1+\lambda} E[(E[y|X] - g_\phi(X_{SK}))^2] + \text{const}\]
Generalization bounds: Uses Rademacher complexity for the composite function class with structured regularization, showing improved generalization under distributional shift.

The key technical insight is treating the neural component as an infinite-dimensional nuisance parameter while maintaining standard parametric rates for the structural parameters $\phi$. The double-asymptotic framework requires $N_{Colloc}/N \to c$ for finite $c > 0$ to prevent discretization error from degrading convergence rates.

Experiments & Validation

Dataset: S&P 500 index options from OptionMetrics (1996-2022), covering 27 years including dot-com crash, 2008 crisis, and COVID-19. Focus on call options with 7-365 day maturities.

Setup: Rolling 3-month training windows with 1-month and 2-month ahead prediction horizons across 317 periods. Evaluation metrics: RMSE for pricing, Delta-hedging errors.

Baselines: Plain neural networks, boundary-constrained NNs, forward/inverse PINNs, transfer learning, structural models (BSM, Heston, SABR, etc.).

Key results:

10-15% RMSE reduction vs leading NN benchmarks at longer horizons
Countercyclical performance: Gains most pronounced during high-volatility periods (VIX > 80th percentile)
Statistical significance: All SKINN variants significantly outperform plain NNs at 1% level via Diebold-Mariano tests
Delta-hedging: Consistent improvements across all SKINN specifications
Parameter stability: SKINN-learned structural parameters exhibit smoother evolution than standalone calibration

Robustness: Tested across multiple structured-knowledge formats (parametric BSM, semi-parametric Heston surrogates, non-parametric martingale representations with 2,000 parameters). Simpler knowledge representations generally perform better for hedging applications.

Limitations & Open Problems

Limitations:

TECHNICAL: Double-asymptotic requirement $N_{Colloc}/N \to c$ needed for theoretical guarantees - can be relaxed in practice but affects theoretical validity
TECHNICAL: Hyperparameter $\lambda$ selection requires cross-validation or economic intuition - the “restricted-optimal” characterization only applies under orthogonal moment conditions
NATURAL: Differentiability requirement for structured knowledge excludes some theoretical representations (e.g., discrete choice models with non-smooth value functions)
RESTRICTIVE: Joint identification requires sufficient variation in the data to separately identify neural and structural components - may fail in low-noise or highly regular environments
TECHNICAL: Collocation point sampling strategy not theoretically characterized - uniform sampling used in practice but optimal designs unknown
NATURAL: Computational overhead from joint optimization and collocation evaluation increases training time relative to plain neural networks

Open problems:
Optimal collocation design: Develop theory for adaptive or importance-weighted collocation point selection to improve efficiency
High-dimensional asymptotics: Extend theoretical analysis to settings where neural network width grows with sample size, connecting to recent deep learning theory