Apr 29, 2026 Theory 3 papers

Theory Digest — Apr 29, 2026

Today’s Digest at a Glance

Preliminary

Today’s theory digest explores pathologies in variational inference, adapted optimal transport for stochastic processes, and simulation-based inference methods for handling selection bias.

Evidence Lower Bound (ELBO)

The Evidence Lower Bound is a fundamental quantity in variational inference that provides a tractable objective for approximate Bayesian inference. In Bayesian inference, we often cannot compute the marginal likelihood $p(x) = \int p(x

\theta)p(\theta)d\theta$ directly, so we introduce an approximate posterior $q(\theta)$ and optimize the ELBO:

\[\text{ELBO}(q) = \mathbb{E}_{q(\theta)}[\log p(x|\theta)] - D_{KL}(q(\theta)||p(\theta))\]

The ELBO decomposes into a data fit term (expected log-likelihood) and a complexity penalty (KL divergence from prior). Maximizing the ELBO is equivalent to minimizing the KL divergence between the approximate and true posterior.

The key insight is that while we cannot compute $\log p(x)$ directly, we know that $\log p(x) \geq \text{ELBO}(q)$ for any $q$, so maximizing the ELBO gives us the tightest lower bound. However, as today’s first paper shows, the choice of parameterization for $q$ can lead to unexpected overfitting behavior where simpler models achieve higher ELBO values than more complex ones, violating Occam’s razor.

Neural Posterior Estimation

Neural Posterior Estimation is a simulation-based inference method that learns approximate posterior distributions using neural networks when the likelihood function is intractable. Unlike traditional MCMC or variational inference, NPE works directly with simulator outputs without requiring a tractable likelihood.

The core idea is to train a neural network $q_\phi(\theta

x)$ to approximate the posterior $p(\theta

x)$ using pairs $(\theta_i, x_i)$ generated by sampling parameters from the prior and running forward simulations. The network is trained to minimize the KL divergence between the approximate and true posterior, typically using a normalizing flow architecture to ensure the output is a valid probability density.

NPE is particularly valuable in scientific applications where we have complex simulation models but cannot write down analytical likelihood functions—the neural network learns to invert the simulator and recover parameter distributions from observed data.

Reading guide: The first paper reveals fundamental limitations in how we optimize variational objectives, showing that common approximations can paradoxically favor overfitting. The second paper develops new theoretical foundations for optimal transport in filtered probability spaces. The third paper extends simulation-based inference to handle missing data mechanisms, connecting to the broader theme of making inference methods more robust to realistic data collection processes.

Occam’s Razor is Only as Sharp as Your ELBO

Authors: Ethan Harvey, Michael C. Hughes · Institution: Tufts University · Category: stat.ML

Shows that rank-1 covariance approximations in variational inference can cause ELBO-based overfitting through null-space collapse, undermining Occam’s razor.

Tags: variational inference ELBO Bayesian linear regression model selection overfitting approximate posterior low-rank approximation Occam's razor

arXiv · PDF

Problem Formulation

Motivation: Variational inference (VI) is widely used in Bayesian deep learning as a tractable approximation to exact inference. The Evidence Lower Bound (ELBO) is often described as embodying Occam’s razor, providing automatic model selection that balances fit and complexity. However, the quality of this approximation depends critically on the chosen variational family.

Mathematical setup: Consider Bayesian linear regression with design matrix $Φ ∈ ℝ^{N×R}$, targets $y ∈ ℝ^N$, and parameters $w ∈ ℝ^R$. The prior is:

\[p_η(w) = \mathcal{N}(w | 0_R, Σ_p)\]

The likelihood is:

\[p_η(y | w) = \mathcal{N}(y | Φw, σ_y^2 I_N)\]

The true posterior is:

\[p_η(w | y) = \mathcal{N}(w | μ^*, Σ^*)\]

where $μ^* = \frac{1}{σ_y^2} Σ^* Φ^⊤ y$ and $Σ^* = (Σ_p^{-1} + \frac{1}{σ_y^2} Φ^⊤Φ)^{-1}$.

VI approximates this with $q(w) = \mathcal{N}(w | μ_q, Σ_q)$. Key assumptions:

Over-parameterized regime: $R > N$
Prior covariance: $Σ_p = I_R$
Approximate posterior covariance has specific structure (diagonal vs rank-1 vs full-rank)

Toy example: When $N=20$, $R=1024$, and data follows $y = \sin(3x) + ε$ with $ε \sim \mathcal{N}(0, 0.01)$, the rank-1 covariance $q(w) = \mathcal{N}(μ_q, v_q v_q^⊤ + εI_R)$ leads to systematic underestimation of $σ_y^2$.

Formal objective: Maximize the ELBO:
\[\mathcal{J}(μ_q, Σ_q, η) = \mathbb{E}_{q(w)}[\log p_η(y | w)] - D_{KL}(q(w) \| p_η(w))\]

Method

The method involves optimizing the ELBO under different covariance structures for the approximate posterior $q(w)$.

Rank-1 covariance parameterization:

\[q(w) = \mathcal{N}(μ_q, v_q v_q^⊤ + εI_R)\]

where $v_q ∈ ℝ^R$ is the learnable uncertainty direction.

Key optimization steps:

Compute ELBO gradient with respect to $v_q$:
\[\frac{∂\mathcal{J}}{∂v_q} = -\frac{1}{σ_y^2}Φ^⊤Φv_q - v_q + \frac{1}{\|v_q\|_2^2 + ε}v_q\]
In over-parameterized setting ($R > N$), optimal $v_q$ satisfies: $|v_q|_2^2 = 1-ε$, $Φv_q = 0$, $v_q ≠ 0$
This causes null-space collapse where uncertainty aligns orthogonal to features
ELBO degenerates to MAP-like objective:
\[\mathcal{J}(μ_q, ..., η) = \text{const} - \frac{1}{2}\left[N\log(2πσ_y^2) + \frac{1}{σ_y^2}\|y - Φμ_q\|_2^2 + \|μ_q\|_2^2\right]\]
Application to toy example: With $N=20$, $R=1024$, the rank-1 method finds $v_q$ in the null space of $Φ$, allowing perfect interpolation ($|y - Φμ_q|_2^2 ≈ 0$) while maximizing entropy. This drives $σ_y → 0$, causing severe overfitting with $σ_y ≈ 5.54 × 10^{-4}$ compared to the true value $0.1$.

Novelty & Lineage

Prior work:

Coker et al. (2022): “Wide Mean-Field Bayesian Neural Networks Ignore the Data” - showed mean-field VI leads to underfitting
Harvey et al. (2024, 2026): Demonstrated underfitting with diagonal covariance and proposed tempered VI solutions
Dusenberry et al. (2020): “Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors” - proposed rank-1 approximations for scalability

Delta: This paper provides the first concrete demonstration of ELBO-based overfitting (not just underfitting). The key insight is that rank-1 covariance restrictions can lead to null-space collapse, causing systematic underestimation of likelihood variance.

Theory-specific assessment:
- The main result (null-space collapse leading to overfitting) is somewhat predictable given the geometric constraints, but the specific manifestation through ELBO optimization had not been demonstrated
- The proof technique is straightforward: gradient analysis reveals that optimal $v_q$ lies in null space of $Φ$, reducing ELBO to MAP objective
- No formal bounds are provided; this is primarily an empirical demonstration with theoretical explanation
The paper correctly identifies a gap in understanding but the theoretical analysis is relatively shallow. The phenomenon, while important for practitioners, follows fairly directly from the geometry of over-parameterized systems.

Verdict: INCREMENTAL — Solid empirical demonstration of a practically relevant phenomenon, but the theoretical insight is modest and the analysis routine.

Proof Techniques

The main theoretical analysis relies on gradient-based optimization analysis:

Gradient computation: For rank-1 covariance $q(w) = \mathcal{N}(μ_q, v_q v_q^⊤ + εI_R)$, the ELBO gradient is:
\[\frac{∂\mathcal{J}}{∂v_q} = -\frac{1}{σ_y^2}Φ^⊤Φv_q - v_q + \frac{1}{\|v_q\|_2^2 + ε}v_q\]
Null-space analysis: In over-parameterized regime ($R > N$), setting $\frac{∂\mathcal{J}}{∂v_q} = 0$ yields the optimality conditions: $|v_q|_2^2 = 1-ε$, $Φv_q = 0$, $v_q ≠ 0$
Objective simplification: When $v_q$ lies in null space of $Φ$, the entropy term $-\frac{1}{2}\log\det(v_q v_q^⊤ + εI_R)$ becomes independent of data, causing ELBO to reduce to:
\[\mathcal{J} ≈ -\frac{1}{2}\left[N\log(2πσ_y^2) + \frac{1}{σ_y^2}\|y - Φμ_q\|_2^2 + \|μ_q\|_2^2\right]\]
Overfitting mechanism: Since $R > N$ allows perfect interpolation ($|y - Φμ_q|_2^2 ≈ 0$), the objective becomes dominated by the log-determinant term $N\log(2πσ_y^2)$, driving $σ_y → 0$.

The proof technique is elementary but correctly identifies the geometric cause of the pathology.

Experiments & Validation

Datasets: Synthetic data with $x \sim \mathcal{N}(0,1)$, $y = \sin(3x) + ε$ where $ε \sim \mathcal{N}(0, 0.01)$. Uses random Fourier features (RFFs) with $N=20$ data points and $R=1024$ features.

Baselines: Compares three approximate posterior covariance structures:

Diagonal (mean-field VI)
Rank-1 ($v_q v_q^⊤ + εI_R$)
Full-rank (exact)

Key results:
- Diagonal: underfitting with $σ_y = 0.39-0.48$ (overestimated)
- Rank-1: severe overfitting with $σ_y ≈ 5.54 × 10^{-4}$ (underestimated by factor of ~200)
- Full-rank: good fit with $σ_y = 0.09-0.13$ (close to true value 0.1)
Additional experiments:
- Scaling analysis showing ELBO vs log-marginal likelihood as function of $R$
- Empirical Bayes analysis showing prior variance regularization
- Tempered VI experiments demonstrating mitigation strategies
The experiments clearly demonstrate the claimed phenomenon but are limited to synthetic data and a single model class.

Limitations & Open Problems

Limitations:

RESTRICTIVE - Analysis limited to linear regression with Gaussian priors/likelihoods; unclear how results extend to nonlinear models or non-Gaussian settings
TECHNICAL - Requires over-parameterized regime ($R > N$) for null-space collapse; behavior in well-specified regimes unclear
RESTRICTIVE - Synthetic experiments only; no validation on real datasets or practical applications
TECHNICAL - Rank-1 parameterization with jitter $εI_R$ is somewhat artificial; pure rank-1 case not addressed
RESTRICTIVE - Focus on hyperparameter learning via ELBO maximization; other inference objectives not considered

Open problems:
Extension to nonlinear models: Does similar overfitting occur in Bayesian neural networks with low-rank posterior approximations?
Optimal rank selection: How should practitioners choose the rank of covariance approximations to balance computational efficiency and inference quality?

Adapted Wasserstein Barycenters of Gaussian Processes

Authors: Francesco Mattesini, Johannes Wiesel · Institution: · **Category:** `math.PR`

Establishes existence, uniqueness, and fixed-point characterization of adapted Wasserstein barycenters for Gaussian processes through multicausal reformulation and novel column decomposition of the adapted Bures-Wasserstein distance.

Tags: optimal_transport Wasserstein_barycenters Gaussian_processes stochastic_processes adapted_transport filtration_constraints mathematical_finance stochastic_control

arXiv · PDF

Problem Formulation

Motivation: In many applications involving stochastic processes (stochastic control, mathematical finance, sequential decision), classical Wasserstein distances ignore temporal structure and information flow. The adapted Wasserstein distance addresses this by enforcing filtration compatibility in transport plans, making it suitable for problems with nested information constraints.
Mathematical setup: Fix a probability space
\[(Ω, F, P)\]
with independent d-dimensional standard Gaussians
\[G_t \sim N(0, I_d)\]
for
\[t = 1,...,T\]
. Let
\[\mathcal{G}_t = σ(G_s : s ≤ t)\]
be the filtration. Define the set of block-lower triangular matrices:
\[\mathcal{L} = \{L = (L_{t,s})_{1 ≤ s ≤ t ≤ T} : L_{t,s} ∈ \mathbb{R}^{d×d}\}\]
Every Gaussian process
\[X = (X_t)_{t=1}^T\]
on this filtered space can be written as
\[X = a + LG\]
for some
\[a ∈ \mathbb{R}^{dT}, L ∈ \mathcal{L}\]
, where
\[X_t = a_t + \sum_{s=1}^t L_{t,s}G_s\]
. The adapted 2-Wasserstein distance between Gaussian processes
\[X^1 = a^1 + L^1G\]
and
\[X^2 = a^2 + L^2G\]
is:
\[AW_2^2(X^1, X^2) = ||a^1 - a^2||_2^2 + d_{ABW}^2([L^1], [L^2])\]
where
\[d_{ABW}\]
is the adapted Bures-Wasserstein distance:
\[d_{ABW}^2([L^1], [L^2]) = \min_{O ∈ \mathcal{O}} ||L^1 - L^2O||_F^2\]
and
\[\mathcal{O} = \{diag(O_1,...,O_T) : O_t ∈ O(d)\}\]
is the set of block-diagonal orthogonal matrices.

Assumptions:
\[N\]
Gaussian processes
\[X^i = a^i + L^iG\]
with
\[a^i ∈ \mathbb{R}^{dT}, L^i ∈ \mathcal{L}\]
Convex weights
\[λ_1,...,λ_N > 0\]
with
\[\sum_{i=1}^N λ_i = 1\]
Toy example: When
\[T=2, d=1\]
, we have
\[L = \begin{pmatrix} L_{1,1} & 0 \\ L_{2,1} & L_{2,2} \end{pmatrix}\]
representing process
\[X_1 = a_1 + L_{1,1}G_1, X_2 = a_2 + L_{2,1}G_1 + L_{2,2}G_2\]
. The adapted barycenter compares how each noise source
\[G_1, G_2\]
propagates forward in time across different models.
Formal objective: Find the adapted Wasserstein barycenter:
\[\inf_{X ∈ \mathcal{F}\mathcal{P}_2} \sum_{i=1}^N λ_i AW_2^2(X, X^i)\]

Method

The method establishes existence, uniqueness, and characterization of adapted Wasserstein barycenters for Gaussian processes through several key steps:

Multicausal reformulation: Convert the barycenter problem to multicausal optimal transport:
\[\inf_{π ∈ Cpl_{mc}(X^1,...,X^N)} \mathbb{E}_π\left[\sum_{i=1}^N λ_i ||X^i - φ_0(X^1,...,X^N)||_2^2\right]\]
where
\[φ_0(x^1,...,x^N) = \sum_{i=1}^N λ_i x^i\]
is the optimal aggregation map.
Column decomposition: For
\[L ∈ \mathcal{L}\]
, define truncated columns
\[\tilde{L}_t ∈ \mathbb{R}^{(T-t+1)d×d}\]
and column covariances
\[\tilde{Σ}_t^L = \tilde{L}_t\tilde{L}_t^⊤\]
. The key decomposition is:
\[d_{ABW}^2([L], [M]) = \sum_{t=1}^T d_{BW}^2(\tilde{Σ}_t^L, \tilde{Σ}_t^M)\]
Fixed-point characterization: The barycenter
\[[L^*]\]
satisfies:
\[L^* = \sum_{i=1}^N λ_i L^i O_i(L^*)\]
where
\[O_i(L^*) ∈ \arg\min_{O ∈ \mathcal{O}} ||L^* - L^i O||_F^2\]
.
Algorithm: Alternating minimization between updating Procrustes optimizers
\[O_i^{(k)}\]
and barycenter candidate
\[L^{(k+1)} = \sum_{i=1}^N λ_i L^i O_i^{(k)}\]
.

Application to toy example: For
\[T=2, d=1\]
with AR(1) processes
\[X_t^i = α_t^i X_{t-1}^i + σ_t^i G_t\]
, the Cholesky factor is
\[L^i = \begin{pmatrix} σ_1^i & 0 \\ α_2^i σ_1^i & σ_2^i \end{pmatrix}\]
. The column covariances are
\[\tilde{Σ}_1^L = L^i(L^i)^⊤\]
,
\[\tilde{Σ}_2^L = (σ_2^i)^2\]
, and the barycenter decomposes into independent problems for each column structure.

Novelty & Lineage

Prior work:

Agueh & Carlier (2011) established existence/uniqueness of classical Wasserstein barycenters between Gaussians
Backhoff et al. (2020-2022) introduced adapted Wasserstein distances for filtered processes
Acciaio et al. (2023) developed adapted Bures-Wasserstein geometry for Gaussian processes

Delta: This paper provides the first complete theory of adapted Wasserstein barycenters for Gaussian processes. Key additions:
- Proves barycenter existence and Gaussianity via multicausal reformulation
- Establishes uniqueness through novel column decomposition
\[d_{ABW}^2 = \sum_t d_{BW}^2(\tilde{Σ}_t^L, \tilde{Σ}_t^M)\]
- Derives fixed-point characterization enabling numerical computation
- Shows regularity preservation (non-degenerate inputs → non-degenerate barycenter)
Theory-specific assessment:
- Main theorem is not surprising given classical Wasserstein barycenter theory, but the temporal constraints create genuine technical challenges
- Proof technique combines multicausal transport (novel application) with backward induction and column decomposition (genuinely new)
- The column decomposition isometry
\[(L/O, d_{ABW}) \cong \prod_{t=1}^T (S_+^{(T-t+1)d}, d_{BW})\]
is a clean structural insight
- No lower bounds are established or known for this setting
Verdict: INCREMENTAL — This is a solid extension of classical barycenter theory to the adapted setting with clean technical contributions, but the main results follow expected patterns from the classical case once the appropriate framework is established.

Proof Techniques

The proof employs several key techniques:

Multicausal reformulation and backward induction: Uses dynamic programming principle to show Gaussianity propagates:
\[V(t, ω_{1:t}) = \inf_π \int V(t+1, ω_{t+1}) π(dω_{t+1})\]
The key insight is that conditional distributions remain Gaussian with affine means, so optimal couplings at each step are Gaussian.
Column decomposition via Procrustes structure: The critical technical lemma shows:
\[d_{ABW}^2(L,M) = \sum_{t=1}^T \min_{O_t ∈ O(d)} ||\tilde{L}_t - \tilde{M}_t O_t||_F^2\]
This reduces to
\[T\]
independent classical problems using the identity:
\[\min_{O ∈ O(d)} ||A - BO||_F^2 = ||A||_F^2 + ||B||_F^2 - 2\text{tr}(S)\]
where
\[S\]
contains singular values of
\[B^⊤A\]
.
Von Neumann trace inequality: For establishing the Bures-Wasserstein connection:
\[\max_{O ∈ O(d)} \text{tr}(\tilde{M}_t^⊤ \tilde{L}_t \cdot O^⊤) = \text{tr}(S_t)\]
Combined with eigenvalue identity for
\[(\tilde{Σ}_t^M)^{1/2} \tilde{Σ}_t^L (\tilde{Σ}_t^M)^{1/2}\]
.
Alternating minimization for fixed-point: The objective
\[F(L; O_1,...,O_N) = \sum_i λ_i ||L - L_i O_i||_F^2\]
decouples:
- Sub-problem 1:
\[L^* = \sum_i λ_i L_i O_i\]
(convex quadratic)
- Sub-problem 2:
\[O_i^* ∈ \arg\min_O ||L - L_i O||_F^2\]
(Procrustes)
Perturbation argument for regularity: If barycenter
\[\tilde{L}_t^*\]
has rank
\[< d\]
, construct perturbation
\[\tilde{L}_t^ε = \tilde{L}_t^* + ε hv^⊤\]
where
\[v ∈ \ker(\tilde{L}_t^*)\]
. The Schur complement analysis shows the perturbation decreases the objective:
\[\text{tr}(\tilde{S}_t^ε) = \text{tr}(\tilde{S}_t) + ε||c_{i,⊥}||^2 + O(ε^2)\]
contradicting optimality.

Experiments & Validation

Purely theoretical. The paper provides explicit characterizations for AR(1) processes and outlines Algorithm 1 for iterative computation via alternating minimization, but no numerical experiments are conducted.

Empirical validation would require:

Numerical implementation of Algorithm 1 with convergence analysis
Comparison with classical Wasserstein barycenters on time series data
Applications to scenario reduction in multi-stage stochastic optimization
Performance evaluation on financial stress testing problems

Limitations & Open Problems

Limitations:

Discrete time only - TECHNICAL: Extension to continuous-time processes (Ornstein-Uhlenbeck) requires handling infinite-dimensional operator equations
Gaussian restriction - RESTRICTIVE: Non-Gaussian processes lack explicit representations, though multicausal framework applies generally
No convergence rates - TECHNICAL: Algorithm 1 convergence follows from classical theory but specific rates not established
Finite horizon - NATURAL: Infinite horizon would require additional technical conditions on growth/stationarity

Open problems:
Continuous-time extension: Whether adapted barycenters of Ornstein-Uhlenbeck processes remain diffusions, and how the fixed-point equation extends to operator-valued settings
Non-Gaussian barycenters: Uniqueness and structural characterization beyond Gaussian case, potentially for Markov processes where conditional structure enables dynamic programming

Overcoming Selection Bias in Statistical Studies With Amortized Bayesian Inference

Authors: Jonas Arruda, Sophie Chervet, Paula Staudt, Andreas Wieser et al. (9 authors) · Institution: University of Bonn · Category: stat.ML

Embeds selection mechanisms into forward simulators for neural posterior estimation, enabling bias-aware Bayesian inference without tractable likelihoods.

Tags: simulation-based inference neural posterior estimation selection bias missing data epidemiology Bayesian inference amortized inference bias correction

arXiv · PDF

Problem Formulation

Motivation (2-3 sentences): Selection bias occurs when the probability of observation depends on variables related to quantities of interest, leading to systematic distortions in statistical inference. This is particularly problematic in epidemiological studies, survey research, and clinical trials where non-representative sampling or outcome-dependent inclusion can severely bias prevalence estimates and parameter inference. Classical corrections require tractable likelihoods, limiting their applicability to complex stochastic models.
Mathematical setup: Consider a probability space where we observe data $Y = {y_i}_{i=1}^N$ with selection indicator $S = 1$ and covariates $C = {c_i}_{i=1}^N$. Let $Y_{full} = (Y, Y_{mis})$ denote the complete data including missing values. The likelihood under selection is:
\[p(Y, S | \theta; C) = \int\int p(Y, Y_{mis} | \theta; C, C_{mis}) p(S | Y, Y_{mis}, \theta; C, C_{mis}) dC_{mis} dY_{mis}\]
The target posterior distribution is:
\[p(\theta | Y, S = 1; C) \propto p(\theta)p(Y, S | \theta; C)\]
Assumptions:
1. The selection mechanism $p(S Y, Y_{mis}, \theta; C, C_{mis})$ can be simulated
2. The forward model $p(Y \theta; C)$ can be simulated
3. The prior $p(\theta)$ is specified

Toy example: When $d=2$ with binary infection status and binary sex covariate, and selection depends only on infection status with probability $p(S=1

Y=1) = 0.8$ and $p(S=1

Y=0) = 0.5$, the observed prevalence will be biased upward compared to the true population prevalence.

Formal objective: Estimate the bias-corrected posterior:
\[\hat{p}(\theta | Y, S = 1; C) \approx q_\phi(\theta | Y; C)\]
where $q_\phi$ is a neural posterior estimator trained on simulated selected data.

Method

The method embeds selection mechanisms directly into forward simulations for neural posterior estimation:

Generate parameters $\theta \sim p(\theta)$ and covariates $C$ from data
Simulate full population data $Y_{full} \sim p(Y \theta; C)$
Apply selection mechanism to obtain $S \sim p(S Y_{full}, \theta; C)$
Retain only selected observations where $S = 1$

Train neural posterior estimator $q_\phi(\theta

Y, S=1; C)$ on selected simulation pairs

The NPE consists of:

Summary network: processes observed data $Y$ into low-dimensional embedding
Inference network: generative model (e.g., flow matching) mapping embeddings to posterior samples

Training objective minimizes:

\[\mathcal{L} = \mathbb{E}_{(\theta,Y,S)} [-\log q_\phi(\theta | Y, S=1; C)]\]

Applied to toy example: For binary infection with biased selection, the method learns to upweight observations from the under-represented group (uninfected individuals) when estimating population prevalence, automatically correcting for the $p(S=1

Y=1) > p(S=1

Y=0)$ bias.

Novelty & Lineage

Step 1 — Prior work:

Little & Rubin (2019): established theoretical framework for missing data and selection bias correction via likelihood-based methods
Lueckmann et al. (2021): developed neural posterior estimation for simulation-based inference but assumed unbiased observations
Wang et al. (2024): extended SBI to missing data under missing-at-random assumptions

Step 2 — Delta: This paper embeds selection mechanisms directly into the forward simulator for neural posterior estimation, enabling bias correction without tractable likelihoods. Key additions include structured diagnostics (C2ST, SBC) for bias detection and amortized inference across heterogeneous selection regimes.

Step 3 — Theory-specific assessment:

Main result is predictable: combining existing NPE methods with selection simulation is a natural extension
Proof technique is routine: assembles known SBI training procedures with forward selection modeling
No theoretical bounds provided; empirical validation only
No lower bounds established for when method succeeds/fails

The contribution is primarily methodological rather than theoretical. While the engineering is solid and applications are convincing, the core insight (simulate selection process) is straightforward.

Verdict: INCREMENTAL — solid engineering contribution combining existing NPE methods with selection modeling, but lacks theoretical novelty or surprising results.

Proof Techniques

No formal proofs provided. The paper is primarily empirical with methodological contributions.

Key technical components:

Forward simulation procedure:
\[(\theta, Y_{full}) \sim p(\theta) \times p(Y | \theta; C)\] \[S \sim p(S | Y_{full}, \theta; C)\]
NPE training on selected pairs ${(\theta_j, Y_j) : S_j = 1}_{j=1}^J$
Simulation-based calibration diagnostic using rank statistics:
\[R_m = \text{rank}(\theta_m \text{ in posterior samples from } Y_m)\]
Classifier two-sample test statistic:
\[T = \frac{1}{K} \sum_{k=1}^K \left(a_k - \frac{1}{2}\right)^2\]
where $a_k$ is classification accuracy distinguishing posterior-data pairs from joint samples.

The technical approach relies on established neural density estimation theory and assumes sufficient network capacity and training data for consistent posterior approximation. No convergence analysis or approximation error bounds provided.

Experiments & Validation

Three applications with real and simulated data:

KoCo19 seroprevalence study: 5577 participants across 5 rounds with increasing missingness. Comparison against inverse probability weighting and unadjusted estimates. Key result: NPE more accurate on simulated data with absolute error ~1-3% vs 2-8% for baselines.
Framingham dementia study: 2200+ patients with time-to-event analysis under death-induced censoring. Comparison against penalized likelihood IDM and naive Cox model. NPE recovers hazard parameters with NRMSE comparable to full-data training.
PedCovid household transmission: Complex stochastic transmission model with outcome-dependent selection. MCMC requires 12+ hours vs seconds for NPE. Shows systematically different parameter estimates under child selection bias.

Diagnostics: C2ST scores near 0.5-0.7 indicate well-calibrated posteriors. SBC ECDFs show uniform rank distributions for bias-aware NPE but systematic deviations for mismatched methods.

Limitations & Open Problems

Limitations:

Requires specifying structural form of selection mechanism — TECHNICAL (can be parameterized as unknown)
Performance depends on neural network expressiveness and training budget — NATURAL (standard in deep learning)
C2ST operates on learned embeddings, may miss discrepancies not captured by summary network — TECHNICAL (could use raw data diagnostics)
No theoretical guarantees on posterior quality or convergence — RESTRICTIVE (limits reliability assessment)
Assumes selection mechanism can be accurately simulated — RESTRICTIVE (strong modeling assumption)

Open problems:
Develop theoretical analysis of approximation error and convergence rates for NPE under selection bias
Design methods for automatic discovery/learning of selection mechanisms from data rather than requiring prior specification