Theory 3 papers

Theory Digest — Apr 15, 2026

Today’s Digest at a Glance

Preliminaries

Today’s digest explores three distinct approaches: optimal transport formulations for causal inference, diffusion-based counterfactual modeling, and Bayesian tensor regression for neuroimaging.

Policy-Relevant Treatment Effects (PRTE)

Policy-relevant treatment effects measure the expected impact of a policy change on the population it would actually affect, rather than the entire population. Unlike average treatment effects (ATE) which assume universal treatment adoption, PRTE focuses on individuals whose treatment status would change under the new policy. The challenge is that we observe each individual in only one treatment state, creating a fundamental identification problem.

The standard approach uses instrumental variables with moment inequalities, but these produce loose bounds because they don’t fully exploit the structure of the problem. The key insight is that PRTE identification can be recast as finding the worst-case joint distribution of potential outcomes consistent with observed data. This naturally leads to an optimal transport formulation where we seek the distribution that maximizes (or minimizes) the treatment effect while respecting observational constraints.

Intuition: We’re finding the most pessimistic and optimistic ways to “transport” probability mass between treatment and control groups while maintaining consistency with what we observe.

Denoising Diffusion Models

Denoising diffusion models generate data by learning to reverse a noise-adding process. The forward process gradually corrupts data $x_0$ by adding Gaussian noise over $T$ steps: $x_t = \sqrt{\alpha_t} x_0 + \sqrt{1-\alpha_t} \epsilon$ where $\epsilon \sim \mathcal{N}(0,I)$ and $\alpha_t$ decreases to zero. The reverse process learns a neural network $\epsilon_\theta(x_t, t)$ to predict the noise at each step, enabling generation by starting from pure noise and iteratively denoising.

The training objective is simple: $\mathbb{E}[|\epsilon - \epsilon_\theta(x_t, t)|^2]$ where $\epsilon$ is the true noise added at step $t$. During generation, we sample $x_T \sim \mathcal{N}(0,I)$ and compute $x_{t-1} = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t,t)) + \sigma_t z$ where $z \sim \mathcal{N}(0,I)$.

For counterfactual modeling, instead of generating generic data, we condition the diffusion process on treatment assignments and patient history to generate full probability distributions over counterfactual outcomes. Intuition: We’re learning to “denoise” counterfactual futures from random noise, guided by treatment and covariate information.

PARAFAC Tensor Decomposition

PARAFAC (Parallel Factor Analysis) decomposes a tensor into a sum of rank-1 components, where each component is the outer product of vectors from each mode. For a 3-way tensor $\mathcal{X} \in \mathbb{R}^{I \times J \times K}$, PARAFAC writes:

\[\mathcal{X} = \sum_{r=1}^R \mathbf{a}_r \circ \mathbf{b}_r \circ \mathbf{c}_r\]

where $\circ$ denotes outer product and $R$ is the tensor rank. In matrix form: $X_{(1)} = A(C \odot B)^T$ where $X_{(1)}$ is the mode-1 matricization and $\odot$ is the Khatri-Rao product.

Unlike matrix SVD, PARAFAC provides a unique decomposition under mild conditions (Kruskal’s theorem), making the factor matrices identifiable without rotation ambiguity. The standard algorithm alternates between solving least squares problems for each factor matrix while holding others fixed, though this can be slow to converge.

For neuroimaging tensors representing brain regions × time × subjects, PARAFAC identifies shared spatial-temporal patterns across individuals. Intuition: We’re finding the fundamental “building blocks” that when combined in different proportions can reconstruct each person’s brain activity pattern.

Reading Guide

The first paper develops sharp bounds for causal effects by reformulating the problem as optimal transport, while the second uses diffusion models to generate full counterfactual distributions rather than point estimates. The third paper tackles high-dimensional neuroimaging data through Bayesian tensor methods, representing a different approach to handling complex structured data in scientific applications.


Partial Identification of Policy-Relevant Treatment Effects with Instrumental Variables via Optimal Transport

Authors: Jiyuan Tan, Jose Blanchet, Vasilis Syrgkanis · Institution: Stanford · Category: stat.ME

Reformulates IV partial identification of policy-relevant treatment effects as optimal transport problems, yielding closed-form sharp bounds that are substantially tighter than existing moment-relaxation methods

Tags: instrumental variables partial identification optimal transport policy evaluation treatment effects econometrics causal inference generalized Roy model

arXiv · PDF

Problem Formulation

Motivation: Policy-Relevant Treatment Effects (PRTEs) measure how outcomes change under counterfactual policies that shift treatment take-up. Unlike Local Average Treatment Effects (LATE), PRTEs are typically not point-identified when instruments have limited support in treatment propensity, creating wide bounds that are uninformative for policy decisions.

Mathematical Setup: Consider the generalized Roy model with:

  • Observable variables: treatment $W \in {0, 1}$, instrument $Z \in \mathcal{Z}$, covariates $X \in \mathcal{X} \subset \mathbb{R}^{d_X}$, outcome $Y \in \mathcal{Y} \subset \mathbb{R}$
  • Unobservable potential outcomes $Y(0), Y(1)$ and selection variable $U$

Assumptions:

  1. Consistency: $Y = W Y(1) + (1-W) Y(0)$
  2. Conditional instrumental exogeneity: $Z \perp!!!\perp (U, Y(0), Y(1)) X$
  3. Threshold crossing: $W = \mathbf{1}(U \leq p(Z,X))$ where $p(Z,X) = P(W=1 Z,X)$ and $U X \sim \text{Unif}(0,1)$
  4. Common instrument support: $\text{Supp}(Z X=x) = \mathcal{Z}$ for all $x \in \mathcal{X}$
  5. Boundedness: $\mathcal{Y} \subseteq [y_{\min}, y_{\max}]$ for known constants

    The target parameter is:

    \[\theta_\omega = \mathbb{E}_{X,U}\left[\text{MTE}(X,U) \omega(X,U)\right]\]
    where $\text{MTE}(x,u) = \mathbb{E}[Y(1) - Y(0) X=x, U=u]$ and $\omega(x,u)$ is an identifiable weight function.

    Toy Example: When $X = \emptyset$, $Z \in {0,1}$ with propensity scores $p(0) = 1/4, p(1) = 3/4$, and weight $\omega(u) = \mathbf{1}(u \in (1/4, 1/2))$. This creates three intervals: $(0, 1/4]$ where only $p(0)$ matters, $(1/4, 3/4]$ where both instruments provide information, and $(3/4, 1]$ where $Y(1)$ is never observed.

    Formal Objective: The PRTE partial identification problem seeks:

    \[\inf_{\pi_0, \pi_1} / \sup_{\pi_0, \pi_1} \; \mathbb{E}_{\pi_1}[Y(1)\omega(X,U)] - \mathbb{E}_{\pi_0}[Y(0)\omega(X,U)]\]
Method

The method reformulates PRTE partial identification as a Constrained Conditional Optimal Transport (CCOT) problem, then proves it decomposes into separable 1D optimal transport subproblems.

Step 1 - CCOT Formulation: Search over all joint distributions $\pi_w$ of $(Y(w), U, X)$ consistent with observed data:

\[\int_0^{p(z,x)} \pi_1(dy | u, x) du = P_{\text{obs}}(dy, W=1 | p(Z,X) = p(z,x), X=x)\] \[\int_{p(z,x)}^1 \pi_0(dy | u, x) du = P_{\text{obs}}(dy, W=0 | p(Z,X) = p(z,x), X=x)\]

Step 2 - Decomposition: For discrete instruments with propensity values $0 = p_0 \leq p_1 < \cdots < p_K \leq 1$, the constraints decompose into interval-wise marginal constraints:

\[\frac{1}{p_{i+1} - p_i} \int_{p_i}^{p_{i+1}} \pi_1(dy | u) du = \mu_{1,i}(dy)\]

where $\mu_{1,i}$ is the identified conditional distribution for the $i$-th complier group.

Step 3 - 1D Optimal Transport: Each interval $I_i = (p_i, p_{i+1}]$ yields a standard 1D OT problem:

\[\min_{\gamma_i \in \Pi(\mu_{1,i}, \nu_i)} \int_{Y \times I_i} y \cdot \omega(u) d\gamma_i(y,u)\]

where $\nu_i = \text{Unif}(p_i, p_{i+1})$.

Closed-Form Solution: By Theorem 2.3, the optimal coupling is countermonotonic (comonotonic) for the lower (upper) bound:

\[\theta_{\omega,1}^{\text{lower}} = \sum_{i=0}^{K-1} (p_{i+1} - p_i) \int_0^1 Q_{Y,i}(t) Q_{\omega,i}(1-t) dt + \int_{p_K}^1 [y_{\min} \max\{0,\omega(u)\} + y_{\max} \min\{0,\omega(u)\}] du\]

Application to Toy Example: With $p(0) = 1/4, p(1) = 3/4$, and $\omega(u) = \mathbf{1}(u \in (1/4, 1/2))$, the bound decomposes as:

\[\theta_{\omega,1}^{\text{lower}} = \frac{1}{4}\mathbb{E}[Y|W=1] + \frac{1}{8}\text{CVaR}_{0.5}(Y|W=1) + \frac{1}{8}y_{\min}\]

This reveals three identification regions: direct mean identification, optimal transport (CVaR), and trivial bounds.

Novelty & Lineage

Prior Work:

  1. Magne et al. (2018): “Marginal Treatment Effects and Policy Relevant Treatment Effects” - established moment-relaxation approach optimizing over MTR functions subject to linear constraints
  2. Heckman and Vytlacil (2005): “Structural Equations, Treatment Effects, and Econometric Policy Evaluation” - developed MTE framework and partial identification under limited support
  3. Balke and Pearl (1997): “Bounds on Treatment Effects from Studies with Imperfect Compliance” - introduced linear programming for IV partial identification

    Delta: This paper reformulates the PRTE partial identification problem as a Constrained Conditional Optimal Transport (CCOT) problem instead of moment-relaxation. Key contributions:

  4. uses full distributional constraints rather than just first moments
  5. proves the multidimensional CCOT analytically reduces to separable 1D OT problems
  6. derives closed-form sharp bounds avoiding linear programming.

    Theory-Specific Assessment:

    • Surprising vs. Predictable: The connection between IV partial identification and optimal transport is genuinely novel. The analytical reduction from a heavily constrained multidimensional CCOT to separable 1D problems is mathematically non-obvious and requires careful exploitation of the threshold-crossing structure.
    • Proof Technique: The proof is not routine - it requires a novel insight that observational constraints in the generalized Roy model create a specific structure where the multidimensional transport problem decomposes. The key technical innovation is showing that the integral constraints difference out to isolate conditional distributions on each resistance interval.
    • Bound Tightness: The bounds are provably sharp (Proposition 3.1 establishes observational equivalence). The improvement over moment-relaxation is demonstrated both theoretically (Example 3.2) and empirically, with substantially tighter bounds in simulations.

    Verdict: SIGNIFICANT - Clear non-obvious advance connecting optimal transport to IV partial identification with practical improvements over existing methods; most researchers working on partial identification should read this.

Proof Techniques

Main Proof Strategy: The proof exploits the threshold-crossing structure of the generalized Roy model to decompose a multidimensional CCOT problem into separable 1D optimal transport subproblems.

Stage 1 - Observational Equivalence (Proposition 3.1): Establish that the CCOT feasible set exactly characterizes all observationally equivalent data-generating processes. Uses the consistency equation and conditional independence to link structural measures $\pi_w$ to observed distributions.

Stage 2 - Constraint Decomposition: For discrete instruments with propensity levels $0 = p_0 \leq p_1 < \cdots < p_K \leq 1$, the key insight is that the cumulative constraints:

\[\int_0^{p_i} \pi_1(dy | u) du = P_{\text{obs}}(dy, W=1 | p(Z) = p_i)\]

can be differenced to isolate interval-wise marginal constraints:

\[\frac{1}{p_{i+1} - p_i} \int_{p_i}^{p_{i+1}} \pi_1(dy | u) du = \mu_{1,i}(dy)\]

Stage 3 - Separability: The objective function decomposes additively across resistance intervals:

\[\mathbb{E}_{\pi_1}[Y(1)\omega(U)] = \sum_{i=0}^K (p_{i+1} - p_i) \mathbb{E}_{\pi_1}[Y(1)\omega(U) | U \in I_i]\]

Stage 4 - 1D Optimal Transport Application: Each subproblem becomes a standard 1D OT problem with product cost $c(y,u) = y \cdot \omega(u)$. Apply Theorem 2.3 (countermonotonic/comonotonic coupling):

\[\min_{\gamma \in \Pi(\mu, \nu)} \int y \cdot \omega(u) d\gamma(y,u) = \int_0^1 Q_\mu(t) Q_\omega(1-t) dt\]

Key Inequalities:

\[\theta_{\omega,1}^{\text{lower}} = \sum_{i=0}^{K-1} (p_{i+1} - p_i) \int_0^1 Q_{Y,i}(t) Q_{\omega,i}(1-t) dt + \int_{p_K}^1 [y_{\min} \max\{0,\omega(u)\} + y_{\max} \min\{0,\omega(u)\}] du\]

Technical Insight: The crucial observation is that the threshold-crossing model creates a natural stratification of the resistance distribution that perfectly aligns with the constraint structure, enabling the reduction to separable subproblems without loss of optimality.

Experiments & Validation

Simulations: The paper conducts extensive Monte Carlo simulations comparing their optimal transport bounds against the moment-relaxation approach of Magne et al. (2018). Key findings: OT bounds are substantially tighter across different data-generating processes, with the gap widening when outcome distributions have more concentrated support.

Real Application: Empirical validation using Dupas (2014) bed-net subsidy data from Kenya. The instrument is the offered price of insecticide-treated bed nets, and the treatment is purchase decision. The PRTE measures effects of uniform price subsidies. Results show OT bounds are 40-60% tighter than moment-relaxation bounds for policy-relevant treatment effects.

Baselines: Primary comparison is against IVMTE package implementing Magne et al. (2018). Also recovers classical Manski bounds as special cases, confirming theoretical consistency.

Key Numbers: In simulations with discrete outcomes, OT bounds can be 50-80% tighter than moment bounds. The computational advantage is substantial - closed-form evaluation vs. linear programming optimization.

Empirical Validation Note: The theoretical results require finite-sample estimation and inference theory, which the paper develops through Double Machine Learning for discrete instruments and nonparametric rates for continuous instruments.

Limitations & Open Problems

Limitations:

  1. TECHNICAL: Boundedness assumption on outcome domain $\mathcal{Y} \subseteq [y_{\min}, y_{\max}]$ - needed for informative bounds in unidentified regions, standard in partial identification literature but can be restrictive for unbounded outcomes.

  2. TECHNICAL: Common instrument support assumption - requires $\text{Supp}(Z X=x) = \mathcal{Z}$ for all $x$, needed for propensity score identification but may fail with complex covariate interactions.
  3. NATURAL: Threshold-crossing/monotonicity assumption - equivalent to Angrist-Imbens monotonicity, widely accepted in IV literature and empirically testable.

  4. RESTRICTIVE: For multi-valued treatments (Section 4), requires additional algebraic condition on propensity score ranges that may not hold in practice.

  5. TECHNICAL: Discrete instrument analysis requires finite instrument support - continuous instrument case has slower nonparametric convergence rates and more complex implementation.

    Open Problems:

  6. Extension to Dynamic Settings: Current framework is static - developing optimal transport bounds for dynamic treatment regimes with time-varying instruments and outcomes remains an open challenge.

  7. Computational Scalability: While closed-form for discrete instruments, extending to high-dimensional covariate spaces and complex instrument structures may require approximation methods that preserve the theoretical guarantees.

Causal Diffusion Models for Counterfactual Outcome Distributions in Longitudinal Data

Authors: Farbod Alinezhad, Jianfei Cao, Gary J. Young, Brady Post · Institution: Northeastern University · Category: stat.ML

First application of diffusion models to generate full probabilistic distributions of counterfactual outcomes in longitudinal causal inference, achieving superior distributional accuracy over point-estimate methods.

Tags: causal inference diffusion models time series counterfactual prediction uncertainty quantification longitudinal data healthcare treatment effects

arXiv · PDF

Problem Formulation

Motivation: Predicting counterfactual outcomes in longitudinal data is critical for personalized medicine and policy evaluation. Existing methods typically provide only point estimates without uncertainty quantification, limiting their utility for high-stakes decision-making where understanding risk is essential.

Mathematical setup: Consider longitudinal data with $N$ individuals over $T$ time steps. For individual $i$ at time $t$, observe:

  • Feature vector $X_{i,t}$ (time-varying covariates)
  • Treatment $A_{i,t}$
  • Outcome $Y_{i,t}$

History up to time $t_0$ is denoted:

\[H_{i,0:t_0} := \{(X_{i,t}, A_{i,t}, Y_{i,t})\}_{t=1}^{t_0}\]

Let $Y_i^{(a_{1:t})}$ be the potential outcome at time $t$ under treatment sequence $a_{1:t}$.

Assumptions:

  1. Sequential ignorability: $Y_i^{(a_{t_0+1:T})} \perp A_{i,t} \mid H_{i,0:t-1}$
  2. Consistency: observed outcome equals potential outcome under received treatment
  3. Positivity: all treatments have non-zero assignment probability

    Toy example: When $T=3$, $d=2$ features, and binary treatment, this reduces to predicting the distribution $P(Y_{i,2:3}^{(0,1)} \mid H_{i,0:1})$ where patient receives no treatment at $t=2$ and treatment at $t=3$, given their history through $t=1$.

    Formal objective: Estimate the full conditional distribution:

    \[P\left(Y_i^{(a_{t_0+1:T})} \mid H_{i,0:t_0}\right)\]
Method

Causal Diffusion Model (CDM) uses denoising diffusion to generate counterfactual outcome distributions.

Forward Process: Add Gaussian noise to true outcomes $Y^{(a)}_{t_0+1:T}$:

\[z_k = \sqrt{1-\beta_k} z_{k-1} + \sqrt{\beta_k} \epsilon_k\]

where $z_0 = Y^{(a)}_{t_0+1:T}$, $\epsilon_k \sim \mathcal{N}(0,I)$, and ${\beta_k}$ follows cosine schedule.

Architecture: Novel residual denoising network $f_\theta(z_k, k)$ with:

  1. Relational Self-Attention (RSA) for temporal dependencies
  2. Residual blocks with skip connections
  3. Selective masking for counterfactual coordinates

    Training Loss:

    \[L(\theta) = \mathbb{E}_{k,z_k}\left[\|f_\theta(z_k,k) - \epsilon_k\|_2^2\right]\]

    Sampling Steps:

  4. Initialize $z_K \sim \mathcal{N}(0,I)$
  5. For $k = K$ down to $1$: $z_{k-1} = \text{denoise}(z_k, f_\theta(z_k,k))$
  6. Return $z_0$ as counterfactual sample

    Toy Example Application: For the $T=3$ binary treatment case, CDM would:

    • Mask outcomes at $t=2,3$
    • Condition on history $H_{i,0:1}$ and planned treatments $(0,1)$
    • Generate distribution samples of $(Y_{i,2}^{(0,1)}, Y_{i,3}^{(0,1)})$
Novelty & Lineage

Prior Work:

  1. Recurrent Marginal Structural Network (R-MSN, Lim 2018): LSTM with propensity score reweighting for point estimates
  2. Counterfactual Recurrent Network (CRN, Bica et al. 2019): RNN with adversarial training for balanced representations, produces point estimates
  3. Causal Transformer (CT, Melnychuk et al. 2022): Transformer with domain confusion loss, state-of-the-art point estimation

    Delta: This paper applies diffusion models to causal inference for the first time, generating full outcome distributions rather than point estimates. Introduces RSA-based architecture for temporal modeling.

    Theory-Specific Assessment:

    • Main contribution: First diffusion approach for counterfactual distributions is novel but technically straightforward - applies existing diffusion framework to new domain
    • Proof technique: No theoretical analysis; purely algorithmic contribution with standard diffusion training
    • Bounds: No theoretical bounds provided; evaluation is purely empirical

    The core insight is domain adaptation of diffusion models rather than fundamental theoretical advance. The RSA architecture and conditioning mechanism are reasonable engineering choices but not theoretically deep.

    Verdict: INCREMENTAL — Solid application of diffusion models to new domain with good empirical results, but lacks theoretical novelty or surprising insights.

Proof Techniques

This is purely an algorithmic paper with no formal proofs or theoretical analysis. The approach relies on standard diffusion model theory without novel mathematical contributions.

Key Technical Components:

  1. Standard Diffusion Framework: Uses established forward/reverse process from Ho et al. (2020)
  2. Conditioning Mechanism: Incorporates patient history and planned treatments through context embeddings
  3. Selective Masking: Only applies noise to unknown/counterfactual coordinates
  4. RSA Architecture: Adapts relational self-attention from video understanding to capture temporal dependencies

    No Formal Guarantees: The paper provides no convergence analysis, approximation bounds, or theoretical justification for why diffusion should work well for counterfactual prediction.

    The technical contribution is primarily architectural - designing appropriate conditioning and masking schemes for the causal setting.

Experiments & Validation

Dataset: Pharmacokinetic-pharmacodynamic (PK-PD) tumor growth simulator with controlled time-dependent confounding.

Setup:

  • 10,000 training patients, 1,000 validation, 1,000 test
  • 60 time steps per patient
  • Confounding level $\gamma \in [0,15]$ controls treatment assignment bias

Tumor Growth Equation:

\[V_{t+1} = V_t\left([1 + \rho \log(K/V_t) - \beta_c \text{Chemo}_t - (\alpha \text{Radio}_t + \beta \text{Radio}_t^2)] + \epsilon_t\right)\]

Baselines: R-MSN, CRN, GNet, Causal Transformer (all use Monte Carlo dropout for uncertainty)

Key Results:

  • Distributional accuracy (1-Wasserstein): CDM consistently best across all $\gamma$ (15-30% improvement)
  • Point accuracy (RMSE): Competitive at low confounding, best at high confounding ($\gamma \geq 8$)
  • At $\gamma=0$: CDM achieves 0.08% vs 0.11% (next best)
  • At $\gamma=15$: CDM achieves 0.91% vs 1.30%+ (baselines)

Ablation Study: RSA backbone and cosine schedule provide largest gains.

Limitations & Open Problems

Limitations:

  1. Evaluation limited to simulation - RESTRICTIVE (only PK-PD simulator, may not capture real clinical complexity)

  2. Computational intensity - TECHNICAL (reverse diffusion sampling is slow, limits scalability)

  3. Sequential ignorability assumption - NATURAL (standard in causal inference, widely used)

  4. Limited seed evaluation - TECHNICAL (only 4 seeds at 3 confounding levels due to computational constraints)

  5. No theoretical analysis - TECHNICAL (lacks convergence guarantees or approximation bounds)

    Open Problems:

  6. Real-world validation: Apply CDM to large-scale clinical datasets with external validation against expert assessments or randomized trial data

  7. Scalability improvements: Develop faster sampling methods (e.g., DDIM, consistency models) to enable real-time clinical decision support


Bayesian Tensor-on-Tensor Varying Coefficient Model for Forecasting Alzheimer’s Disease Progression

Authors: Yajie Liu, Hengrui Luo, Suprateek Kundu · Institution: UT Health Science Center Houston, Rice University, MD Anderson Cancer Center · Category: stat.ME

Introduces patch-to-voxel Gaussian process priors in Bayesian tensor-on-tensor regression for forecasting brain changes in Alzheimer’s disease progression.

Tags: tensor-regression gaussian-processes neuroimaging alzheimers-disease bayesian-methods spatial-statistics longitudinal-modeling brain-aging

arXiv · PDF

Problem Formulation

Motivation: Alzheimer’s disease progression modeling requires forecasting future brain changes from longitudinal neuroimaging data. Existing tensor-on-tensor (TOT) models assume linear voxel-to-voxel relationships and fail to capture local spatial heterogeneity, limiting their ability to predict complex neurobiological changes.

Mathematical setup: Consider $N$ samples with $D$-dimensional input tensors $\mathbf{X}_n \in \mathbb{R}^{p_1 \times \cdots \times p_D}$ and output tensors $\mathbf{Y}_n \in \mathbb{R}^{p_1 \times \cdots \times p_D}$, plus covariates $\mathbf{Z}_n \in \mathbb{R}^{S \times 1}$. For each voxel $v \in \mathcal{V}$, define input patch $\mathbf{X}_{P,n}(v) \in \mathbb{R}^{h^D}$ centered at voxel $v$ with size $h^D$.

Assumptions:

  1. Images are registered to common template with spatial coordinates preserved across samples
  2. Residuals follow $\text{vec}(\mathbf{E}_n) \stackrel{i.i.d.}{\sim} \mathcal{N}_V(0, \sigma_e^2 \mathbf{I}_V)$
  3. Tensor coefficients admit low-rank PARAFAC decomposition with rank $R$
  4. Voxel-wise nonlinear functions are independent across voxels

    Toy example: For $D=3$, patch size $h=3$, and voxel $v=(2,2,2)$ in a $5 \times 5 \times 5$ image, the input patch $\mathbf{X}_{P,n}(v)$ contains the $3 \times 3 \times 3$ neighborhood around voxel $(2,2,2)$. The nonlinear mapping $M_{n,v}(\mathbf{X}_{P,n}(v)): \mathbb{R}^{27} \to \mathbb{R}$ predicts the scalar output $Y_n(v)$ from this 27-dimensional patch.

    Formal objective:

    \[\mathbf{Y}_n = \boldsymbol{\Gamma} + \boldsymbol{\Theta} \odot \mathbf{M}_{n,\cdot}(\mathbf{X}_{P,n}) + \sum_{s=1}^S \mathbf{D}_s z_{ns} + \mathbf{E}_n\]
Method

The Bayesian tensor-on-tensor varying coefficient (BTOT-VC) model operates at the voxel level:

\[Y_n(v) = \Gamma(v) + \Theta(v)M_{n,v}(\mathbf{X}_{P,n}(v)) + \sum_{s=1}^S D_s(v)z_{ns} + E_n(v)\]

Key components:

  1. Low-rank tensor decomposition via PARAFAC:

    \[\boldsymbol{\Gamma} = \sum_{r=1}^R \boldsymbol{\gamma}_{1 \cdot,r} \circ \cdots \circ \boldsymbol{\gamma}_{D \cdot,r}\] \[\boldsymbol{\Theta} = \sum_{r=1}^R \boldsymbol{\theta}_{1 \cdot,r} \circ \cdots \circ \boldsymbol{\theta}_{D \cdot,r}\]
  2. Gaussian process priors for nonlinear patch-to-voxel mappings:

    \[M_{\cdot,v}(\mathbf{X}_P(v)) \stackrel{\text{indep}}{\sim} \text{GP}(0, K_v)\] \[K_v(i,j) = \phi_1 \exp\left(-\phi_2 \|\mathbf{X}_{P,i}(v) - \mathbf{X}_{P,j}(v)\|_2^2\right)\]
  3. Spatial smoothing priors on tensor margins:

    \[\boldsymbol{\gamma}_{d \cdot,r} \sim \mathcal{N}(0, \tau_\gamma \mathbf{W}^\gamma_{d,r})\]
    where $\mathbf{W}^\gamma_{d,r}(k_1,k_2) = w^\gamma_{d,r} \exp(-\alpha^\gamma_{d,r} k_1 - k_2 )$.

    Applied to toy example: For the $3 \times 3 \times 3$ patch at voxel $(2,2,2)$, the GP $M_{n,(2,2,2)}$ maps the 27-dimensional patch vector to a scalar via the squared exponential kernel, capturing nonlinear dependencies between the local neighborhood and the output voxel value.

Novelty & Lineage

Step 1 — Prior work:

  1. Lock (2018): “Tensor-on-tensor regression” — introduced basic TOT framework with linear voxel-to-voxel mapping
  2. Wang & Xu (2024): “Bayesian tensor-on-tensor regression using Tucker decomposition” — Bayesian TOT with automatic rank selection but linear assumptions
  3. Gahrooei et al. (2021): “Generalized tensor-on-tensor regression” — extended to multiple tensor covariates, still linear

    Step 2 — Delta: This paper adds: (i) Gaussian process priors for nonlinear voxel-level dependencies, (ii) patch-to-voxel mapping enabling spatial heterogeneity, (iii) embarrassingly parallel MCMC for scalability.

    Step 3 — Theory-specific assessment:

    • Main contribution is combining GP nonlinearity with tensor structure, which is conceptually straightforward but technically involved
    • Proof techniques are routine: standard Bayesian computation with GP predictive distributions
    • No theoretical bounds provided; purely methodological contribution
    • The patch-to-voxel idea has precedent in image processing but novel for tensor regression

    Verdict: INCREMENTAL — solid engineering advance combining known techniques (GP priors + tensor decomposition) with reasonable application, but lacks fundamental theoretical novelty.

Proof Techniques

This is primarily a methodological paper with standard Bayesian computation rather than theoretical proofs. The main technical contributions are:

  1. MCMC algorithm design: Embarrassingly parallel updates for voxel-specific GP terms $M_{n,v}$, exploiting independence across voxels $v \in \mathcal{V}$.

  2. GP predictive distributions: Uses standard kriging formula for out-of-sample prediction:

    \[Y_{\text{test}}(v) | Y_{\text{train}}(v) \sim \mathcal{N}\left(m_{\text{test}}(v) + \mathbf{K}_{\text{vo}}^T \mathbf{K}_{\text{vv}}^{-1}(Y_{\text{train}}(v) - m_{\text{train}}(v)), \mathbf{K}_{\text{oo}} - \mathbf{K}_{\text{vo}}^T \mathbf{K}_{\text{vv}}^{-1} \mathbf{K}_{\text{vo}}\right)\]

    where $\mathbf{K}_{\text{vv}} = \Theta(v)^2 \mathbf{K}_{v,TT} + \sigma_e^2 \mathbf{I}_{N_{\text{train}}}$.

  3. Tensor margin updates: Conjugate Gaussian updates for tensor margins $\boldsymbol{\gamma}_{d \cdot,r}, \boldsymbol{\theta}_{d \cdot,r}$ given current GP values.

  4. Hyperparameter updates: Metropolis-Hastings for lengthscale parameters using log-normal random walk proposals.

    The computational scalability stems from the factored structure enabling parallel computation across voxels, not from novel algorithmic insights.

Experiments & Validation

Datasets:

  • Synthetic 3D images: $32 \times 32 \times 32$ and $8 \times 8 \times 8$ tensors
  • ADNI dataset: 639 participants, T1-weighted MRI at baseline, 6-month, 12-month visits
  • Cortical thickness maps downsampled to $48 \times 48 \times 48$

Baselines: TOT (Lock 2018), RTOT, RPCA, DL-IIR (Ma et al. 2025) - many failed to scale to large 3D images

Key numbers:

  • Simulation: BTOT-VC achieves RPE ≈ 0.4-0.6 vs competitors ≈ 0.9-1.0
  • ADNI prediction: Significant improvements in voxel-wise correlations across 83 cortical ROIs
  • Brain age gap (BAG): F1 scores for predicting accelerated aging: MCI cohort shows substantial improvements
  • Computation: 87.6 min per 1000 MCMC iterations for $32^3$ images on 4 CPUs

Limitations: Most competing methods couldn’t scale to realistic image sizes, limiting fair comparison. Only DL-IIR was scalable but performed poorly.

Limitations & Open Problems

Limitations:

  1. RESTRICTIVE: Assumes independence of GP functions across voxels, ignoring spatial correlations in the nonlinear terms
  2. TECHNICAL: Requires manual selection of patch size $h$ and tensor rank $R$ via cross-validation/DIC
  3. RESTRICTIVE: Common brain mask assumption limits applicability to datasets with heterogeneous sparsity patterns
  4. TECHNICAL: Squared exponential kernel choice may be suboptimal for non-smooth image features
  5. NATURAL: Computational cost scales linearly with number of voxels and quadratically with sample size per voxel

    Open problems:

  6. Develop spatially correlated GP priors that preserve computational tractability while modeling spatial dependencies in nonlinear terms
  7. Extend to multi-task settings where multiple output modalities share common spatial structure but have distinct nonlinear mappings