Apr 12, 2026 Applied AI 5 papers

Applied AI Digest — Apr 12, 2026

ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

Authors: Shifeng Liu, Zhengye Zhang, Sirui Zhao, Xinglong Mao et al. (10 authors) · Institution: University of Science and Technology of China · Category: cs.CV

ActFER reformulates facial expression recognition as agentic active visual evidence acquisition, using utility-calibrated reinforcement learning to determine when local facial region inspection improves emotion understanding.

Practical Takeaway: The key insight is reformulating perception tasks from passive single-pass reasoning to active evidence acquisition with utility-aware tool selection. The UC-GRPO algorithm’s query-conditional contrastive utility estimation could be adapted to other vision tasks where selective attention/zooming is beneficial. However, the approach requires careful reward engineering and may be overkill for simpler FER scenarios. Research engineers should consider this agentic paradigm for tasks involving fine-grained visual analysis where different samples benefit from different inspection strategies. The emotion-wise EMA calibration technique for stabilizing noisy utility estimates could have broader applications in tool-augmented RL.

Tags: facial-expression-recognition multimodal-llm reinforcement-learning agentic-ai tool-augmented-reasoning action-units affective-computing computer-vision

arXiv · PDF

Task & Setting

ActFER addresses facial expression recognition (FER) through multimodal large language models (MLLMs). Traditional FER methods lack interpretable reasoning and rely on fixed preprocessing, limiting their ability to actively inspect facial regions when needed for fine-grained emotion understanding.

The task takes raw facial images as input and produces structured outputs containing:

predicted Action Units (AUs) based on the Facial Action Coding System (FACS), and

emotion labels from 8 categories (neutral, happiness, sadness, surprise, fear, disgust, anger, contempt). The model operates through a thought-action-observation loop with tool invocation capabilities.

The formal objective combines emotion accuracy and AU prediction quality:

\[R_{acc} = \begin{cases}\] \[w_y + w_{au}F_{1}^{AU}(\hat{A}, A^*), & \hat{y} = y^* \\\] \[r_{wrong} + \frac{1}{2}w_{au}F_{1}^{AU}(\hat{A}, A^*), & \hat{y} \neq y^*\] \[\end{cases}\]

where $F_{1}^{AU}(\hat{A}, A^*) = \frac{2

\hat{A} \cap A^*

}{

\hat{A}

A^*

}$ measures AU set overlap.

Success is measured by emotion recognition accuracy/F1 on FERBench (combining AffectNet, RAF-DB, FERPlus, SFEW2.0 test sets) and zero-shot AU detection F1 on DISFA using 8 standard AUs. The training data comprises 54.8K samples (48K SFT + 6.8K RL) synthesized from the four FERBench datasets.

Architecture & Method

Base Architecture: Qwen3VL-4B backbone with vision encoder frozen during training, only fine-tuning LLM linear layers and vision-language projector.
Agentic Pipeline: Interactive thought-action-observation loop where the model generates thoughts $r_t = \pi_{think}(o_t)$ and either executes tool actions $c_t = \pi_{act}(o_t, r_t)$ or terminates with structured prediction $z = (r_t, \hat{A}, \hat{y})$.
Tool Library: Two vision tools implemented via InsightFace - (a) Face Detection-Alignment for standardized facial views and ROI coordinates, (b) Zoom-In for magnifying specific facial regions (forehead-eyebrow, eye-periorbital, nose, mouth-chin).
FACS-Grounded Reasoning: Visual Chain-of-Thought that links local facial movements to Action Units, then reasons from AU combinations to final emotion labels.
Core Technical Contribution: Utility-calibrated policy learning that determines when local inspection is beneficial vs. harmful on a per-sample basis, moving beyond passive single-pass reasoning to active evidence acquisition.

Training Recipe

Stage 1 - Supervised Fine-Tuning:

Data: 48K synthetic multi-turn trajectories from AffectNet, FERPlus, RAF-DB, SFEW2.0
Optimizer: AdamW, LR 1×10^-5 (LLM) / 2×10^-5 (projector), weight decay 0.01
Batch size 16, max sequence length 8192, 2 epochs
Hardware: 4 × NVIDIA A800 80GB GPUs (wall-clock time not reported)

Stage 2 - UC-GRPO Reinforcement Learning:
Data: 6.8K single-turn prompts rebalanced toward harder low-resource emotions
Optimizer: AdamW, LR 2×10^-6, batch size 32, 1 epoch
GRPO with G=5 rollouts per query, max interaction horizon T≤4
KL regularization coefficient 0.1, entropy coefficient 0.01
Hardware: Same 4 × A800 GPUs (wall-clock time not reported)

Novelty & Lineage

Prior Work:

ExpLLM (2024): Integrated Chain-of-Thought reasoning with AU information as interpretable cues for FER, achieving strong results but still passive.
UniFER (2024): Used RL with verifiable rewards for post-training FER improvement, but lacked active perception capabilities.
FEALLM (2025): Constructed AU-aligned datasets to strengthen local detail modeling, but remained in fixed visual evidence paradigm.

Delta: This paper adds:
agentic tool-augmented visual reasoning with dynamic face detection, alignment, and selective zoom-in capabilities
UC-GRPO algorithm with query-conditional contrastive utility estimation and emotion-wise EMA calibration.

Applied-Specific Assessment:
- Architectural novelty: The agentic formulation with utility-calibrated tool selection is novel for FER, though tool-augmented MLLMs exist in other domains.
- Benchmark gains: Substantial improvements - +12.34 Acc and +21.98 F1 over strongest general MLLM (Gemini-2.5-Flash), +5.05 Acc and +12.13 F1 over best FER baseline (UniFER).
- Fair comparisons: Uses same evaluation protocol (FERBench) with models of comparable or larger size (4B vs 7B+ baselines).
- Generalizability concerns: Gains may be partly attributable to synthetic data quality and specialized reward engineering rather than purely architectural advances.
Verdict: SIGNIFICANT — The agentic reformulation with utility-calibrated tool selection represents a clear advance in MLLM-based FER that addresses real limitations of passive approaches.

Benchmarks & Results

FERBench Emotion Recognition: 73.89% accuracy, 67.45% macro-F1 vs. previous best UniFER (68.84% acc, 55.32% F1) - improvement of +5.05 acc, +12.13 F1
RAF-DB F1: 82.72% vs. ExpLLM 84.80% (best baseline) - ActFER slightly lower by 2.08%
FERPlus F1: 59.92% vs. UniFER 58.55% (best baseline) - improvement of +1.37%
AffectNet F1: 57.66% vs. ExpLLM 46.86% (best baseline) - improvement of +10.80%
SFEW2.0 F1: 51.13% vs. ExpLLM 43.49% (best baseline) - improvement of +7.64%
DISFA Zero-shot AU Detection: 58.2% average F1 vs. InternVL3.5-4B 40.8% (best baseline) - improvement of +17.4%
Per-emotion Results: Strong gains on contempt (51.00% vs. 20.49% previous best), fear (60.00% vs. 54.79%), disgust (54.71% vs. 46.83%)

Results show mixed performance with clear strengths on most benchmarks but some variation across individual datasets.

Compute & Efficiency

Model size: 4B parameters (Qwen3VL-4B backbone)
Training compute: 4 × NVIDIA A800 80GB GPUs for both SFT and RL stages (specific GPU hours not reported)
Inference speed/latency: Not reported, though multi-turn tool interaction likely increases latency vs. single-pass models
Memory footprint: Not explicitly reported, but 4B parameter model suggests moderate memory requirements
Deployment practicality: Moderate - requires tool integration (InsightFace) and multi-turn interaction capability, but uses relatively compact 4B model compared to larger baselines

Real-World Applicability

Raw image capability: Model processes raw in-the-wild images without requiring external preprocessing, demonstrated on standard FER benchmarks containing natural images
Tool integration: Uses InsightFace face analysis toolkit for detection and alignment, showing integration with existing computer vision tools
Cross-dataset generalization: Zero-shot AU detection on DISFA (video frames) using model trained only on static image datasets shows some domain transfer
No production deployment: No reported real-world deployment results, hardware experiments on actual systems, or extensive sim-to-real validation
Benchmark-focused evaluation: Testing primarily on curated academic datasets rather than streaming video or mobile deployment scenarios

Limitations & Failure Modes

ENGINEERING: Dependency on external face detection tools (InsightFace) creates additional failure points and deployment complexity
EVALUATION: Limited to academic datasets; lacks evaluation on streaming video, mobile devices, or adverse lighting conditions
FUNDAMENTAL: Multi-turn interaction increases inference latency and computational cost compared to single-pass models
ENGINEERING: Synthetic training data may not capture full diversity of real-world facial expressions and image conditions
EVALUATION: UC-GRPO algorithm complexity makes training sensitive to hyperparameters and requires careful tuning

Failure Modes:
- Face detection failures cause trajectory termination and fallback to degraded reasoning
- Over-aggressive zoom-in on low-quality images may reduce rather than improve evidence quality

Learning Vision-Language-Action World Models for Autonomous Driving

Authors: Guoqing Wang, Pin Tang, Xiangxuan Ren, Guodongfang Zhao et al. (6 authors) · Institution: Shanghai Jiao Tong University, Huawei · Category: cs.CV

VLA-World combines vision-language-action models with world model imagination, using predicted trajectories to generate future frames and then reasoning over these generated images to refine autonomous driving decisions.

Practical Takeaway: For research engineers in autonomous driving: The key insight is using short-term trajectory predictions to condition future frame generation, then reasoning over those generated images to refine trajectory planning. This creates a useful “imagination then reflection” loop that could be applied to other sequential decision-making problems. The three-stage training approach (pretraining→SFT→GRPO) provides a template for training complex multi-modal reasoning systems. However, the computational overhead of visual generation may limit real-time applicability, so consider whether the reasoning benefits justify the additional complexity for your specific use case.

Tags: autonomous_driving vision_language_action world_models trajectory_planning multimodal reinforcement_learning future_prediction visual_generation

arXiv · PDF

Task & Setting

This paper addresses autonomous driving through Vision-Language-Action (VLA) world models. The practical need arises from limitations in current approaches: VLA models lack temporal dynamics modeling while world models struggle with reasoning about their generated futures. This creates safety and foresight limitations in autonomous driving systems.

The task involves multi-view visual perception and trajectory planning for autonomous driving. Given multi-view camera inputs $I^k_t \in \mathbb{R}^{H \times W \times 3}$ for $k \in {1, …, K}$ cameras and ego status $S_t$, the system must output future waypoint trajectories:

\[\tau_{t:t+H} = \{p_{t+1}, p_{t+2}, ..., p_{t+H}\}, \quad p_{t+h} \in \mathbb{R}^2\]

Success is measured by: 1) L2 displacement error between predicted and ground truth trajectories at 1s, 2s, 3s horizons, 2) Collision rate percentage, and 3) Fréchet Inception Distance (FID) for generated future frames.

The paper introduces nuScenes-GR-20K, a 20K sample dataset derived from nuScenes specifically for generation and reasoning tasks.

Architecture & Method

Base architecture: Qwen2-VL-2B as the multimodal backbone for unified vision-language-action processing
Multi-stage pipeline with five key modules: - Perception: Detects dynamic agents and estimates 3D positions from multi-view inputs - Short-term prediction: Predicts next waypoint and driving direction at 0.5s intervals - Generation: Produces future frame tokens conditioned on predicted trajectory using VQGAN tokenization - Thinking: Reflective reasoning over generated future images to assess risks and validate trajectories - Action planning: Outputs refined 3s trajectory based on reflective analysis
Core formulation unifies decision policy and world model imagination:
\[p(\tau_{t:t+H}, x_{t+1} | o_{1:t}, g) = p(\tau_{t:t+H} | o_{1:t}, g) \cdot p(x_{t+1} | o_{1:t}, \tau_{t+1})\]
Visual generation through autoregressive next-token prediction:
\[P(Q_{t+1}^k) = \prod_{i=1}^N P_\theta(q_i^k | q_{<i}^k, h_t, L)\]
The key innovation is the reflective reasoning loop: predict short-term trajectory → generate corresponding future image → reason over generated image → refine trajectory planning.

Training Recipe

Stage 1 - Visual Pretraining (30 epochs): - Data: Large-scale image-instruction datasets (480k samples) for multi-view visual generation - Optimizer: AdamW, learning rate 5×10⁻⁴, batch size 16 per device - Hardware: 8×80GB GPUs - Activates both visual understanding and generation capabilities
Stage 2 - Supervised Fine-Tuning (12 epochs): - Data: nuScenes-GR-20K (20k samples) for perception, generation, reasoning, and planning - Optimizer: AdamW, learning rate 1×10⁻⁴ - Multi-task training on perception, short-term prediction, generation, thinking, and action modules
Stage 3 - Reinforcement Learning (1 epoch): - Algorithm: Group Relative Policy Optimization (GRPO) - Learning rate: 1×10⁻⁶, global batch size 16 - Samples 8 candidate responses per prompt for policy gradient estimation - Reward function combines format, prediction, visual, action, and trajectory rewards

Wall-clock time not reported. Training uses PyTorch framework.

Novelty & Lineage

Prior work:

FSDrive (NeurIPS 2025): Spatiotemporal Chain-of-Thought using Qwen2-VL, generates future frames as reasoning steps but only for front view
DriveDreamer (ECCV 2024): Diffusion-based framework for realistic future driving videos with action prediction
OccWorld (ECCV 2024): Uses 3D occupancy observations to generate future occupancy maps for multi-view consistency

Delta: VLA-World adds: 1) Multi-view consistent future generation (vs. FSDrive’s single view), 2) Explicit reflective reasoning over self-generated futures, 3) Closed-loop refinement from initial trajectory prediction through visual imagination to final trajectory adjustment, 4) Three-stage training with GRPO reinforcement learning

Assessment:
- Architectural idea: The perception→prediction→generation→reasoning→planning pipeline is a logical but incremental extension combining existing VLA and world model concepts
- Benchmark gains: Modest improvements (e.g., L2 error 0.28→0.26m average on ST-P3) within typical noise margins for autonomous driving
- Comparisons: Fair comparison on same nuScenes dataset with same evaluation protocols, though uses additional ego-state information marked with *
- Scale dependency: Gains likely depend on the multi-stage training and GRPO optimization which requires significant compute
Verdict: INCREMENTAL — Solid engineering combining VLA models with world model generation, but the core insight of reasoning over self-generated futures is a straightforward extension of existing approaches.

Benchmarks & Results

ST-P3 trajectory planning: L2 error 0.26m average (vs. FSDrive 0.28m), collision rate 0.08% (vs. FSDrive 0.10%)
UniAD trajectory planning: L2 error 0.42m average (vs. FSDrive 0.45m), collision rate 0.12% (vs. FSDrive 0.16%)
Future frame generation: FID score 9.8 (vs. FSDrive 10.1, DriveDreamer 52.6, Drive-WM 15.8)
Action prediction F1-scores: Forward 95.88%, Left 74.22%, Right 75.06% (vs. base Qwen2-VL-2B†: Forward 92.60%, Left 61.78%, Right 66.52%)

Results show consistent but modest improvements across all metrics. The gains are most pronounced in action prediction tasks, suggesting the reflective reasoning component provides meaningful benefits. Generation quality matches or slightly exceeds autoregressive baselines while remaining competitive with dedicated diffusion models.

Compute & Efficiency

Model size: 2B parameters (Qwen2-VL-2B base)
Training compute: 8×80GB GPUs for three-stage training (pretraining 30 epochs, SFT 12 epochs, RL 1 epoch) - total wall-clock time not reported
Inference speed/latency: Not reported
Memory footprint: Not reported
Deployment practicality: Moderate - requires multi-stage training pipeline and GRPO optimization, but 2B parameter size is relatively lightweight compared to larger VLA models. Multi-view processing and visual token generation likely adds computational overhead during inference.

Real-World Applicability

Dataset evaluation: Tested on nuScenes real-world driving dataset with actual multi-view camera data from autonomous vehicles
Simulation environment: Uses nuScenes validation split following established autonomous driving evaluation protocols
No deployment results: No evidence of testing on actual autonomous vehicles or hardware systems
No sim-to-real validation: Evaluation limited to offline trajectory prediction on recorded driving sequences
Real-world constraints: Multi-view processing and visual generation requirements may pose challenges for real-time deployment in production autonomous driving systems

Limitations & Failure Modes

ENGINEERING: Requires multi-stage training pipeline making optimization complex and potentially unstable
EVALUATION: Limited to offline trajectory prediction - no closed-loop driving simulation or real-world validation
ENGINEERING: Visual generation computational overhead may limit real-time performance for safety-critical applications
FUNDAMENTAL: Relies on short-term (0.5s) prediction horizon for conditioning future generation, potentially missing longer-term temporal dependencies
EVALUATION: No ablation on different prediction horizons or failure analysis under edge cases

Failure modes:
Visual generation quality degradation in complex/rare scenarios could lead to incorrect reasoning and unsafe trajectory planning
Error accumulation through the multi-stage pipeline (perception→prediction→generation→reasoning→planning) could amplify initial mistakes

CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation

Authors: Yushi Feng, Junye Du, Qifan Wang, Zizhan Ma et al. (8 authors) · Institution: University of Hong Kong · Category: cs.LG

CORA applies conformal risk control to mobile GUI agent safety, providing statistical guarantees on harmful action rates through selective execution with trainable risk estimation and semantic intervention.

Practical Takeaway: If building mobile GUI agents, CORA provides a practical post-policy safety framework that can wrap existing agents without retraining. The key insight is reformulating safety as selective execution with statistical guarantees via conformal risk control. The Goal-Lock mechanism against visual injection and trainable diagnostician for semantic interventions are worth implementing. However, you’ll need step-level harm annotations and calibration data. The biggest value is in the systematic approach to GUI safety rather than novel algorithms.

Tags: mobile_gui safety conformal_prediction vision_language_models autonomous_agents risk_control human_computer_interaction

arXiv · PDF

Task & Setting

Mobile GUI automation agents powered by vision language models face severe safety risks including financial harm, privacy violations, and social damage when operating autonomously. Current safeguards lack formal guarantees and user control.

The task is selective action execution for mobile GUI control. Given a user goal $g$, environment observation $o_t = (x_t, u_t)$ containing screenshot and UI tree, and proposed action $\hat{a}_t$ from a base policy, the system must decide: EXECUTE autonomously or ABSTAIN and intervene. The objective is to bound the executed harm rate:

\[E[L(Z_{n+1}; \tau)] \leq \alpha\]

where $L(Z_t; \tau) = \ell(Z_t) \cdot I{s_t \leq \tau}$ with $\ell_t \in {0,1}$ indicating harmful outcomes and $s_t$ being the risk score.

Success is measured via: Harm Rate (HR), Goal Achievement Rate (GAR), Intervention F1 (IF1), and Over-Intervention Rate (OIR). The paper introduces Phone-Harm benchmark with Harm-150 (harmful tasks with step-level labels) and Normal-150 (benign tasks) subsets, totaling 300 mobile tasks across 29 apps.

Architecture & Method

Guardian Model (System 1): Action-conditional risk estimator $R_\psi(g, o_t, h_t, \hat{a}_t)$ using VLM backbone with trainable risk head, outputting scalar risk score $s_t \in [0,1]$.
Conformal Risk Control: Calibrates execute/abstain threshold $\hat{\tau}(\alpha)$ using held-out calibration set to guarantee expected executed harm rate $\leq \alpha$.
Diagnostician Model (System 2): Generative VLM with LoRA adapters for rejected actions, outputting structured report with reasoning, harm type, and intervention recommendation (CONFIRM/REFLECT/ABORT).
Goal-Lock mechanism: Freezes user intent $g$ and treats on-screen text as untrusted to resist visual injection attacks.
Weighted calibration for distribution shift: $w(c) \propto p_{target}(c)/p_{cal}(c)$ to handle app/OS/device shifts.

Core technical contribution is applying conformal risk control to GUI safety with action-conditional risk estimation and semantic intervention generation.

Training Recipe

Guardian training: Weighted binary cross-entropy loss on step-level harm labels from policy rollouts, with class imbalance weighting $\omega > 1$ for harmful actions.
Diagnostician training: Causal language modeling loss on annotated safety trajectories, computing loss only on diagnostic tokens (reasoning, harm type, intervention).
Data: Trajectory collection from fixed base policies in mobile environment with 42 apps, step-level harm annotation, subsampled training mixture to handle class imbalance.
Calibration: Held-out $D_{cal}$ for threshold selection via weighted CRC with trajectory-level splitting.

Training details (optimizer, learning rate, hardware) not reported in detail.

Novelty & Lineage

Prior work: AppAgent (2024), Ferret-UI (2024) - screenshot-grounded mobile GUI agents focusing on capability. OS-Harm (2025) - desktop safety evaluation without formal guarantees. Conformal Risk Control (Angelopoulos et al., 2024) - statistical risk bounds for general ML tasks.

Delta: This paper applies conformal risk control specifically to mobile GUI safety with action-conditional risk estimation, introduces trainable diagnostician for semantic interventions, and Goal-Lock mechanism against visual injection.

Assessment: The architectural idea combines existing techniques (conformal prediction + VLM risk estimation) in a novel GUI safety context. Benchmark gains are meaningful on safety metrics with improved safety-helpfulness trade-offs. Comparisons appear fair using same base policies. However, the core technical novelty is primarily in application rather than fundamental algorithmic innovation.

The main value is in the systematic application of conformal methods to GUI safety with practical components like Goal-Lock and semantic diagnostician.

Verdict: INCREMENTAL — solid application of conformal risk control to mobile GUI safety with practical system design, but limited fundamental technical novelty.

Benchmarks & Results

Phone-Harm Harm-150: HR 4.37%, GAR 79.80%, IF1 85.29% vs best baseline AutoGLM+VLM-critic (HR 4.30%, GAR 46.46%, IF1 44.21%) - major improvement in GAR and IF1.
Phone-Harm merged (300 tasks): HR 2.19%, GAR 89.69%, IF1 85.29% vs UI-TARS-1.5 (HR 2.87%, GAR 77.28%, IF1 57.14%) - consistent improvements.
MobileRisk: Accuracy 61.27%, Precision 58.65%, Recall 76.47%, F1 66.38% vs VLM-as-critic baseline (Acc 86.92%, F1 19.05%) - much better recall and F1.
AndroidWorld CORE20: Success rate 40.0% vs AutoGLM base 30.0% - capability improvement on benign tasks.

Results show consistent safety-helpfulness improvement across benchmarks.

Compute & Efficiency

Model size: VLM backbone with lightweight risk head and LoRA adapters for diagnostician (exact parameters not reported).
Training compute: Not reported in detail.
Inference: Guardian provides fast scalar risk scoring, diagnostician generates structured reports only for rejected actions.
Memory footprint: Not reported.
Deployment practicality: Post-policy design allows wrapping existing GUI agents without retraining base models, making deployment practical.

Real-World Applicability

Mobile environment testing: Sandboxed mobile environment with 42 apps, realistic app states with authentication and content.
Step-level harm evaluation: Real mobile GUI actions (TAP, TYPE, SWIPE) with concrete consequences.
Distribution shift handling: Weighted calibration tested across apps, OS versions, UI themes.
Benchmark realism: Phone-Harm covers 29 commonly used apps with human-authored tasks and annotations.
Integration ready: Post-policy framework can wrap existing mobile GUI agents without modification.

Limitations & Failure Modes

FUNDAMENTAL: Requires held-out calibration data for threshold setting, limiting adaptation to new domains.
ENGINEERING: Guardian and diagnostician require training data with step-level harm annotations.
ENGINEERING: Weighted calibration assumes access to deployment distribution statistics for reweighting.
EVALUATION: Phone-Harm limited to 29 apps, may not capture full diversity of mobile interactions.
FUNDAMENTAL: Sequential dependence in trajectories not fully addressed despite blockwise splitting.

Failure modes:
Distribution shift beyond calibration coverage could violate risk guarantees.
Visual injection attacks more sophisticated than Goal-Lock defenses could bypass safety controls.

Multimodal Large Language Models for Multi-Subject In-Context Image Generation

Authors: Yucheng Zhou, Dubing Chen, Huan Zheng, Jianbing Shen · Institution: University of Macau · Category: cs.LG

MUSIC introduces the first MLLM designed for multi-subject in-context image generation, using vision Chain-of-Thought reasoning and automated data generation to significantly outperform existing methods in maintaining subject identity and spatial relationships.

Practical Takeaway: This work demonstrates that MLLMs can effectively handle multi-subject image generation through structured reasoning approaches. The key insight is using vision Chain-of-Thought to decompose complex spatial composition tasks into step-by-step reasoning. The automated data generation pipeline is particularly valuable for practitioners facing data scarcity in specialized generation tasks. Consider implementing the semantics-driven spatial layout planning approach for controllable generation tasks. However, be aware that the method requires significant computational resources and may not generalize well beyond synthetic training distributions. The test-time scaling technique (Pass@N) provides a useful quality-compute trade-off mechanism.

Tags: multimodal-llm text-to-image-generation in-context-learning multi-subject-generation vision-language-models chain-of-thought-reasoning spatial-layout-planning synthetic-data-generation

arXiv · PDF

Task & Setting

Multi-subject text-to-image generation addresses the challenge of synthesizing images that contain multiple specific subjects while maintaining their identities and spatial relationships. Current methods suffer from subject missing and semantic drift as the number of reference subjects increases, making personalized applications like multi-person scene synthesis and product visualization difficult to achieve at scale.

The task takes as input:

a set of reference subject images ${I_{subj1}, I_{subj2}, …, I_{subjS}}$ where $S$ is the number of subjects, and
a text instruction describing the desired scene composition. The output is a synthesized image $I_{tgt}$ that contains all reference subjects arranged according to the text prompt while preserving their visual identities.

Success is measured using three automatic metrics: DINO score for image-level fidelity using self-supervised features, CLIP-I for subject identity preservation by comparing generated images to reference subjects, and CLIP-T for text-image alignment. Human evaluation assesses overall quality, subject identity, and adherence to instructions.

The paper introduces the MSIC benchmark specifically designed for multi-subject in-context generation, containing images with 1-12 subjects to test scalability. Evaluation is also conducted on DreamBench for single-subject scenarios.

Architecture & Method

Base architecture: SEED-X multimodal large language model (MLLM) framework for processing and generating both textual and visual data.
Automated data generation pipeline: Integrates LLM (Qwen-3), T2I model (FLUX-1.0-DEV), VLM (Qwen-2.5 VL), I2I model (UNO-FLUX), open-vocabulary detection (GroundingDINO), and segmentation model (SAM2) to generate training pairs automatically.
Vision Chain-of-Thought (CoT) mechanism: Guides step-by-step reasoning from subject images to semantic understanding and final generation, learning mapping:
\[f_1(T_{instr}, \{I_{subji}\}^S_{i=1}) \rightarrow (\hat{C}_{CoT}, \hat{L}_{spatial})\]
Semantics-driven spatial layout planning: Partitions target image into 8×8 grid, assigns semantic categories to patches using SAM2 segmentation and CLIP verification. Dynamic IoU threshold:
\[\tau = \lambda \cdot \frac{1}{K} \sum_{i=1}^K IoU(b_i, p_i)\]
Two-stage training: Stage 1 learns visual reasoning and spatial planning, Stage 2 learns image generation from plans:
\[f_2(C_{CoT}, L_{spatial}) \rightarrow \hat{I}_{tgt}\]
Overall training objective combines both stages:
\[L = w_1 L_1(T_{instr}, \{I_{subji}\}^S_{i=1}; C_{CoT}, L_{spatial}) + w_2 L_2(C_{CoT}, L_{spatial}; I_{tgt})\]

Training Recipe

Data generation: 10,000 synthetic samples created through automated pipeline using foundation models, no manual annotation required. Complex subject augmentation by incorporating semantically similar classes.
Architecture: SEED-X base model with LoRA fine-tuning (rank 64, α=64) for parameter efficiency.
Training setup: 10 epochs on 8×A100 GPUs, learning rate 1×10^-4, loss weights w₁=w₂=0.5 for balanced training across both capabilities.
Data augmentation: Systematic subject reduction from generated scenes (removing smallest subjects iteratively) to create varying complexity levels at no extra cost.
Training stages: End-to-end optimization allowing gradients from generation loss L₂ to flow back through predicted CoT and layout representations.

Hardware and wall-clock time: not reported in detail beyond GPU configuration.

Novelty & Lineage

Prior work:

UNO (Wu et al. 2025) achieved multi-subject generation via vision in-context learning but struggles with scaling beyond few subjects.
MS-Diffusion (Wang et al. 2024) used layout guidance for multi-subject synthesis but limited by diffusion model constraints.
Subject Diffusion (Ma et al. 2024) addressed personalized generation but focused primarily on single subjects.

Delta: This paper adds:
First MLLM specifically designed for multi-subject in-context generation
Fully automated training data generation pipeline eliminating manual annotation
Vision Chain-of-Thought mechanism for step-by-step reasoning
Semantics-driven spatial layout planning with test-time scaling capability.

Applied-specific assessment: The architectural idea of using MLLMs for multi-subject generation is novel, leveraging reasoning capabilities that diffusion models lack. Benchmark gains are substantial (DINO: 0.622 vs 0.541 for UNO, CLIP-I: 0.812 vs 0.721) and hold across multiple metrics. Comparisons appear fair using same evaluation protocols. However, the method still shows performance degradation as subject count increases, and relies heavily on synthetic training data which may limit real-world generalization.

The automated data generation pipeline is genuinely innovative, addressing a key bottleneck in multi-subject generation research. The vision CoT mechanism represents a non-obvious application of reasoning to spatial composition.

Verdict: SIGNIFICANT — introduces novel MLLM-based approach with strong empirical gains and addresses real scalability challenges in multi-subject generation.

Benchmarks & Results

MSIC (multi-subject): DINO 0.622, CLIP-I 0.812, CLIP-T 0.322 vs UNO (previous best): DINO 0.541, CLIP-I 0.721, CLIP-T 0.296. Improvements: +15.0% DINO, +12.6% CLIP-I, +8.8% CLIP-T.
MSIC with test-time scaling (MUSIC*): DINO 0.631, CLIP-I 0.822, CLIP-T 0.330, showing consistent gains over base method.
DreamBench (single-subject): DINO 0.761, CLIP-I 0.837, CLIP-T 0.317 vs UNO: DINO 0.760, CLIP-I 0.835, CLIP-T 0.304. Competitive performance despite being designed for multi-subject.
Human evaluation: 69% preference over OmniGen (21% for OmniGen, 10% tie), 63% preference over UNO (31% for UNO, 6% tie).
Ablation studies confirm all components contribute: Vision CoT most critical, spatial planning second most important, complex case augmentation provides modest gains.

Results are consistently strong across metrics with no conspicuous benchmark omissions for the stated task.

Compute & Efficiency

Model size: Built on SEED-X framework with LoRA adaptation (rank 64), exact parameter count not reported but likely in 7B+ range based on base model.
Training compute: 8×A100 GPUs for 10 epochs on 10,000 synthetic samples, specific training time not reported.
Inference speed/latency: Test-time scaling with Pass@N increases computation linearly (N=16 candidates), base inference time not quantified.
Memory footprint: Uses LoRA for efficiency but specific memory requirements not reported.
Deployment practicality: Requires multiple foundation models in data generation pipeline, making deployment complex. Test-time scaling trades computation for quality, limiting low-latency applications.

Real-World Applicability

Training data: Uses entirely synthetic data generated from foundation models rather than real-world datasets, potentially limiting generalization to real user scenarios.
Evaluation: Conducted primarily on curated benchmarks (MSIC, DreamBench) with human evaluation on limited subsets.
No reported deployment results, hardware experiments in real applications, or production integration discussed.
No sim-to-real analysis or discussion of domain gap between synthetic training data and real user inputs.

The work shows strong performance on benchmarks but lacks validation on real-world deployment scenarios or analysis of how synthetic training translates to practical applications.

Limitations & Failure Modes

FUNDAMENTAL: Performance degradation as subject count increases despite improvements over baselines, inherent to complexity of maintaining multiple identities and spatial relationships.
ENGINEERING: Test-time scaling via Pass@N increases inference time linearly, limiting applicability in latency-sensitive scenarios.
EVALUATION: Training entirely on synthetic data may not generalize well to real user inputs and diverse visual styles not covered in generation pipeline.
ENGINEERING: Deployment requires multiple foundation models (LLM, T2I, VLM, OVD, segmentation) making system complex and resource-intensive.

Likely failure modes:
Identity confusion when subjects have similar visual features
Spatial arrangement errors in highly complex scenes with many overlapping objects.

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

Authors: Wenyi Xiao, Xinchi Xu, Leilei Gan · Institution: Zhejiang University · Category: cs.CV

VL-Calibration decouples verbalized confidence into visual and reasoning components for LVLMs, using novel visual certainty estimation and token-level advantage reweighting to achieve superior calibration while improving accuracy.

Practical Takeaway: If you’re working on LVLM calibration, the key insight is that visual and reasoning confidence should be treated as separate quantities rather than conflated into a single score. The visual certainty estimation combining KL-divergence under perturbation with token entropy provides a practical supervision signal without ground-truth labels. The token-level advantage reweighting technique could be adapted to other RL training scenarios where you want to penalize high-uncertainty errors more heavily. However, consider the 11% computational overhead from the additional forward pass and whether your application can tolerate this cost.

Tags: LVLM confidence_calibration multimodal_reasoning reinforcement_learning hallucination_detection uncertainty_estimation visual_reasoning GRPO

arXiv · PDF

Task & Setting

Large Vision-Language Models (LVLMs) excel at multimodal reasoning but frequently produce incorrect answers with high confidence, creating risks in high-stakes applications like healthcare and law. The core issue is that existing confidence calibration methods, developed for text-only LLMs, use a single holistic confidence score based on binary correctness, which conflates two distinct error sources in LVLMs: perceptual failures and reasoning errors given correct perception.

The task is verbalized confidence calibration for LVLM reasoning. Given a multimodal input (image I, text query x), the model must generate a structured response with visual rationale, reasoning chain, answer, and separate confidence scores for visual and reasoning components. The visual confidence measures certainty about image perception, while reasoning confidence measures certainty about logical deduction.

Success is measured by calibration metrics (Expected Calibration Error, AUROC) and task performance (accuracy). The framework is evaluated on thirteen benchmarks covering mathematical reasoning (DynaMath, Geo3K, MathVerse, MathVision, MathVista, WeMath), logical reasoning (LogicVista), vision-dominant tasks (CLEVR, MathVerseV), and multi-disciplinary problems (A-OKVQA, MMK12, MMMU-Pro, ViRL-39K).

Architecture & Method

Base architecture: Qwen3-VL-4B/8B/30B-Instruct and InternVL3.5-4B-MPO models
Decoupled confidence framework: Model generates structured output τ = (z_vis, c_vis, z_reas, c_reas, y) where z_vis is visual rationale, z_reas is reasoning chain, c_vis and c_reas are respective confidence scores, y is the answer
Holistic confidence aggregation via harmonic mean:
\[\Phi(\hat{c}_{vis}, \hat{c}_{reas}) = \frac{2 \cdot \hat{c}_{vis} \cdot \hat{c}_{reas}}{\hat{c}_{vis} + \hat{c}_{reas}}\]
Visual certainty estimation combining visual grounding (KL-divergence under image perturbation) and internal certainty (token entropy):
\[S_{vis} = \log(D_{KL} + \epsilon) - \log(H + \epsilon)\]
Token-level advantage reweighting for GRPO optimization, penalizing high-uncertainty tokens with negative advantage

The core contribution is explicitly decoupling visual perception confidence from reasoning confidence, enabling precise error localization in LVLMs.

Training Recipe

Training data: 12,000 samples from ViRL-39K dataset
GRPO (Group Relative Policy Optimization) training with multi-component reward:
\[R(\tau, y^*) = \lambda_{acc}R_{acc} + \lambda_{cal}R_{cal} + \lambda_{vis}R_{vis}\]
Hyperparameters: Learning rate 1e-6, batch size 256, 15 epochs, temperature 1.0, 8 rollouts per prompt, BF16 precision
Reward weights: λ_acc = 1.0, λ_cal = 2.0, λ_vis = 0.4
Hardware: 240 H200 GPU hours for 4B model, 450 hours for 8B model, 1900 hours for 30B model
Visual certainty supervision uses batch-wise z-score normalization followed by sigmoid mapping to [0,1]
Token-level advantage reweighting with λ_TAR = 0.1 applied only to visual rationale tokens with negative advantage

Novelty & Lineage

Prior work: RLCR (2025) uses holistic confidence calibration with Brier score loss for LLMs. SaySelf (2024) trains verbalized confidence via GPT-4 distillation then RL alignment. VL-Uncertainty (2024) estimates uncertainty via multiple perturbed inputs for LVLMs.

This paper adds:

Explicit decoupling of visual vs reasoning confidence in LVLMs
Novel visual certainty estimation combining KL-divergence under perturbation with token entropy
Token-level advantage reweighting based on visual uncertainty.

Applied-specific assessment: The architectural idea of decoupling confidence sources is intuitive but non-trivial for LVLMs. Benchmark gains are substantial (ECE from 0.421 to 0.098, +2-3% accuracy) and hold across model scales/architectures. However, comparisons use reimplemented baselines rather than original results. The visual certainty estimation requires additional forward passes (11% overhead). Gains likely depend on the specific perturbation strategy and may not generalize to all visual reasoning tasks.

Verdict: SIGNIFICANT — Clear advance in LVLM calibration with principled decoupling approach that addresses fundamental limitations of holistic confidence methods.

Benchmarks & Results

DynaMath: 0.753 accuracy vs 0.718 previous best, 0.081 ECE vs 0.165 best baseline
Geo3K: 0.671 accuracy vs 0.616 best, 0.073 ECE vs 0.159 best
MathVerse: 0.807 accuracy vs 0.796 best, 0.042 ECE vs 0.142 best
MathVision: 0.483 accuracy vs 0.440 best, 0.170 ECE vs 0.207 best
MathVista: 0.730 accuracy vs 0.772 best (mixed result), 0.107 ECE vs 0.132 best
WeMath: 0.820 accuracy vs 0.771 best, 0.048 ECE vs 0.164 best
LogicVista: 0.570 accuracy vs 0.519 best, 0.203 ECE vs 0.232 best
CLEVR: 0.935 accuracy vs 0.935 best (tied), 0.035 ECE vs 0.058 best
MathVerseV: 0.781 accuracy vs 0.748 best, 0.056 ECE vs 0.171 best
A-OKVQA: 0.875 accuracy vs 0.861 best, 0.017 ECE vs 0.112 best
MMK12: 0.747 accuracy vs 0.741 best, 0.083 ECE vs 0.182 best
MMMU-Pro: 0.458 accuracy vs 0.436 best, 0.335 ECE vs 0.340 best
ViRL-39K: 0.816 accuracy vs 0.796 best, 0.026 ECE vs 0.113 best

Results show consistent improvements across most benchmarks, with particularly strong calibration gains.

Compute & Efficiency

Model size: 4B, 8B, 30B parameters (Qwen3-VL), 4B parameters (InternVL3.5)
Training compute: 240 H200 GPU hours (4B), 450 hours (8B), 1900 hours (30B)
Inference overhead: 11% additional time due to second forward pass for KL-divergence computation (15 seconds added to 140-second step for 8B model)
Memory footprint: Standard LVLM requirements plus storage for perturbed images during training
Deployment practicality: Method requires structured output generation with explicit confidence tokens, but no additional model parameters. Visual certainty estimation adds computational cost compared to sampling-based alternatives but lower cost than external annotators ($43,200 per training cycle for Gemini-based annotation)

Real-World Applicability

Evaluation focuses on academic benchmarks rather than real-world deployment scenarios
No reported production integration or actual deployment results in high-stakes domains mentioned as motivation (healthcare, law)
Limited sim-to-real analysis - method tested across different model architectures (Qwen3-VL, InternVL) but within controlled academic settings
Computational overhead (11% inference time increase) may limit practical deployment in latency-sensitive applications
Framework designed for structured reasoning tasks with explicit rationales, which may not match all real-world LVLM use cases

Limitations & Failure Modes

FUNDAMENTAL: Method requires explicit visual-reasoning decomposition which may not suit all multimodal tasks
ENGINEERING: 11% computational overhead from additional forward pass for visual certainty estimation
ENGINEERING: Training requires careful hyperparameter tuning (reward weights λ_acc, λ_cal, λ_vis) and may be unstable with single metrics
EVALUATION: Limited to academic benchmarks, no real-world deployment validation
EVALUATION: Baselines are reimplemented rather than using original published results

Failure modes:
Visual certainty estimation may fail when perturbations don’t adequately disrupt relevant visual features
Harmonic mean aggregation may be overly conservative when one confidence component is legitimately low but the other is high