Mar 31, 2026 Applied AI 5 papers

Applied AI Digest — Mar 31, 2026

Today’s Digest at a Glance

Preliminary

Today’s papers explore multimodal reasoning through adversarial robustness, agent-based code generation, latent visual reasoning, video-based image synthesis, and theory of mind interventions.

Agentic Frameworks for Multimodal Tasks

Agentic frameworks represent a computational paradigm where autonomous agents coordinate multiple specialized subsystems to solve complex tasks. Unlike monolithic models that process all inputs through a single pathway, agentic systems decompose problems into subtasks handled by specialized components that communicate and validate each other’s outputs.

The core insight is that complex multimodal tasks often require different types of reasoning—visual perception, symbolic manipulation, logical verification—that benefit from specialized processing pathways. For example, when analyzing potentially misleading charts, visual analysis might detect design anomalies while numerical extraction verifies data consistency. Each agent operates semi-independently but coordinates through structured communication protocols.

Mathematically, an agentic system can be formalized as $A = {a_1, a_2, …, a_n}$ where each agent $a_i$ has a specialized function $f_i: X_i \rightarrow Y_i$, and a coordination mechanism $C$ that aggregates outputs: $y = C(f_1(x_1), f_2(x_2), …, f_n(x_n))$. The key advantage is that each $f_i$ can be optimized for its specific domain while $C$ handles integration and conflict resolution.

The intuition is like having a team of specialists (visual analyst, data validator, reasoning coordinator) work together rather than expecting one generalist to handle everything perfectly.

Scanning Positional Encoding (ScanPE)

Scanning Positional Encoding addresses the challenge of generating extremely long images with coherent spatial relationships by reformulating the task as sequential video generation with carefully engineered position encodings.

Traditional positional encodings for images assign fixed 2D coordinates to each spatial location, which works well for standard aspect ratios but breaks down for extreme panoramas where the model must maintain coherence across thousands of pixels. The naive approach of simply extending 2D encodings leads to repetitive patterns and spatial discontinuities because the model lacks proper inductive biases for sequential spatial reasoning.

ScanPE solves this by distributing global image coordinates across video frames using a scanning pattern. For a panoramic image of width $W$, the encoding divides it into $T$ overlapping frames, where each frame $t$ covers a spatial window. The global position of pixel $(i,j)$ in frame $t$ is computed as:

\[\mathbf{O}_t = \sum_{k=1}^{t-1} \delta \cdot \mathbf{d}_k + \mathbf{p}_{local}\]

where $\delta$ is the stride between frames, $\mathbf{d}_k$ represents the displacement vector for frame $k$, and $\mathbf{p}_{local}$ is the local position within the current frame. This creates a consistent global coordinate system while allowing the video diffusion model to process manageable chunks sequentially.

The key insight is that video diffusion models already understand temporal coherence, so by mapping spatial coherence to temporal relationships, we can generate extremely wide images with natural flow and consistency.

Reading Guide

The chart reasoning and quantum code generation papers both demonstrate agentic approaches—ChartCynics uses dual visual/numerical agents for robustness, while the quantum work shows general-purpose LLMs with execution feedback outperform specialized models. LanteRn introduces latent visual reasoning tokens that complement the agentic theme by enabling richer multimodal integration. ScrollScape’s video-to-image reformulation provides a novel application of temporal models to spatial problems, while VisionToM uses attention interventions to enhance social reasoning capabilities in video understanding.

Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering

Authors: Yanjie Zhang, Yafei Li, Rui Sheng, Zixin Chen et al. (8 authors) · Institution: HKUST · Category: cs.CV

ChartCynics introduces a dual-path agentic framework that combines diagnostic visual analysis with OCR-driven data extraction to achieve robust performance against misleading chart visualizations through skeptical reasoning and adversarial alignment.

Practical Takeaway: Research engineers should consider the dual-path approach when building systems that need to be robust against adversarial or misleading visual content. The key insight is that rather than choosing between visual analysis and OCR extraction, combining them through structured conflict resolution yields superior results. The detective chain-of-thought framework could be adapted to other domains requiring skeptical reasoning. However, implementing this requires significant infrastructure (OCR pipeline, ROI detection, specialized training) and may not be worth the complexity unless robustness against visual deception is critical for your application.

Tags: vision-language-models chart-understanding misleading-visualizations multimodal-reasoning adversarial-robustness agentic-frameworks chain-of-thought reinforcement-learning

arXiv · PDF

Task & Setting

Misleading chart question answering (MQA) addresses a critical vulnerability in automated data analysis systems. Current Vision-Language Models (VLMs) fail catastrophically when confronted with deceptive visualizations designed to manipulate perception through axis manipulation, cherry-picking, or disproportionate encoding, scoring below 50% accuracy on specialized benchmarks.

The task requires answering multiple-choice questions about chart images that contain visual deceptions. Given a chart image $I$ and natural language question $Q$ with candidate options $O = {a_1, a_2, …, a_n}$, the system must predict the correct answer $a^*$. The challenge lies in resolving conflicts between visual trends $V_{trend}$ and actual numerical relationships $D_{rel}$, where misleading charts incorporate a manipulation function $f_{deceptive}(D) \rightarrow I$ that creates trap answers:

\[a_{trap} = \text{Inference}(V_{trend}) \neq a^* = \text{Inference}(D_{rel})\]

Success is measured through three metrics: overall Accuracy (Acc), Wrong due to Misleader (WM) tracking errors from visual traps, and Wrong due to Other factors (WO) capturing non-deception-related failures.

The evaluation uses Misleading ChartQA (305 test samples), Curated Deceptive Chart Collection (110 expert-validated samples), and Mixed Standard and Misleading Benchmark (244 balanced samples) to test both robustness against deception and absence of over-skepticism on benign data.

Architecture & Method

ChartCynics employs an agentic dual-path framework that decouples visual perception from numerical verification:

Diagnostic Vision Path: Uses strategic ROI cropping with nemotron-graphic-elements-v1 to extract high-resolution crops of critical chart components (title, legend, x-axis, y-axis). A Diagnostic Agent performs blind structural analysis without access to questions/options to identify anomalies, generating an Action Directive.
OCR-Driven Data Path: Utilizes LlamaParse (GPT-4o) to extract numerical literals and serialize charts into structured Markdown format, bypassing deceptive visual encodings to recover underlying data relationships.
Agentic Summarizer: Implements Detective Chain-of-Thought (D-CoT) reasoning through a 5-step process: perception audit, numerical anchoring, deception mapping, sufficiency check, and adversarial trap rejection.

The core technical contribution is the conflict arbitration mechanism that weighs evidence hierarchically rather than simply discarding one modality. The objective becomes:
\[a^* = \arg\max_{a \in O} P(a | P_v(I), P_d(I), Q, T)\]
where $P_v$ and $P_d$ represent visual and data path outputs, and $T$ is the misleading taxonomy used for expert guidance.

Training Recipe

Two-stage optimization pipeline:

Oracle-Informed SFT: Supervised fine-tuning on 5,238 reasoning chains distilled from Qwen3-VL-32B teacher model using Misleading ChartQA training set (2,619 samples). Teacher model uses ground-truth CSV data but generates reasoning chains based only on visible chart elements to bridge epistemic gap between structured data and visual pixels.
Deception-Aware GRPO: Group Relative Policy Optimization with group size G=8. Multi-objective reward function with coefficients: factual grounding (w_fact=0.20), semantic contradiction (w_contra=0.25), logical consistency (w_logic=0.20), format enforcement (w_fmt=0.10). Asymmetric reward shaping applies -2.0 penalty for selecting trap answers and +1.0 reward for correct answers.

Training conducted on 4× NVIDIA A800 (80GB) GPUs using Qwen3-VL-8B backbone. Optimizer details, learning rates, and wall-clock times not reported.

Novelty & Lineage

Prior work:

MATCHA (2023): Enhanced visual language pretraining with math reasoning and chart derendering for standard ChartQA
DEPLOT (2023): One-shot visual reasoning by plot-to-table translation using OCR linearization
Misleading ChartQA (2025): Benchmark showing VLMs score below 50% on deceptive visualizations

Delta: This paper introduces dual-path architecture that synergistically combines diagnostic visual analysis with OCR extraction, rather than choosing one modality over another. The key innovations are:

decoupled agentic workflow preventing confirmation bias through blind structural analysis
detective chain-of-thought reasoning for conflict resolution, and
deception-aware GRPO alignment specifically targeting visual trap avoidance.

Applied-specific assessment:
- Architectural idea is a reasonable engineering solution combining existing components (ROI extraction, OCR, structured reasoning)
- Benchmark gains are substantial (~29% absolute improvement over backbone) and validated across multiple datasets
- Comparisons appear fair, testing on same models with consistent evaluation protocols
- Gains likely depend on specialized training data and dual-path processing overhead
Verdict: SIGNIFICANT — The dual-path conflict resolution approach addresses a real problem with substantial empirical gains, though the core insight of combining visual structure detection with OCR verification is somewhat incremental.

Benchmarks & Results

Misleading ChartQA (MC): ChartCynics achieves 74.43% accuracy vs. Qwen3-VL-8B baseline 45.57% (+28.86% improvement). Outperforms Gemini-3.1-Pro (70.49%).
Curated Deceptive Chart Collection (CDCC): ChartCynics reaches 64.55% accuracy vs. Qwen3-VL-8B baseline 35.45% (+29.1% improvement).
Mixed Standard and Misleading Benchmark (MSMB): Training-free ChartCynics achieves 81.15% overall vs. ChartMoE 60.25%. On standard charts: 94.26% vs. ChartMoE 88.52%. On misleading charts: 68.03% vs. ChartMoE 31.97%.
Wrong due to Misleader (WM) reduction: ChartCynics reduces WM from 40.00% to 11.15% on MC dataset, demonstrating effective trap avoidance.

Results consistently show large improvements across different base models (GPT-o4-mini, Gemini-2.5-Flash) and datasets, indicating robust performance gains.

Compute & Efficiency

Model size: 8B parameters (Qwen3-VL-8B backbone)
Training compute: 4× NVIDIA A800 (80GB) GPUs for SFT and GRPO stages, specific GPU hours not reported
Inference speed/latency: Not reported, but dual-path processing with ROI extraction and OCR parsing likely adds significant latency overhead
Memory footprint: Not reported
Deployment practicality: Moderate - requires additional components (nemotron-graphic-elements-v1 for ROI extraction, LlamaParse for OCR) and specialized training data, but achieves strong performance with relatively small 8B model

Real-World Applicability

Real-world data evaluation: Uses expert-validated deceptive visualizations from established HCI studies in the Curated Deceptive Chart Collection, demonstrating effectiveness beyond synthetic benchmarks.
Deployment considerations: Framework tested on both proprietary (GPT-4o-mini, Gemini) and open-source models, showing broad applicability across different systems.
Production integration: No specific deployment results or production integration discussed.
Robustness validation: Mixed benchmark testing confirms the system doesn’t suffer from over-skepticism on benign data, actually improving standard chart comprehension (94.26% vs. 88.52% for ChartMoE).

Limitations & Failure Modes

ENGINEERING: Dependency on external components (ROI detector, OCR parser) that could introduce failure points or computational overhead
ENGINEERING: Requires specialized training data with expert-annotated deception types and reasoning chains, limiting scalability to new deception patterns
EVALUATION: Limited evaluation on chart types beyond standard statistical visualizations (bar, line, scatter plots)
ENGINEERING: Two-stage optimization pipeline adds training complexity compared to end-to-end approaches
FUNDAMENTAL: May struggle with novel deception types not covered in the misleading taxonomy used for training

Failure modes:
OCR parsing errors could misalign numerical data with visual elements
Complex multi-panel or interactive visualizations might overwhelm the ROI-based structural analysis

Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?

Authors: Oscar Novo, Oscar Bastidas-Jossa, Alberto Calvo, Antonio Peris et al. (5 authors) · Institution: QCentroid · Category: cs.LG

Shows that modern general-purpose LLMs with execution-feedback agents achieve 85.4% pass@1 on Qiskit code generation, substantially outperforming domain-specific fine-tuning (46.5%) without requiring specialized training.

Practical Takeaway: For quantum code generation tasks, modern general-purpose LLMs with execution-feedback agents can outperform domain-specific fine-tuning without requiring specialized training. The key insight is that inference-time augmentation (especially iterative repair based on execution feedback) provides a more flexible and maintainable approach than fine-tuning, particularly for rapidly evolving frameworks. However, agent-based approaches come with 2-10x inference cost increases. If building quantum programming assistants, consider starting with general-purpose models + execution feedback rather than domain fine-tuning, but budget for increased API costs and execution environment requirements.

Tags: quantum-computing code-generation LLM RAG agents benchmarking domain-specialization inference-time-augmentation

arXiv · PDF

Task & Setting

Real-world context: Quantum software development relies on complex programming frameworks like Qiskit that expose intricate abstractions and evolve rapidly. As quantum computing matures, there’s increasing need for automated code generation assistants to help developers navigate these specialized APIs and quantum programming patterns.

Task definition: The paper studies quantum code generation using the Qiskit-HumanEval benchmark. Input: natural language task descriptions with function signatures. Output: executable Qiskit code that passes unit tests. The benchmark contains 151 programming tasks derived from HumanEval, adapted for quantum programming with Qiskit APIs. Tasks range from basic quantum circuit construction to advanced quantum algorithm implementations.

Evaluation criteria: Success measured using pass@1 metric - fraction of tasks where a single generated solution passes all unit tests. Tasks are executed in sandboxed Python environments with Qiskit installed.

Dataset: Qiskit-HumanEval benchmark with 151 tasks categorized as basic (majority), intermediate, and advanced difficulty levels. Tasks require generating functionally correct Qiskit code from natural language specifications.

Architecture & Method

Parameter-specialized baseline: Fine-tuned Granite-20B model trained on curated Qiskit corpora (from Dupuis et al. 2024)
General-purpose LLMs: OpenAI GPT models (4o, 4.1, 5, o3 variants), Anthropic Claude models (Opus 4.6, Sonnet variants, Haiku variants), Google Gemini models (Pro 3, Flash variants)
Retrieval-Augmented Generation (RAG): Dense retrieval using FAISS with text-embedding-3-large embeddings, retrieving from combined Qiskit documentation + source code corpus with k=4 chunks
Agent-based inference: Iterative generate-execute-repair loop where models receive Python error messages as feedback and generate revised solutions, bounded by 1-5 repair attempts

Core contribution: Demonstrates that inference-time system-level specialization (RAG + agents) can match/exceed parameter-level fine-tuning performance without domain-specific training.

Training Recipe

Parameter-specialized baseline: Granite-20B fine-tuned on curated Qiskit corpora (training details from Dupuis et al. - not fully detailed in this paper)
General-purpose models: No additional training - evaluated out-of-the-box via commercial APIs
RAG setup: Built retrieval corpora from Qiskit 2.0.1 documentation and Qiskit 2.2.0 source code, indexed using text-embedding-3-large embeddings
Agent configuration: Up to 5 iterative repair attempts with 10-minute per-attempt timeout

Training compute, optimizer details, and other specifics: Not reported for the main models (commercial API access). Original Granite fine-tuning details referenced but not reproduced.

Novelty & Lineage

Step 1 — Prior work:

Dupuis et al. (2024) “Qiskit Code Assistant” showed that fine-tuning Granite-20B on Qiskit corpora achieved 46.53% pass@1 vs 20.79% for base model
General RAG and agent-based inference approaches (Lewis et al. 2021, Yao et al. 2023) established inference-time augmentation techniques
HumanEval (Chen et al. 2021) provided the base code generation benchmark framework

Step 2 — Delta: This paper applies modern general-purpose LLMs (GPT-5, Claude Opus 4.6, etc.) with RAG and execution-feedback agents to quantum code generation, achieving up to 85.4% pass@1 vs 46.5% baseline.

Step 3 — Applied-specific assessment:

Architectural novelty: INCREMENTAL - standard RAG + execution feedback, not novel architectures
Benchmark gains: Substantial (85.4% vs 46.5% baseline, >20% over zero-shot) but limited to one benchmark
Fair comparisons: Somewhat questionable - inference-time methods use much more compute than single-pass fine-tuned baseline
Scale dependence: Results rely on frontier commercial models with substantial compute

Verdict: INCREMENTAL — Solid engineering study showing inference-time augmentation can replace fine-tuning, but uses well-established techniques without architectural novelty.

Benchmarks & Results

Qiskit-HumanEval pass@1: Parameter-specialized baseline 46.5%, best general-purpose models 60-65% zero-shot, up to 85.4% (Claude Opus 4.6) with 5-step agents - improvement of 38.9 absolute percentage points over baseline
Execution time: Agent-based inference 2-10x slower than zero-shot, with Claude models showing better time scaling than OpenAI/Gemini models
Difficulty-tiered performance: Advanced tasks remain challenging across all models (~20-40% success rate vs >80% on basic tasks)
RAG effectiveness: Mixed results, modest gains for OpenAI models, neutral/negative for Claude and Gemini models
Agent iteration scaling: Performance generally improves from 1 to 5 repair attempts across all model families

Results show consistent improvements from inference-time augmentation, but with significant computational overhead trade-offs.

Compute & Efficiency

Model size: Parameter-specialized baseline ~20B parameters, commercial models likely 70B-405B+ range (not disclosed)
Training compute: Not reported for commercial models, Granite fine-tuning details referenced but not reproduced
Inference speed: Zero-shot ~400-4000s total benchmark time, RAG similar, agent-based 2-10x slower (up to 12,000s for some configurations)
Memory footprint: Not reported
Deployment practicality: Commercial API dependency limits deployment flexibility; agent-based approaches require execution environments and substantial API cost increases due to multiple generation cycles

Real-World Applicability

Benchmark-only evaluation: All results on Qiskit-HumanEval synthetic benchmark tasks, no real quantum application deployments reported
Production considerations: Paper discusses maintenance advantages of inference-time approaches for rapidly evolving quantum SDKs, but no actual production deployments
Cost analysis: Execution time reported but not API costs - agent-based approaches likely 2-5x more expensive due to multiple model calls
Quantum hardware integration: Not addressed - purely code generation without quantum backend execution

Limited real-world validation beyond benchmark performance.

Limitations & Failure Modes

EVALUATION: Benchmark exposure risk - Qiskit-HumanEval is public and models may have seen similar code during pretraining
EVALUATION: Unfair compute comparison - agent methods use substantially more inference compute than fine-tuned baseline
EVALUATION: Single benchmark evaluation limits generalization claims across quantum frameworks
FUNDAMENTAL: Advanced tasks remain challenging (20-40% success) even with best methods
ENGINEERING: Commercial API dependency creates deployment constraints and cost unpredictability
ENGINEERING: Agent execution requires sandboxed environments and timeout handling for non-terminating code

Failure modes: Models can generate syntactically correct but semantically meaningless quantum code; iterative repair can lead to increasingly complex but incorrect solutions.

LanteRn: Latent Visual Structured Reasoning

Authors: André G. Viveiros, Nuno Gonçalves, Matthias Lindemann, André Martins · Institution: Instituto de Telecomunicações, Instituto Superior Técnico · Category: cs.CV

LanteRn enables vision-language models to interleave text with continuous latent visual tokens during reasoning, achieving modest improvements on visual benchmarks through two-stage training with supervised grounding and reinforcement learning.

Practical Takeaway: LanteRn demonstrates that interleaving continuous latent visual tokens with text can provide modest improvements on perception-heavy benchmarks, particularly after reinforcement learning alignment. The two-stage training approach (SFT for grounding, RL for task optimization) could be valuable for practitioners working on visual reasoning tasks. However, the gains are incremental and the method requires careful engineering of control tokens, latent state replay, and multi-objective loss balancing. Consider this approach if you need better visual grounding than pure text-based reasoning but be prepared for implementation complexity and mixed results across different reasoning types.

Tags: multimodal visual-reasoning latent-representations vision-language reinforcement-learning structured-reasoning chain-of-thought

arXiv · PDF

Task & Setting

LanteRn addresses visual reasoning limitations in large multimodal models (LMMs) that predominantly rely on verbalizing visual content into text, losing fine-grained spatial information crucial for perception-heavy tasks.

The task is visual question answering with structured reasoning. The input consists of an image I and question Q, producing a hybrid reasoning trajectory τ = [s1, s2, …, sT] where each state st is either a discrete token from vocabulary V or a continuous latent vector zt ∈ R^d. The model generates interleaved text and K-token latent blocks using control tokens <

lvr_start

>, <

lvr_sep

>, and <

lvr_end

Success is measured by final answer accuracy on three benchmarks: VisCoT (visual chain-of-thought reasoning), V* (visual search in real-world scenarios), and Blink (core visual perception where textual priors are insufficient). The evaluation covers object localization, spatial relationships, and fine-grained visual understanding.

The paper doesn’t introduce a new dataset but constructs synthetic training data from Visual-CoT by prompting Qwen3-VL-235B-Thinking to generate structured reasoning traces with bounding box supervision.

Architecture & Method

Base architecture: Qwen2.5-VL-3B-Instruct with extended vocabulary including control tokens <

lvr_start

>, <

lvr_sep

>, <

lvr_end

Hybrid reasoning mechanism: Model operates in two modes - text mode (standard autoregressive) and visual latent mode where hidden states bypass language head for K consecutive steps
Target latent representation extraction: Uses frozen vision encoder to extract features Fb for bounding box regions b, then applies average pooling:
\[F_b = \text{VisionEncoder}(I, b)\] \[Z_{\text{target}} = \text{Pool}(F_b) \in \mathbb{R}^{K \times d}\]
Multi-objective training loss combining text generation and latent alignment:
\[L_{\text{LanteRn}} = L_{\text{text}} + \gamma L_{\text{latent}}\]
Latent alignment loss using MSE between generated and target latent sequences:
\[L_{\text{latent}} = \frac{1}{K} \sum_{i=1}^{K} ||h_{gen}^{(i)} - z_{target}^{(i)}||_2^2\]
The core technical contribution is enabling continuous latent visual “thoughts” interleaved with discrete text tokens, allowing reasoning in visual feature space rather than forcing verbalization.

Training Recipe

Stage 1 - Supervised Fine-Tuning: Custom dataset from Visual-CoT with bounding box supervision. AdamW optimizer, learning rate 1×10^-5, cosine schedule with 0.05 warmup ratio. Frozen vision encoder, latent loss weight γ=0.1. Variants trained with K∈{4,8,16,32} latent tokens.
Stage 2 - Reinforcement Learning: VIRL-39k dataset without bounding box supervision. Group Relative Policy Optimization (GRPO) with learning rate 5×10^-6, warmup ratio 0.03, KL regularization β=0.1. Sampling with temperature T=0.6, top-p=0.85, G=4 rollouts per prompt. Combined accuracy reward (weight 1.0) and format reward (weight 1.0). Latent state replay mechanism to stabilize training.

Training time and specific hardware details not reported. Uses TRL library with custom GRPOTrainer extension for hybrid action spaces.

Novelty & Lineage

Prior work:

Latent Visual Reasoning (Li et al., 2025): Conditions final answers on latent visual tokens but primarily appends them for downstream decoding
Machine Mental Imagery (Yang et al., 2025b): Introduces latent image representations but lacks iterative refinement
Tool-based methods (Yang et al., 2023; Surís et al., 2023): Use external vision modules but require hand-designed tools

Delta: LanteRn formulates latent visual reasoning as an interleaved process alternating between text and latent tokens, enabling iterative refinement of internal visual representations rather than one-shot latent conditioning.

Applied-specific assessment:

Architectural novelty: The interleaved text-latent mechanism is a reasonable extension of prior latent reasoning work, not fundamentally novel
Benchmark gains: Mixed results across benchmarks, with some improvements (BlinkOL: 0.48→0.54) but also degradations on certain tasks
Fair comparisons: Limited by using 3B model vs. 7B baselines in prior work, though authors acknowledge this limitation
Scale dependence: Gains appear modest and may not hold without the two-stage training recipe

Verdict: INCREMENTAL — Solid engineering contribution that extends latent visual reasoning to interleaved sequences, but the core idea is an expected evolution of existing approaches with mixed empirical gains.

Benchmarks & Results

VisCoT: Accuracy improved from 0.66 (base) to 0.83 (LanteRn-RL-8), previous SOTA not reported, +0.17 improvement
V* Direct Attribution: 0.75 (base) to 0.76 (LanteRn-RL-8), minimal +0.01 improvement
V* Relative Position: 0.63 (base) to 0.67 (LanteRn-RL-8), +0.04 improvement
V* overall: 0.70 (base) to 0.71 (LanteRn-RL-8), +0.01 improvement
Blink Object Localization: 0.48 (base) to 0.54 (LanteRn-RL-8), +0.06 improvement
Blink Relative Position: 0.81 (base) to 0.81 (LanteRn-RL-8), no improvement

Results are mixed across benchmarks with modest improvements. Some tasks show degradation during SFT stage before RL recovery. Performance doesn’t scale monotonically with latent size K, suggesting optimization challenges.

Compute & Efficiency

Model size: 3B parameters (Qwen2.5-VL-3B backbone)
Training compute: Not reported (GPU hours, hardware unspecified)
Inference speed/latency: Not reported
Memory footprint: Additional K×d latent vectors per reasoning step, where d is hidden dimension
Deployment practicality: Reasonable given 3B scale, but requires specialized inference code for hybrid text-latent generation and control token handling. Latent state replay mechanism adds implementation complexity.

Real-World Applicability

Evaluation limited to curated benchmarks (VisCoT, V*, Blink) with no real-world deployment results reported
No hardware experiments, robotics applications, or production integration discussed
No sim-to-real transfer analysis provided
Training data derived from synthetic annotations using Qwen3-VL-235B-Thinking, raising questions about generalization to truly novel visual scenarios
Fixed latent size K may not adapt well to varying complexity of real-world visual reasoning tasks

Limitations & Failure Modes

FUNDAMENTAL: Fixed latent size K doesn’t adapt to task complexity, leading to performance degradation with larger K values
ENGINEERING: Training stability requires latent state replay mechanism and careful hyperparameter tuning of multi-objective loss
EVALUATION: Limited to curated benchmarks without real-world validation or analysis of latent representation quality
FUNDAMENTAL: Supervision strategy relies on frozen vision encoder features, potentially limiting latent representations to encoder’s capabilities
ENGINEERING: Two-stage training pipeline adds complexity compared to end-to-end approaches

Failure modes:
- Performance degradation on relational reasoning tasks during SFT stage
- Potential collapse to purely textual reasoning without format reward enforcement

ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

Authors: Haodong Yu, Yabo Zhang, Donglin Di, Ruyi Zhang et al. (5 authors) · Institution: Harbin Institute of Technology, Li Auto · Category: cs.CV

ScrollScape reformulates extreme aspect ratio image synthesis as sequential video generation, using video diffusion priors to achieve coherent 32K resolution panoramas without object repetition.

Practical Takeaway: If working on high-resolution or extreme aspect ratio image generation, this paper demonstrates a powerful paradigm shift worth implementing: reformulating spatial generation as temporal video generation to leverage video models’ temporal consistency for spatial coherence. The ScanPE coordinate mapping technique could be adapted to other spatial generation tasks. For practitioners, the key insight is that video diffusion priors can serve as effective spatial regulators, potentially applicable beyond just EAR synthesis to other structured generation tasks requiring long-range consistency.

Tags: image_generation diffusion_models video_diffusion high_resolution extreme_aspect_ratios positional_encoding super_resolution panoramic_images

arXiv · PDF

Task & Setting

The task addresses ultra-high-resolution image generation at extreme aspect ratios (EAR), particularly for panoramic images and traditional scroll paintings. Current text-to-image diffusion models trained on conventional dimensions suffer from catastrophic structural failures when generating images with extreme aspect ratios (e.g., 8:1) due to lack of robust spatial priors, resulting in object repetition and spatial fragmentation.

The task takes text prompts as input and generates ultra-high-resolution images with extreme aspect ratios up to 32K resolution. The approach reformulates EAR synthesis as a sequential video generation process, mapping spatial expansion of a massive canvas to temporal evolution of video frames. The objective is to minimize the Flow Matching loss:

\[\mathcal{L}_{\text{FM}} = \mathbb{E}_{z_0, z_1, \tau} \left[\left\|v_\theta (z_\tau , \tau , c, \mathcal{R})-(z_1 - z_0)\right\|_2^2\right]\]

Success is measured using patch-based FID and KID scores, CLIP alignment, Intra Style Loss for visual continuity, and Global Structural Diversity (GSD) metrics using LPIPS and DINOv2 features to detect object repetition across panoramas.

The authors contribute a curated dataset of 3,000 high-resolution multi-ratio images including 2,000 natural landscapes (6:1+ ratios) and 1,000 traditional Chinese scroll paintings (6:1 ratio).

Architecture & Method

Base architecture: Wan2.1-T2V-1.3B video diffusion model used as foundation
Scanning Positional Encoding (ScanPE): Re-engineers 3D RoPE by distributing global coordinates across video frames, with global anchor position defined as:
\[\mathbf{O}_t = \sum_{k=1}^{t-1} \delta \cdot \mathbf{d}_k + \mathbf{P}_{init}\]
Global coordinate projection from frame-centric to unified system:
\[\mathbf{P}_g (t, \mathbf{p}_{loc}) = \mathbf{p}_{loc} + \mathbf{O}_t\]
Trajectory-aware rotational embedding:
\[\mathcal{R}(t, H_g, W_g) = \text{Concat} \left[ \mathbf{\Theta}(t \theta_j), \mathbf{\Theta}(H_g \theta_j), \mathbf{\Theta}(W_g \theta_j) \right]\]
Scrolling Super-Resolution (ScrollSR): Leverages video super-resolution diffusion priors (FlashVSR) for frame-by-frame enhancement
Trajectory Anchored Partitioning (TAP): Zero-shot spatial alignment strategy for seamless 3D VAE decoding
Median Consensus Selection (MCS) for frame fusion:
\[\bar{I}_t = \arg \min_i \left| f_{t,i} - \operatorname{Median}\{ f_{t,k} \}_{k=1}^N \right|\]
The core contribution is reformulating spatial EAR synthesis as temporal video generation to leverage video models’ inherent temporal consistency as global structural constraint.

Training Recipe

Base model: Initialize with pre-trained Wan2.1-T2V-1.3B video diffusion model
Fine-tuning stage: Train on curated 3K multi-ratio dataset using conditional Flow Matching objective
Hardware: 2x A100 GPUs with total batch size of 4
Optimizer: AdamW with learning rate 1×10^-5
Training duration: 10,000 iterations
Inference: Base frames generated at reduced resolution, then processed by modified FlashVSR on single A100 (80GB) for 32K upscaling
Data filtering: Curated dataset of 2,000 natural landscapes (6:1+ ratios) + 1,000 traditional Chinese scroll paintings (6:1 ratio)
No additional training reported for ScrollSR component - leverages pre-trained video super-resolution priors

Novelty & Lineage

Prior work:

SyncDiffusion (2023) - partitions target space into overlapping patches for EAR generation but suffers from fragmented compositions
ScaleCrafter & DyPE (2023-2024) - manipulate internal representations via dilated convolutions and position embedding interpolation for high-resolution synthesis
MultiDiffusion & Tiled Diffusion - use tiling-based approaches for large image generation

Delta: This paper reformulates EAR synthesis from static image generation to temporal video generation, introducing ScanPE for flexible coordinate distribution across video frames and ScrollSR for video super-resolution-based scaling.

Applied-specific assessment:
- The architectural idea of mapping spatial coordinates to temporal frames is genuinely novel and non-obvious
- Benchmark gains are substantial: FID improves from 241-334 to 215, with clear elimination of object repetition artifacts
- Comparisons appear fair using same evaluation protocols, though baselines use different underlying models
- The approach requires pre-trained video diffusion models, making gains somewhat dependent on that foundation
- User study shows 74-92% preference over baselines across multiple dimensions
Verdict: SIGNIFICANT — The spatial-to-temporal reformulation is a genuinely clever architectural insight that effectively leverages video priors to solve a fundamental limitation of image diffusion models, with clear practical benefits demonstrated across multiple metrics.

Benchmarks & Results

FID (↓): ScrollScape 214.7 vs best baseline 241.2 (Tiled Diffusion), improvement of 26.5 points
CLIP Score (↑): ScrollScape 30.0 vs best baseline 29.7 (MultiDiffusion), marginal improvement
KID (↓): ScrollScape 2.0×10^-2 vs best baseline 3.0×10^-2 (Tiled Diffusion), 33% improvement
Style Loss (↓): ScrollScape 4.0×10^-3 vs best baseline 4.5×10^-3 (Tiled Diffusion), modest improvement
Global Structural Diversity LPIPS (↑): ScrollScape 0.674 vs best baseline 0.658 (MultiDiffusion), indicating better diversity
Global Structural Diversity DINOv2 (↓): ScrollScape 0.670 vs best baseline 0.682 (DyPE), indicating less repetition
User study: 74-92% preference over baselines across structural coherence, content richness, and image quality

Results show consistent improvements across metrics, with particularly strong gains in eliminating object repetition (GSD metrics) and overall image quality (FID/KID). The method addresses fundamental failure modes of existing approaches.

Compute & Efficiency

Model size: Based on Wan2.1-T2V-1.3B (1.3 billion parameters)
Training compute: 2x A100 GPUs for 10,000 iterations, wall-clock time not reported
Inference speed/latency: Two-stage process - base generation then ScrollSR upsampling on single A100 (80GB), specific timing not reported
Memory footprint: Single A100 (80GB) required for 32K resolution generation via ScrollSR
Deployment practicality: Requires significant GPU resources (A100-class) for 32K generation, but approach is more memory-efficient than direct pixel-space generation due to latent-space processing. Sequential video generation may be slower than single-shot image generation but enables much higher resolution outputs.

Real-World Applicability

Evaluation on curated dataset only - no reported deployment in production systems
Qualitative results shown across diverse domains: natural landscapes, traditional scroll paintings, microscopic textures, ice crystals
No hardware experiments beyond standard GPU evaluation reported
No sim-to-real transfer discussed as method focuses on static image generation
Generated samples appear suitable for real-world applications like digital art, panoramic photography, and traditional scroll creation
Method designed for creative/artistic applications rather than safety-critical deployment scenarios

Limitations & Failure Modes

ENGINEERING - Requires pre-trained video diffusion models as foundation, limiting independence
ENGINEERING - Two-stage inference process (base generation + super-resolution) increases computational cost
ENGINEERING - Memory requirements still substantial (A100 80GB) for 32K generation
EVALUATION - Training dataset relatively small (3K images) may limit generalization
EVALUATION - Evaluation primarily on 8:1 aspect ratios, unclear performance on more extreme ratios
FUNDAMENTAL - Sequential generation approach may be inherently slower than single-shot methods

Failure modes:
- May struggle with aspect ratios significantly beyond training distribution (>8:1)
- Sequential approach could accumulate errors across long panoramic sequences leading to drift

Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

Authors: Siqi Liu, Xinyang Li, Bochao Zou, Junbao Zhuo et al. (6 authors) · Institution: University of Science and Technology Beijing · Category: cs.CV

VisionToM improves multimodal language models’ Theory of Mind reasoning on egocentric videos through targeted attention head interventions that enhance visual focus and task-specific reasoning.

Practical Takeaway: If you’re working on video-based reasoning or social AI, this paper provides a practical framework for improving MLLM performance on Theory of Mind tasks through attention interventions. The key insight is that visual attention patterns are consistent across ToM tasks, enabling targeted enhancement. While the gains are modest, the approach is lightweight and doesn’t require model fine-tuning. Consider implementing this if you have access to calibration data and need to improve social reasoning capabilities in video understanding systems. The clustering-based encoder approach for handling diverse reasoning failures could be adapted to other intervention scenarios beyond ToM.

Tags: multimodal theory-of-mind video-understanding attention-intervention interpretability social-cognition vision-language-models egocentric-video

arXiv · PDF

Task & Setting

Theory of Mind (ToM) evaluation for multimodal large language models (MLLMs) focuses on assessing their ability to infer mental states from visual information, addressing a critical gap in human-AI interaction capabilities. While most ToM evaluations use text-based inputs, real-world scenarios require understanding intentions, beliefs, and goals from visual cues alone. This is challenging because MLLMs suffer from hallucinations and over-reliance on linguistic priors when reasoning about mental states.

The task takes egocentric video sequences as input (24 frames) paired with multiple-choice questions about agents’ mental states. The formal objective is to maximize accuracy on three ToM reasoning subtasks:

\[P(\text{correct choice}|\text{video}, \text{question}) \rightarrow \max\]

Success is measured using Top-1 accuracy on three specific ToM tasks: goal inference (“What will C most likely do next?”), belief reasoning (inferring what agents know/believe), and action inference (predicting future behaviors).

The paper evaluates on EgoToM, a real-world egocentric video dataset with carefully curated question-answer pairs across diverse social interaction scenarios, providing ecological validity compared to simulated environments.

Architecture & Method

Base MLLMs: LLaVA-Next-Video-7B and Qwen2.5-VL-7B for video understanding and generation tasks
Internal representation extraction: Decompose visual ToM reasoning into visual attention and belief representation components by extracting activations from attention heads
\[T_{l+1} = T_l + \sum_{h=1}^{H} Attn^h_l(P_l^h T_l) \cdot W^o_l\]
Visual attention enhancement: Generate adversarial examples using ℓ∞-bounded PGD attack (ε=16/255, 300 iterations) to create positive/negative sample pairs, compute activation offset vectors:
\[\{\delta_{V,l}^h\} = \frac{1}{S}\sum_{i=1}^{S} (X_{V,i,l}^{pos,h} - X_{V,i,l}^{neg,h})\]
ToM reasoning guidance: Use correct answers as positive samples and incorrect answers as negatives, employ clustering-based approach with encoder networks to separate semantic spaces
Intervention: Apply learned intervention vectors to top-K sensitive attention heads during inference:
\[T_{l+1} = T_l + \sum_{h=1}^{H} (Attn^h_l(P_l^h T_l) + \alpha \times \Delta) \cdot W^o_l\]
The core technical contribution is the joint optimization of visual attention and ToM-specific reasoning through targeted attention head interventions.

Training Recipe

Probing stage: Train lightweight logistic regression classifiers on 30% calibration split using cross-entropy loss to identify task-sensitive attention heads, freeze MLLM backbone
Encoder training: Train separate encoder networks (two linear layers with GELU, 128→256→128 dimensions) for each attention head using Adam optimizer, learning rate 1×10^-3, optimize clustering-based separation loss
Intervention vector computation: Extract activation offsets from positive/negative sample pairs, compute intervention directions as combination of visual and reasoning components
Hardware: Not reported for training compute, calibration takes 0.2 hours for probing + 1 hour for encoder training
Data: Use EgoToM dataset 30%/70% train/test split, no additional data augmentation or filtering reported
Inference: Apply interventions at zero temperature, zero-shot setting with intervention strength α=1.0 and top-K=64 heads

Novelty & Lineage

Prior work:

GridToM (2025): Introduced intervention methods for ToM in MLLMs using binary logistic regression classifiers to derive intervention directions, limited to simulated grid-world environments
Neural Theory-of-Mind (TOMI) dataset (2022): Evaluated LLM ToM abilities on text-based tasks, showed instability in mental state reasoning
MMToM-QA (2024): Created multimodal ToM benchmark but relied heavily on textual input alongside visual information

Delta: This paper extends intervention methods to real-world egocentric video scenarios, introduces clustering-based encoder networks for fine-grained reasoning failure correction, and operates purely on visual input without textual annotations.

Applied-specific assessment:
- Architectural novelty: The clustering-based encoder approach for handling heterogeneous negative samples is a reasonable extension, but the core attention intervention technique builds directly on GridToM
- Benchmark gains: Improvements are meaningful (13.0% on Goal task for LLaVA-Next-Video) but the baseline performance is quite poor (24.0% on Actions), suggesting potential for larger gains
- Fair comparisons: Uses same evaluation protocol as EgoToM benchmark, though limited to two open-source models
- Generalizability concerns: Method requires calibration data and shows mixed transfer results (60.5% vs 66.3% when transferring to MMToM-QA)
Verdict: INCREMENTAL — Solid application of known intervention techniques to a new visual ToM setting, but the core methodology builds incrementally on GridToM with reasonable engineering improvements rather than fundamental algorithmic advances.

Benchmarks & Results

EgoToM Goal Inference: LLaVA-Next-Video improves from 61.5% to 74.5% (+13.0%), Qwen2.5-VL from 86.9% to 88.9% (+2.0%) vs human baseline 88%
EgoToM Belief Reasoning: LLaVA-Next-Video improves from 38.9% to 45.3% (+6.4%), Qwen2.5-VL from 35.6% to 42.0% (+6.4%) vs human baseline 72%
EgoToM Action Inference: LLaVA-Next-Video improves from 24.0% to 29.7% (+5.7%), Qwen2.5-VL from 31.1% to 37.6% (+6.5%) vs human baseline 78%
MMToM-QA transfer: Qwen2.5-VL achieves 66.3% vs 60.5% when transferring learned vectors from EgoToM (+5.8%)
Open-ended generation (TruthfulQA metrics): VisionToM improves truthfulness scores, e.g., LLaVA-Next-Video Goal task from 8.5% to 27.3% for True metric

Results show consistent but modest improvements, with particularly large gaps remaining on Belief and Action tasks. The method shows mixed generalization across models and datasets.

Compute & Efficiency

Model size: 7B parameters for main experiments (LLaVA-Next-Video, Qwen2.5-VL), also tested 72B parameter Qwen2.5-VL
Training compute: Calibration stage takes 0.2 hours for probing + 1 hour for encoder training, specific GPU hours and hardware not reported
Inference speed/latency: Not reported, but method adds minimal overhead as intervention vectors are precomputed and applied during forward pass
Memory footprint: Minimal additional memory required as method only stores learned intervention vectors, backbone remains frozen
Deployment practicality: High - lightweight approach requiring only one-time calibration, compatible with existing MLLM inference pipelines without fine-tuning

Real-World Applicability

Dataset evaluation: Uses EgoToM benchmark with real-world egocentric videos from natural social interactions, providing ecological validity compared to simulated environments
Video-only setting: Method operates solely on visual input without textual annotations, more realistic for embodied AI applications
Cross-dataset transfer: Shows limited but positive transfer from EgoToM to MMToM-QA (60.5% vs baseline 38.2%), indicating some generalization
Production considerations: Method requires calibration data and manual hyperparameter tuning (K=64 heads, α=1.0), potentially limiting deployment flexibility
Hardware deployment: Tested on standard MLLM architectures, should be compatible with existing video understanding systems

Limitations & Failure Modes

FUNDAMENTAL: Method requires calibration data from target task domain, limiting zero-shot applicability across different ToM scenarios
ENGINEERING: Performance gains are modest, especially on challenging Belief (6.4%) and Action (5.7%) tasks, suggesting need for more sophisticated intervention strategies
EVALUATION: Limited to two open-source models, missing comparison with latest closed-source models like GPT-4o beyond baseline reporting
FUNDAMENTAL: Cross-dataset transfer shows degraded performance (66.3% vs 60.5%), indicating learned representations may be dataset-specific
ENGINEERING: Hyperparameter sensitivity not thoroughly analyzed - only reports single configuration (K=64, α=1.0)

Failure modes: 1) Method fails when visual attention patterns differ significantly between calibration and test scenarios, 2) Clustering-based approach may struggle with novel types of reasoning failures not seen during calibration