Mar 24, 2026 Applied AI 5 papers

Applied AI Digest — Mar 24, 2026

Today’s Digest at a Glance

Today’s papers focus on efficient multimodal agents that combine vision, language, and action through novel planning paradigms, reinforcement learning optimization strategies, and parallel reasoning architectures.

Planning-Before-Perception

Traditional video understanding systems process entire video sequences upfront, leading to computational waste when only specific temporal segments contain relevant information. Planning-before-perception inverts this approach by having agents start with textual queries alone and iteratively decide what visual content to observe based on their current understanding and task requirements.

The core idea treats video analysis as a sequential decision problem where an agent maintains an internal state and uses a tool-calling interface to selectively sample video frames. At each step, the agent can specify temporal parameters (start_time, end_time, nframes) and spatial parameters (resize factors) to request precisely the visual information needed for the current reasoning step. This creates a feedback loop: query → plan → perceive → reason → update plan → perceive more.

Intuitively, this mimics how humans watch videos—we don’t process every frame uniformly but instead focus attention on relevant moments based on what we’re looking for.

Token-Level Reinforcement Learning for Vision-Language Models

Standard policy optimization in vision-language models applies rewards at the sequence level, treating all generated tokens equally. However, in multimodal reasoning tasks, different tokens contribute differently to the final answer quality—some tokens represent crucial visual grounding while others are generic linguistic connectives.

Token-level policy optimization addresses this by computing individual advantage scores for each token position. The key challenge is defining meaningful token-level rewards without ground-truth token annotations. Recent approaches combine multiple signals: visual similarity measures how well each token aligns with relevant image regions (computed as cosine similarity between token hidden states and vision encoder outputs), while token entropy captures the model’s confidence in each prediction.

Mathematically, the token-level advantage function becomes $A_t = \alpha \cdot VS_t + \beta \cdot H_t$ where $VS_t$ measures visual grounding and $H_t$ represents prediction entropy, with learnable gating weights $\alpha, \beta$. This allows the policy gradient to provide fine-grained feedback about which tokens should be reinforced or suppressed.

Parallel Chain-of-Thought via Learnable Query Tokens

Autoregressive chain-of-thought reasoning forces models to generate intermediate reasoning steps sequentially, creating computational bottlenecks especially for vision-language-action tasks requiring both spatial perception and logical planning. Parallel chain-of-thought eliminates this sequential constraint by performing visual and linguistic reasoning simultaneously.

The technique introduces learnable query tokens—special embeddings that act as “reasoning slots” in the input sequence. Visual CoT tokens $Q_{vis}$ are trained to extract spatial reasoning patterns (object relationships, geometric constraints), while linguistic CoT tokens $Q_{ling}$ capture logical reasoning chains (causal relationships, planning steps). These tokens are processed in parallel during the forward pass, allowing the model to jointly optimize both reasoning modalities.

The input sequence becomes $X = [V_{obs}, Q_{vis}, L_{instr}, Q_{ling}]$ where visual observations and language instructions are interspersed with the learnable reasoning tokens. Through training, these query tokens learn to encode implicit reasoning patterns that would normally require explicit sequential generation, achieving comparable reasoning quality with significantly faster inference.

Reading Guide

EVA and DualCoT-VLA both tackle efficiency in multimodal reasoning but from different angles—EVA optimizes what visual content to process while DualCoT-VLA optimizes how reasoning is performed. PEPO provides the optimization framework that could enhance both approaches through better token-level learning signals. CaP-X and SG-VLA focus on embodied applications, with CaP-X emphasizing systematic benchmarking of code generation while SG-VLA demonstrates auxiliary learning benefits for mobile manipulation.

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Authors: Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang et al. (9 authors) · Institution: SenseTime Research · Category: cs.CV

EVA introduces a planning-before-perception framework for video understanding that autonomously decides what, when, and how to watch videos, achieving significant efficiency gains through reinforcement learning-based training.

Practical Takeaway: EVA demonstrates that planning-before-perception can significantly improve video understanding efficiency. The key insight is starting with text-only reasoning to guide visual token allocation, rather than processing uniform frame samples upfront. The three-stage training approach (SFT→KTO→GRPO) provides a replicable framework for training video agents. Research engineers should consider: (1) implementing flexible frame selection tools with both temporal and spatial control, (2) designing reward functions that balance accuracy with efficiency, (3) using KTO to address common failure modes before online RL. The dramatic reduction in visual tokens (27x fewer) while maintaining accuracy suggests this approach could enable video understanding at much larger scales.

Tags: video-understanding multimodal-agents reinforcement-learning efficient-inference tool-use planning visual-reasoning long-video

arXiv · PDF

Task & Setting

Video understanding with multimodal large language models (MLLMs) faces significant challenges due to long token sequences containing extensive temporal dependencies and redundant frames. Most existing approaches treat MLLMs as passive recognizers that process entire videos or uniformly sampled frames without adaptive reasoning. This leads to computational inefficiency, especially for long videos where only specific segments may be relevant to answering queries.

The task involves video question answering where the input consists of a user query and a video, with outputs being natural language answers. EVA operates through an iterative process where at each timestep t, the agent observes a belief state:

\[s_t = \{q, h_t, F_t\}\]

where q is the user query, h_t represents interleaved text-frame history, and F_t corresponds to visual evidence from tool calls. The agent’s policy is parameterized as πθ(at

st).

Success is measured by accuracy on video understanding benchmarks including LSDBench, LongVideoBench, MLVU, VideoMME, LVBench, and Video-Holmes. For multiple-choice questions, Completeness Self-Verification (CSV) reward is used, while open-ended questions use ROUGE scores. The paper introduces three datasets: EVA-SFT (10k samples), EVA-KTO (11k labeled trajectories), and EVA-RL (9.6k open-ended QA pairs + 1.1k multiple choice).

Architecture & Method

Base model: Qwen2.5-VL-7B-Instruct with flexible frame selection tool supporting temporal (start_time, end_time, nframes) and spatial (resize) control parameters.
Planning-before-perception paradigm: Agent starts with only the textual query, no visual input initially, and iteratively performs summary-plan-action-reflection cycles.
Frame selection tool parameters: start_time and end_time specify temporal window, nframes controls sampling density, resize enables spatial downsampling for zoom operations.
Multi-round reasoning: Agent autonomously decides what to watch, when to watch, and how to watch through iterative tool calls and visual evidence accumulation.
Reward function for GRPO training:
\[R(\tau) = w_{acc} r_{acc} + w_{fmt} r_{fmt}\]
where accuracy reward is:
\[r_{acc} = \begin{cases} r_{csv} & \text{if multiple-choice} \\ r_{rouge} & \text{if open-ended} \end{cases}\]
ROUGE reward for open-ended tasks:
\[r_{rouge} = \frac{1}{3}(R_1 + R_2 + R_L) \in [0,1]\]
Core technical contribution: Planning-before-perception framework enabling autonomous, query-driven video understanding through flexible multi-dimensional frame selection rather than fixed uniform sampling.

Training Recipe

Supervised Fine-Tuning (SFT): EVA-SFT dataset with 10k samples covering general and task-specific agent training data. Training for 2 epochs, batch size=8, learning rate=2e-6. Data generated using Qwen2.5-VL-72B teacher model with prompts following Summary+Planning+Action+Reflection format.
Kahneman-Tversky Optimization (KTO): EVA-KTO dataset with 11k labeled trajectories (63% correct, 37% incorrect). Learning rate=2e-6, beta=0.1. Addresses typical failure cases like insufficient visual evidence and poor frame selection strategies.
Generalized Reward Policy Optimization (GRPO): EVA-RL dataset with 90% open-ended QA and 10% multiple choice questions. Training for 1 epoch, batch size=64, 8 rollouts per sample, learning rate=1e-6 on 32 H100 GPUs. Introduces Data-Enhanced GRPO pipeline that collects failure cases and generates new QA pairs using teacher model.
Hardware: Training conducted on H100 GPUs, with wall-clock time not explicitly reported.
Data sources: llava-video (short video QA), cgbench (long video QA), HD-VILA (unseen videos for enhanced dataset generation).

Novelty & Lineage

Prior work:

Traditional video MLLMs like LLaVA-Video
and Video-ChatGPT treat models as passive recognizers processing entire videos uniformly.
Recent agent methods like FrameThinker
and VideoAgent
introduce external tools but rely on fixed workflows and perception-first strategies.
Tool-integrated reasoning works like ToolLLM focus on API usage but not video-specific adaptive perception.

Delta: EVA introduces planning-before-perception paradigm where agent reasons solely from text query before any visual input, enabling autonomous decisions about what/when/how to watch. The three-stage training pipeline (SFT-KTO-GRPO) with specialized datasets is novel for video agent training.

Applied-specific assessment:
- Architectural novelty: Planning-before-perception is a meaningful paradigm shift from perception-first approaches, though individual components (GRPO, tool calling) are known techniques.
- Benchmark gains: 6-12% over general MLLM baselines and 1-3% over adaptive agents is substantial, especially with significantly fewer visual tokens (6.2K vs 166K+).
- Fair comparisons: Uses same base model (Qwen2.5-VL) as baseline, evaluates on standard benchmarks with consistent protocols.
- Scale dependency: Training requires substantial compute (32 H100s) and proprietary teacher model, but inference efficiency gains are significant.
The planning-before-perception paradigm represents a non-obvious insight that enables meaningful efficiency gains while maintaining accuracy.

Verdict: SIGNIFICANT — Planning-before-perception paradigm provides clear efficiency gains with maintained accuracy, addressing fundamental inefficiencies in video understanding.

Benchmarks & Results

LSDBench: EVA achieves 51.8% accuracy vs 49.2% baseline (Qwen2.5-VL), +2.6% improvement using only 6.2K visual tokens vs 21.0K
LongVideoBench: EVA-GRPO achieves 55.0% vs 52.9% FrameThinker baseline, outperforming with fewer frames (25.3 vs 21.1 estimated)
MLVU: EVA-GRPO achieves 68.3% vs 59.1% FrameThinker, significant improvement with 22.2 estimated frames vs 23.2
VideoMME-Long/Overall: EVA-GRPO achieves 48.4%/60.2% vs 47.6%/- FrameThinker, competitive performance
LVBench: EVA-GRPO achieves 43.3% vs 36.6% FrameThinker, substantial +6.7% improvement with comparable frame usage
Video-Holmes (zero-shot): EVA-GRPO achieves 37.2% overall vs 36.5% Video-R1, competitive performance without task-specific training

Results show consistent improvements across benchmarks, particularly strong gains on LVBench and MLVU. The efficiency gains (using 6.2K vs 166K+ visual tokens on LSDBench) are particularly impressive. Results appear fairly compared with consistent evaluation protocols.

Compute & Efficiency

Model size: 7B parameters (Qwen2.5-VL-7B-Instruct base)
Training compute: 32 H100 GPUs for GRPO stage, wall-clock time not reported for full training pipeline
Inference speed/latency: Not explicitly reported, but dominated by visual token processing rather than text reasoning rounds
Memory footprint: Significantly reduced visual token usage - 6.2K tokens vs 166K+ for baselines on LSDBench, representing ~27x reduction
Deployment practicality: Highly practical - maintains accuracy while dramatically reducing visual token requirements. Multi-round reasoning adds minimal text tokens compared to visual processing costs. Framework generalizable to other base models with tool-calling capabilities.

Real-World Applicability

Benchmark evaluation only: All experiments conducted on curated video understanding benchmarks (LSDBench, LongVideoBench, MLVU, VideoMME, LVBench, Video-Holmes)
No deployment results: Paper does not report real-world deployment, production integration, or hardware experiments on actual robotic/autonomous systems
Video resolution: Evaluations conducted on 720p videos, representing realistic video quality
Zero-shot transfer: Shows some generalization capability on Video-Holmes benchmark without task-specific training
Scalability considerations: Framework designed for long videos (tested on videos over 6600 seconds), suggesting applicability to real-world scenarios requiring efficient processing of extended video content

Limitations & Failure Modes

Tool interface dependency (FUNDAMENTAL): Current reasoning loop relies on pre-defined tool interfaces and may struggle with unseen or noisy query distributions
Limited exploration space (ENGINEERING): Despite flexible parameters, action space is still constrained by designed tool schema rather than truly open-ended exploration
Training data requirements (ENGINEERING): Requires substantial compute resources (32 H100s) and high-quality teacher models for dataset construction
Evaluation scope (EVALUATION): Tested only on curated benchmarks, lacking real-world deployment validation or robustness testing
Multi-turn overhead (ENGINEERING): While efficient overall, multi-round reasoning may add latency in time-critical applications

Known failure modes:
- May generate answers without sufficient visual evidence when trained insufficiently (addressed by KTO stage)
- Can over-explore or under-explore depending on query complexity and available context

CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Authors: Max Fu, Justin Yu, Karim El-Refai, Ethan Kou et al. (15 authors) · Institution: NVIDIA, UC Berkeley, Stanford University, Carnegie Mellon University · Category: cs.RO

CaP-X introduces a systematic framework for benchmarking code-generating agents in robot manipulation, showing that test-time computation scaling can recover human-level performance even with low-level primitives.

Practical Takeaway: If you’re building robot control systems, this work provides valuable insights into the trade-offs between abstraction levels and a systematic framework for evaluating coding agents. The key practical takeaway is that test-time compute scaling through multi-turn interaction, visual differencing into text (rather than direct image input), and ensemble reasoning can significantly improve robustness even when operating over low-level primitives. Consider implementing the Visual Differencing Module approach for better visual grounding, and note that training-free agentic scaffolding may be more practical than collecting large-scale robot data. However, be aware that the computational overhead of multi-model ensembles may limit deployment feasibility, and contact-rich tasks may still require hybrid approaches combining programmatic control with learned policies.

Tags: robotics code-generation embodied-ai benchmark multimodal manipulation reinforcement-learning vision-language-models

arXiv · PDF

Task & Setting

This paper addresses the challenge of creating autonomous robot controllers that can generate and execute complex manipulation tasks using executable code rather than relying on data-intensive training approaches. Robot control has traditionally required either explicit programming by human experts (creating scalability bottlenecks) or large-scale data collection for Vision-Language-Action (VLA) models.

The task is to develop agents that control robots by synthesizing and executing Python programs that compose perception and control primitives. The input is natural language task instructions and RGB-D visual observations; the output is executable Python code that orchestrates robot behaviors through API calls to perception modules (like SAM3 for segmentation) and control primitives (like motion planners and inverse kinematics). The formal objective is to maximize task success rate:

\[\max_{\pi} \mathbb{E}_{t \sim \mathcal{T}} [\mathbb{I}[\text{success}(t, \pi(t))]]\]

where $\pi$ is the coding policy and $\mathcal{T}$ is the task distribution.

Success is measured by task completion rate across 7 core manipulation tasks (Cube Lift, Cube Stack, Spill Wipe, Peg Insertion, Cube Re-stack, Two-Arm Lift, Two-Arm Handover) and extended evaluation on LIBERO-PRO and BEHAVIOR benchmarks. The framework introduces CaP-Gym with 187 tasks integrated from RoboSuite, LIBERO-PRO, and BEHAVIOR simulators, with evaluation protocols comparing single-turn vs multi-turn interaction across different abstraction levels.

Architecture & Method

CaP-Gym: A hierarchical control framework built on Gymnasium interface that binds low-level physics simulators with a stateful Code Executor loop using a Read-Eval-Print Loop (REPL) paradigm.
Perception primitives: Modular services including SAM3 for language-conditioned segmentation, Molmo 2 for open-vocabulary pointing, and standard vision libraries (OpenCV, Open3D) that abstract raw sensor data into structured semantic objects.
Control primitives: Motion planners and inverse kinematics solvers (PyRoki) that handle collision checking, reachability constraints, and action-space transformations, allowing agents to reason in task-oriented Cartesian space.
CaP-Agent0 framework: Training-free agentic system with three key components: - Multi-turn Visual Differencing Module (VDM) that converts visual observations into structured natural language - Auto-synthesized persistent skill library that extracts and reuses successful code patterns - Parallel reasoning with ensemble of models (Gemini-3-Pro, GPT-5.2, Claude Opus 4.5)
CaP-RL: Applies Group Relative Policy Optimization (GRPO) for reinforcement learning directly on the coding agent using verifiable environment rewards.

Training Recipe

Base model evaluation: 12 frontier models evaluated in zero-shot Pass@1 protocol across single-turn (S1-S4) and multi-turn (M1-M4) tiers. Models include closed-source (Gemini-3-Pro, GPT o1/5.1/5.2, Claude Haiku/Opus 4.5) and open-source (GPT-OSS-20B/120B, Qwen3 235B, DeepSeek-V3.1-Terminus) models. No additional training for base evaluation.
CaP-Agent0: Training-free framework that augments base models with agentic scaffolding. Uses parallel sampling with varying temperatures across multiple models. No model parameter updates required.
CaP-RL training: Post-trains Qwen2.5-Coder-7B-Instruct using GRPO for 50 iterations per task on three tasks (Cube Lift, Cube Stack, Spill Wipe). Training uses privileged state-based APIs (tier S1) to avoid noisy reward signals. Optimizer and learning rate details not reported. Hardware requirements not specified.
Evaluation protocol: 100 trials per task for core benchmark, 25 trials for real-world deployment. Each trial allows only one continuous interaction episode without environment resets.

Novelty & Lineage

Prior work: Code-as-Policy pioneers like Liang et al. (2023) and Singh et al. (2023) validated LLM-generated code for robot control but relied heavily on high-level, human-designed primitives that encode significant task structure. SWE-Bench (Jimenez et al., 2024) demonstrated coding agent capabilities in software environments but without embodied constraints.

Delta: This paper systematically studies how coding agent performance degrades as human-designed abstractions are removed, introducing a structured benchmark across abstraction levels (high-level to low-level primitives). It demonstrates that test-time computation scaling through multi-turn interaction, visual differencing, and ensemble reasoning can recover performance even with low-level primitives.

Applied-specific assessment:

The architectural idea of systematic abstraction-level evaluation is novel and reveals important dependencies on designer scaffolding
Benchmark gains are substantial: CaP-Agent0 achieves human-level performance on 4/7 tasks and significantly outperforms VLA baselines on LIBERO-PRO
Comparisons appear fair - same primitives and evaluation protocols across methods
The gains likely hold as the approach is training-free and relies on general capabilities rather than task-specific data

However, the work is primarily a benchmark and framework contribution rather than a fundamental algorithmic breakthrough. The individual components (multi-turn interaction, visual differencing, ensemble methods) are not novel, though their combination and systematic evaluation in robotics is valuable.

Verdict: INCREMENTAL — solid systematic study that reveals important insights about coding agents in robotics but combines known techniques rather than introducing fundamental innovations.

Benchmarks & Results

CaP-Bench core tasks (average across 12 models): S4 (low-level, no examples): ~15-25% success rate; S3 (low-level with examples): ~20-30%; S2 (high-level with perception): ~35-45%; S1 (high-level with ground truth): ~50-60%. Human expert baseline: 73-100% across tasks.
CaP-Agent0 on CaP-Bench: Achieves 68% average success rate vs 24% for S3 baseline. Reaches human-level performance (90%+) on 4/7 tasks including Cube Lift (97%), Cube Stack (98%), Spill Wipe (100%), and Peg Insert (89%).
LIBERO-PRO comparison: CaP-Agent0 achieves 0.22 (Pos)/0.18 (Task) on libero-object vs π0.5’s 0.17/0.01, and 0.26/0.17 on libero-goal vs π0.5’s 0.38/0.00. OpenVLA and π0 achieve 0.00 across all metrics.
BEHAVIOR mobile manipulation: CaP-Agent0 achieves 56% task success on radio pickup (vs 24% S3 baseline) and 72% on soda can pickup (vs 32% baseline).
CaP-RL simulation results: Qwen 2.5 Coder with RL achieves 80% on Cube Lift (vs 25% base), 44% on Cube Stack (vs 4% base), 93% on Spill Wipe (vs 30% base).
Real-world deployment: CaP-RL model maintains high sim-to-real transfer with 84% success on Cube Lift and 76% on Cube Stack on Franka robot.

Results are mixed across different abstraction levels and tasks, with consistent improvements from multi-turn interaction and visual grounding. Notable absence of comparison to other code-generation robotics methods beyond VLA baselines.

Compute & Efficiency

Model size: Evaluates models ranging from 7B parameters (Qwen2.5-Coder) to 235B parameters (Qwen3), with closed-source models of unknown size (Gemini-3-Pro, GPT-5.2, Claude Opus 4.5).
Training compute: CaP-RL training for 50 iterations per task on 7B model. Specific GPU hours and hardware details not reported. CaP-Agent0 is training-free.
Inference speed: Not explicitly reported. CaP-Agent0 uses parallel sampling with up to 9 queries per turn, suggesting significant inference overhead.
Memory footprint: Not reported, though framework supports both simulation and real robot deployment.
Deployment practicality: Successfully deployed on real robots (Franka Panda, AgiBot G1) with zero-shot transfer. Framework designed for compatibility between simulation and physical systems. However, multi-model ensemble approach likely requires significant computational resources for practical deployment.

Real-World Applicability

Real robot deployment: Successfully deployed on Franka Panda and AgiBot G1 robots performing manipulation tasks like cube lifting, stacking, and complex reasoning tasks (finding objects under cups, solving math problems with physical blocks).
Sim-to-real transfer: CaP-RL achieves minimal sim-to-real gap, maintaining 84% success on Cube Lift and 76% on Cube Stack when transferring from simulation to real Franka robot.
Cross-embodiment generalization: Framework transfers between single-arm and bimanual robots with minimal modifications (primarily primitive-level changes for different control interfaces).
Long-horizon mobile manipulation: Demonstrated on BEHAVIOR tasks requiring navigation, search, and manipulation with R1Pro humanoid robot in complex indoor environments.
Interactive correction: Supports human-in-the-loop correction where users can provide additional feedback between execution turns.
Production readiness: While demonstrated on real robots, the multi-model ensemble approach may be computationally expensive for continuous deployment. The framework shows promise for research and development applications but may need optimization for production robotics systems.

Limitations & Failure Modes

Contact-rich manipulation: FUNDAMENTAL - Programmatic control remains brittle for tasks requiring tight visual servoing and continuous feedback (insertion, pouring) compared to VLA approaches.
Perception noise sensitivity: ENGINEERING - Performance degrades significantly when moving from ground-truth state (S1) to noisy perception (S2), indicating sensitivity to visual estimation errors.
Cross-modal alignment gap: FUNDAMENTAL - Direct visual input (M2 tier) degrades performance compared to text-only feedback, suggesting foundation models struggle to jointly reason over code and physical images.
Computational overhead: ENGINEERING - CaP-Agent0 requires multiple model queries and ensemble reasoning, creating significant inference-time costs that may limit practical deployment.
Limited primitive expressivity: FUNDAMENTAL - Low-level primitives still constrain the action space compared to direct motor control, potentially limiting behaviors that require fine-grained control.
Scale dependence: EVALUATION - Most impressive results achieved with large closed-source models; open-source alternatives consistently underperform.

Failure modes:
Compounding errors in multi-step tasks where early perception or control failures cascade
Inability to recover from physical disturbances that move objects outside expected positions or orientations

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Authors: Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao et al. (7 authors) · Institution: Nankai University · Category: cs.CV

PEPO combines visual similarity and token entropy through a gating mechanism to provide fine-grained token-level advantages for vision-language model reinforcement learning, achieving modest but consistent improvements over sequence-level optimization.

Practical Takeaway: If you’re working with vision-language models and RLVR frameworks like GRPO, this method offers a straightforward way to get 1-4 point improvements on reasoning tasks. The key insight is reweighting policy gradients based on both visual similarity and token entropy rather than uniform sequence-level advantages. Implementation is relatively simple - compute cosine similarity between response and vision token hidden states, combine with entropy via a gating function, and use the result to modulate token-level advantages. Worth trying if you have the computational budget for layer-wise hidden state extraction during training, but don’t expect dramatic breakthroughs.

Tags: multimodal-reasoning vision-language-models reinforcement-learning chain-of-thought token-level-optimization visual-grounding policy-optimization RLHF

arXiv · PDF

Task & Setting

Vision-language models (VLMs) need to construct coherent reasoning trajectories that combine visual perception with step-by-step inference, but existing reinforcement learning methods apply rewards uniformly across all tokens, ignoring how different tokens contribute to visual grounding versus reasoning exploration.

The task is multimodal chain-of-thought (CoT) reasoning across diverse settings: geometry reasoning (e.g., solving for angles/lengths in diagrams), visual grounding (localizing objects described in text), visual puzzle solving (pattern recognition in abstract images), and few-shot classification. Input consists of an image and text query, output is a structured reasoning chain followed by final answer. The objective optimizes:

\[\mathcal{J}_G(\theta) = \mathbb{E}\left[\min\left(r_t^{(i)} A^{(i)}, \text{clip}(r_t^{(i)}, 1-\epsilon, 1+\epsilon) A^{(i)}\right)\right]\]

where $A^{(i)}$ is sequence-level advantage from GRPO framework.

Success is measured by task-specific accuracy metrics: percentage correct for reasoning tasks, IoU@50 for grounding, classification accuracy for few-shot learning. The paper evaluates on Geometry3K, MathVista, MathVerse, LogicVista, RefCOCO, LISA-Grounding, PuzzleVQA, AlgoPuzzleVQA, FGVC Aircraft, and Flower102 datasets.

No new dataset introduced - uses existing benchmarks for evaluation.

Architecture & Method

Base models: Qwen2.5-VL-3B-Instruct and InternVL3-2B-Instruct vision-language transformers with standard encoder-decoder architecture
Token-level visual similarity computation: For each response token, compute cosine similarity with all vision token hidden states across all layers:
\[VS_t = \frac{1}{L} \sum_{l=1}^L \frac{1}{N} \sum_{n=1}^N \frac{\langle h_{l,t}, v_{l,n} \rangle}{\|h_{l,t}\| \|v_{l,n}\|}\]
Token entropy calculation from output logits:
\[H_t^{(i)} = -\sum_{x \in V} p_\theta(x|s_t^{(i)}) \log p_\theta(x|s_t^{(i)})\]
Perception-exploration fusion via smooth gating mechanism:
\[\hat{g}_t^{(i)} = \hat{VS}_t^{(i)} + \hat{H}_t^{(i)} - \text{mean}_t(\hat{VS}^{(i)} + \hat{H}^{(i)})\] \[w_t^{(i)} = T \cdot \text{Softmax}\left((1 + \alpha \tanh(\hat{g}_t^{(i)})) \cdot VS_t^{(i)}\right)\]
Token-level advantage weighting:
\[A_t^{(i)} = \left((1-\lambda) + \lambda w_t^{(i)}\right) A^{(i)}\]
Core contribution: Integrating visual similarity (perception) and entropy (exploration) through gated fusion to produce fine-grained token-level advantages, rather than uniform sequence-level supervision.

Training Recipe

RLVR training stage: Uses Group Relative Policy Optimization (GRPO) or Diverse Advantage Policy Optimization (DAPO) frameworks - Data: Task-specific datasets (Geometry3K, RefCOCO samples, etc.) with verifiable rewards - Optimizer: AdamW with full-parameter fine-tuning - Precision: bfloat16 with gradient checkpointing - Sampling: 8 responses per query, temperature=1.0, top-p=1.0 - Hardware: 8 NVIDIA A40 GPUs with DeepSpeed ZeRO-2 - Schedule: λ parameter linearly increases from 0 to 1 over training steps - Wall-clock time: Not reported
Hyperparameter tuning: α coefficient tuned per dataset (typically 0.02-0.05)
Implementation: Swift framework for distributed training, computational overhead <1% of total training time

Training uses existing verifiable rewards from each benchmark without additional supervision or auxiliary branches.

Novelty & Lineage

Prior work:

GRPO (Shao et al. 2024) - Group Relative Policy Optimization for sequence-level RL with verifiable rewards, widely used for LLM/VLM reasoning
High-Entropy RL (Wang et al. 2025) - Token-level entropy advantages to encourage exploration at uncertain reasoning steps, but text-only focus
PAPO (Wang et al. 2025) - Perception-Aware Policy Optimization using auxiliary masking branches and attention measures for visual grounding

Delta: This paper combines visual similarity (derived from hidden state correlations between response and vision tokens) with token entropy through a smooth gating mechanism, avoiding auxiliary branches while capturing both perceptual grounding and reasoning uncertainty.

Applied-specific assessment:
- Architectural idea: Incremental - combines known techniques (cosine similarity, entropy weighting) in a straightforward way
- Benchmark gains: Modest but consistent - typically 1-4 point improvements across tasks, which could be meaningful
- Comparisons: Reasonably fair, uses same base models and training setups, though some baselines show instability
- Scale dependency: Method appears to work without large compute/data requirements, integrates into existing frameworks
The core insight about perception-exploration complementarity is reasonable but not particularly novel. The gating mechanism is a standard technique applied to a new domain.

Verdict: INCREMENTAL — solid engineering contribution combining existing techniques for modest but consistent improvements in multimodal reasoning.

Benchmarks & Results

Geometry3K (validation): PEPOG 22.80% vs GRPO 19.00%, +3.80 improvement
Geometry3K (test): PEPOG 27.27% vs GRPO 23.79%, +3.48 improvement
MathVista-mini: PEPOG 54.45% vs GRPO 51.56%, +2.89 improvement
MathVerse-mini: PEPOG 45.42% vs GRPO 40.54%, +4.88 improvement
LogicVista: PEPOG 34.45% vs GRPO 28.30%, +6.15 improvement
RefCOCO validation IoU@50: PEPOG 90.44% vs GRPO 90.12%, +0.32 improvement
LISA-Grounding IoU@50: PEPOG 65.26% vs GRPO 62.42%, +2.84 improvement
FGVC Aircraft (4-shot): PEPOG 75.79% vs GRPO 63.94%, +11.85 improvement
PuzzleVQA: PEPOG 45.00% vs GRPO 43.20%, +1.80 improvement
AlgoPuzzleVQA: PEPOG 26.94% vs GRPO 25.44%, +1.50 improvement

Results show consistent but modest improvements. Largest gains on few-shot classification and logical reasoning. High-Entropy RL baseline often unstable/collapsed.

Compute & Efficiency

Model size: 3B parameters (Qwen2.5-VL) and 2B parameters (InternVL3)
Training compute: 8 NVIDIA A40 GPUs, DeepSpeed ZeRO-2, wall-clock time not reported
Inference speed: Comparable to baseline GRPO, slight overhead <1% for token weight computation
Memory footprint: Uses gradient checkpointing and bfloat16 precision, no significant memory increase over baseline
Deployment practicality: Integrates seamlessly with existing RLVR frameworks (GRPO/DAPO), no auxiliary branches or additional supervision required, suitable for production deployment

Real-World Applicability

Evaluation limited to standard academic benchmarks - no deployment results reported
No hardware experiments with actual robots or autonomous systems mentioned
No production integration or real-world testing discussed
Sim-to-real transfer not addressed
Analysis focuses on curated datasets rather than noisy real-world data
Missing evaluation on challenging real-world scenarios like medical imaging, industrial inspection, or safety-critical applications

The work remains primarily academic benchmark-focused without demonstrating real-world robustness.

Limitations & Failure Modes

FUNDAMENTAL: Method requires layer-wise hidden state access during training, limiting compatibility with some deployment frameworks
FUNDAMENTAL: Visual similarity computation assumes meaningful correlation between hidden states and visual grounding, which may not hold across all model architectures
ENGINEERING: Hyperparameter α requires per-dataset tuning, reducing generalizability
ENGINEERING: Computational overhead increases with number of layers and vision tokens
EVALUATION: Limited to academic benchmarks, no real-world deployment validation
EVALUATION: Baseline High-Entropy RL frequently collapses, making comparisons potentially unfair

Failure modes:
- May amplify biases in visual token representations if the base model has poor visual grounding
- Could fail on tasks where reasoning requires abstract thinking disconnected from visual elements

SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

Authors: Ruisen Tu, Arth Shukla, Sohyun Yoo, Xuanlin Li et al. (8 authors) · Institution: UC San Diego · Category: cs.RO

SG-VLA improves vision-language-action models for mobile manipulation through auxiliary task co-training and multi-modal input enhancement, achieving 22% better success rates in household simulation tasks.

Practical Takeaway: Research engineers working on robot learning should consider auxiliary task co-training as a viable approach for improving VLA model representations, particularly the progressive training scheme to avoid gradient interference. The multi-view + depth input enhancement provides clear benefits and is straightforward to implement. However, focus on simulation-to-real transfer validation before deploying these techniques, as the auxiliary tasks rely heavily on simulation ground truth that may not be available on real robots. The mixed results with Flow Matching suggest task-adaptive action generation warrants further investigation.

Tags: mobile_manipulation vision_language_action auxiliary_learning multi_modal_input household_robotics imitation_learning progressive_training depth_perception

arXiv · PDF

Task & Setting

Mobile manipulation in household environments requires robots to coordinate navigation and manipulation while following natural language instructions. This is challenging because it involves controlling high-dimensional continuous action spaces (13 dimensions including base motion, arm articulation, and gripper), reasoning about global scene structure and fine-grained object geometry, and handling partial observability in unstructured home environments.

The task takes multi-modal inputs including multi-view RGB observations (head and hand cameras), depth information, and natural language commands, then outputs 13-dimensional continuous actions:

\[\Delta X \in \mathbb{R}^3 \text{ (base pose)}, \Delta z \in \mathbb{R} \text{ (torso height)}, \Delta q \in \mathbb{R}^7 \text{ (arm joints)}, \Delta G \in \mathbb{R}^2 \text{ (gripper)}\]

Success is measured by task completion rates across four fundamental subtasks: Pick, Place, Open, and Close operations in household scenarios like TidyHouse, PrepareGroceries, and SetTable.

The paper evaluates on ManiSkill-HAB benchmark containing 44K episodes with 1.4M transitions across three long-horizon household tasks, providing comprehensive coverage of household manipulation scenarios with varying difficulty levels.

Architecture & Method

VLM backbone: Prismatic architecture with dual visual encoder (DINOv2 + SigLIP) and Qwen2.5-0.5B LLM (1.3B total parameters)
Multi-modal input processing: Head/hand RGB cameras, depth maps normalized as
\[p_{obs} = 1 - \tanh\left(\frac{\text{depth value}}{1000}\right)\]
, and short temporal history (4 timesteps)
Auxiliary decoder suite operating on shared VLM features: - Global position decoder (MLP): predicts 2D robot coordinates with
\[L_{pos} = \text{MSE}(\hat{p}, p)\]
```
- Grasp success decoder (MLP): binary classification with 
```
\[L_{grasp} = \text{CrossEntropy}(\hat{y}, y)\]
```
- Object pose decoder (MLP): 7D pose prediction with 
```
\[L_{obj} = ||\hat{t} - t||_2^2 + (1 - |\hat{q} \cdot q|)\]
```
- Joint pose decoder (Transformer): 12D joint configuration with 
```
\[L_{qpos} = \text{MSE}(\hat{J}, J)\]
```
- Segmentation decoder (CNN): binary masks with 
```
\[L_{seg} = \text{CrossEntropy}(\hat{M}, M)\]
Optional Flow Matching action expert (100M parameters) for continuous action generation
Combined loss function:
\[L_{auxiliary} = \sum_{task} \lambda_{task} L_{task}\]
with task-specific weights

Training Recipe

Stage 1 (Decoder Adaptation): Freeze gradient flow from auxiliary decoders to VLM backbone, train only discrete action prediction path. Adam optimizer, 2e-5 learning rate, batch size 512, 3 epochs for SetTable data, 2 epochs for Pick/Place data.
Stage 2 (Joint Refinement): Enable full gradient flow, co-train all auxiliary decoders with VLM backbone. Same optimizer settings, 7 epochs for SetTable, 4 epochs for Pick/Place data.
Stage 3 (Action Head Training): Freeze VLM backbone completely, train Flow Matching action head in isolation with denoising loss. Action chunks of size 8, 10 denoising steps.

Training hardware: 8 NVIDIA A100 GPUs. Loss weights: λ_pos=1.0, λ_grasp=5.0, λ_qpos=1.0, λ_obj=1.0, λ_seg=1.0. Data filtering: segmentation tasks use only Pick/Place episodes.

Novelty & Lineage

Step 1 — Prior work: OpenVLA (Kim et al. 2024) established VLA models for tabletop manipulation using discrete action tokens. RT-2 (Brohan et al. 2023) demonstrated vision-language-action models but in constrained settings. π0 (Black et al. 2024) introduced Flow Matching for continuous robot control.

Step 2 — Delta: This paper adds:

systematic auxiliary task co-training with five complementary decoders
multi-stage progressive training to prevent gradient interference
multi-modal input enhancement (multi-view + depth), and
application to mobile manipulation (13D action space).

Step 3 — Applied-specific assessment:
- Architectural contribution is incremental: auxiliary decoders and multi-modal inputs are well-established techniques
- Benchmark gains are meaningful: 22% improvement (0.60→0.73 success rate) over baseline, but comparison is only against direct imitation learning on same architecture
- Progressive training addresses a real engineering problem but is not conceptually novel
- Results likely depend on specific simulation environment and may not transfer to real-world settings
- Missing comparisons to other mobile manipulation approaches or modular systems
Verdict: INCREMENTAL — solid engineering contribution applying known auxiliary training techniques to mobile manipulation, but lacks architectural novelty or comprehensive baseline comparisons.

Benchmarks & Results

Pick All Objects: SG-VLA 0.13 vs baseline 0.16 (worse), with action head 0.27
Place All Objects: SG-VLA 0.70 vs baseline 0.56 (+25%), with action head 0.80
Open Fridge: SG-VLA 0.87 vs baseline 0.67 (+30%), with action head 0.76
Open Drawer: SG-VLA 0.77 vs baseline 0.36 (+114%), with action head 0.60
Close Fridge: SG-VLA 0.90 vs baseline 0.83 (+8%), with action head 0.76
Close Drawer: SG-VLA 1.00 vs baseline 1.00 (same), with action head 0.97

Overall average: SG-VLA 0.73 vs baseline 0.60 (+22%). Mixed results with action head (0.69 average). Results show substantial improvements in drawer opening but inconsistent gains across tasks. Notable absence of comparisons to other mobile manipulation methods or real-world validation.

Compute & Efficiency

Model size: 1.3B parameters (VLM backbone) + 100M parameters (optional Flow Matching action head) = 1.4B total
Training compute: 8 NVIDIA A100 GPUs, training time not reported, 10-15 epochs depending on stage
Inference speed/latency: Not reported
Memory footprint: Not reported
Deployment practicality: Simulation-only evaluation, no real robot deployment or hardware considerations discussed. Model size suggests reasonable deployment potential but lacks validation.

Real-World Applicability

No real robot experiments reported - all evaluation conducted in ManiSkill-HAB simulation environment
No hardware deployment results or integration studies
No sim-to-real transfer analysis or discussion of domain gaps
No testing on real household environments or objects
Auxiliary tasks designed for simulation ground truth (perfect segmentation, exact joint poses) may not transfer to noisy real-world sensing

Limitations & Failure Modes

Simulation-only evaluation - FUNDAMENTAL limitation as real-world household environments have different dynamics, sensing noise, and object properties
Limited baseline comparisons - EVALUATION gap as paper only compares against direct imitation learning, missing comparisons to modular navigation+manipulation systems
Auxiliary task dependence on perfect ground truth - ENGINEERING issue as real robots lack perfect segmentation masks and joint angle feedback
Flow Matching action head shows inconsistent benefits - FUNDAMENTAL trade-off between continuous control precision and discrete action decisiveness
Progressive training adds complexity - ENGINEERING overhead requiring careful hyperparameter tuning

Failure modes:
Model likely fails when auxiliary ground truth is unavailable or noisy in real settings
Performance may degrade significantly in environments with different visual characteristics or object types than training data.

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Authors: Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan et al. (13 authors) · Institution: Hong Kong University of Science and Technology (Guangzhou), Huawei Foundation Model Department · Category: cs.CV

DualCoT-VLA introduces parallel visual-linguistic chain-of-thought reasoning via learnable query tokens, enabling VLA models to combine spatial perception and logical planning in a single forward pass while achieving 40x faster inference than autoregressive CoT methods.

Practical Takeaway: As a research engineer, the key takeaway is that parallel implicit reasoning through learnable query tokens can effectively combine spatial and logical reasoning in VLA models while avoiding autoregressive inference bottlenecks. The ~40x speedup over sequential CoT (83ms vs 3178ms) makes this architecture practical for real-time robotic control. Worth implementing if you’re working on manipulation tasks requiring both precise spatial understanding and multi-step planning. The auxiliary teacher training setup provides a clear recipe for distilling multimodal reasoning capabilities into efficient inference-time models.

Tags: robotics vision-language-action chain-of-thought manipulation multimodal-reasoning diffusion-transformer query-tokens implicit-reasoning

arXiv · PDF

Task & Setting

Vision-Language-Action (VLA) models address the challenge of enabling robots to perform complex manipulation tasks by directly mapping visual observations and language instructions to robotic actions. Standard VLA models struggle with multi-step tasks requiring logical planning and precise spatial perception due to their direct observation-to-action mapping approach.

The task involves processing visual observations (RGB images) and natural language instructions to generate continuous robotic actions. The formal objective combines three loss components:

\[\mathcal{L}_{total} = \lambda_{vis}\mathcal{L}_{vis} + \lambda_{lin}\mathcal{L}_{lin} + \lambda_{act}\mathcal{L}_{act}\]

Success is measured by task completion success rates on robotic manipulation benchmarks. The paper evaluates on LIBERO (4 task suites with 7-DoF robotic arm), RoboCasa GR1 (24 tabletop tasks with 29-DoF dexterous hand), and real-world experiments using AgileX Cobot dual-arm robot with 7-DoF arms. LIBERO uses 500 episodes per task suite, RoboCasa uses 50 episodes per task, and real-world uses 25 trials across three complexity levels.

Architecture & Method

VLM backbone uses Qwen3-VL-4B to process unified input sequence containing visual observations, language instructions, and learnable query tokens: $X_{input} = [V_{obs}, Q_{vis}, L_{instr}, Q_{lin}]$
Parallel implicit CoT mechanism with two sets of learnable query tokens: visual CoT tokens $Q_{vis} \in \mathbb{R}^{16 \times d_{VLM}}$ and linguistic CoT tokens $Q_{lin} \in \mathbb{R}^{4 \times d_{VLM}}$
Visual CoT stream aligns visual query hidden states with frozen Depth Anything 3 features using cross-attention projector and MSE loss:
\[\mathcal{L}_{vis} = MSE(\hat{F}_{DA3}, F_{DA3})\]
Linguistic CoT stream uses frozen Qwen3-0.6B to decode linguistic query states into explicit CoT text with cross-entropy loss:
\[\mathcal{L}_{lin} = -\sum_{i=1}^{L} \log p_\phi(y_i | \mathcal{P}_{lin}(H_{lin}), y_{<i})\]
Flow-Matching DiT action head predicts continuous actions with vector field matching objective:
\[\mathcal{L}_{act} = \mathbb{E}_{t,a_0,A}[\|v_\theta(a_t, t, H_{vlm}) - (A - a_0)\|_2^2]\]
Core contribution is parallel reasoning in continuous latent space that combines low-level spatial perception (visual CoT) with high-level logical planning (linguistic CoT) without autoregressive decoding.

Training Recipe

Pretraining stage: Joint training on robotic demonstration datasets with multi-task learning across LIBERO and RoboCasa benchmarks
Training details: - LIBERO: learning rate 2.5e-5, action window 7 steps, global batch size 48 - RoboCasa GR1: learning rate 3e-5, action window 15 steps, global batch size 256 - Hardware: NVIDIA H100 GPUs - Wall-clock time: not reported
Data sources: - LIBERO: existing benchmark demonstrations - RoboCasa GR1: generated CoT annotations using Qwen3-VL-32B - Real-world: 100 human-teleoperated demonstrations per task
Optimizer and schedule details: not reported
Loss weighting: λ_vis = 0.1, λ_lin = 0.1, λ_act = 1.0

Novelty & Lineage

Prior work:

CoT-VLA (Zhao et al., 2025): Visual chain-of-thought with explicit sub-goal image generation, achieved 83.9% on LIBERO
ThinkAct (Huang et al., 2025): Linguistic-only CoT with autoregressive text generation, achieved 84.4% on LIBERO
Fast-ThinkAct (Huang et al., 2026): Implicit reasoning in latent space but still autoregressive, achieved 89.7% on LIBERO

Delta: This paper combines visual and linguistic CoT in a parallel mechanism rather than sequential/autoregressive. Uses learnable query tokens to extract both spatial (via DA3 alignment) and logical reasoning (via LLM supervision) simultaneously in single forward pass.

Assessment:
- Architectural idea: Moderately novel - parallel dual-modal CoT is logical extension but non-obvious combination of existing techniques
- Benchmark gains: Meaningful improvements (98.8% vs 97.9% best prior on LIBERO, 55.1% vs 48.8% on RoboCasa)
- Fair comparisons: Uses same evaluation protocols, though some baselines use different architectures
- Generalization: Improvements consistent across multiple benchmarks and real-world deployment
- Scalability concerns: Relies on auxiliary teacher models during training, gains may not transfer to different scale/data
Verdict: INCREMENTAL — Solid engineering contribution combining known techniques (visual/linguistic CoT, query tokens) in parallel architecture, with consistent but modest improvements over strong baselines.

Benchmarks & Results

LIBERO benchmark: 98.8% average success rate vs. 97.9% previous best (LaRA-VLA), +0.9% improvement
LIBERO Spatial suite: 99.4% vs. 98.8% previous best (π0.5), +0.6% improvement
LIBERO Object suite: 99.8% vs. 98.6% previous best (LaRA-VLA), +1.2% improvement
LIBERO Goal suite: 97.8% vs. 99.8% previous best (LaRA-VLA), -2.0% degradation
LIBERO Long suite: 98.2% vs. 96.6% previous best (LaRA-VLA), +1.6% improvement
RoboCasa GR1 average: 55.1% vs. 48.8% previous best (Qwen3OFT), +6.3% improvement
Real-world bread task: 64% vs. 48% (GR00T-N1.6), +16% improvement
Real-world blocks task: 56% vs. 32% (GR00T-N1.6), +24% improvement
Real-world fruits task: 48% vs. 20% (GR00T-N1.6), +28% improvement

Results show consistent improvements across benchmarks, with particularly strong gains on real-world tasks and spatially demanding scenarios.

Compute & Efficiency

Model size: Qwen3-VL-4B backbone plus auxiliary DA3 and Qwen3-0.6B teachers during training (total parameters not specified)
Training compute: Multiple NVIDIA H100 GPUs, specific GPU hours not reported
Inference speed: 83.2ms total latency (58.1ms VLM + 25.1ms action head) vs. 3178.5ms for autoregressive CoT baseline
Memory footprint: Not specified
Deployment practicality: High - achieves real-time control at ~12Hz, auxiliary models discarded at inference for efficiency, successfully deployed on physical robots

Real-World Applicability

Real-world robot deployment on AgileX Cobot dual-arm system with 7-DoF arms and parallel grippers
Uses only onboard RGB cameras (front-facing and wrist-mounted) without additional sensors
Three manipulation tasks tested: bread placement (64% success), block placement (56% success), fruit gathering (48% success)
Demonstrates effective sim-to-real transfer across varying complexity levels
Maintains high-frequency control loop (~12Hz) suitable for responsive robotic control
Successfully handles unstructured environments with varied lighting and arbitrary object placements

Limitations & Failure Modes

ENGINEERING: Requires auxiliary teacher models (DA3, Qwen3-0.6B) during training, increasing training complexity
EVALUATION: Limited real-world evaluation (only 3 tasks, 25 trials each) compared to extensive simulation benchmarks
ENGINEERING: Performance still relies heavily on quality of demonstration data and CoT annotations
FUNDAMENTAL: Fixed number of query tokens (16 visual, 4 linguistic) may not scale to more complex reasoning requirements
EVALUATION: No comparison with explicit CoT methods on inference speed vs. accuracy tradeoffs

Failure modes:
Likely struggles with tasks requiring longer reasoning chains than current 4 linguistic tokens can encode
Visual reasoning limited to depth-based spatial understanding, may fail on tasks requiring other geometric priors (surface normals, object boundaries)