Applied AI 20 papers

Applied AI Digest — Mar 17, 2026

Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation

Authors: Jacob Levy, Tyler Westenbroek, Kevin Huang, Fernando Palafox et al. (9 authors) · Institution: Toyota Research · Category: cs.RO

SimDist introduces a sim-to-real framework that freezes learned representations, rewards, and values while adapting only dynamics, reducing real-world robot adaptation to supervised system identification and achieving 2× performance improvements over standard RL fine-tuning methods.

Practical Takeaway: If you’re working on sim-to-real transfer for robotics, SimDist offers a compelling alternative to end-to-end policy fine-tuning that avoids catastrophic forgetting. The key insight is architectural: decompose world models to freeze global task structure (representations, rewards, values) while adapting only environment-specific dynamics. This reduces real-world adaptation to a simple supervised learning problem rather than complex RL optimization. The framework integrates well with existing simulation pipelines and demonstrates practical success with minimal real-world data. Consider implementing this approach if you have access to high-fidelity simulators and can generate diverse training data with action perturbations during pretraining.

Tags: sim-to-real world-models model-based-rl robotics manipulation locomotion dynamics-adaptation planning

arXiv · PDF

Task & Setting

The sim-to-real transfer problem in robotics addresses the challenge of deploying policies trained in simulation to real-world systems, where dynamics mismatches often cause catastrophic failures. While reinforcement learning offers principled adaptation mechanisms, existing methods struggle with exploration and long-horizon credit assignment when fine-tuning with limited real-world data typical in robotics applications.

The task involves learning a policy that can perform precise manipulation (peg insertion, table leg assembly) and quadruped locomotion tasks in the real world after pretraining in simulation. Inputs are raw RGB images (224×224 for manipulation) or proprioceptive/terrain data (for quadruped), and outputs are 6-DOF end-effector poses or 12-joint position targets respectively. The formal objective is to maximize expected discounted return:

\[\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t, s_{t+1})\right]\]

Success is measured by task completion rates for manipulation (successful peg insertion or table leg assembly) and forward progress distance for quadruped locomotion tasks. The evaluation uses 20 trials for manipulation tasks across narrow (2cm×2cm) and wide (35cm×35cm) initial condition ranges, and 15 trials for quadruped tasks across 3 different commanded speeds.

Architecture & Method
  1. Train state-based expert policy π^e(s_t) and value function V^e(s_t) in simulation using PPO, saving intermediate checkpoints {π_k}.

  2. Generate diverse simulation dataset by alternating between expert and sub-optimal policies with contiguous Gaussian action perturbations over 1-5 timestep windows.

  3. Pretrain planning-oriented latent world model with components: latent encoder E_θ(o_t) → z_t, history encoder C_θ for temporal context, transformer-based dynamics model f_θ predicting future latent states, reward model R_θ, value model V_θ, and base policy π_θ.

  4. Train world model using multi-term loss at each timestep:

    \[L^{sim}_t(\theta) = \sum_{i=0}^T \left[\|\hat{z}_{t+i+1} - sg(E_θ(o_{t+i+1}))\|_2^2 + c_1(\hat{r}_{t+i} - r_{t+i})^2 + c_2(\hat{v}_{t+i+1} - v_{t+i+1})^2 + c_3 \mathbf{1}_e(a_{t+i})\|\hat{a}_{t+i} - a_{t+i}\|_2^2\right]\]
  5. Deploy to real world using Model Predictive Path Integral (MPPI) planning with the pretrained world model.

  6. Adapt via iterative supervised learning by freezing encoder E_θ, reward R_θ, and value V_θ models, fine-tuning only dynamics f_θ using simplified loss:

    \[L^{real}_t(\theta) = \sum_{i=0}^T \|\hat{z}_{t+i+1} - sg(E_θ(o_{t+i+1}))\|_2^2\]
Training Recipe
  1. Expert policy pretraining: Train state-based policies using PPO in simulation with privileged state access, saving policy checkpoints every 100 iterations up to 1000 checkpoints.

  2. Diverse data generation: Generate 100K trajectories for manipulation (100M for quadruped) using expert and sub-optimal policies with action perturbations, yielding ~36% expert actions for manipulation, ~55.7% for quadruped.

  3. World model pretraining: Train for 2 epochs on simulation dataset using Adam optimizer with learning rate 2×10^-4 → 1×10^-4 cosine decay, batch size 256, ~200K gradient updates for manipulation. Apply data augmentation including Gaussian noise injection and visual augmentations.

  4. Real-world fine-tuning: Collect M on-policy rollouts using MPPI planner, add to real-world dataset D_real, then fine-tune dynamics model f_θ only while keeping other components frozen. Update after every 20 episodes for manipulation tasks.

    Hardware and timing details not comprehensively reported across all experiments.

Novelty & Lineage

This work builds on model-based RL approaches like Dreamer (2020) and TD-MPC (2022), and sim-to-real transfer methods. The key novel contribution is the insight that world models should be decomposed for transfer - freezing global task structure (representations, rewards, values) while adapting only local dynamics. This differs from end-to-end policy fine-tuning approaches like RLPD (2023), IQL (2022), and SGFT (2025) that suffer from catastrophic forgetting.

The specific technical delta includes:

  1. systematic framework for distilling simulator structure into transferable world model components
  2. chunked transformer dynamics prediction for parallel planning
  3. sequence-to-sequence reward/value modeling, and
  4. reduction of real-world adaptation to supervised system identification problem.

    Rating: SIGNIFICANT - provides a principled decomposition for sim-to-real transfer with strong empirical validation.

Benchmarks & Results
  1. Peg Insertion (Wide): Success rate 0.8 vs baselines 0.0-0.4, ~2× improvement
  2. Peg Insertion (Narrow): Success rate 0.8 vs baselines 0.2-0.4, ~2× improvement
  3. Table Leg (Wide): Success rate 0.85 vs baselines 0.0-0.4, ~2× improvement
  4. Table Leg (Narrow): Success rate 0.6 vs baselines 0.0-0.3, ~2× improvement
  5. Quadruped Slippery Slope: Forward progress 1.5m vs baselines 0.5-1.0m, 50%+ improvement
  6. Quadruped Foam: Forward progress 3.0m vs baselines 1.0-2.0m, 50%+ improvement

    SimDist consistently outperforms RLPD, IQL, SGFT-SAC, and Diffusion Policy baselines across all tasks. Results show monotonic improvement vs catastrophic collapse in standard RL fine-tuning methods. Performance gaps widen on more challenging tasks with narrow initial condition ranges.

Compute & Efficiency
  1. Model size: World model parameters not explicitly reported, uses ResNet-18 encoders and transformer components with 64-dim latent space for manipulation
  2. Training compute: 100K-100M simulation trajectories generated, pretraining details partially reported (200K gradient updates for manipulation), specific GPU hours not provided
  3. Inference speed: Real-time planning at 5 Hz for manipulation, 50 Hz for quadruped locomotion, chunked prediction enables parallel GPU utilization
  4. Memory footprint: Not explicitly reported
  5. Deployment practicality: Demonstrated on real UR5e robot and Unitree Go2 quadruped, achieves effective adaptation with just 15-30 minutes of real-world data, making it practical for real-world deployment
Real-World Applicability
  1. Real robot deployment on UR5e manipulator performing precise assembly tasks (peg insertion, table leg assembly) with RGB camera observations and 6-DOF end-effector control

  2. Real quadruped experiments on Unitree Go2 robot traversing challenging terrains (slippery PTFE surfaces, compliant foam) not modeled in simulation

  3. Achieves autonomous improvement with minimal real-world interaction (15-30 minutes of data) across contact-rich manipulation and dynamic locomotion tasks

  4. Successfully transfers from high-fidelity physics simulation (IsaacLab, custom manipulation simulator) to real hardware with significant dynamics gaps

  5. Demonstrates zero-shot transfer of reward and value models from raw perception, enabling immediate planning capabilities in real-world deployment

Limitations & Failure Modes
  1. FUNDAMENTAL: Requires high-fidelity simulator with privileged state access during pretraining, limiting applicability to domains without good simulators

  2. FUNDAMENTAL: Assumes global task structure (representations, rewards, values) transfers well across sim-to-real gap, may not hold for tasks with significant perceptual or semantic differences

  3. ENGINEERING: Massive simulation data requirements (100K-100M trajectories) for effective pretraining, though this is parallelizable

  4. ENGINEERING: Limited to single-task settings, scaling to multi-task world models remains unexplored

  5. EVALUATION: Evaluation limited to specific manipulation and locomotion tasks, broader generalization across robot morphologies and task domains unclear

  6. EVALUATION: Real-world experiments conducted in relatively controlled lab environments, robustness to more varied real-world conditions unknown

    Known failure modes:

  7. Planning may exploit inaccuracies in transferred reward/value models when dynamics adaptation is insufficient
  8. Method may struggle when simulator fails to capture essential task structure that should transfer (e.g., contact dynamics, object properties).

AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Authors: Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang et al. (5 authors) · Institution: Figure AI · Category: cs.AI

AsgardBench isolates visually grounded interactive planning in embodied AI by removing navigation complexity and requiring agents to adapt action sequences based on visual observations during execution.

Practical Takeaway: If you’re building embodied AI systems, AsgardBench reveals that current vision-language models struggle significantly with visual state tracking and plan adaptation during execution - capabilities essential for real-world deployment. The benchmark’s systematic ablations show models rely heavily on auxiliary cues (hand overlays, detailed feedback, temporal context) and fail when forced to ground decisions purely in visual observations. Before deploying VLMs in embodied systems, test their ability to maintain coherent state across multi-step interactions and adapt plans when visual observations contradict expectations. The benchmark code is available and could serve as a diagnostic tool for your own agents.

Tags: embodied_ai interactive_planning visual_grounding benchmark vision_language_models multimodal_reasoning plan_adaptation state_tracking

arXiv · PDF

Task & Setting

Embodied AI agents often fail when real-world conditions differ from their initial assumptions, requiring them to adapt their plans based on visual observations during execution. Existing benchmarks conflate high-level reasoning with navigation and low-level control, making it difficult to isolate interactive planning failures.

AsgardBench evaluates visually grounded interactive planning where agents must:

  1. receive RGB images, action history, and minimal success/failure signals as input
  2. generate complete action sequences at each step (though only the first action executes), and
  3. adapt plans when visual observations contradict expectations. The objective is task completion where success depends on conditional branching based on visual state:

    \[\text{Success} = \mathbb{I}[\text{all task goals met} \land \text{world left in reasonable state}]\]

    Success is measured by binary task completion across 108 task instances. Tasks terminate when:

  4. goals are met
  5. 10 consecutive undoable actions occur
  6. the same action repeats >8 times, or
  7. step limits are exceeded (soft: 1.5x optimal steps, hard: 2x optimal steps).

    The benchmark contains 108 task instances spanning 12 task types (coffee consumption, cooking, cleaning, organization) across Kitchen, Living Room, and Bathroom scenes in AI2-THOR. Each task includes systematic variations in object state (clean/dirty), placement, and scene configuration that create conditional execution branches requiring different action sequences for the same instruction.

Architecture & Method
  1. Environment abstraction layer built on AI2-THOR that removes navigation and low-level manipulation, automatically positioning agents for object interactions

  2. Simplified action space with high-level commands (FIND, PICKUP, PUT, CLEAN, TOGGLE_ON/OFF) that abstract away motor control and spatial reasoning

  3. Visual observation system providing current and previous RGB images with optional translucent hand overlay to disambiguate held objects

  4. Minimal feedback mechanism providing only binary success/failure signals for each action attempt

  5. Task variation generator that systematically alters object states, placements, and scene configurations to create conditional execution branches

  6. Evaluation framework testing vision-language models (GPT-4o, Claude, Qwen-VL, Llama, etc.) in image-based vs text-only conditions with different feedback levels

    The core technical contribution is isolating interactive planning by removing confounding factors (navigation, detailed error messages, structured state) while requiring genuine visual grounding for plan adaptation during execution.

Training Recipe

Not applicable - this is a benchmark evaluation paper that tests existing pre-trained vision-language models rather than training new models. The paper evaluates models including:

  1. GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro (proprietary frontier models)
  2. Qwen-VL-Max, Mistral-Large-3 (open-weight models)
  3. Various model sizes and architectures

    Models were tested in their pre-trained state without additional fine-tuning on AsgardBench tasks. Training details for the underlying models are not reported as they come from external sources.

Novelty & Lineage

This work builds on prior embodied AI benchmarks including ALFRED (2020), ALFWorld (2021), BEHAVIOR-1K (2024), and EmbodiedBench (2025). The closest prior works are ET-PLAN-BENCH (2024) and LoTa-Bench (2024) which evaluate planning capabilities.

The specific delta is:

  1. isolating interactive planning by removing navigation and manipulation
  2. restricting feedback to minimal success/failure signals rather than detailed error messages
  3. requiring genuine visual grounding through systematic text-only ablations, and
  4. creating controlled task variations that force conditional branching during execution.

    Prior benchmarks either conflate reasoning with navigation/control or provide rich feedback that substitutes for perception. AsgardBench is the first to isolate visually grounded interactive planning as a standalone capability.

    Rating: SIGNIFICANT - addresses a clear gap in embodied AI evaluation with novel benchmark design principles.

Benchmarks & Results
  1. AsgardBench (108 tasks) - GPT-4o: 35% success, Kimi-K5.2: 34%, Claude-3.5-Sonnet: 28%, Qwen-VL-Max: 24%

  2. Text-only ablation - All models drop substantially: GPT-4o: 35% → 16%, Claude: 28% → 12%, confirming visual grounding necessity

  3. No feedback condition - Performance degrades: GPT-4o drops ~8 percentage points when success/failure signals removed

  4. Detailed feedback condition - Performance increases significantly, with some text-only models matching image-based baseline

  5. Hand overlay ablation - All models perform worse without visual hand cues for held object disambiguation

  6. Things to Remember ablation - Mixed results, stronger models benefit from memory scaffolding

  7. Current image only - Performance degrades when previous state image removed, confirming value of temporal context

    No comparison to other embodied benchmarks provided since AsgardBench targets a different capability (interactive planning vs. end-to-end execution).

Compute & Efficiency
  1. Model size - Tests existing models ranging from smaller open-weight to large proprietary models (exact parameters not specified)

  2. Training compute - Not applicable, uses pre-trained models

  3. Inference speed/latency - Not reported, though notes some expensive models excluded from all conditions for cost control

  4. Memory footprint - Not reported

  5. Deployment practicality - Low barrier as benchmark runs on standard AI2-THOR simulation infrastructure, suitable for research evaluation but abstracted from real-world deployment requirements

Real-World Applicability
  1. Simulation-only evaluation in AI2-THOR with no real-world robot experiments

  2. No hardware deployment or physical embodiment testing reported

  3. No sim-to-real transfer analysis provided

  4. Authors acknowledge reduced ecological validity due to abstraction away from navigation and low-level control

  5. Designed as diagnostic tool for capabilities rather than end-to-end system evaluation

    The benchmark intentionally abstracts away real-world complexities to isolate reasoning capabilities, limiting direct real-world applicability but providing valuable capability assessment for future deployed systems.

Limitations & Failure Modes
  1. FUNDAMENTAL - Abstracts away navigation and low-level control, reducing ecological validity for real embodied systems

  2. FUNDAMENTAL - Limited to AI2-THOR environment with constrained object set and scene types, may not generalize to more diverse visual environments

  3. EVALUATION - Requires models to produce full action sequences each turn even though only first action executes, potentially favoring planning-oriented models over reactive policies

  4. ENGINEERING - Controlled lighting and scene conditions may not reflect performance in varied real-world environments

  5. FUNDAMENTAL - Minimal feedback design, while serving benchmark goals, may not reflect richer sensory inputs available in real systems

  6. EVALUATION - 108 task instances may be insufficient for robust statistical analysis of model capabilities

    Failure modes:

  7. Models frequently get stuck in repetitive action loops when plans fail
  8. Visual misinterpretations (confusing reflections for flames, objects for clutter) lead to cascading planning errors.

Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

Authors: Yurui Dong, Ziyue Wang, Shuyun Lu, Dairu Liu et al. (8 authors) · Institution: Toyota Research · Category: cs.CV

This paper introduces EscapeCraft-4D, a 4D multimodal benchmark that evaluates time-aware reasoning and cross-modal active perception in Omni models through escape room tasks requiring coordination of vision, language, and audio under time constraints.

Practical Takeaway: If you’re building multimodal AI systems for real-world applications, this work reveals critical gaps in current models’ ability to handle time-sensitive, multi-modal decision making. The benchmark exposes that strong performance on static multimodal benchmarks doesn’t transfer to embodied scenarios requiring active perception and temporal reasoning. Key implementation insight: unified multimodal processing (like true Omni models) outperforms modular agentic approaches for cross-modal coordination. The time-aware evaluation paradigm and audio integration techniques could be adapted for robotics and autonomous systems where decisions have irreversible consequences under time constraints.

Tags: multimodal embodied_ai time_aware_reasoning cross_modal_perception audio_visual_fusion escape_room benchmark omni_models

arXiv · PDF

Task & Setting

This work addresses the gap between static multimodal benchmarks and real-world scenarios requiring time-aware reasoning and active cross-modal perception. Current multimodal large language models (MLLMs) excel at static vision-language tasks but struggle with dynamic environments where information is transient, multiple modalities provide complementary or interfering signals, and decisions must be made under irreversible time constraints.

The task is embodied multimodal reasoning in a 4D escape room environment that combines vision, language, and audio modalities. Models receive visual observations (RGB images), spatial audio cues (ambient sounds with distance-based attenuation), and triggered speech messages containing passwords or misleading information. The objective is to navigate a 3D room, collect time-sensitive clues, and escape within step limits (50-80 steps depending on difficulty). The environment incorporates perishable evidence that appears temporarily and disappears permanently if not found within time limits.

Success is measured by Escape Rate (ER), Audio Misguidance Rate (AMR) measuring susceptibility to distractors, Time-Constrained Search Score (TCSS) calculated as:

\[TCSS = \begin{cases} 1 - \frac{t_{found}}{T_{lim}} & \text{if found within } T_{lim} \\ 0 & \text{otherwise} \end{cases}\]

The MM-Escape4D benchmark contains 66 scenes across 6 difficulty levels, with 11 scenes each for basic difficulties (1-3), misleading variants (2-M, 3-M), and time-aware scenarios (2-T). Each scene contains 13-17 interactive objects on average.

Architecture & Method
  1. Environment Design: EscapeCraft-4D extends 3D escape rooms with temporal dimension through ambient audio (continuous spatial signals) and triggered audio (time-limited speech cues activated by proximity and orientation thresholds).

  2. Ambient Audio Integration: Spatially-grounded continuous audio sources with loudness as deterministic function of agent-to-source distance, providing gradient signals for navigation and exploration guidance.

  3. Trigger Audio System: Time-variant discrete audio cues bound to interactable objects, activated only when agent satisfies proximity/orientation conditions and issues explicit trigger requests.

  4. Time-Aware Task Design: Critical evidence appears temporarily after specific triggers and disappears permanently within 20-second windows, forcing irreversible time-constrained decision making.

  5. Misleading Modality Integration: Parallel distractor audio sources provide plausible but irrelevant information to test robustness against cross-modal interference.

  6. Multi-hop Reasoning Chain: Tasks require sequential interactions across visual props and audio cues, with difficulty scaling from 1-hop (direct exit) to 3-hop (audio + visual + physical items) reasoning paths.

    The core technical contribution is the systematic integration of temporally-dynamic audio cues with spatial grounding and time-variant evidence availability, creating a 4D evaluation paradigm that goes beyond static multimodal reasoning.

Training Recipe

Not reported. This paper focuses on environment design and evaluation rather than model training. The work evaluates existing pre-trained models including GPT-4o, Gemini-3-Pro, Qwen3-Omni variants, and MGM-Omni without additional fine-tuning or training on the benchmark tasks.

Novelty & Lineage

This work builds on MM-Escape (2024) for 3D escape room tasks and extends it with audio modalities and temporal dynamics. Prior multimodal benchmarks like SEED-Bench (2023), MVBench (2024), and Video-MME (2024) focus on static vision-language tasks or silent video clips, lacking audio integration and time-aware reasoning.

The specific delta includes:

  1. systematic audio modality integration with spatial grounding
  2. time-variant evidence that disappears permanently
  3. misleading cross-modal cues requiring active perception, and
  4. 4D evaluation paradigm combining space and time. The closest work is EscapeCraft
  5. which lacks audio and temporal constraints.

    Rating: SIGNIFICANT - introduces novel 4D evaluation paradigm with meaningful extensions to existing 3D environments, though building incrementally on established escape room framework.

Benchmarks & Results
  1. MM-Escape4D Difficulty-1: GPT-4o Omni-Agent achieves 90.91% ER, Gemini-3-Pro 100% ER, Qwen3-Omni-Thinking 63.64% ER

  2. MM-Escape4D Difficulty-2: GPT-4o Omni-Agent 81.82% ER, Gemini-3-Pro 59.25% ER, Qwen3-Omni-Thinking 27.27% ER

  3. MM-Escape4D Difficulty-3: GPT-4o Omni-Agent 45.45% ER, Gemini-3-Pro 27.27% ER, all open-source models 0% ER

  4. Time-Aware TCSS: GPT-4o 67.95%, Qwen3-Omni-Thinking 55.87%, Gemini-3-Pro 47.04%

  5. Audio Misguidance Rate: GPT-4o 100% (high susceptibility), Qwen3-Omni-Thinking 23.08% (more robust)

  6. Misleading Levels: GPT-4o maintains 72.73%/63.64% ER on D2-M/D3-M, while most open-source models drop to 0%

    Results show substantial performance gaps, with proprietary models significantly outperforming open-source alternatives, especially on complex multi-hop reasoning tasks. No prior SOTA scores available as this is a new benchmark.

Compute & Efficiency
  1. Model sizes range from 30B parameters (Qwen3-Omni) to 32B (MGM-Omni), with proprietary model sizes not reported

  2. Training compute: Not reported as models are evaluated without additional training

  3. Inference speed: Not reported, though step limits (50-80 actions) suggest real-time interaction requirements

  4. Memory footprint: Not reported

  5. Deployment practicality: Limited by requirement for unified vision-language-audio processing and 3D spatial reasoning capabilities, currently accessible mainly through API-based proprietary models or large open-source variants

Real-World Applicability
  1. Synthetic 3D environment evaluation only - no real-world deployment or hardware experiments reported

  2. No robot or autonomous vehicle integration demonstrated

  3. No sim-to-real transfer analysis provided

  4. Environment designed to simulate real escape room scenarios with realistic multimodal cues, but validation limited to simulation

  5. Task design inspired by real-world applications requiring time-aware multimodal reasoning (emergency response, interactive assistants), but no production integration shown

Limitations & Failure Modes
  1. EVALUATION: Limited to synthetic 3D environments without real-world validation or sim-to-real analysis

  2. FUNDAMENTAL: Current models show high susceptibility to misleading audio cues (100% AMR for GPT-4o), indicating poor cross-modal reasoning robustness

  3. ENGINEERING: Open-source models completely fail on complex multi-hop tasks (0% ER on Difficulty-3), suggesting insufficient training on embodied reasoning

  4. EVALUATION: Benchmark scale relatively small (66 scenes) compared to standard vision-language benchmarks with thousands of examples

  5. FUNDAMENTAL: Time-aware reasoning requires irreversible decision making under constraints, which current models struggle with even when given explicit time cues

    Failure modes:

  6. Models get trapped in exploration loops without strategic planning, particularly in misleading scenarios
  7. Time-sensitive evidence is missed due to poor temporal reasoning, leading to permanent information loss and task failure

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Authors: Mohamed Aghzal, Gregory J. Stein, Ziyu Yao · Institution: Toyota Research · Category: cs.AI

A hierarchical evaluation framework reveals that LLM web agents fail primarily due to poor low-level execution and perceptual grounding rather than high-level planning deficiencies.

Practical Takeaway: If you’re building LLM web agents, focus your efforts on improving low-level execution rather than just high-level reasoning. The hierarchical evaluation framework reveals that structured PDDL representations help with planning, but perceptual grounding and adaptive control remain the critical bottlenecks. Consider implementing explicit uncertainty signals in action spaces rather than forcing concrete actions at every step. The framework itself is valuable for systematic diagnosis - implement process-based evaluation to identify where your agent fails rather than relying solely on end-to-end success metrics.

Tags: web-agents hierarchical-planning llm-evaluation pddl process-based-evaluation web-navigation automated-planning failure-analysis

arXiv · PDF

Task & Setting

Large language model (LLM) web agents are increasingly deployed for automating online tasks like e-commerce transactions and form completion, but they remain far from human-level reliability on realistic, long-horizon web navigation tasks. Current evaluation approaches focus primarily on end-to-end success metrics, providing limited insight into where and why failures occur during task execution.

The task is to systematically analyze LLM-based web agents across three fundamental capabilities:

  1. high-level planning - decomposing tasks into subgoals
  2. low-level execution - translating subgoals into concrete UI actions, and
  3. replanning - adapting when the environment doesn’t match expectations. The input consists of natural language instructions and live web page DOM representations. The output is a sequence of executable web actions (clicks, form fills, navigation) that accomplish the specified task.

    Success is measured through multiple metrics: Perfect Match rate (proportion of human-authored steps with corresponding LLM actions), subgoal completion rate, plan completion rate, task success rate, and action validity rates. The evaluation uses the Mind2Web-Live benchmark extended with 104 instances of expert-annotated key nodes representing human-aligned high-level subgoals, providing ground truth for hierarchical analysis.

Architecture & Method
  1. Hierarchical planning framework with three distinct layers: high-level planner LLM generates abstract subgoals, low-level planner LLM converts subgoals to executable actions, and replanner LLM revises plans when failures occur.

  2. Two high-level planning representations compared: Natural Language (NL) plans as free-form text subgoals, and structured Planning Domain Definition Language (PDDL) plans with explicit preconditions and effects.

  3. Three action space configurations evaluated: Expanded actions (google_search, goto, click, fill_form), Action Object (full action-object pair prediction), and Action ID (selection from enumerated valid actions).

  4. LLM-as-judge postcondition checker evaluates whether subgoal effects are satisfied after execution using current state and action history.

  5. Core technical contribution is the process-based evaluation framework that isolates and measures each hierarchical capability independently, enabling systematic diagnosis of failure modes beyond end-to-end metrics.

Training Recipe

Not applicable - this work uses pre-trained models (gpt-5-nano, claude-haiku-4.5, gemini-flash-2.5) without additional training. The evaluation framework relies on prompt engineering across different planning stages with carefully designed prompts for high-level planning, low-level execution, postcondition checking, and replanning. No model fine-tuning, supervised training, or reinforcement learning is performed.

Novelty & Lineage

This work builds on existing web agent benchmarks like Mind2Web (Deng et al. 2023) and WebArena (Zhou et al. 2023), but introduces the first systematic hierarchical evaluation framework for LLM web agents. Prior works like WebVoyager (He et al. 2024) and GPT-4V web agents (Zheng et al. 2024) focus on end-to-end performance. The specific delta is decomposing web agent evaluation into planning, execution, and replanning capabilities with structured PDDL representation for web tasks. This enables process-based rather than outcome-based analysis. Rating: SIGNIFICANT - provides new evaluation methodology and structured planning approach for web agents.

Benchmarks & Results
  1. Mind2Web-Live benchmark: 104 web task instances, gpt-5-nano achieves 36.4% task success rate with human plans vs 32.1% with NL plans vs 35.6% with PDDL plans

  2. High-level plan alignment: PDDL achieves 67.7% perfect match vs 60.6% for NL plans against human annotations

  3. Subgoal completion rate: 69.7% with human plans vs 63.2% NL vs 70.3% PDDL

  4. Plan completion rate: 38.5% human vs 34.6% NL vs 37.5% PDDL

  5. claude-haiku-4.5 achieves 29.2% success rate, gemini-flash-2.5 achieves 17.3% success rate

  6. Replanning improves PDDL subgoal completion from 70.3% to 93.3%

    Results show PDDL plans are more structured but low-level execution remains the primary bottleneck across all models.

Compute & Efficiency
  1. Model size: Uses pre-trained models (gpt-5-nano, claude-haiku-4.5, gemini-flash-2.5) - exact parameter counts not specified

  2. Training compute: Not applicable - no training performed

  3. Inference speed/latency: Not reported

  4. Memory footprint: Not reported

  5. Deployment practicality: Framework operates on live websites through DOM text representation, making it practically deployable but dependent on API access to commercial LLMs. The hierarchical structure adds multiple LLM calls per task, increasing latency and cost.

Real-World Applicability
  1. Evaluation conducted on live websites through Mind2Web-Live benchmark with 104 real web tasks across different domains

  2. Tests performed on actual dynamic web environments rather than static snapshots

  3. Framework handles real-world web navigation challenges like changing page layouts and unpredictable content

  4. No specific deployment results or production integration reported

  5. The approach is designed for practical web automation but remains limited by current LLM capabilities in perceptual grounding and state tracking

Limitations & Failure Modes
  1. FUNDAMENTAL: Low-level execution bottleneck - even with perfect high-level plans, agents achieve only 38.5% plan completion due to poor perceptual grounding and state tracking

  2. ENGINEERING: High hallucination rates in action-object prediction (34.0%) and redundant actions (34.2%), indicating poor understanding of web page dynamics

  3. EVALUATION: Limited to single replanning round and relies on LLM-as-judge evaluation which may have systematic biases

  4. ENGINEERING: Requires expert-annotated reference plans for high-level evaluation, limiting scalability

  5. FUNDAMENTAL: Repetitive failure modes where agents get stuck executing same failed actions (10.4% of failures)

    Failure modes: agents frequently navigate to non-existent links (32% of goto actions fail) and produce redundant actions that don’t change page state, indicating fundamental issues with world model understanding.


HapticVLA: Contact-Rich Manipulation via Vision-Language-Action Model without Inference-Time Tactile Sensing

Authors: Konstantin Gubernatorov, Mikhail Sannikov, Ilya Mikhalchuk, Egor Kuznetsov et al. (10 authors) · Institution: Toyota Research · Category: cs.RO

HapticVLA enables tactile-aware manipulation without inference-time sensors by training with safety-aware reward-weighted flow matching and distilling tactile knowledge into vision-only policies.

Practical Takeaway: Research engineers working on contact-rich manipulation should consider this approach if tactile sensors are unavailable at deployment but can be used during training data collection. The key insight is that tactile-aware manipulation can be learned offline and transferred via distillation, reducing hardware dependencies. The safety-aware reward formulation provides a principled way to incorporate contact constraints into flow matching policies. However, the method requires careful calibration of reward thresholds and may need adaptation for significantly different manipulation domains. The combination of reward-weighted training and distillation could be applied beyond tactile sensing to other expensive sensor modalities.

Tags: robotics vision-language-action tactile-sensing contact-rich-manipulation knowledge-distillation flow-matching reinforcement-learning bimanual-manipulation

arXiv · PDF

Task & Setting

Contact-rich manipulation tasks like grasping fragile objects (eggs, soft packages) require precise force control that vision alone cannot provide, yet dedicated tactile sensors increase cost and reduce reproducibility across robotic platforms. The core challenge is learning tactile-aware manipulation policies that can be deployed without requiring haptic sensors at inference time.

The task takes visual observations I ∈ R^{H×W×3}, language instructions ℓ, and proprioceptive joint states q ∈ R^6 as input, and outputs action chunks a_{1:H} ∈ R^{H×6} for manipulation control over horizon H=50. During training, tactile maps M^{L,R}_t ∈ [0,1]^{10×10} from fingertip sensors provide additional supervision. The objective combines standard flow matching loss with safety-aware tactile rewards:

\[R_{episode} = R_{step} + R_{succ} \cdot \text{success} - R_{drop} \cdot \text{drop} - R_{damage} \cdot \text{damage} - R_{risk} \cdot \text{risk}\]

Success is measured by task completion without object damage across three contact-rich manipulation tasks: jar pick-and-place, waffle handling, and egg manipulation. The authors collected 310 real-world episodes (successful + failure modes) plus 1,000 simulated episodes in Isaac Sim, totaling 1,310 training episodes with tactile feedback annotations.

Architecture & Method
  1. Safety-Aware Reward-Weighted Flow Matching (SA-RWFM): Fine-tunes SmolVLA flow matching policy using tactile-based safety rewards that penalize excessive grasping force, under-force during holding, peak pressure violations, and slip detection

  2. Tactile reward computation: From 10×10 tactile maps, computes contact statistics (mean force f^s_t, peak pressure p^s_t, concentration c^s_t) and detects slip via center-of-pressure jumps and force drops

  3. Per-step safety reward: Combines multiple penalty terms with the loss function:

    \[r_t = -\sum_{s \in A}[\lambda_{high}\text{ReLU}(f^s_t - f_{max})^2 + \lambda_{low}I[h_t=1]\text{ReLU}(f_{min} - f^s_t)^2 + ...]\]
  4. Reward-weighted flow matching: Weights training samples by normalized episode + chunk returns with exponentiated advantage scores and anchor regularization:

    \[L_{total} = L_{rwfm} + \lambda_{anchor}L_{anchor}\]
  5. Tactile Distillation (TD): Trains student VLA to predict teacher’s tactile-aware actions using only vision and proprioception, with blended targets:

    \[\tilde{a}_i = (1-\alpha)a^{GT}_i + \alpha \hat{a}^T_i\]

    The core contribution is learning tactile-aware manipulation offline and deploying without tactile sensors via knowledge distillation.

Training Recipe
  1. Offline tactile reward calculation: Compute safety rewards from collected 1,310 episodes using tactile sensor data and manipulator states, including failure mode episodes with negative rewards

  2. SA-RWFM fine-tuning: Fine-tune SmolVLA (0.45B parameters) using reward-weighted flow matching with tactile conditions, anchor regularization λ_anchor, hyperparameters α=0.25, β=0.7, warm-up schedule for stability

  3. Tactile distillation: Initialize student VLA from teacher weights (excluding tactile encoder), train on blended targets with α=0.5 mixing ratio between ground truth and teacher predictions

  4. Training details: Training performed on NVIDIA Jetson Orin NX 16GB edge computer, specific optimizer, learning rates, and training duration not reported

  5. Data preprocessing: Tactile maps normalized using 99th percentile scaling, force thresholds calibrated from dataset quantiles for robustness across episodes

Novelty & Lineage

The closest prior works are FD-VLA (2026) which also performs force-aware manipulation without sensors at inference, and various tactile VLA approaches like Tactile-VLA (2025), VTLA (2025), OmniVTLA (2025) that require tactile sensors during deployment.

The specific delta is:

  1. incorporating safety-aware reward weighting into flow matching training rather than simple imitation
  2. using explicit tactile reward formulation based on contact statistics and slip detection, and
  3. tactile distillation that preserves haptic reasoning without hardware dependencies.

    This is most similar to FD-VLA’s force distillation approach but differs in using reward-weighted training and explicit tactile safety modeling rather than pure force alignment.

    SIGNIFICANT: Novel combination of reward-weighted flow matching with tactile distillation, though builds incrementally on existing tactile VLA and distillation techniques.

Benchmarks & Results
  1. Jar pick-and-place task: HapticVLA achieves 18/20 (90%) vs SmolVLA 11/20 (55%), 35% absolute improvement

  2. Waffles pick-and-place task: HapticVLA achieves 18/20 (90%) vs SmolVLA 10/20 (50%), 40% absolute improvement

  3. Egg pick-and-place task: HapticVLA achieves 19/20 (95%) vs SmolVLA 10/20 (50%), 45% absolute improvement

  4. Mean success rate: HapticVLA achieves 86.7% vs SmolVLA 51.7%, X-VLA 0%, VLA-0 0%

  5. Ablation study: SA-RWFM alone (without distillation) achieves 75% mean success rate, showing both components contribute

    Results show consistent and substantial improvements, though X-VLA and VLA-0 completely failing suggests potential evaluation issues or domain gaps. No comparison to other tactile-aware VLA methods that require sensors at inference.

Compute & Efficiency
  1. Model size: SmolVLA foundation model with 0.45B parameters, student model has identical size after distillation

  2. Training compute: Training performed on NVIDIA Jetson Orin NX 16GB edge device, specific GPU hours and wall-clock time not reported

  3. Inference speed: Asynchronous inference shows performance degradation in ablations, suggesting latency sensitivity, specific inference times not reported

  4. Memory footprint: Deployment feasible on edge hardware (Jetson Orin NX 16GB), indicating reasonable memory requirements

  5. Deployment practicality: High - eliminates need for expensive tactile sensors while maintaining tactile-aware manipulation capabilities, improving cost-effectiveness and cross-platform reproducibility

Real-World Applicability
  1. Real robot deployment: Evaluated on dual LeRobot SO-101 manipulator setup with Intel RealSense D435 camera and wrist-mounted IMX335 cameras at 640×480 resolution

  2. Physical hardware: Uses high-density tactile arrays with 100 taxels per fingertip (200 total) operating at 120Hz with 1-9N force detection range per point

  3. Real-world tasks: Three contact-rich manipulation tasks involving actual fragile objects (marmalade jars, waffles, eggs) in uncontrolled environments

  4. Digital twin validation: 1,000 additional episodes collected in NVIDIA Isaac Sim with identical observation/action spaces to real setup for data augmentation

  5. Edge deployment: All computation runs on NVIDIA Jetson Orin NX 16GB demonstrating practical deployment without high-end compute infrastructure

Limitations & Failure Modes
  1. FUNDAMENTAL: Method still requires tactile sensors during training data collection, limiting applicability to domains where such data is unavailable

  2. ENGINEERING: Asynchronous inference shows performance degradation, suggesting temporal alignment issues that could be addressed with better system integration

  3. EVALUATION: Only three contact-rich tasks evaluated, limited diversity in manipulation scenarios and object types

  4. ENGINEERING: Baseline models (X-VLA, VLA-0) achieve 0% success rate, suggesting potential domain gap or insufficient fine-tuning that could be addressed

  5. FUNDAMENTAL: Distillation approach may not capture all nuances of direct tactile feedback, potentially limiting performance on highly dynamic contact scenarios

    Failure modes:

  6. May struggle with novel objects with significantly different material properties than training data
  7. Performance likely degrades with temporal misalignment between visual observations and required tactile responses.

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

Authors: Zifan Xu, Ran Gong, Maria Vittoria Minniti, Ahmet Salih Gundogdu et al. (12 authors) · Institution: Toyota Research · Category: cs.RO

ExpertGen automatically learns robust manipulation expert policies by steering frozen diffusion models trained on imperfect demonstrations through massively parallel reinforcement learning, enabling scalable sim-to-real transfer without reward engineering.

Practical Takeaway: If you’re working on robotic manipulation, this provides a practical pipeline for scaling up expert policy learning from limited demonstrations. The key insight is that keeping the diffusion policy frozen during RL refinement preserves human-like motion characteristics while still enabling task optimization. The combination with massively parallel simulation makes this approach tractable for complex manipulation tasks. Consider implementing this if you have access to parallel simulation environments and want to avoid manual reward engineering while still achieving robust sim-to-real transfer.

Tags: robotics manipulation sim-to-real diffusion-models reinforcement-learning behavior-cloning policy-learning expert-demonstration

arXiv · PDF

Task & Setting

Robotics manipulation faces a fundamental bottleneck: acquiring large-scale, high-quality expert demonstrations for behavior cloning requires expensive human teleoperation at scale, while scripted policies lack behavioral diversity and robust failure recovery. This creates a scalability challenge for sim-to-real transfer of manipulation policies.

The task is to learn expert manipulation policies from small sets (200-1000) of imperfect demonstrations that may have incomplete state coverage, limited recovery behaviors, or dynamics mismatch. Input demonstrations are state-action trajectories in simulation; output is a robust expert policy that achieves high task success across diverse configurations. The formal objective is a constrained MDP:

\[\max_\pi \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^T \gamma^t r(s_t, a_t)\right] \text{ s.t. } (s_t, a_t) \in \mathcal{F}, \forall t\]

where $r(s_t, a_t) = I_{\text{succ}}(s_t)$ is sparse binary task success reward and $\mathcal{F}$ represents feasible state-action constraints for real-world deployment.

Success is measured by:

  1. task success rate on diverse initial configurations
  2. failure recovery under perturbations
  3. trajectory smoothness (jerk cost)
  4. similarity to human-like motions (DTW distance), and
  5. sim-to-real transfer success rates. The paper evaluates on ANYTASK (8 tabletop manipulation tasks) and AutoMate (industrial assembly benchmark with high-precision peg insertion tasks).
Architecture & Method
  1. Behavior Prior Modeling: Train a diffusion policy on imperfect demonstrations using standard diffusion objective:

    \[L_{\text{diff}} = \mathbb{E}_{a_t^0, s_t, k, \epsilon \sim \mathcal{N}(0,I)}[|\epsilon - \epsilon_\theta(a_t^k, s_t, k)|^2]\]
  2. Diffusion Steering Reinforcement Learning (DSRL): Keep the pretrained diffusion policy frozen and learn a steering policy that optimizes only the initial noise of the diffusion process, constraining exploration to remain within the learned motion manifold.

  3. FastTD3 for Massively Parallel Training: Use FastTD3 (scalable variant of TD3 with large-batch training, distributional critics, mixed exploration noise) instead of SAC to enable efficient learning in massively parallel simulation with sparse rewards.

  4. Visuomotor Policy Distillation: Apply DAgger to distill the state-based expert policy into observation-based policies for real-world deployment, with aggressive visual domain randomization.

    The core technical contribution is combining diffusion steering with massively parallel RL training, preserving human-like motion manifolds while achieving high task success under sparse rewards without manual reward engineering.

Training Recipe
  1. Behavior Prior Training: Train diffusion policy on 200-1000 imperfect demonstrations using standard diffusion loss, specific optimizer/schedule not reported.

  2. Expert Policy Acquisition: Apply FastTD3 with diffusion steering in massively parallel simulation (IsaacLab), keeping diffusion policy frozen while optimizing initial noise steering. Training uses sparse binary task success rewards only. Specific hyperparameters not detailed.

  3. Visuomotor Distillation: Use DAgger with visual domain randomization (textures, lighting, camera poses, object appearances, HSV jitter, ±10% object scaling). Train until convergence with teacher rollout ratio α linearly decayed from 1.0 to 0.2 between 50k-100k steps, continuing to 200k steps total.

    Hardware: Experiments run on massively parallel GPU simulation, specific compute resources not reported. Real-world deployment uses AMD Ryzen Threadripper PRO 5955WX CPU + NVIDIA RTX 6000 Ada GPU. Wall-clock training times not reported.

Novelty & Lineage

This builds directly on Diffusion Steering Reinforcement Learning (DSRL, Wagenmaker et al. 2025) and FastTD3 (Seo et al. 2025). The specific delta is scaling DSRL to massively parallel simulation using FastTD3 instead of SAC, and demonstrating comprehensive sim-to-real transfer through DAgger distillation.

Related work includes: Residual RL methods (Silver et al. 2018), offline-to-online RL, synthetic data generation systems like MimicGen (Mandlekar et al. 2023), and DPPO (Ren et al. 2024) for diffusion policy fine-tuning. The approach is similar to concurrent work on automated data generation (ANYTASK, Gong et al. 2025).

Rating: INCREMENTAL - combines existing techniques (DSRL + FastTD3) in a new training regime with solid empirical validation, but limited algorithmic novelty.

Benchmarks & Results
  1. ANYTASK benchmark (8 tabletop tasks): ExpertGen achieves 85% overall success vs. Residual RL 53.4%, SMP 12.9%, base Diffusion Policy 40.3%, FastTD3 alone 0.3%

  2. AutoMate benchmark (industrial assembly): ExpertGen achieves 90.5% overall success vs. Diffusion Policy with x-y noise ~60%, without noise ~20%

  3. Failure Recovery (perturbations): Under gripper opening perturbations, only 0.5% average performance drop vs. Diffusion Policy’s severe degradation. Under external force perturbations, 28.6% drop vs. near-zero for Diffusion Policy

  4. Real-world transfer: Point-cloud policies achieve 75% (Lift Banana), 65% (Push Pear), 85% (Open Drawer) vs. ANYTASK baseline 73.3%, 16.7%, 42.5%. RGB policy achieves 80% vs. 0% for behavioral cloning baseline

  5. Motion quality: Comparable trajectory smoothness (DTW ~0.139, Jerk ~5.97) to base diffusion policy while achieving much higher success rates

Compute & Efficiency
  1. Model size: Diffusion policy architecture details not specified, presumably lightweight as mentioned

  2. Training compute: Massively parallel simulation training in IsaacLab, specific GPU hours and hardware scale not reported

  3. Inference speed: Real-world inference on NVIDIA RTX 6000 Ada GPU, specific latency not reported

  4. Memory footprint: Not reported

  5. Deployment practicality: Successfully deployed on real Franka robot with Robotiq gripper using 3 Intel RealSense cameras, demonstrating practical feasibility for common research platforms

Real-World Applicability
  1. Real robot deployment: Successfully deployed on Franka robot with Robotiq 2F-85 gripper in real lab environment, using 3 calibrated Intel RealSense D435i cameras for perception

  2. Hardware experiments: Tested on tabletop manipulation tasks (Lift Banana, Push Pear, Open Drawer) with both point-cloud and RGB-based policies achieving 65-85% success rates

  3. Sim-to-real gap analysis: Identifies critical bottlenecks in visual policy distillation, shows importance of DAgger over pure behavioral cloning for robust transfer

  4. Production considerations: Demonstrates zero-shot transfer without real-world fine-tuning, but requires extensive visual domain randomization during simulation training

  5. Environmental robustness: Shows robustness to visual distractors and cluttered workspaces in real-world testing

Limitations & Failure Modes
  1. FUNDAMENTAL: Expert policies remain constrained by coverage of provided demonstrations - cannot generate qualitatively new behaviors beyond the demonstration distribution

  2. FUNDAMENTAL: Method requires access to high-fidelity simulation environments that support massive parallelization

  3. ENGINEERING: Performance degrades on inherently difficult configurations (e.g., objects blocking drawer opening, collision-prone stacking scenarios)

  4. ENGINEERING: Requires extensive visual domain randomization and careful tuning of DAgger mixing ratio (α) for successful sim-to-real transfer

  5. EVALUATION: Limited evaluation on truly long-horizon tasks and complex multi-step manipulation sequences

    Failure modes:

  6. Tasks requiring behaviors absent from initial demonstrations fail completely
  7. Policies may struggle with novel object geometries or configurations far from training distribution

Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

Authors: Heng Zhou, Li Kang, Yiran Qin, Xiufeng Song et al. (15 authors) · Institution: Together AI · Category: cs.RO

CoRL introduces a reinforcement learning framework with Cross-View Spatial Reward to enable vision-language models to perform collaborative spatial reasoning from distributed ego-centric robot observations.

Practical Takeaway: This work demonstrates that multi-view reasoning significantly outperforms single global views for embodied tasks, even when the global view has complete scene coverage. The Cross-View Spatial Reward design is the key innovation - if you’re working on multi-robot systems or VLM spatial reasoning, implement the CVSR components (grounding reward with Hungarian matching, overlap accuracy for entity resolution, and task-specific answer rewards). The two-stage SFT→RL pipeline with GRPO is essential - RL from scratch fails completely. For practitioners, this suggests distributed ego-centric sensing is more valuable than expensive global camera setups, but requires careful reward engineering to teach cross-view consistency.

Tags: multi-agent robotics vision-language models spatial reasoning cross-view fusion reinforcement learning embodied AI collaborative perception robotic manipulation

arXiv · PDF

Task & Setting

Multi-agent robotic systems in applications like cooperative service robots and autonomous driving require collaborative spatial reasoning from distributed ego-centric viewpoints, but current VLMs are limited to single-view scenarios. Each agent perceives only partial, occluded observations that must be fused into coherent world understanding.

The task takes multiple RGB images {I₁, I₂, …, Iₙ} from N distributed agents and a natural language query Q as input, producing either textual responses (for counting/reasoning) or 2D image coordinates (for grasping). The formal objective maximizes expected task reward:

\[\max_θ \mathbb{E}_{(\{I_i\}, Q, y) \sim D}[R(\hat{y}, y)]\]

where $\hat{y} = π_θ({I_i}_{i=1}^N, Q)$ is the VLM policy output and y is ground truth.

Success is measured by exact match accuracy for QA tasks and normalized coordinate distance scores (0-100) for grasping tasks. The paper introduces the Ego-to-World (E2W) benchmark with 160k+ samples across 15k+ scenes, evaluating three tasks:

  1. global counting across views
  2. relational location reasoning, and
  3. action-oriented grasping with view-specific coordinate prediction.
Architecture & Method
  1. Base architecture uses Qwen2.5-VL-Instruct (3B/7B parameters) as the vision-language backbone
  2. Two-stage training pipeline: Chain-of-Thought supervised fine-tuning followed by reinforcement learning
  3. Group Relative Policy Optimization (GRPO) computes group-relative advantages: $A_j = \frac{R_j - \bar{R}}{σ_R}$
  4. Cross-View Spatial Reward (CVSR) combines three components: $R_{CVSR} = w_{ground}R_{ground} + w_{overlap}R_{overlap} + w_{ans}R_{ans}$
  5. Grounding reward uses Hungarian algorithm for optimal bipartite matching: $R_{ground} = \frac{1}{\lvert σ \rvert}\sum_{i=1}^{\lvert σ \rvert}IoU(\hat{b_i}, b^*_{σ(i)})$
  6. Overlap reward incentivizes cross-view entity resolution: $R_{overlap} = \mathbb{I}[\hat{n}_{overlap} = n^*_{overlap}]$
  7. GRPO objective with clipping: $L_{CLIP}(θ) = \mathbb{E}_u[\sum_{j=1}^G \min(r_j(θ)A_j, \text{clip}(r_j(θ), 1-ε, 1+ε)A_j)]$

    The core contribution is the CVSR reward design that explicitly shapes cross-view fusion, spatial consistency, and visual grounding for collaborative reasoning.

Training Recipe
  1. Supervised Fine-Tuning stage: 3 epochs on Chain-of-Thought annotated E2W data, cosine decay learning rate (peak 2×10⁻⁵), batch size 4 with gradient accumulation over 4 steps
  2. Reinforcement Learning stage: GRPO with G=8 candidate responses per input, clipping range ε=0.2, KL coefficient β=0.04, reward weights λ₁=0.1, λ₂=1.0
  3. RL training: 200 steps with learning rate 1×10⁻⁶, normalization radius dₘₐₓ=100 pixels for grasping reward
  4. Data: 160k+ samples (100k simulated + 60k real-world) from RoboFactory/ManiSkill3 with programmatic CoT annotation generation
  5. Hardware: 8×NVIDIA H200 GPUs
  6. Wall-clock time: not reported
Novelty & Lineage

The work is SIGNIFICANT in novelty. Prior works like COMBO (Zhang et al. 2024) assume shared global state, while ROCKET-2 (Cai et al. 2025) focuses on single-agent cross-view goal alignment. MindCube (Yin et al. 2025) studies mental reconstruction from static views but not embodied manipulation.

The specific delta is:

  1. first formalization of multi-agent collaborative spatial reasoning from distributed ego-centric observations
  2. novel Cross-View Spatial Reward design that explicitly incentivizes cross-view entity resolution and spatial grounding
  3. comprehensive benchmark spanning reasoning and perception-grounding tasks. The closest work is MindCube 2025, but that focuses on cognitive mapping from static views rather than embodied multi-robot coordination and action generation.
Benchmarks & Results
  1. E2W-1 (Counting): CoRL-7B achieves 61.0% vs GPT-5 42.5% (18.5% improvement)
  2. E2W-2(S) (Location Reasoning-Sim): CoRL-7B achieves 97.0% vs GPT-5 48.5% (48.5% improvement)
  3. E2W-2(R) (Location Reasoning-Real): CoRL-7B achieves 90.0% vs GPT-5 72.5% (17.5% improvement)
  4. E2W-3(S) (Grasping-Sim): CoRL-7B achieves 95.69 vs GPT-5 50.43 (45.26 point improvement)
  5. E2W-3(R) (Grasping-Real): CoRL-7B achieves 44.32 vs GPT-5 12.02 (32.3 point improvement)
  6. Where2Place (external): CoRL-7B achieves 50.9 vs RoboPoint 46.8 (4.1 point improvement)
  7. Real-world robot tasks: 65% success on blue block grasping vs RoboPoint 0%

    Results are consistently strong across all benchmarks with substantial improvements over both proprietary and open-source baselines.

Compute & Efficiency
  1. Model size: 3B and 7B parameters (Qwen2.5-VL backbone)
  2. Training compute: 8×NVIDIA H200 GPUs, wall-clock time not reported
  3. Inference speed/latency: not reported
  4. Memory footprint: not reported
  5. Deployment practicality: demonstrates real-world deployment with multi-robot setup (2 Franka Research 3 arms + 1 Realman mobile base) but requires pre-calibrated camera rigs and depth lifting for 3D coordinates
Real-World Applicability
  1. Real-world robot experiments with 2 Franka Research 3 arms and 1 Realman mobile base, each with Intel RealSense D435 cameras
  2. Multi-camera rig requires pre-calibration to shared world frame for coordinate lifting
  3. Model processes only RGB images during inference, uses depth channel post-prediction for 3D coordinate conversion
  4. Achieves 65% success rate on complex multi-view grasping tasks vs 0% for baselines
  5. Sim-to-real gap noted as challenge: primary failure mode is imprecise coordinate prediction rather than incorrect spatial reasoning
  6. Dataset includes 60k real-world samples to support generalization
Limitations & Failure Modes
  1. ENGINEERING: Operates on static synchronized snapshots, cannot handle asynchronous sensing or temporal reasoning over video streams
  2. FUNDAMENTAL: Centralized architecture creates communication bottleneck and single point of failure, scalability to large agent fleets unclear
  3. ENGINEERING: Limited to 2D coordinate prediction in image space, lacks full 3D spatial grounding integration
  4. EVALUATION: Benchmark limited to 50+ object categories, long-tail real-world distribution not fully addressed
  5. ENGINEERING: Real-world deployment requires pre-calibrated camera rigs, limiting deployment flexibility

    Primary failure modes:

  6. Imprecise coordinate prediction at object boundaries despite correct spatial reasoning
  7. Cross-view entity resolution failures in highly cluttered or ambiguous scenes.

What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

Authors: David Holtz, Niklas Hanselmann, Simon Doll, Marius Cordts et al. (5 authors) · Institution: Toyota Research · Category: cs.RO

BevAD shows that compressing high-resolution BEV features before planning prevents overfitting and enables robust closed-loop driving, while diffusion-based planners scale better with data than point estimators.

Practical Takeaway: If you’re building end-to-end driving systems, the key insight is that high-resolution BEV features can actually hurt closed-loop performance by enabling causal confusion. Implement spatial compression (masking + patchifying BEV features) before feeding to your planner - this simple change can provide massive closed-loop improvements (+15-35 DS points). Also consider diffusion-based planning over point estimators for better data scaling, and use disentangled path+speed representations rather than temporal waypoints. The paper’s scene tokenization approach is straightforward to implement and the gains are substantial.

Tags: end-to-end-driving autonomous-driving imitation-learning bird's-eye-view diffusion-models closed-loop-evaluation scene-tokenization multi-modal-planning

arXiv · PDF

Task & Setting

End-to-end autonomous driving (E2E-AD) promises to learn robust driving behavior from data by optimizing the entire perception-to-planning pipeline jointly, but architectural choices that excel in open-loop evaluation often fail catastrophically in closed-loop driving scenarios. This gap between open-loop and closed-loop performance represents a critical challenge for real-world deployment.

The task is to process multi-view camera images (6 cameras, typical automotive setup) and produce driving commands (steering angle, acceleration) that successfully navigate complex driving scenarios. The input consists of RGB images from surround-view cameras, and the output is either waypoint trajectories with timestamps or disentangled path+speed representations. The objective is to minimize imitation learning loss:

\[\mathcal{L} = \mathcal{L}_{\text{perception}} + \lambda \mathcal{L}_{\text{planning}}\]

where perception loss supervises 3D object detection and planning loss supervises trajectory/path prediction.

Success is measured by closed-loop driving performance on the Bench2Drive benchmark: Driving Score (DS) combining route completion and safety, Success Rate (SR) measuring episode completion without infractions, and infraction rates for static (IR_s) and dynamic (IR_d) violations.

The paper uses Bench2Drive benchmark with 220 test routes in CARLA simulator, each featuring single driving scenarios. Training data consists of expert demonstrations from rule-based agents on diverse route-scenario combinations, scaling from 1K to 16K training episodes.

Architecture & Method
  1. BEV Backbone: RADIO image encoder with LoRA adaptation processes 6-camera images into Bird’s Eye View features F_BEV with dimensions H×W×C

  2. Scene Tokenizer: Novel spatial compression module that applies masking (removes 20% leftmost/rightmost BEV cells) and patchifying (combines p×p BEV patches via pixel unshuffling) to create scene tokens F_Scene

  3. Planning Head: Transformer decoder with self-attention among planning queries Q_Plan and cross-attention to compressed scene tokens, using adaLN-Zero conditioning on driving commands and ego-state

  4. Diffusion-based Planning: Planning queries generated by adding Gaussian noise to ground truth trajectories, using DDIM diffusion schedule for iterative denoising at inference

  5. Disentangled Representation: Outputs separate path (spatial waypoints at fixed distances) and speed profile instead of entangled temporal trajectories

  6. PID Controllers: Convert path and speed outputs to steering and acceleration commands for vehicle control

    The core contribution is identifying that high-resolution BEV features cause causal confusion in planning, and that spatial compression via scene tokenization dramatically improves closed-loop robustness while diffusion modeling provides superior data scaling properties.

Training Recipe
  1. Warm-up Stage: 4 epochs with AdamW Schedule-free optimizer (lr=1e-4, weight decay=0.01), batch size 128 on 8×A100-80GB GPUs, mixed precision (bfloat16), perception supervision only to initialize BEV backbone

  2. Planning Stage: Additional training with planning supervision added, BEV backbone frozen for efficiency except in data scaling experiments, same optimizer settings

  3. Streaming Training: Process sequential 2-second clips (n=20 frames) with cached BEV features to avoid recurrent computation, enabling 35× training speedup over UniAD

  4. Data Scaling: Train on progressively larger datasets from 1K to 16K episodes, each model trained until convergence

    Training hardware: 8×A100-80GB GPUs, wall-clock time not reported. Data consists of expert demonstrations collected from CARLA using PDM-lite rule-based agent on custom-generated diverse route-scenario combinations.

Novelty & Lineage

The work builds on UniAD (2023) and ParaDrive (2024) architectures but makes several significant contributions. Prior works like VAD (2023), UniAD (2023) focused on open-loop performance with high-resolution BEV representations. Closed-loop methods like Simlingo (2025), TF++ (2024) used simplified single-camera setups.

The key novelty is systematically identifying that high-resolution BEV features harm closed-loop driving due to causal confusion, and introducing spatial compression via scene tokenization as mitigation. The paper also demonstrates complementary benefits of disentangled planning representations and diffusion modeling, and shows diffusion planners scale better with data than point estimators.

Previous diffusion-based planners like DiffusionDrive (2025) were only evaluated open-loop. This is the first systematic closed-loop analysis of architectural patterns and their scaling properties.

Rating: SIGNIFICANT - addresses a fundamental gap between open-loop and closed-loop performance with novel architectural insights and systematic empirical analysis.

Benchmarks & Results
  1. Bench2Drive benchmark: BevAD-M achieves 88.11% Driving Score vs previous SOTA BridgeDrive 86.87% (+1.24 DS), and 72.73% Success Rate vs 72.27% SR (+0.46 SR)

  2. Bench2Drive benchmark: BevAD-S (smaller dataset) achieves 80.63% DS vs UniAD baseline 45.81% (+34.8 DS) and 55.30% SR vs 16.36% SR (+38.9 SR)

  3. NAVSIM real-world benchmark: BevAD achieves 86.6 PDMS vs ParaDrive 84.0 (+2.6) and UniAD 83.4 (+3.2), with 10× lower training compute (570 GPU-hours vs 5700)

  4. Data scaling analysis: Diffusion planner maintains linear improvement while point estimator saturates after 8K episodes

  5. Tokenization ablation: Scene tokenization with patch size p=4 achieves 82.62% DS vs 66.86% DS without compression (+15.76 DS)

    Results show consistent improvements across multiple benchmarks, with particularly strong gains on closed-loop metrics that don’t translate to open-loop L1 trajectory error.

Compute & Efficiency
  1. Model size: Not explicitly reported, but described as “lightweight” compared to baselines

  2. Training compute: 570 GPU-hours on A100-80GB (10× less than ParaDrive’s 5700 hours), batch size 128 across 8 GPUs

  3. Inference speed: 35× faster training throughput than UniAD-tiny due to streaming optimization, inference speed not reported

  4. Memory footprint: Enables end-to-end training with batch size 16 on single A100-80GB GPU vs UniAD requiring two-stage training due to memory constraints

  5. Deployment practicality: Highly practical - camera-only sensor setup, no LiDAR required, efficient architecture with spatial compression, demonstrated on real-world NAVSIM benchmark

Real-World Applicability
  1. Real-world validation: Evaluated on NAVSIM benchmark using real nuScenes data, achieving 86.6 PDMS and outperforming baselines by 2.6-3.2 points

  2. Sensor configuration: Uses realistic 6-camera surround-view setup typical of production vehicles, no exotic sensors required

  3. Sim-to-real: Shows that architectural insights from CARLA simulator (scene tokenization, diffusion planning) transfer effectively to real-world nuScenes data

  4. Production relevance: Camera-only approach with efficient compute requirements (single GPU inference) and no auxiliary task dependencies makes deployment feasible

  5. Hardware requirements: Standard automotive camera setup, no specialized hardware beyond typical autonomous driving platforms

Limitations & Failure Modes
  1. FUNDAMENTAL: Scene tokenization may not extend to high-speed highway scenarios requiring long-range perception due to spatial compression

  2. ENGINEERING: Red light infractions occur in 19% of failures, suggesting persistent causal confusion despite tokenization

  3. ENGINEERING: Route deviations due to weak navigation command conditioning, could be improved with stronger target point guidance

  4. EVALUATION: Limited to CARLA simulation environments, though NAVSIM validation partially addresses this

  5. ENGINEERING: Masking strategy is CARLA-specific (removing left/right regions), needs generalization to other map layouts

    Failure modes include:

  6. Running red lights after pedestrians cross, indicating learned spurious correlations
  7. Ignoring lane change commands on multi-lane roads due to insufficient conditioning signal strength.

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Authors: Xuanyu Zhu, Yuhao Dong, Rundong Wang, Yang Shi et al. (12 authors) · Institution: Toyota Research · Category: cs.AI

VTC-Bench introduces a comprehensive benchmark for evaluating multimodal large language models on complex multi-step visual tool orchestration, revealing that even leading models achieve only 51% success due to poor tool selection and composition strategies.

Practical Takeaway: Research engineers should recognize that current MLLMs struggle significantly with multi-step visual tool orchestration, achieving only ~50% success rates on complex tasks. The benchmark reveals that simply providing more tools doesn’t improve performance - models tend to use narrow subsets of familiar operations inefficiently. If building visual agents, focus on improving tool selection strategies and intermediate result verification rather than expanding tool libraries. The dual-paradigm evaluation (code vs interface) shows interface-based approaches can match code-based performance, suggesting structured tool APIs may be preferable to free-form code generation for reliability.

Tags: multimodal_evaluation visual_agents tool_use benchmark opencv visual_reasoning mllm_evaluation compositional_reasoning

arXiv · PDF

Task & Setting

Visual agents must execute complex computer vision workflows by chaining multiple OpenCV-based operations to solve real-world visual tasks, yet existing benchmarks only test simple single-tool scenarios. Current multimodal large language models (MLLMs) need to dynamically adapt to diverse tool sets and formulate multi-step execution plans for advanced visual reasoning, which requires more than basic image understanding.

The task evaluates how effectively MLLMs can use external visual tools for complex reasoning. Input consists of images paired with questions requiring multi-step tool orchestration. Models must generate either Python code or interface calls to solve problems across 9 categories: Robust OCR, Attention Focusing, Perceptual Restoration, Chart analysis, Measurement, Counting, Math, Spatial Reasoning, and Color analysis. Success is measured by Average Pass Rate (APR), Tool Call Rate (TCR), Mean Absolute Error (MAE) between predicted and ground-truth toolchains, and Tool Usage Efficiency (Efftool).

\[\text{MAE} = \frac{1}{N} \sum |L_{G, i} - L_{T, i}|, \quad Eff_{\text{tool}} = \frac{\sum L_{e, i}}{\sum L_{T, i}}\]

VTC-Bench provides 680 curated problems with ground-truth execution trajectories, featuring 32 diverse OpenCV tools across four modules (Geometry, Enhancement, Feature Extraction, Drawing). Average toolchain length is 5.04 steps using 4.97 unique tools per problem.

Architecture & Method
  1. VTC-Bench is a benchmark framework, not a new model architecture - it evaluates existing MLLMs using structured visual tool evaluation

  2. Tool Library: 32 OpenCV-based operations organized into four functional modules: Geometry (spatial transformations), Enhancement (signal optimization), Feature Extraction (structural primitives), Drawing (verification and measurement)

  3. Cognitive Hierarchy: Three-tier progressive evaluation structure - Tier 1 (Visual Perception Enhancement), Tier 2 (Quantitative Visual Estimation), Tier 3 (Compositional Visual Reasoning)

  4. Dual Evaluation Paradigm: Models tested using both code-driven (Python execution) and interface-driven (atomic tool calls) approaches within sandbox environments

  5. Ground Truth Generation: Expert-annotated reference toolchains verified through multi-stage validation using Gemini-3.0-Pro and manual cross-verification

    The core technical contribution is the systematic design of complex multi-tool composition tasks requiring functional dependencies where prior tool outputs serve as mandatory inputs for subsequent operations.

Training Recipe

This paper introduces a benchmark evaluation framework, not a trained model. The training details refer to the evaluated models:

  1. Evaluated models include pre-trained systems: proprietary models (GPT-o3, GPT-4o, Gemini-3.0-Pro/Flash, GPT-5.2) and open-source models (Qwen3-VL variants 8B-235B, DeepEyes, Thyme, V-Thinker)

  2. Training details for individual models not reported - paper focuses on evaluation methodology rather than model training

  3. Benchmark construction involved human annotation and expert verification phases, but no model training was performed

  4. Evaluation framework uses Qwen-Agent and Thyme frameworks for tool orchestration during testing

  5. Hardware and compute requirements not reported as this is an evaluation study

Novelty & Lineage

This work builds on prior visual tool benchmarks including VisualToolBench (2024), AgentVista (2026), and TIR-Bench (2025). The key advancement is introducing complex multi-tool composition with strict functional dependencies, where tool outputs must serve as inputs for subsequent operations.

Closest prior works: VisualToolBench had only 6 tools with basic combinations, AgentVista focused on 4 tools without functional dependencies, TIR-Bench used single-tool scenarios. VTC-Bench introduces 32 diverse tools requiring average 5.04-step chains with mandatory inter-tool dependencies.

The specific delta includes:

  1. significantly larger and more diverse tool inventory
  2. systematic cognitive hierarchy mapping basic perception to complex reasoning
  3. strict functional dependency requirements, and
  4. dual evaluation paradigms (code vs interface).

    Rating: SIGNIFICANT - substantial advancement in benchmark complexity and systematic evaluation framework, though building incrementally on established tool-use evaluation concepts.

Benchmarks & Results
  1. VTC-Bench Overall Performance: Gemini-3.0-Pro achieves 51.18% APR (best performing model), showing significant room for improvement across all models

  2. Proprietary vs Open-source Gap: Leading proprietary models (Gemini-3.0-Pro, GPT-5.2) achieve 44-51% while best open-source models (Qwen3-VL-235B) reach only 38% maximum

  3. Tool Augmentation Benefits: Proprietary models show consistent improvements with tools (+6-9%), while open-source models show minimal or negative gains

  4. Task-specific Performance: Models perform better on Tier 1 (perception enhancement) tasks, struggle more with Tier 2 (quantitative estimation) requiring precise tool selection

  5. Code vs Interface Paradigms: Interface-based tool calling performs comparably to code-based approaches, contrary to common assumptions

  6. Tool Usage Efficiency: Even top models achieve low efficiency scores - GPT-5.2 only 16.78% efficiency, Gemini-3.0-Pro 36.51%, indicating significant redundancy in tool calling patterns

    Previous SOTA comparisons not directly available as this introduces a new benchmark, but the low overall scores indicate substantial challenges for current SOTA models.

Compute & Efficiency
  1. Model sizes vary across evaluated systems: 7B-8B parameters (open-source tool-use models) to 235B parameters (Qwen3-VL variants), with proprietary models having undisclosed sizes

  2. Training compute not reported as this is an evaluation study rather than model training work

  3. Inference efficiency measured through Tool Usage Efficiency metric - most models show significant inefficiency with tool calling, achieving 16-36% efficiency rates

  4. Memory footprint not explicitly reported, but framework supports both code execution and interface-based tool calling paradigms

  5. Deployment practicality is moderate - requires sandbox environments for safe code execution and structured tool libraries, making it suitable for controlled evaluation but challenging for production deployment without proper safety measures

Real-World Applicability
  1. Limited real-world deployment evidence - this is primarily a benchmark evaluation study without production system testing

  2. Tool library based on OpenCV operations aligns with practical computer vision pipelines used in industry applications

  3. Task categories (OCR, measurement, counting, spatial reasoning) correspond to common real-world visual analysis needs

  4. No hardware experiments or robot deployment results reported

  5. Evaluation uses controlled benchmark conditions rather than messy real-world data, limiting direct applicability assessment

  6. Framework design enables potential real-world application through its systematic tool orchestration approach, but actual deployment validation is absent

Limitations & Failure Modes
  1. EVALUATION - Benchmark focuses on controlled scenarios with curated images rather than testing robustness to real-world visual complexity and noise

  2. FUNDAMENTAL - Current models show severe limitations in multi-tool composition, defaulting to narrow subsets of familiar tools rather than optimal selection

  3. ENGINEERING - Tool usage efficiency remains extremely low (16-36%) indicating models rely on trial-and-error rather than strategic planning

  4. EVALUATION - Limited analysis of why certain tool combinations fail and insufficient investigation of intermediate reasoning steps

  5. FUNDAMENTAL - Significant gap between proprietary and open-source model capabilities suggests architectural or training differences not addressed

    Failure modes:

  6. Strategic misselection and operational errors - models choose inappropriate tools and execute them incorrectly, bypassing prerequisite steps
  7. Over-reliance on intermediate tool outputs - models superficially analyze tool results without cross-verification against original visual input, leading to propagated errors.

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Authors: Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu et al. (8 authors) · Institution: Toyota Research · Category: cs.CV

VLA-Thinker introduces the first vision-language-action model that treats visual perception as a dynamically invocable reasoning action, enabling interleaved perception-reasoning-action processes that achieve 97.5% success on LIBERO and strong gains on long-horizon robotic manipulation tasks.

Practical Takeaway: This work demonstrates that treating perception as an active reasoning component rather than static context can significantly improve VLA model performance, especially on long-horizon tasks. Research engineers should consider implementing similar interleaved perception-reasoning frameworks in their own VLA systems. The two-stage training approach (SFT cold-start followed by trajectory-level RL) provides a practical template for training such systems. However, the approach is currently limited to simulation and a single visual tool, so engineers should focus on extending the visual tool repertoire and validating on real robots before deployment.

Tags: vision-language-action embodied-ai robotics chain-of-thought reinforcement-learning manipulation multimodal-reasoning active-perception

arXiv · PDF

Task & Setting

Vision-Language-Action (VLA) models need to control robotic manipulators by generating action commands from visual observations and natural language instructions. Existing VLA models treat visual perception as static context provided once, limiting their ability to actively revisit the environment and resolve ambiguities during long-horizon manipulation tasks. This paper addresses the problem of enabling VLA models to perform “thinking-with-image” reasoning, where perception becomes a dynamically invocable reasoning action.

The task is to develop a VLA model that can iteratively generate multimodal reasoning trajectories. Given an initial language instruction T0 and visual observation set V0, the model produces sequences of outputs:

\[A_k = f_{VLA}(\{T_i, C_i, V_i\}_{i=0}^k)\]

where T_k denotes textual reasoning steps, C_k denotes perception invocations (specifically zoom-in tool calls), V_k denotes returned visual evidence, and A_k denotes environment actions. The complete trajectory is:

\[\tau = \{T_1, C_1, V_1, T_2, C_2, V_2, ..., T_k, A_k\}\]

Success is measured by task completion success rate on manipulation benchmarks. The paper evaluates on LIBERO (5 suites with 10-90 tasks each, 50 test scenes per task) and RoboTwin 2.0 (12 representative dual-arm tasks with 100 test scenarios each, categorized by horizon length from 112-637 steps).

Architecture & Method
  1. Base architecture: OpenVLA-OFT built on OpenVLA with vision encoder and LLaMA2-7B backbone, using action chunking and parallel decoding design

  2. Core innovation: Models visual perception as an explicit, dynamically invocable reasoning action rather than static context - the model can actively request task-relevant visual information through tool invocation during reasoning

  3. Visual tool implementation: ZOOM-IN mechanism that inspects fine-grained details within specified image regions using bounding box coordinates

  4. Multimodal reasoning process: Iterative interleaved sequence where controller determines whether to generate next reasoning step and perception request or terminate and output action

  5. Two-stage training pipeline combining supervised fine-tuning on synthesized embodied Chain-of-Thought data followed by Group Relative Policy Optimization (GRPO) for trajectory-level alignment

  6. Reward function for reinforcement learning:

    \[R(\tau) = \alpha_s \cdot I_{success} + \alpha_f \cdot I_{format}\]

    where $I_{success}$ indicates task completion and $I_{format}$ ensures correct reasoning format

  7. GRPO objective with relative advantage computation:

    \[A_i = \frac{R(\tau_i) - \text{mean}(\{R(\tau_j)\}_{j=1}^M)}{\text{std}(\{R(\tau_j)\}_{j=1}^M)}\]
Training Recipe
  1. Data construction: Use Qwen3-VL-30B-A3B-Instruct to synthesize embodied Chain-of-Thought data on LIBERO (273,465 keyframes) and RoboTwin2.0 (215,784 keyframes) demonstrations

  2. Stage 1 - Supervised Fine-Tuning: 100k steps, batch size 64, learning rate 1×10^-5, AdamW optimizer, hybrid attention mask enabling autoregressive CoT supervision and bidirectional action supervision

  3. Stage 2 - Reinforcement Learning: Group Relative Policy Optimization (GRPO), batch size 128, learning rate 2×10^-6, AdamW optimizer, sparse task-success reward plus format regularization

  4. Hardware: 8 NVIDIA H100 GPUs, approximately 3 days total training time

  5. Input modalities: Single-view images, language instructions, robot proprioceptive states (except LIBERO which excludes proprioception)

Novelty & Lineage

This work introduces the first VLA model capable of thinking-with-image reasoning, fundamentally changing how perception is integrated into embodied reasoning. Prior work includes static Chain-of-Thought approaches (CoT-VLA 2025, ECoT 2025) and reinforcement learning enhanced VLA models (Robot-R1 2025, VLA-RL 2025), but all treat visual inputs as static context.

The specific delta is modeling perception as a dynamically invocable reasoning action that can be called during intermediate reasoning steps, enabling active environment revisitation. The two-stage training combining SFT cold-start with GRPO trajectory-level optimization is also novel for multimodal reasoning-action sequences.

This represents a SIGNIFICANT advancement as it fundamentally changes the perception-reasoning-action paradigm from passive to active, though it builds incrementally on existing VLA architectures and CoT reasoning frameworks.

Benchmarks & Results
  1. LIBERO benchmark: 97.5% average success rate vs previous best of 95.5% (π0), representing +2.0% overall improvement; +6.5% improvement over OpenVLA-OFT baseline (91.0%)

  2. LIBERO-Spatial: 98.7% vs 96.8% (π0), +1.9% improvement

  3. LIBERO-Object: 99.0% vs 98.8% (π0), +0.2% improvement

  4. LIBERO-Goal: 95.2% vs 95.8% (π0), -0.6% decrease

  5. LIBERO-Long: 96.9% vs 94.0% (UnifiedVLA), +2.9% improvement

  6. RoboTwin2.0 short-horizon tasks (100-130 steps): 62.3% vs 55.0% (DeepThinkVLA), +7.3% improvement

  7. RoboTwin2.0 medium-horizon tasks (150-230 steps): 70.7% vs 65.3% (DeepThinkVLA), +5.4% improvement

  8. RoboTwin2.0 long/extra-long horizon tasks (280-650 steps): 64.6% vs 57.8% (DeepThinkVLA), +6.8% improvement

Compute & Efficiency
  1. Model size: Based on OpenVLA-OFT with LLaMA2-7B backbone (~7B parameters)

  2. Training compute: 8 NVIDIA H100 GPUs for approximately 3 days (not specified in GPU-hours)

  3. Inference speed/latency: Not reported, though paper mentions “high efficiency in online reinforcement learning scenarios”

  4. Memory footprint: Not reported

  5. Deployment practicality: Uses only single-view images rather than multi-view setup of original OpenVLA, potentially more practical, but inference costs likely higher due to iterative reasoning and tool calls

Real-World Applicability
  1. All experiments conducted in simulation environments (LIBERO and RoboTwin2.0 simulators) with no real robot validation reported

  2. RoboTwin2.0 includes domain randomization (clutter, lighting, background, tabletop height, language instructions) intended to improve sim-to-real transfer

  3. No deployment results, hardware experiments, or production integration discussed

  4. No explicit sim-to-real analysis or validation on physical robots

  5. Paper focuses on validating the interleaved perception-reasoning paradigm in simulation before real-world deployment

Limitations & Failure Modes
  1. FUNDAMENTAL: Only implements ZOOM-IN tool, limiting types of visual queries possible compared to human active perception capabilities

  2. ENGINEERING: All evaluation conducted in simulation without real-world robot validation

  3. EVALUATION: Limited to single visual tool type, unclear how approach scales to more diverse perception actions

  4. ENGINEERING: Higher computational cost due to iterative reasoning and tool invocation compared to direct action prediction

  5. FUNDAMENTAL: Relies on sparse reward signals which may not provide sufficient guidance for complex reasoning chains

    Failure modes: 1) Model may invoke tools unnecessarily in simple scenarios, leading to inefficient reasoning, 2) In complex scenes, single zoom-in tool may be insufficient to resolve all ambiguities, potentially leading to suboptimal decisions.


WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning

Authors: Stefan Englmeier, Katharina Winter, Fabian B. Flohr · Institution: Figure AI, MIT · Category: cs.CV

WorldVLM introduces the first hybrid architecture that combines Vision-Language Model reasoning for high-level behavioral planning with World Model forecasting for precise ego-trajectory prediction in autonomous driving.

Practical Takeaway: This work demonstrates a promising direction for combining high-level reasoning with low-level prediction in autonomous driving, but current implementation has significant practical limitations. The key insight - using VLMs for behavioral reasoning rather than direct trajectory generation - could be valuable for building more interpretable driving systems. However, the 1-second VLM inference time makes real-time deployment impractical. Research engineers should consider this architectural pattern but focus on faster VLMs or hierarchical inference schemes. The behavior conditioning approach and evaluation methodology provide useful templates for future hybrid systems, though closed-loop validation remains essential.

Tags: autonomous_driving vision_language_models world_models trajectory_prediction multimodal_fusion behavioral_planning interpretable_ai nuScenes

arXiv · PDF

Task & Setting

Autonomous driving systems require both high-level contextual reasoning about complex traffic scenarios and accurate prediction of vehicle dynamics and trajectories. Vision-Language Models (VLMs) excel at scene understanding and reasoning but struggle with spatial comprehension and precise trajectory generation, while World Models (WMs) can predict realistic scene evolution but lack semantic reasoning capabilities.

The task is to develop a hybrid system that combines VLM reasoning with WM forecasting for autonomous driving. The input consists of front-view camera images (RGB) along with navigation instructions and current ego-speed. The VLM generates structured behavioral commands including justification text, action descriptions, and 2D steering-velocity vectors. These commands then condition a World Model to predict future ego-trajectories over 1-3 second horizons. The training objective combines text generation loss and behavioral prediction loss:

\[\ell = \ell_{behavior} + \ell_{text}\]

where

\[\ell_{behavior} = MSE(\hat{\alpha}, \alpha) + MSE(\hat{v}, v)\]

Success is measured using nuScenes metrics: L2 trajectory error (meters) at 1s/2s/3s horizons, collision rates (percentage), and text quality metrics (BLEU, ROUGE, BERTScore) for reasoning evaluation.

The authors extend nuScenes with 28,131 action-justification annotations combining doScenes scene-level instructions and DriveLM frame-level VQA data, processed through GPT-OSS-120B to generate structured reasoning traces.

Architecture & Method
  1. Vision-Language Model (VLM) component uses LLaVA-Qwen1.5-0.5B or LLaVA-Qwen2-1.5B to process front-view images and generate structured reasoning outputs with justification, action description, and behavioral commands

  2. Behavioral command generation via a 3-layer MLP with ReLU activations and 0.5 dropout probability that maps VLM hidden states to 2D steering-velocity vectors using either first 16 tokens or dedicated behavior tokens

  3. World Model component adopts LAW (Latent Action-aware World model) architecture with transformer-based spatial decoder, waypoint transformer decoder, and WM transformer decoder

  4. Behavior conditioning mechanism injects VLM-generated behavioral commands at two points: concatenated with learned waypoint queries and spatial view features before waypoint decoding, and concatenated with predicted waypoints before WM transformer processing

  5. Text generation loss uses standard next-token prediction:

    \[\ell_{text} = -\frac{1}{T-1}\sum_{t=1}^{T-1} \log p_\theta(y_{t+1} | y_{\leq t}, x)\]

    The core technical contribution is the first framework to condition trajectory-predictive world models with high-level VLM behavioral reasoning, enabling semantically-guided and interpretable autonomous driving.

Training Recipe
  1. Stage 1 - VLM fine-tuning: Train on extended nuScenes dataset with justification-action annotations, learning rate 1e-6 with 100 warmup steps decaying to 1e-7, weight decay 0.06, batch size 3 distributed across 2 GPUs, trained for 4 epochs

  2. Stage 2 - World Model conditioning training: Train LAW model with behavior conditioning using published vision encoder checkpoint, adopt original LAW training configurations on 4 NVIDIA A40 GPUs with 46GB VRAM

  3. Data: 22,516 training samples (80%) and 5,615 validation samples (20%) from nuScenes train split with synthetic reasoning annotations generated via GPT-OSS-120B

  4. Hardware: Training performed on NVIDIA A40 GPUs for WM, inference evaluated on NVIDIA GeForce RTX 4090

  5. Wall-clock time: Not reported for training duration, VLM inference ~1s, LAW runs at ~12Hz

Novelty & Lineage

This work builds on established components but introduces a novel integration approach. Prior work includes LAW (2025) for latent world models, LMDrive (2024) and BEVDriver (2025) for VLM-based driving, and various autonomous driving world models like GAIA-2, Vista, and Orbis.

The specific delta is the first conceptual framework where a VLM generates high-level behavioral commands that condition a trajectory-predictive world model, rather than using VLMs for end-to-end trajectory planning or world models in isolation. The hybrid architecture enables interpretable reasoning while maintaining physically grounded trajectory prediction.

Closest prior works: LAW (2025) for the WM backbone, LMDrive (2024) for VLM driving applications, SimLingo (2025) for vision-only VLM driving.

Rating: INCREMENTAL - Novel architectural combination but built from existing components without fundamental algorithmic breakthroughs.

Benchmarks & Results
  1. nuScenes trajectory prediction: L2 error 0.31m/0.62m/1.03m at 1s/2s/3s (matches LAW baseline 0.31m/0.61m/1.02m), collision rates 0.10%/0.14%/0.48% at 1s/2s/3s (baseline: 0.10%/0.14%/0.44%)

  2. Reasoning quality evaluation: BLEU-1 0.36 vs 0.04 zero-shot, ROUGE-1 0.47 vs 0.09 zero-shot, BERTScore F1 0.67 vs 0.54 zero-shot (24% improvement)

  3. Ablation studies show ground truth motion vector conditioning achieves best L2 performance (0.28m at 3s, 73% improvement over baseline)

    Results are mixed - trajectory accuracy maintained but collision rates slightly increase at long horizons, while reasoning quality substantially improves over zero-shot baselines. No comparison to other hybrid VLM-WM approaches as this appears to be the first such framework.

Compute & Efficiency
  1. Model size: LLaVA-Qwen1.5-0.5B (500M parameters) or LLaVA-Qwen2-1.5B (1.5B parameters) for VLM component, LAW model size not specified

  2. Training compute: 4 NVIDIA A40 GPUs with 46GB VRAM for WM training, 2 GPUs for VLM training, total GPU hours not reported

  3. Inference speed: VLM ~1s per forward pass due to long reasoning generation, LAW runs at ~12Hz on NVIDIA GeForce RTX 4090

  4. Memory footprint: Not explicitly reported, though uses compact models for feasibility demonstration

  5. Deployment practicality: Current inference speeds (1s VLM + 12Hz WM) problematic for real-time driving; authors propose decoupled inference rates with lower-frequency VLM updates for practical deployment

Real-World Applicability
  1. Evaluation conducted on nuScenes real-world dataset with actual driving scenarios including urban environments, intersections, construction zones

  2. No deployment on actual vehicles reported - evaluation remains in open-loop simulation on recorded data

  3. No hardware experiments on physical robots or autonomous vehicles described

  4. Sim-to-real discussion acknowledges need for closed-loop evaluation and violation-sensitive safety metrics in future work

  5. Authors note current single-frame processing and narrow field of view as limitations for real-world deployment

    The work demonstrates feasibility on real driving data but has not been validated in live driving scenarios or deployed systems.

Limitations & Failure Modes
  1. FUNDAMENTAL: Single-frame processing limits temporal awareness and spatial understanding due to LLaVA’s cropped vision encoder field of view

  2. FUNDAMENTAL: VLM inference speed (~1s) too slow for real-time autonomous driving applications requiring millisecond response times

  3. ENGINEERING: Reasoning dataset lacks human-annotated traces, relying on synthetic GPT-generated annotations that may not reflect human reasoning patterns

  4. EVALUATION: No closed-loop evaluation or safety-critical scenario testing - only open-loop trajectory prediction metrics

  5. ENGINEERING: Long-horizon collision rates increase compared to baseline, suggesting behavioral conditioning may be suboptimal for complex interactions

  6. EVALUATION: Limited comparison to other hybrid approaches since this is the first VLM-WM integration framework

    Failure modes: 1) VLM misclassifies traffic signals leading to incorrect behavioral commands that propagate to trajectory generation, 2) Overly conservative behavior in interactive scenarios may cause traffic flow disruption or deadlock situations.


VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion

Authors: Aditya Shirwatkar, Satyam Gupta, Shishir Kolathaya · Institution: Figure AI · Category: cs.RO

VIP-Loco integrates vision-based scene understanding with infinite-horizon MPC through a learned internal model that provides imagination augmentation during RL training and serves as a dynamics oracle for structured planning at deployment.

Practical Takeaway: Research engineers working on legged robotics should pay attention to this framework’s approach of using a learned internal model for both training augmentation and deployment-time planning. The key insight is that the same model can serve dual purposes: providing imagination rollouts during RL training and acting as a dynamics oracle for MPC at deployment. The variational objective for visual representation learning outperforms consistency-based approaches on visually constrained tasks. However, the simulation-only results and GPU requirements mean this needs significant engineering work for real-world deployment.

Tags: legged_locomotion model_predictive_control reinforcement_learning visual_perception depth_sensing planning robotics quadruped

arXiv · PDF

Task & Setting

Legged robots must navigate complex, dynamic environments requiring anticipation of terrain changes like gaps, stairs, and obstacles. Traditional Model Predictive Control (MPC) provides interpretable planning but struggles with high-dimensional visual inputs, while Reinforcement Learning (RL) adapts well to visual scenarios but lacks structured planning capability.

The task is visual locomotion control where the input consists of proprioceptive observations (joint angles, velocities, gravity projection, velocity commands) and depth images from onboard cameras. The output is joint angle perturbations for PD controllers. The objective maximizes cumulative discounted reward:

\[\max_{\pi} \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]\]

where rewards include velocity tracking, stability maintenance, and collision avoidance.

Success is measured by episodic return, success rate across terrain difficulty levels (0-8), and terrain mastery progression. Evaluation covers six terrain types: slopes, stairs, gaps, climbing, crawling, and tilting across three robot morphologies in Isaac Gym simulation with 4096 parallel environments.

Architecture & Method
  1. Internal Model with GRU cell operating at 10Hz for temporal memory and CNN-MLP encoder for depth processing
  2. Model state representation: y = [x, h, z] where x is kinodynamic center-of-mass state, h is recurrent hidden state, z is stochastic latent state
  3. Encoder components (p^z_φ, p^x_φ) estimate future states from observation embeddings and recurrent state
  4. Dynamics components (d^z_φ, d^x_φ) predict future states, with d^x_φ using rigid body dynamics equations for interpretability
  5. Expert Actor π_θ (MLP at 50Hz) receives proprioceptive observations and imagined rollouts with stop-gradient
  6. Combined supervised loss for internal model:

    \[L_{IM} = \sum_{t=0}^L \mathbb{E}_{p_φ} \left[ -\ln r_φ(r^{sim}_t | y_t, a^{expert}_t) - \ln V_φ(V^{target}_t | y_t) + \beta \text{KL}(p^z_φ \| d^z_φ) - \ln π_φ(a^{expert}_t | y_t) + \|x^{sim}_t - p^x_φ\|^2_2 + \|x^{sim}_{t+1} - d^x_φ\|^2_2 - \ln q_φ(o_t | y_t) \right]\]
  7. At deployment, infinite-horizon MPC uses learned models with MPPI optimization and value function bootstrapping for terminal states
Training Recipe
  1. On-policy training using PPO with clipping range 0.2, GAE factor 0.95, discount factor 0.99
  2. Data: 4096 parallel Isaac Gym environments with domain randomization and curriculum learning across terrain difficulty levels
  3. Optimizer: Adam with learning rate 0.001 for all networks
  4. Hardware: Intel Xeon Gold 5318Y (48 cores) @ 3.40 GHz, two NVIDIA RTX A6000 GPUs
  5. Internal model trained via supervised learning on expert trajectories with targets from simulation interaction
  6. Expert Actor and Privileged Critic trained jointly with PPO on collected trajectories
  7. Training duration and wall-clock time: not reported
  8. Memory optimization: NVIDIA WARP library to maintain VRAM usage below 25GB
Novelty & Lineage

The core novelty is integrating vision-based scene understanding with infinite-horizon MPC through a learned internal model that operates at deployment time. Closest prior works are WMP (2024) for vision-based locomotion RL and PIP-Loco (2025) for proprioceptive MPC planning. The specific delta is combining visual perception with structured planning via a dual-purpose internal model that provides imagination augmentation during training and serves as dynamics oracle for MPC at deployment. The variational objective for representation learning (vs consistency-based) enables better performance on visually constrained tasks like crawling and tilting. Rating: SIGNIFICANT - meaningfully advances the state of combining perception and planning for legged locomotion.

Benchmarks & Results
  1. Six terrain types (slopes, stairs, gaps, climbing, crawling, tilting) across difficulty levels 0-8, measured by success rate and episodic return
  2. Go1 quadruped: VIP-Loco achieves >90% success on stairs at maximum difficulty vs sharp degradation for proprioceptive methods
  3. Cassie biped: VIP-Loco shows consistent performance across terrains with some marginal regressions on gaps and crawl compared to no-planning variant
  4. TronA1-W wheeled-biped: Shows largest improvement from planning component, achieving 38.29±2.39 return on slopes vs 33.47±1.38 without planning
  5. Terrain mastery: VIP-Loco (Variational) sustains levels ≥6 while WMP asymptotically regresses to easier terrains
  6. Training comparison: VIP-Loco (Variational) achieves highest and most stable returns over 6000 training iterations
  7. All methods compared against HIM-Loco, PIP-Loco, WMP baselines across 5 random seeds
Compute & Efficiency
  1. Model size: not explicitly reported, but includes CNN encoder, GRU cell, MLPs for dynamics/value/reward heads
  2. Training compute: Two NVIDIA RTX A6000 GPUs with 4096 parallel environments, training duration not reported
  3. Inference speed: 40-50 Hz deployment with ~20-25ms total planner latency (depth encoding 2-3ms, GRU update <1ms, imagination rollout 3-5ms, MPPI sampling 10-15ms)
  4. Memory footprint: <25GB VRAM during training using NVIDIA WARP optimization
  5. Deployment practicality: Real-time feasible at standard control frequencies, though requires desktop GPU for full performance
Real-World Applicability
  1. Simulation-only evaluation in Isaac Gym with domain randomization for robustness
  2. No hardware experiments reported in this work
  3. Authors note that PIP-Loco (closest predecessor) has been validated on physical hardware, providing evidence for sim-to-real transfer potential
  4. Deployment runs at 40-50 Hz which is compatible with real-world control loops
  5. Framework incorporates domain randomization and depth sensor modeling for real-world preparation
  6. Authors explicitly state extending to hardware with depth sensor noise and onboard compute constraints is primary future work direction
Limitations & Failure Modes
  1. EVALUATION - Simulation-only results with no real-world validation or hardware experiments
  2. ENGINEERING - Requires desktop GPU for full 40-50 Hz performance, may need optimization for onboard compute
  3. ENGINEERING - Marginal performance regressions on some terrain types (gaps, crawl) for certain morphologies compared to no-planning baseline
  4. FUNDAMENTAL - Planning benefits are morphology and terrain-dependent, not universally applicable
  5. EVALUATION - Limited comparison with other state-of-the-art visual locomotion methods beyond WMP
  6. ENGINEERING - 10Hz internal model update rate vs 50Hz control may introduce temporal aliasing issues

    Likely failure modes:

  7. Performance degradation under significant domain shift between training and deployment environments
  8. Planning optimization may get stuck in local minima during challenging terrain transitions requiring rapid replanning.

Unlocking the Value of Text: Event-Driven Reasoning and Multi-Level Alignment for Time Series Forecasting

Authors: Siyuan Wang, Peng Chen, Yihang Wang, Wanghui Qiu et al. (7 authors) · Institution: Toyota Research · Category: cs.AI

VoT introduces event-driven reasoning with LLMs and multi-level alignment to effectively integrate textual information with numerical time series for improved forecasting across diverse domains.

Practical Takeaway: If you’re working on time series forecasting in domains with rich textual context (finance, economics, health policy), this paper provides a concrete framework for leveraging both numerical patterns and event-driven information. The Historical In-Context Learning approach is particularly clever for avoiding expensive LLM fine-tuning while still getting semantic reasoning benefits. The dual-branch architecture with frequency-domain fusion could be adapted to other multimodal forecasting scenarios. However, implement this only if you have access to relevant textual data - the method’s value is heavily dependent on text quality and relevance.

Tags: time-series-forecasting multimodal-learning large-language-models contrastive-learning frequency-domain-analysis in-context-learning cross-modal-alignment event-driven-prediction

arXiv · PDF

Task & Setting
  1. Real-world context: Time series forecasting traditionally relies on numerical data alone, but real-world series exhibit complex patterns driven by external events (financial crises, pandemics, policy changes) that are difficult to predict from historical numerical patterns alone. Incorporating textual information like news, policy documents, and contextual descriptions could provide crucial guidance for capturing sudden shifts and event-driven dynamics.

  2. Task definition: Given a multimodal time series input consisting of numerical observations X ∈ R^(L×N) (where L is lookback window, N is number of variables) and associated textual information (both exogenous text T_ex from external sources like news/policies and endogenous text T_en from time series statistics), predict future H time steps. The objective is:

    \[\hat{Y} = F(X, T_{ex}, T_{en}; \theta)\]

    where $\hat{Y} \in R^{H×N}$ represents predicted values.

  3. Evaluation criteria: Performance measured using Mean Squared Error (MSE) and Mean Absolute Error (MAE) across multiple forecasting horizons on real-world datasets from 10 domains.

  4. Dataset/benchmark: Evaluation on 10 real-world multimodal datasets from MM-TSFLib covering Agriculture, Climate, Economy, Energy, Health, Environment, Traffic, Security, Social Good, and a new Weather dataset, spanning diverse temporal frequencies and forecasting horizons.

Architecture & Method
  1. Dual-branch architecture with Event-Driven Prediction Branch processing exogenous text through a three-step generative pipeline (template generation, summarization, reasoning) using LLMs.

  2. Historical In-Context Learning (HIC) builds knowledge base K = {(Embed(S_i), C_i)} during training by correcting LLM reasoning with ground truth, then retrieves similar examples during inference for error-informed guidance.

  3. Numerical Prediction Branch employs Endogenous Text Alignment (ETA) with decomposed pattern extraction using dual-query attention to extract trend/seasonal components from endogenous text representations.

  4. Decomposed contrastive learning aligns temporal and textual representations at component level with loss:

    \[L_{align} = \frac{1}{2}\sum_{*\in\{tr,se\}} \left(-\log\frac{\exp(\text{sim}(\bar{H}^*_i, \bar{Z}^*_i))}{\sum_{j=1}^B \exp(\text{sim}(\bar{H}^*_i, \bar{Z}^*_j))} - \log\frac{\exp(\text{sim}(\bar{Z}^*_i, \bar{H}^*_i))}{\sum_{j=1}^B \exp(\text{sim}(\bar{Z}^*_i, \bar{H}^*_j))}\right)\]
  5. Adaptive Frequency Fusion (AFF) decomposes both branch predictions via FFT into frequency bands and learns adaptive weights for fusion:

    \[F_{fused} = \sum_* \sum_b w^*_b F^*_b\]
  6. Combined training objective: L_total = L_ts + L_align + L_final with MSE losses for temporal prediction and final fused output.

Training Recipe
  1. Event-driven branch training: Uses pre-trained LLMs for template generation, summarization, and reasoning without fine-tuning. Knowledge base constructed during training by correcting reasoning with ground truth.

  2. Numerical branch training: Standard supervised learning with composite loss function L_total = L_ts + L_align + L_final using MSE for prediction tasks and contrastive learning for cross-modal alignment.

  3. Optimizer details: Not fully specified, follows standard time series forecasting protocols with hyperparameter tuning on validation sets.

  4. Data preprocessing: Time-based splits (70% train, 10% validation, 20% test) following MM-TSFLib protocols with varying lookback windows (L=8,36,96) and forecasting horizons based on data frequency.

  5. Hardware requirements: PyTorch 2.5.0+cu121, CUDA 12.1, specific GPU details not reported.

  6. Training time: Wall-clock time not reported.

Novelty & Lineage

This work builds on multimodal time series forecasting methods like CMIN (2023), DualTime (2024), and MM-TSFLib (2024). Closest prior works include GPT4TS (2025), TimeLLM (2024), and TaTS (2026) which incorporate LLMs for time series but focus primarily on representation extraction.

Key novel contributions:

  1. Event-driven reasoning pipeline that uses LLMs for actual numerical prediction generation rather than just feature extraction
  2. Historical In-Context Learning that builds error-corrected knowledge base for retrieval-guided inference
  3. Multi-level alignment combining representation-level decomposed contrastive learning with prediction-level adaptive frequency fusion.

    The specific delta is moving beyond representation-level text integration to semantic reasoning and adaptive frequency-domain fusion of complementary modalities.

    Rating: SIGNIFICANT - meaningful advance in multimodal time series forecasting with novel reasoning and alignment mechanisms.

Benchmarks & Results
  1. Agriculture dataset: MSE 0.209, MAE 0.302, outperforms all 11 baselines including best multimodal method GPT4MTS (MSE 0.225, MAE 0.298)

  2. Climate dataset: MSE 1.078, MAE 0.840, beats previous best including multimodal methods like GPT4TS (MSE 1.184, MAE 0.891)

  3. Economy dataset: MSE 0.201, MAE 0.353, outperforms multimodal baselines like CALF (MSE 0.207, MAE 0.357)

  4. Energy dataset: MSE 0.222, MAE 0.343, significantly better than time-series only PatchTST (MSE 0.250, MAE 0.363)

  5. Environment dataset: MSE 0.268, MAE 0.380, outperforms all baselines

  6. Health dataset: MSE 1.205, MAE 0.714, substantially better than time-series only iTransformer (MSE 1.432, MAE 0.804)

  7. Security dataset: MSE 70.117, MAE 3.937, outperforms all methods

  8. Social Good dataset: MSE 0.804, MAE 0.389, beats time-series only baseline (MSE 0.944, MAE 0.475)

  9. Traffic dataset: MSE 0.169, MAE 0.232, outperforms baselines

  10. Weather dataset: MSE 0.968, MAE 0.706, best performance

    Achieves 1st place on 19/20 metrics against multimodal methods, 20/20 against time-series only methods.

Compute & Efficiency
  1. Model size: Not explicitly reported, but uses pre-trained LLMs without fine-tuning suggesting moderate parameter overhead beyond base time series models

  2. Training compute: GPU hours and specific hardware not reported, uses CUDA 12.1 environment

  3. Inference speed/latency: Not reported, but HIC retrieval mechanism may add latency compared to direct prediction

  4. Memory footprint: Not specified, dual-branch architecture likely increases memory requirements

  5. Deployment practicality: Moderate - relies on pre-trained LLMs and requires textual data alignment, but avoids expensive LLM fine-tuning through HIC approach

Real-World Applicability
  1. Tested on real-world datasets from 10 domains including financial (agriculture, energy), governmental (economy, health), and environmental (climate, weather) data sources

  2. Datasets sourced from authoritative organizations like USDA, NOAA, CDC, EPA, FEMA indicating real-world relevance

  3. No deployment results or production integration reported

  4. No hardware experiments beyond standard computational evaluation

  5. Method designed for scenarios with available textual context (news, reports, policies) which limits applicability to domains lacking rich textual information

Limitations & Failure Modes
  1. FUNDAMENTAL: Requires availability of relevant exogenous textual information - method cannot help domains lacking rich textual context

  2. FUNDAMENTAL: Relies on LLM reasoning quality - errors in LLM understanding of text-time series relationships could propagate to final predictions

  3. ENGINEERING: HIC knowledge base construction requires sufficient training examples with correctable reasoning patterns

  4. ENGINEERING: Adaptive frequency fusion weights need adequate training data to learn optimal frequency-specific combinations

  5. EVALUATION: Only evaluated on English textual data, multilingual capability not assessed

  6. EVALUATION: No analysis of computational overhead compared to simpler text integration approaches

    Failure modes:

  7. Performance likely degrades when exogenous text contains misleading or irrelevant information
  8. Method may fail when event-driven patterns are not well-represented in training data for HIC retrieval.

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Authors: Yulin Peng, Xinxin Zhu, Chenxing Wei, Nianbo Zeng et al. (7 authors) · Institution: Toyota Research · Category: cs.AI

SAGE enables LLMs to self-evolve reasoning capabilities through four specialized agents (Challenger, Planner, Solver, Critic) that co-train adversarially using minimal seed data and automatic verification.

Practical Takeaway: SAGE demonstrates that multi-agent specialization within a shared LLM backbone can enable effective self-evolution with minimal human supervision. The key insight is combining adversarial task generation (Challenger rewarded for difficulty, Solver for correctness) with explicit planning decomposition and quality filtering. Research engineers should consider this approach for domains with automatic verification capabilities, but be aware of the increased inference complexity and need for careful training monitoring to prevent over-specialization. The framework’s strong out-of-distribution generalization, particularly on LiveCodeBench, suggests value for developing more robust reasoning agents.

Tags: multi-agent-systems self-evolution reinforcement-learning mathematical-reasoning code-generation curriculum-learning self-play policy-gradients

arXiv · PDF

Task & Setting

Self-evolving reasoning agents for large language models face the fundamental challenge of improving without extensive human supervision. Current methods either rely on large-scale human-labeled datasets or use unstable self-play without explicit planning and quality control, limiting their effectiveness for complex multi-step reasoning tasks.

The task is to develop a closed-loop multi-agent framework that enables LLMs to co-evolve their reasoning capabilities in verifiable domains (mathematics and code generation) using only minimal seed data. The input is a small seed set of 500 examples with verifiers, and the output is an improved reasoning agent capable of generating, planning, solving, and evaluating tasks autonomously. The objective is to maximize reasoning performance while maintaining training stability through adversarial yet collaborative agent dynamics.

Success is measured by pass@1 accuracy on mathematical reasoning benchmarks (GSM8K, MATH, AIME, OlympiadBench, AMC) and code generation benchmarks (HumanEval+, MBPP+, LiveCodeBench). The evaluation spans both in-distribution and out-of-distribution generalization across multiple model scales.

The framework operates on existing benchmarks but generates its own expanded training curriculum, growing from 500 seed examples to over 20,000 generated questions during training.

Architecture & Method
  1. SAGE instantiates four specialized agents from a shared LLM backbone Mθ: Challenger (task generation), Planner (solution decomposition), Solver (answer execution), and Critic (quality assessment).

  2. The Challenger generates new problems and verifiers using reference examples from the seed set:

    \[(q, v) \sim \pi_c(\cdot | q_{ref}, v_{ref}; \theta)\]
  3. The Challenger receives a composite reward combining quality score, difficulty, and format compliance:

    \[r_c(q, v) = \frac{1}{3}s_q(q) + \frac{1}{3}r_d(q, v) + \frac{1}{3}r_f(o_c)\]

    where difficulty is computed as:

    \[r_d(q, v) = 1 - \frac{1}{N_s}\sum_{j=1}^{N_s} V_{gt}(q, a_j, v)\]
  4. The Planner generates structured multi-step plans that are gated by Critic evaluation (threshold β = 0.3) before being provided to the Solver.

  5. The Solver receives composite rewards mixing plan quality, verified correctness, and format compliance:

    \[r_s = w_p \tilde{s}_p + w_c s_{gt} + w_f r_f(o_s)\]
  6. All agents are jointly optimized using Task-Relative REINFORCE++ with per-role advantage normalization:

    \[A^{role}_{norm} = \frac{r - \mu^{role}}{\sigma^{role} + \epsilon}\]
Training Recipe
  1. Initialization: All four agents (Challenger, Planner, Solver, Critic) are initialized from the same base LLM (Qwen2.5-3B/7B-Instruct or Qwen3-4B-Base) with LoRA adaptation (rank 128).

  2. Training setup: 200 training steps with batch size 128, learning rate 3×10⁻⁶, using Task-Relative REINFORCE++ without KL regularization.

  3. Data: Seed set of 500 examples sampled from MATH (156), GSM8K (148), HumanEval (87), and MBPP (109), with automatic verifiers for correctness evaluation.

  4. Three-phase training loop:
  5. Challenger phase generates questions filtered by Critic quality threshold α=0.7
  6. Plan-Solve phase with Planner gating threshold β=0.3
  7. Joint parameter update with per-role advantage normalization.

  8. Hardware: Not explicitly reported, implemented using VeRL framework.

  9. Training dynamics: Peak validation accuracy around step 100-120, with question pool expanding from 1,136 to 20,532 examples by step 250.
Novelty & Lineage

The core novelty is the integration of four specialized agents (Challenger, Planner, Solver, Critic) within a single shared LLM backbone for self-evolution, with explicit planning decomposition and dual-role critic mechanisms for quality control.

Closest prior works include Absolute Zero (Zhao et al., 2025) for self-play reasoning, MARS (Yuan et al., 2025) for multi-agent RL, and SPIRAL (Liu et al., 2025) for adversarial self-play. The specific delta is combining adversarial task generation (Challenger vs Solver) with structured planning and comprehensive quality filtering in a unified framework.

Prior multi-agent systems like MetaGPT and CAMEL focus on task decomposition among separate models, while SAGE trains all agents jointly from a shared backbone. The planning component distinguishes it from pure self-play approaches that lack explicit reasoning decomposition.

Rating: SIGNIFICANT - meaningful architectural innovation combining established components in a novel way with strong empirical validation.

Benchmarks & Results
  1. HumanEval+: 68.9% (Qwen-2.5-3B), 76.2% (Qwen-2.5-7B), 75.6% (Qwen-3-4B) - improvements vary by model
  2. MBPP+: 62.4% (3B), 64.0% (7B), 62.4% (4B) - mixed results vs baselines
  3. LiveCodeBench: 16.9% (3B), 26.4% (7B), 30.6% (4B) - consistent +8.9% to +9.1% improvements over base models
  4. GSM8K: 85.5% (3B), 92.2% (7B), 94.3% (4B) - modest improvements on in-distribution data
  5. MATH: 66.2% (3B), 74.7% (7B), 91.0% (4B) - competitive with base models
  6. AIME 2024: 6.7% (3B), 13.3% (7B), 16.7% (4B) - strong on competition math
  7. AIME 2025: 6.7% (3B), 13.3% (7B), 10.0% (4B) - mixed results
  8. AMC: 35.0% (3B), 52.5% (7B), 75.0% (4B) - competitive performance
  9. OlympiadBench: 29.8% (3B), 38.7% (7B), 47.9% (4B) - notable +10.7% improvement on 7B model

    Results show consistent improvements on out-of-distribution benchmarks, particularly LiveCodeBench, while maintaining competitive in-distribution performance. SAGE achieves best overall averages across model scales compared to AZR and MAE baselines.

Compute & Efficiency
  1. Model size: Tested on 3B, 7B parameters (Qwen2.5) and 4B parameters (Qwen3), using LoRA with rank 128 for parameter-efficient training

  2. Training compute: 200 training steps with batch size 128, learning rate 3×10⁻⁶ - specific GPU hours and hardware not reported

  3. Inference speed/latency: Not reported - framework requires multiple agent calls per inference which likely increases latency

  4. Memory footprint: LoRA adaptation reduces memory requirements compared to full fine-tuning, but specific numbers not provided

  5. Deployment practicality: Moderate - requires maintaining four specialized agent prompts and orchestrating multi-agent interactions, plus external verifiers for mathematical/code domains, making deployment more complex than single-model inference

Real-World Applicability
  1. Domain constraints: SAGE operates specifically in verifiable domains (mathematics and programming) where automatic correctness verification is possible through symbolic graders or test execution.

  2. Benchmark evaluation: All experiments conducted on standard academic benchmarks (GSM8K, MATH, HumanEval, etc.) rather than real-world deployment scenarios.

  3. Production considerations: Framework requires external verifiers and multi-agent orchestration, limiting immediate production deployment without additional engineering.

  4. Scalability: Method scales question generation from 500 seed examples to 20,000+ generated problems, demonstrating autonomous curriculum expansion capability.

  5. Generalization: Strong out-of-distribution performance on competition-level math problems and recent code benchmarks suggests potential for real-world mathematical and programming assistance applications.

Limitations & Failure Modes
  1. FUNDAMENTAL - Restricted to verifiable domains where automatic correctness evaluation is possible (math, programming), limiting applicability to open-ended tasks with subjective evaluation criteria.

  2. ENGINEERING - Still requires seed set of 500 examples to bootstrap self-evolution, though significantly smaller than typical supervised datasets.

  3. EVALUATION - Limited to mathematical reasoning and code generation benchmarks; generalization to other structured reasoning domains (logical reasoning, scientific problem solving) not demonstrated.

  4. ENGINEERING - Training dynamics show potential over-specialization after step 120, requiring careful monitoring and early stopping.

  5. FUNDAMENTAL - Multi-agent architecture increases inference complexity compared to single-model approaches, requiring orchestration of four specialized agents.

    Failure modes:

    • Over-specialization on self-generated curriculum leading to performance degradation beyond peak training point
    • Quality drift if Critic filtering fails, potentially degrading the expanded training dataset with low-quality generated problems

Adaptive Theory of Mind for LLM-based Multi-Agent Coordination

Authors: Chunjiang Mu, Ya Zeng, Qiaosheng Zhang, Kun Shao et al. (9 authors) · Institution: Toyota Research · Category: cs.AI

Proposes Adaptive Theory of Mind (A-ToM) agents that dynamically estimate and align with partners’ ToM reasoning depth to improve multi-agent coordination, treating ToM alignment as an online expert advice problem.

Practical Takeaway: If you’re building multi-agent systems with LLM-based agents, this paper provides a crucial insight: don’t just give agents Theory of Mind capabilities - ensure their ToM reasoning depths are aligned. The A-ToM framework offers a practical solution using online learning algorithms (FTL for stable partners, Hedge for adaptive ones) that can be implemented as a wrapper around existing LLM agents. The key engineering insight is transforming coordination from complex policy space to simpler ToM-order space. Consider implementing this if your agents need to coordinate with unknown partners, especially in cooperative settings where misaligned mental models cause coordination failures.

Tags: multi-agent-coordination theory-of-mind LLM-agents online-learning expert-advice zero-shot-coordination cooperative-AI game-theory

arXiv · PDF

Task & Setting

Multi-agent coordination in cooperative settings requires agents to align their actions without prior agreement or communication, a critical challenge in autonomous driving, robotics, and distributed systems. The problem arises when agents equipped with Theory of Mind (ToM) reasoning of different depths (orders) fail to coordinate effectively due to misaligned mental models of each other’s reasoning processes.

The task involves two LLM-based agents cooperating in fully cooperative Markovian environments M = ⟨S, A1, A2, T, R, γ⟩ where both agents share the same reward function R : S × A1 × A2 → ℝ. The agents must select joint actions to maximize their shared value function:

\[a^* = \arg\max_{a \in A_1 \times A_2} Q^π(s, a)\]

where

\[Q^π(s, a) = E_π\left[\sum_{t=0}^∞ γ^t R(s_t, a_t) \mid s_0 = s, a_0 = a\right]\]

Success is measured by task-specific metrics: accumulated points (0-75) in repeated matrix games, completion time (0-30 steps for grid worlds, 0-100 for Overcooked) with failed episodes assigned maximum time limits. The evaluation spans four environments: repeated matrix games with Memory-1 and Memory-N settings, two grid world navigation tasks, and an Overcooked cooking scenario, each designed to test different coordination challenges.

Architecture & Method
  1. ToM Agent Architecture: Each LLM-based agent uses LLaMA-3.3-70B-Instruct with four modules: state encoding (converts environment to natural language), ToM module (predicts partner actions), decision module (selects actions), and action controller (converts to executable actions).

  2. Fixed-Order ToM Agents: ToM-k agents recursively model partners as ToM-(k-1) agents, with policies defined as:

    \[π_i^{(k)}(s, b_i^{(k)}) := \arg\max_{a \in A_i} Q^π(s, a_{j}^{pred}, a)\]

    where $b_i^{(k)} := a_{j}^{pred}$ and $a_{j}^{pred} = π_j^{(k-1)}(s, b_j^{(k-1)})$

  3. Adaptive ToM (A-ToM) Agent: Maintains multiple hypothetical agents {$π_j^{(k)}$}_{k∈{0,1,2}} and treats ToM order estimation as an expert advice problem.

  4. Online Learning Algorithms: Implements Follow-the-Leader (FTL) with O(log T) regret for stable partners, and Hedge algorithm with O(√T log N) worst-case regret for non-stationary behavior.

  5. Weight Update Mechanism: Updates expert weights based on prediction accuracy, with FTL selecting the best-performing expert deterministically and Hedge maintaining soft probability distributions over experts.

Training Recipe

This work does not involve traditional model training as it uses pre-trained LLaMA-3.3-70B-Instruct as the backbone LLM. The method operates through:

  1. Inference-Only Approach: Uses the pre-trained LLM with temperature=0.1 for consistent decision-making, no additional training required.

  2. Online Learning: Expert weights are updated in real-time during task execution using prediction accuracy feedback.

  3. Implementation Details: All experiments use random seed=42, each configuration repeated 30 times independently for statistical significance.

    Training compute, hardware requirements, and optimization details are not reported as the method relies entirely on inference with existing models.

Novelty & Lineage

This work builds on established Theory of Mind research in multi-agent systems (Li et al. 2023, Agashe et al. 2023) and online learning theory (Cesa-Bianchi et al. 1997). The key novelty is identifying ToM order misalignment as a fundamental coordination problem and proposing the first adaptive mechanism to dynamically estimate and align with partner’s ToM order in real-time.

Previous work showed mixed results when equipping agents with higher-order ToM but attributed performance drops to LLM limitations or over-reasoning. This paper provides the deeper insight that misaligned ToM orders between agents cause coordination failures, and demonstrates that adaptive alignment can consistently improve performance.

The core technical contribution is formulating ToM alignment as an expert advice problem with theoretical guarantees, transforming coordination from policy space to ToM-order space.

Rating: SIGNIFICANT - Provides important theoretical insight with practical algorithmic solution.

Benchmarks & Results
  1. Repeated Matrix Game (Memory-1): A-ToM achieves 70-75 points vs 0 points for misaligned ToM pairs, matching aligned pairs’ performance (75 points).

  2. Repeated Matrix Game (Memory-N): A-ToM achieves 70-75 points vs 0-22 points for misaligned pairs, comparable to aligned pairs (75 points).

  3. Grid World Navigation Game 1: A-ToM completes in 6-8 steps vs 23-30 steps for misaligned pairs, matching aligned pairs (6 steps).

  4. Grid World Navigation Game 2: A-ToM completes in 7-11 steps vs 30 steps for misaligned pairs, slightly slower than aligned pairs (7-8 steps) but dramatically better than misaligned.

  5. Overcooked Task: A-ToM completes in 43-58 steps vs 83-100 steps for misaligned pairs, comparable to aligned pairs (44-51 steps).

  6. Cross-play with Non-LLM Agents: A-ToM outperforms fixed ToM agents when paired with Greedy (48-49 vs 55-65 steps) and PBT agents (48-49 vs 49 steps).

    Results consistently demonstrate that ToM alignment is critical for coordination, with A-ToM successfully adapting to different partner types.

Compute & Efficiency
  1. Model Size: Uses LLaMA-3.3-70B-Instruct (70 billion parameters) as backbone, no additional learned parameters for A-ToM mechanism.

  2. Training Compute: Not applicable - inference-only approach using pre-trained models, no training required.

  3. Inference Speed/Latency: Not reported, but involves multiple LLM calls per decision (one per ToM order hypothesis).

  4. Memory Footprint: Not reported, but requires storing expert weights and prediction histories for online learning.

  5. Deployment Practicality: Moderate - requires access to large LLM API calls and maintains lightweight expert advice algorithms, but computational overhead from multiple ToM reasoning paths may limit real-time applications.

Real-World Applicability
  1. Simulation-Only Evaluation: All experiments conducted in controlled simulation environments (matrix games, grid worlds, Overcooked), no real-world deployment reported.

  2. Cross-Agent Generalization: Successfully coordinates with non-LLM agents (planning-based Greedy agent and MARL-trained PBT agent), suggesting broader applicability beyond LLM-to-LLM coordination.

  3. Task Diversity: Evaluation spans different coordination challenges from simple game-theoretic scenarios to complex multi-step planning tasks, demonstrating method generality.

  4. Implementation Availability: Authors provide code repository (https://github.com/ChunjiangMonkey/Adaptive-ToM) for reproducibility and potential real-world adaptation.

    Real-world deployment in autonomous driving, robotics, or human-AI interaction scenarios remains unexplored, representing a significant gap between theoretical contribution and practical application.

Limitations & Failure Modes
  1. FUNDAMENTAL: Method assumes cooperative settings with shared reward functions - may not generalize to competitive or mixed-motive scenarios where ToM alignment dynamics differ.

  2. FUNDAMENTAL: Limited to ToM orders 0-2 based on cognitive science literature, potentially missing more complex reasoning patterns in sophisticated AI agents.

  3. ENGINEERING: Computational overhead from maintaining multiple ToM hypotheses and recursive reasoning may limit scalability to real-time applications.

  4. ENGINEERING: Relies on LLM reasoning capabilities which may degrade for complex environments or long interaction sequences.

  5. EVALUATION: No evaluation on real-world coordination tasks or human-AI interaction scenarios where ToM alignment matters most.

  6. EVALUATION: Limited analysis of performance with more than two agents or in partially observable environments.

    Failure Modes:

    • A-ToM agents in self-play may fail to converge when both agents continuously adapt their ToM order estimates
    • Method may perform poorly when optimal action space is large or agents behave irrationally (as shown in 3-action game experiment)

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

Authors: Yu Pan, Wenlong Yu, Tiejun Wu, Xiaohu Ye et al. (7 authors) · Institution: Toyota Research · Category: cs.CR

SFCoT introduces real-time safety monitoring and intervention during Chain-of-Thought reasoning steps, reducing jailbreak attack success rates from 58.97% to 12.31% while preserving 91.2% of model utility.

Practical Takeaway: This work demonstrates that monitoring intermediate reasoning steps rather than just final outputs can significantly improve LLM safety (79% ASR reduction). The three-tier scoring approach and consistency verification are implementable techniques worth considering for production LLM systems. However, the framework requires model-specific integration and has only been validated on one model and dataset. Research engineers should consider implementing similar step-by-step monitoring for safety-critical applications, but should validate across diverse models and attack scenarios before deployment.

Tags: llm-safety jailbreak-defense chain-of-thought adversarial-robustness real-time-monitoring safety-alignment reasoning-safety

arXiv · PDF

Task & Setting

The paper addresses the critical safety vulnerability of large language models (LLMs) to jailbreak attacks that bypass safety alignment by exploiting multi-step reasoning processes. Existing defenses only filter final outputs, leaving intermediate Chain-of-Thought (CoT) reasoning steps unmonitored where adversarial manipulation can propagate undetected.

The task is to implement real-time safety monitoring and intervention during CoT reasoning. Input: user query x leading to reasoning chain T = {t1, t2, …, tn} and final output y. The system must ensure safety score S(x,T,y) ≥ τ with probability ≥ 1-ε, where:

\[S: \mathcal{X} \times \mathcal{T} \times \mathcal{Y} \rightarrow [0,1]\] \[\mathbf{P}[S(x, \mathcal{T}, y) \geq \tau] \geq 1 - \epsilon\]

Success is measured by Attack Success Rate (ASR) - the proportion of jailbreak prompts yielding unsafe responses. Secondary metrics include Output Quality Score (1-5 rating) and Utility Preservation on standard benchmarks.

The evaluation uses JailBreakV_28K dataset with 20,000 jailbreak samples across 16 safety categories, with 195 representative samples for validation.

Architecture & Method
  1. CoT Parser extracts structured reasoning chain T and final answer y from model output stream (using tags for Qwen3 or explicit parsing for closed models like GPT-4)

  2. Three-tier Safety Scoring System evaluates each reasoning step ti: - Lexical Level: rapid screening using sensitive lexicon and regex rules - Semantic Level: lightweight deep learning model for implicit risk detection - Policy Level: contextual analysis within broader CoT for advanced tactics

  3. Safety score fusion via weighted averaging:

    \[S(t_i) = \alpha_1 \cdot S_{lex}(t_i) + \alpha_2 \cdot S_{sem}(t_i) + \alpha_3 \cdot S_{policy}(t_i)\]

    with weights α₁=0.3, α₂=0.5, α₃=0.2

  4. Multi-perspective Consistency Verifier for gray-zone steps generates K semantic variants and computes variance:

    \[S(t_i) = \frac{1}{K}\sum_{k=1}^{K}S(t_i^{(k)})\]
  5. Dynamic Intervention Module: truncates unsafe steps, rewrites ambiguous steps, or proceeds if safe

    The core contribution is proactive step-by-step safety monitoring during reasoning rather than post-hoc filtering of final outputs.

Training Recipe

Not reported - the paper focuses on inference-time safety intervention rather than training new models. The framework operates on pre-trained LLMs (specifically Qwen3-8B) without requiring additional training phases. The safety scoring components appear to use existing lightweight models and rule-based systems, but specific training details for these components are not provided.

Novelty & Lineage

The work builds on Chain-of-Thought prompting (Wei et al. 2022) and existing safety defenses like post-hoc filtering and layer-wise editing. Prior work includes SelfDefend (Wang et al. 2024) and layer-specific editing approaches (Zhao et al. 2024).

The specific delta is moving from post-hoc safety filtering to real-time step-by-step monitoring during CoT reasoning, combined with multi-perspective consistency verification for ambiguous cases. This represents a shift from reactive to proactive safety intervention.

Rating: SIGNIFICANT - addresses a clear gap in existing approaches with a novel architectural solution, though builds incrementally on established CoT and safety concepts.

Benchmarks & Results
  1. JailBreakV_28K (Attack Success Rate): SFCoT 12.31%, Post-hoc filtering 45.13%, Baseline 58.97% - 79.1% improvement over baseline
  2. MMLU (accuracy): SFCoT 69.84%, Baseline 76.89% - 90.8% utility preservation
  3. GSM8K (accuracy): SFCoT 82.67%, Baseline 89.84% - 92.0% utility preservation
  4. MBPP (accuracy): SFCoT 63.28%, Baseline 69.80% - 90.7% utility preservation
  5. Output Quality Score: SFCoT 4.6/5, truncation-only method 2.1/5

    Results show strong safety improvement with modest utility degradation. The paper lacks comparison to other recent safety methods beyond post-hoc filtering.

Compute & Efficiency
  1. Model size: Evaluated on Qwen3-8B (8 billion parameters), framework adds lightweight safety scoring components
  2. Training compute: Not reported - uses pre-trained models
  3. Inference speed/latency: Not reported, though claims “efficient” operation
  4. Memory footprint: Not reported beyond mentioning “lightweight deep learning model” for semantic scoring
  5. Deployment practicality: Moderate - requires parsing capabilities and real-time scoring, but designed for practical deployment without model retraining
Real-World Applicability
  1. Framework tested only on curated benchmark datasets (JailBreakV_28K, MMLU, GSM8K, MBPP)
  2. No production deployment results reported
  3. No real-world user study or field testing mentioned
  4. Implementation requires model-specific parsing (demonstrated on Qwen3, mentioned for GPT-4)
  5. No discussion of performance on diverse real-world prompts beyond benchmark evaluation
Limitations & Failure Modes
  1. ENGINEERING - Requires model-specific parsing implementation for different LLM architectures
  2. EVALUATION - Limited to single model (Qwen3-8B) and specific jailbreak dataset
  3. ENGINEERING - Safety scoring weights (α₁, α₂, α₃) appear manually tuned rather than learned
  4. EVALUATION - No comparison to other recent safety methods beyond basic post-hoc filtering
  5. FUNDAMENTAL - May struggle with novel attack patterns not covered in training data for safety scorers

    Failure modes:

  6. Sophisticated attacks that maintain consistency across paraphrases could bypass the consistency verifier
  7. Over-conservative safety filtering may truncate legitimate reasoning chains, degrading utility more than reported.

Towards Generalizable Robotic Manipulation in Dynamic Environments

Authors: Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi et al. (6 authors) · Institution: Toyota Research · Category: cs.CV

This paper introduces DOMINO, a large-scale dynamic manipulation benchmark, and PUMA, a VLA architecture that uses historical optical flow and auxiliary future prediction to achieve 6.3% improvement over baselines on tasks involving moving objects.

Practical Takeaway: This work identifies dynamic manipulation as a critical frontier where current VLA models fail dramatically, with even strong baselines dropping from 44% to 7% success rates when objects start moving. The key insight is that explicit motion cues (optical flow) outperform raw frame stacking, and auxiliary future prediction during training helps without inference overhead. Research engineers should consider: (1) incorporating historical optical flow in VLA architectures, (2) using object-centric future prediction as a training signal, and (3) testing their manipulation policies on moving targets. However, the low absolute performance (17% success rate) suggests this remains a hard problem requiring significant architectural advances beyond current VLA foundations.

Tags: dynamic-manipulation vision-language-action robotic-manipulation temporal-modeling optical-flow dual-arm-robotics spatiotemporal-reasoning embodied-ai

arXiv · PDF

Task & Setting

Robotic manipulation in real-world environments requires adapting to moving objects and continuous environmental changes, such as operating on assembly lines or working alongside humans. Current Vision-Language-Action (VLA) models excel at static manipulation but struggle when targets move, primarily due to their reliance on single-frame observations and lack of spatiotemporal reasoning capabilities. This creates a critical gap for deploying robots in complex dynamic scenarios.

The task is defined as a Partially Observable Markov Decision Process where at time step t, the robot receives observation history o_{t-h:t} including RGB-D visual inputs I_t and proprioception s_t^r, along with language instructions l. The policy π_φ(a_t o_{t-h:t}) must output continuous dual-arm control commands a_t ∈ A to minimize the expected finite-horizon cost:
\[J(π_φ) = \mathbb{E}\left[\sum_{k=0}^{H-1} γ^k ℓ(s_{t+k}, a_{t+k})\right]\]

where ℓ(·) penalizes spatial discrepancy between end-effectors and moving objects plus control effort.

Success is measured by Success Rate (SR) - percentage of episodes satisfying all task conditions within time budget - and Manipulation Score (MS), a continuous metric capturing execution quality through Route Completion adjusted by safety penalties. The benchmark DOMINO@α parameterizes maximum target speed using scalar coefficient α.

The paper introduces DOMINO, a large-scale dataset featuring 35 dynamic manipulation tasks across 5 robot embodiments with 110K+ expert trajectories. Tasks are organized into 3 hierarchical difficulty levels: Level 1 (constant velocity), Level 2 (polynomial trajectories), and Level 3 (stochastic with abrupt changes).

Architecture & Method
  1. Base architecture: PUMA (Predictive Unified Manipulation Architecture) built on Qwen3-VL backbone with dual-query mechanism for action prediction and future state forecasting.

  2. Scene-centric historical dynamics encoding: Process h historical third-person frames using optical flow computation instead of raw frame stacking, providing explicit dense motion cues through compressed flow maps of size [h, 64, 64].

  3. Object-centric future prediction: Extract target object features from N future frames using frozen GroundingDINO + SAM2 for segmentation, then DINO encoder for patch tokens. Object-centric future feature computed as:

    \[f_{t+i} = \mathcal{P}(\mathcal{E}(I_{t+i}), \mathcal{B}(I_{t+i}, p))\]

    where P denotes masked average pooling, E is DINO encoder, B is binary mask from grounding module.

  4. Dual supervision: Action policy trained with ℓ1 regression loss:

    \[\mathcal{L}_{action} = \frac{1}{K} \sum_{i=0}^{K-1} \|\hat{a}_{t+i} - a^*_{t+i}\|_1\]
  5. Auxiliary future predictor optimized via cosine similarity loss against ground-truth object features:

    \[\mathcal{L}_{world} = \frac{1}{N}\sum_{i=1}^{N}\left(1-\frac{z_{t+i}^⊤f_{t+i}}{\|z_{t+i}\|_2\,\|f_{t+i}\|_2}\right)\]
  6. Total objective: $\mathcal{L}_{total} = \mathcal{L}_{action} + λ\mathcal{L}_{world}$ where λ balances action prediction and dynamics modeling.

Training Recipe
  1. Dataset construction: 110K+ expert trajectories generated using two-stage spatiotemporal synchronization in SAPIEN physics engine with RoboTwin 2.0 framework. Temporal dry-run phase records execution times, kinematic back-calculation phase determines initial object positions.

  2. Training data: Mixed static and dynamic manipulation tasks across 5 robot embodiments (Aloha-AgileX, ARX-X5, Franka, Piper, UR5-Wsg) with domain randomization for generalization.

  3. Model training: End-to-end supervised behavioral cloning on NVIDIA A100 GPUs. Training tuples include {o_{t-h:t}, l, f_{t+1:t+N}, a*_{t:t+K-1}} with historical observations, language instructions, future features, and expert actions.

  4. Hyperparameters: Balancing parameter λ for world model loss, prediction horizon N=4 frames, action chunk length K (not specified), learning rate and optimizer details not reported.

  5. Evaluation setup: All baselines fine-tuned on DOMINO dataset for fair comparison, with primary experiments using Aloha-AgileX robot under Level 1 dynamics with α=0.1 coefficient.

  6. Hardware requirements: Training on A100 GPUs, evaluation on RTX GPUs. Wall-clock training time not reported.

Novelty & Lineage

This work addresses a genuinely underexplored area - dynamic manipulation with moving targets - that existing VLA models like OpenVLA (2024), π0/π0.5 (2024-2025), and RDT (2024) have not tackled systematically. The closest prior work is static manipulation datasets like RoboTwin 2.0 and general VLA architectures, but none focus on continuous object motion.

Key novel contributions:

  1. DOMINO dataset - first large-scale dynamic manipulation benchmark with hierarchical complexity levels and comprehensive evaluation metrics
  2. explicit optical flow integration for historical motion encoding rather than raw frame stacking
  3. object-centric future prediction supervision using grounded visual features during training only.

    The technical delta from prior VLAs is significant: while existing methods like DreamVLA attempt scene-level prediction, this work focuses on object-centric dynamics with explicit motion cues. The auxiliary future predictor that operates only during training is a clever architectural choice.

    Rating: SIGNIFICANT - addresses an important gap with novel dataset and targeted architectural innovations, though builds incrementally on existing VLA foundations.

Benchmarks & Results
  1. DOMINO@0.1 dynamic manipulation: PUMA achieves 17.20% success rate vs OpenVLA-OFT 10.86% (+6.34% absolute improvement), π0.5 9.63%, RDT-1B 5.34%, OpenVLA 1.54%.

  2. DOMINO@0.1 Manipulation Score: PUMA achieves 34.97 vs OpenVLA-OFT 30.49, π0.5 26.17, demonstrating higher quality interactions.

  3. Static-to-dynamic transfer: All methods show dramatic performance drops from static to dynamic environments. π0.5 drops from 44.8% to 7.5% success rate, OpenVLA-OFT from 17.5% to 6.7%.

  4. Dynamic complexity scaling: Performance degrades rapidly across hierarchical levels. ACT model shows 48.17% → 34.40% → 28.60% success rates from Level 1 → Level 2 → Level 3.

  5. Cross-domain generalization: Dynamic-trained models achieve comparable performance to static-trained models on some static tasks, suggesting dynamic data provides robust representations.

  6. Co-training benefits: Mixed static+dynamic training improves PUMA performance by +4.91% success rate (14.80% → 19.71%) compared to dynamic-only training.

    Results consistently demonstrate the challenge of dynamic manipulation and PUMA’s effectiveness, though absolute performance remains modest across all methods.

Compute & Efficiency
  1. Model size: Built on Qwen3-VL backbone, specific parameter count not reported but likely 7B+ parameters based on base model.

  2. Training compute: NVIDIA A100 GPUs used for training, specific GPU hours and total compute not reported. Dataset generation and evaluation performed on RTX GPUs.

  3. Inference speed/latency: Not reported. The auxiliary future predictor operates only during training, so inference overhead should be minimal compared to base VLA models.

  4. Memory footprint: Not specified, but historical optical flow processing ([h, 64, 64] compressed flow maps) and multi-frame inputs likely increase memory requirements over single-frame VLAs.

  5. Deployment practicality: Currently simulation-only (SAPIEN physics engine). No real robot experiments or sim-to-real validation provided. The method requires historical frame buffers and optical flow computation, adding complexity to real-time deployment.

Real-World Applicability
  1. Simulation-only evaluation: All experiments conducted in SAPIEN physics engine with RoboTwin 2.0 framework. No real robot validation provided.

  2. Hardware considerations: Tested across 5 simulated robot embodiments (Aloha-AgileX, ARX-X5, Franka, Piper, UR5-Wsg) but no physical hardware experiments.

  3. Sim-to-real gap: Paper acknowledges this as future work. The reliance on optical flow computation and multi-frame processing may introduce latency issues in real-time robotic control.

  4. Deployment challenges: Method requires maintaining historical frame buffers, computing optical flow in real-time, and coordinating with grounding modules (GroundingDINO + SAM2) which adds computational overhead.

  5. Domain randomization: Dataset includes domain-randomized settings to improve generalization, but effectiveness on real-world visual variations remains unproven.

    The work establishes important foundations but significant engineering effort would be required for real-world deployment.

Limitations & Failure Modes
  1. FUNDAMENTAL: Even PUMA achieves only 17.20% success rate on dynamic tasks, indicating substantial room for improvement in handling complex object dynamics and spatiotemporal reasoning.

  2. EVALUATION: Simulation-only validation with no real robot experiments creates uncertainty about sim-to-real transfer, especially given the reliance on perfect optical flow and object tracking.

  3. ENGINEERING: Dependence on frozen grounding modules (GroundingDINO + SAM2) during training creates potential brittleness and adds computational overhead that may not scale to real-time applications.

  4. FUNDAMENTAL: The approach still struggles with higher-order dynamics (Level 2/3), suggesting current architectures may need more sophisticated temporal modeling beyond optical flow and short-horizon prediction.

  5. ENGINEERING: Historical frame buffering and optical flow computation requirements may introduce significant latency in real-world deployment scenarios.

    Failure modes:

  6. Performance degrades rapidly with unpredictable object motion (Level 3 dynamics), suggesting limited robustness to truly stochastic environments.
  7. The method may fail when optical flow estimation is unreliable due to lighting changes, motion blur, or visual occlusions common in real environments.

Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Authors: Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui et al. (13 authors) · Institution: Toyota Research · Category: cs.CV

Proposes DeepVision-VLA, which addresses visual information degradation in deeper VLA layers by injecting multi-level vision foundation model features via a novel Vision-Language Mixture-of-Transformers architecture, achieving 9% and 7.5% improvements in simulation and real-world manipulation tasks.

Practical Takeaway: If you’re building VLA models, this paper reveals a critical but previously unrecognized issue: visual information degrades in deeper LLM layers during action prediction. The key insight is that you shouldn’t treat the LLM backbone as a black box. Instead, consider injecting visual features directly into deep layers via the proposed VL-MoT architecture, especially if your tasks require fine-grained visual understanding. The Action-Guided Visual Pruning strategy is particularly practical - using shallow-layer attention to guide which visual tokens to preserve is both computationally efficient and effective. While the dual-model architecture adds complexity, the consistent 7-20% improvements across diverse tasks suggest this approach is worth implementing for manipulation-heavy applications.

Tags: vision-language-action robotics manipulation foundation-models attention-mechanisms vision-language-models transformer-architecture visual-grounding

arXiv · PDF

Task & Setting

Robotic manipulation requires Vision-Language-Action (VLA) models to accurately interpret visual scenes and language instructions to generate precise robot actions. Current VLA models struggle because visual information becomes progressively less accessible in deeper layers of their LLM backbones, degrading action prediction quality especially for complex manipulation tasks requiring fine-grained visual understanding.

The task is to learn a policy πθ that maps visual observations ot and language instructions l to robot actions at. The formal objective is:

\[\theta^* = \arg \min_\theta \mathbb{E}_{(\tau,l) \sim \mathcal{D}} \left[ \sum_{t=1}^T \ell(\pi_\theta(o_{\leq t}, l), a_t) \right]\]

where τ = {(ot, at)} represents demonstration trajectories, l is the language instruction, and ℓ(·,·) is the task-specific action supervision objective. Input consists of RGB images (256×256 for VLA, 512×512 for vision expert), natural language instructions, and output is continuous robot actions (typically 7-DoF for manipulation).

Success is measured by task completion rates across manipulation benchmarks. The paper evaluates on RLBench (10 simulated tasks, 20 rollouts each) and real-world experiments (4 tasks on Franka Research 3 robot, 20 rollouts each). Performance is reported as success rate percentages.

Architecture & Method
  1. Vision Expert Integration: Uses DINOv3-H (0.8B parameters) as a dedicated vision foundation model alongside the main VLA visual encoder (SigLIP2-Large, 0.3B)

  2. Vision-Language Mixture-of-Transformers (VL-MoT): Novel architecture where Vision Expert QKV representations are directly integrated with deep VLA layers via shared attention mechanism:

    \[Q = [Q_E; Q_Z], \quad K = [K_E; K_Z], \quad V = [V_E; V_Z]\] \[A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right), \quad H = AV\]
  3. Multi-level Feature Selection: Extracts features from the last n transformer layers of Vision Expert (empirically n=16) and injects them into corresponding deep layers of VLA backbone

  4. Action-Guided Visual Pruning (AGVP): Uses shallow-layer action-to-vision attention maps to identify task-relevant visual regions:

    \[m = \frac{1}{|L_s|} \sum_{\ell \in L_s} \frac{1}{N_a} \sum_{i=1}^{N_a} A^\ell_{i,:}\]
    Then applies TopK selection to retain only the most relevant visual tokens
    
  5. Baseline Architecture: Built on QwenVLA-OFT using Qwen3-VL (4B) backbone with parallel action prediction and ℓ1 regression loss

Training Recipe
  1. Initialization: Pretrained weights from Qwen3-VL (4B) and DINOv3-H (0.8B), then end-to-end training of entire architecture

  2. Pretraining Data: 400K+ trajectories curated from Open X-Embodiment, DROID, and RoboMIND datasets with careful processing and filtering

  3. Pretraining: One epoch on the curated dataset using AdamW optimizer (specific learning rate not reported)

  4. Simulation Fine-tuning: 300 epochs on RLBench tasks using AdamW optimizer on 8 NVIDIA H20 GPUs (learning rate, batch size not reported)

  5. Real-world Fine-tuning: Same protocol as simulation for real-world tasks (specific details not reported)

  6. Computational Setup: Dual resolution processing (256×256 for VLA branch, 512×512 for Vision Expert), pruning ratio set to 0.5, integration of last 16 layers from both models

Novelty & Lineage

Closest Prior Work: OpenVLA (2024), π0 (2024), HybridVLA (2025) represent current SOTA in VLA models. Previous vision enhancement approaches include RT-Trajectory (2023), SpatialVLA (2024), CogACT (2024).

Core Delta: This is the first work to systematically analyze how visual information degrades in deeper LLM layers of VLA models and propose a targeted solution. The Vision-Language Mixture-of-Transformers framework with direct QKV-level integration and Action-Guided Visual Pruning are novel architectural contributions.

Key Innovation: Moving beyond treating LLM backbone as a “black box” to understanding and addressing layer-wise visual degradation through targeted multi-level feature injection.

Rating: SIGNIFICANT - Addresses a previously unrecognized fundamental limitation in VLA architectures with a principled solution backed by thorough analysis.

Benchmarks & Results
  1. RLBench Simulation: Mean success rate 83% vs. best baseline HybridVLA 74% (+9.0% improvement), QwenVLA-OFT 69% (+14% improvement)

  2. Real-world Single-arm Manipulation: Average success rate 91.7% vs. π0.5 84.2% (+7.5%), QwenVLA-OFT 74.2% (+17.5%), OpenVLA-OFT 71.7% (+20.0%)

  3. Individual RLBench Tasks: Perfect scores (100%) on Close box, Toilet seat down; 95% on Sweep to dustpan (vs. 15% baseline), 85% on Wine at rack (vs. 65% baseline)

  4. Real-world Task Breakdown: Stack coke cans 65%, Write letter ‘S’ 95%, Pick fruit (both steps) 95%, Pour coke (both steps) 100%

  5. Generalization Tests: Maintains high performance under novel backgrounds and lighting conditions with smaller degradation than baselines

    The results show consistent improvements across diverse manipulation scenarios, with particularly strong gains on visually challenging tasks.

Compute & Efficiency
  1. Model Size: Total ~4.8B parameters (Qwen3-VL 4B + DINOv3-H 0.8B + additional alignment layers)

  2. Training Compute: 8 NVIDIA H20 GPUs for 300 epochs fine-tuning (total GPU hours not reported), pretraining compute not specified

  3. Inference Speed: Not explicitly reported, but dual-resolution processing (256×256 and 512×512) and attention pruning suggest computational overhead vs. baseline

  4. Memory Footprint: Higher than baseline due to dual-model architecture, but AGVP strategy with TopK pruning (ratio 0.5) helps control memory usage

  5. Deployment Practicality: Successfully deployed on Franka Research 3 robot for real-world tasks, suggesting practical feasibility, though computational overhead limits scalability

Real-World Applicability
  1. Real Hardware Deployment: Successfully tested on Franka Research 3 robot with Intel RealSense D455 camera across 4 complex manipulation tasks

  2. Physical Environment: Real laboratory setting with varying backgrounds, lighting conditions, and object arrangements

  3. Task Complexity: Multi-step manipulation including precise actions like writing letters, pouring liquids, and coordinated pick-and-place sequences

  4. Generalization Testing: Demonstrated robustness to unseen environmental conditions (novel backgrounds, lighting changes) in zero-shot settings

  5. Production Readiness: While successfully deployed in controlled lab settings, the dual-model architecture and computational requirements may limit immediate production deployment

  6. Sim-to-Real: No explicit sim-to-real transfer discussion, but real-world results suggest effective generalization from simulation training

Limitations & Failure Modes
  1. ENGINEERING: Computational overhead from dual-model architecture (4.8B total parameters) may limit deployment scalability

  2. ENGINEERING: Requires careful hyperparameter tuning for layer selection (n=16), pruning ratios (0.5), and attention aggregation across shallow layers

  3. EVALUATION: Limited to single-arm manipulation tasks; dual-arm or more complex embodiments not tested

  4. EVALUATION: Real-world evaluation limited to 4 tasks in controlled lab environment; broader real-world diversity not assessed

  5. FUNDAMENTAL: Dependence on quality of shallow-layer attention maps for AGVP guidance could fail if early layers don’t maintain good visual grounding

  6. ENGINEERING: Vision Expert choice (DINOv3) not systematically compared against other foundation models

    Failure Modes:

    • May struggle with tasks requiring global scene understanding if AGVP over-prunes important context
    • Could fail in environments with significant domain shift from training data despite demonstrated robustness to some variations

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Authors: Yibin Liu, Yaxing Lyu, Daqi Gao, Zhixuan Liang et al. (8 authors) · Institution: Toyota Research · Category: cs.RO

PRIMO R1 transforms video MLLMs from passive observers into active critics for robotic process supervision by using outcome-based reinforcement learning to elicit Chain-of-Thought reasoning for progress estimation.

Practical Takeaway: If you’re working on robotic manipulation or video understanding for robotics, the key insight is using outcome-based reinforcement learning to elicit explicit reasoning chains rather than direct regression for progress estimation. The temporal boundary anchoring technique (initial state + video + current state) is worth implementing as it significantly outperforms video-only approaches. However, be aware of the inference latency tradeoff from Chain-of-Thought generation and the sim-to-real performance gap. The GRPO training approach could be adapted to other multimodal reasoning tasks where you want structured outputs without expensive value function training.

Tags: robotics multimodal-llm video-understanding reinforcement-learning chain-of-thought manipulation progress-estimation process-supervision

arXiv · PDF

Task & Setting

Accurate process supervision for long-horizon robotic manipulation tasks remains challenging because existing video multimodal large language models (MLLMs) function as passive observers rather than active evaluators that can reason about task progress. Most current models excel at describing what is happening but struggle to quantitatively assess how well a task is proceeding relative to the goal, often assigning high progress scores to failed attempts based on superficial visual similarity.

The task is to estimate continuous task progress from multimodal inputs: an initial state image I_init, a process video sequence V_seq = {v1, v2, …, vT}, a current state image I_curr, and a language instruction I specifying the task goal. The objective is to learn a mapping function F that evaluates the visual tuple conditioned on the instruction, outputting a scalar progress indicator y ∈ [0, 100] where 0 denotes initial state and 100 signifies task completion.

Success is measured using Mean Relative Accuracy (MRA) and Mean Absolute Error (MAE). MRA is defined as:

\[\mathrm{MRA} = \frac{1}{|\mathcal{T}|} \sum_{\tau \in \mathcal{T}} \mathbb{I}\left(\frac{|\hat{y} - y|}{|y|} < 1 - \tau \right)\]

The paper introduces the PRIMO Dataset with 326k processed samples across simulation (BEHAVIOR-1k, RoboTwin) and real-world (AgiBot) environments, plus PRIMO Benchmark for systematic evaluation of in-domain vs out-of-domain generalization including real humanoid robot scenarios.

Architecture & Method
  1. Base architecture: Qwen2.5-VL-7B-Instruct as the foundation multimodal large language model

  2. Structured temporal input design: Explicitly anchor video sequences between initial state image (I_init), process video (V_seq), and current state image (I_curr) to provide clear visual boundary conditions

  3. Chain-of-Thought reasoning framework: Transform prediction from direct scalar regression into multi-step generative reasoning task with three modules - Planning (decompose high-level goal into execution steps), Observation (extract dynamic variables from video), and Reasoning (map visual primitives to planned topology)

  4. Group Relative Policy Optimization (GRPO): Replace expensive value function critic with group statistics baseline. Advantage estimation:

    \[A_i = \frac{r_i - \text{mean}(\{r_1, \dots, r_G\})}{\text{std}(\{r_1, \dots, r_G\}) + \epsilon}\]
  5. Composite reward function: R(o_i, y_gt) = r_fmt + r_acc with format reward enforcing reasoningprediction structure and accuracy reward:

    \[r_{\text{acc}} = \max\left(0, 1 - \frac{|\hat{y}_i - y_{gt}|}{R_{\text{max}}}\right)\]
  6. GRPO optimization objective:

    \[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^G \left[\min(\rho_i A_i, \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i) - \beta \cdot \mathbb{D}_{\text{KL}}(\pi_\theta(o_i|x) || \pi_{\text{ref}}(o_i|x))\right]\]

    The core technical contribution is using outcome-based RL to incentivize explicit Chain-of-Thought generation for progress reasoning, transforming video MLLMs from passive observers into active critics.

Training Recipe
  1. Supervised Fine-Tuning (SFT): 116k samples (PRIMO-R1-CoT-116k) combining robotic datasets (AgiBot, BEHAVIOR, RoboTwin) with general video reasoning datasets (EgoPlan, RoboVQA, ShareRobot, STAR, NExT-QA, Perception Test). Training details not specified.

  2. Reinforcement Learning: 182k samples (PRIMO-R1-182k) using Group Relative Policy Optimization with group size G samples per input. Format reward (+1 for correct structure) plus bounded linear decay accuracy reward. KL divergence penalty β to prevent language degeneration. Specific learning rates, batch sizes, and training duration not reported.

  3. Hardware specifications, wall-clock training time, and detailed hyperparameters provided in Appendix G but not included in main text - marked as not reported for missing details in main paper.

Novelty & Lineage

The closest prior works include VLAC (2025), ProgressLM (2026), and Robo-Dopamine (2025) for vision-based progress estimation, plus Video R1 (2025) and VLM-R1 (2025) for multimodal reasoning with RL. The specific delta is:

  1. explicit temporal boundary anchoring with initial/current state images alongside video sequences
  2. outcome-based RL to elicit Chain-of-Thought for progress reasoning rather than direct regression, and
  3. demonstrating that continuous progress reasoning intrinsically enables zero-shot discrete failure detection. The approach builds incrementally on existing video MLLM architectures and GRPO training methodology. Rating: INCREMENTAL - solid engineering contribution combining existing components in a novel way for robotic process supervision.
Benchmarks & Results
  1. PRIMO Bench progress estimation: MRA 82.90, MAE 15.52 vs best prior ProgressLM-3B-RL MRA 38.29, MAE 32.61 (improvement: +44.61 MRA, -17.09 MAE)

  2. AgiBot environment: MRA 87.67, MAE 12.33 vs best baseline GPT-4o MRA 81.01, MAE 18.99 (improvement: +6.66 MRA, -6.66 MAE)

  3. BEHAVIOR environment: MRA 87.08, MAE 12.90 vs best baseline GPT-5 mini MRA 79.60, MAE 20.08 (improvement: +7.48 MRA, -7.18 MAE)

  4. RoboTwin environment: MRA 84.52, MAE 15.48 vs best baseline Claude-Haiku-4.5 MRA 81.73, MAE 18.27 (improvement: +2.79 MRA, -2.79 MAE)

  5. Real Humanoid environment: MRA 72.32, MAE 21.37 vs best baseline GPT-4o MRA 74.65, MAE 25.35 (mixed results: -2.33 MRA, +3.98 MAE)

  6. RoboFail benchmark (failure detection): 67.0% accuracy vs OpenAI o1 61.0% (improvement: +6.0%)

    Results show strong performance on simulation environments but mixed results on real-world humanoid scenarios.

Compute & Efficiency
  1. Model size: 7B parameters (Qwen2.5-VL-7B-Instruct base model)

  2. Training compute: Not reported in main text, referenced to Appendix G

  3. Inference speed/latency: Referenced to Appendix C.2 for reasoning chain lengths and inference time analysis, specific numbers not provided in main text

  4. Memory footprint: Not reported, though GRPO chosen specifically to avoid expensive value function critic that would increase memory overhead

  5. Deployment practicality: Moderate - 7B parameters make it more deployable than 72B models, but real-time inference requirements for robotic manipulation may still be challenging given Chain-of-Thought generation overhead

Real-World Applicability
  1. Real-world training data: Incorporates AgiBot dataset from real-world teleoperation data for training corpus

  2. Hardware experiments: Evaluated on Kuavo 4 Pro full-size humanoid robot from LejuRobotics Technology Co., Ltd. via teleoperation

  3. Multi-scenario testing: Real humanoid evaluation covers hotel services, manufacturing factories, fast-moving consumer goods (FMCG) scenarios, and automotive assembly lines

  4. Sim-to-real transfer: Demonstrates generalization from simulation (BEHAVIOR, RoboTwin) to real humanoid environments, though with reduced performance (MRA drops to 72.32 vs 84-87 in simulation)

  5. Cross-environment evaluation: Tests on completely unseen physical robot and unstructured environments different from training data

Limitations & Failure Modes
  1. FUNDAMENTAL: Requires explicit Chain-of-Thought generation which increases inference latency, potentially problematic for real-time robotic control

  2. ENGINEERING: Performance degradation in sim-to-real transfer (MRA drops from 84-87 in simulation to 72.32 in real humanoid scenarios) - could be improved with more real-world training data

  3. ENGINEERING: Dependency on structured temporal input requiring both initial and current state images in addition to video sequence increases data collection complexity

  4. EVALUATION: Limited evaluation on true real-time deployment scenarios - teleoperation data may not reflect autonomous execution challenges

  5. FUNDAMENTAL: Chain-of-Thought reasoning may not scale to very long horizon tasks due to context length limitations

    Failure modes:

  6. May assign high progress scores to visually similar but incorrect manipulation sequences
  7. Chain-of-Thought generation could hallucinate plausible-sounding but incorrect reasoning steps.

Efficient Reasoning on the Edge

Authors: Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert et al. (18 authors) · Institution: Qualcomm AI · Category: cs.LG

The paper presents an end-to-end system for deploying reasoning-capable small LLMs on mobile devices using LoRA adapters, budget forcing, and dynamic switching to achieve 2-8x efficiency gains while maintaining accuracy.

Practical Takeaway: This paper provides a complete blueprint for deploying reasoning-capable LLMs on mobile devices. The key insight is that LoRA adapters can enable reasoning while preserving efficiency through dynamic switching and KV-cache sharing. The budget forcing approach with multiplicative rewards effectively reduces verbosity without sacrificing accuracy. For practitioners, the masked LoRA training technique and switcher module are immediately implementable improvements. The quantization pipeline and mobile deployment validation demonstrate this isn’t just theoretical - reasoning LLMs can actually run on phones today. This is particularly valuable for building privacy-preserving AI assistants that work offline.

Tags: edge-deployment mobile-ai reasoning LoRA budget-forcing quantization on-device-inference parameter-efficient-fine-tuning

arXiv · PDF

Task & Setting

Real-world context: Large language models with chain-of-thought reasoning achieve excellent performance on complex tasks, but their verbose reasoning traces and large memory requirements make them impractical for mobile devices. Edge deployment is attractive for privacy, latency, and offline availability, but faces strict constraints on memory, power consumption, and computational resources.

Task definition: The task is to enable efficient reasoning capabilities in small LLMs (3B-7B parameters) for deployment on edge devices while maintaining accuracy on complex reasoning tasks. Input consists of user queries requiring mathematical, scientific, or coding reasoning. Output should be accurate answers with concise reasoning traces that fit within device memory and latency budgets. The objective balances accuracy and efficiency through:

\[R(y, x) = R_{accuracy}(y, x) \times R_{budget}(L)\]

where $L$ is response length and $R_{budget}$ applies soft penalties for exceeding token budgets.

Evaluation criteria: Success is measured by accuracy on reasoning benchmarks (AIME, MATH500, GPQA, LiveCodeBench) while achieving significant reductions in average completion length (target 2-8x compression) and enabling real-time inference on mobile hardware.

The paper demonstrates the complete pipeline on Qwen2.5-3B and 7B models across mathematical reasoning, scientific problems, and coding tasks.

Architecture & Method
  1. Base Architecture: Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct models serve as frozen backbones for parameter-efficient adaptation.

  2. LoRA Adapters: Low-Rank Adaptation modules (rank 64-256) enable reasoning capabilities while keeping the base model frozen and reusable across different modes.

  3. Dynamic Switcher Module: Lightweight MLP classifier (hidden dimension 8, ReLU activation, dropout 0.2) processes averaged hidden states from the final transformer layer to decide whether reasoning is needed for each query.

  4. Masked LoRA Training: During fine-tuning, LoRA weights are disabled during prompt encoding but active during response generation, enabling KV-cache sharing between base and reasoning modes.

  5. Budget Forcing via RL: Reinforcement learning with GRPO optimizer applies multiplicative soft-barrier reward:

    \[R_{budget}(L) = \begin{cases}\] \[1 & L \leq L_{low} \\\] \[1 - (1-p)\frac{L-L_{low}}{L_{high}-L_{low}} & L_{low} < L \leq L_{high} \\\] \[p & L > L_{high}\] \[\end{cases}\]
  6. Parallel Test-Time Scaling: Generate multiple reasoning streams concurrently with neural verification using lightweight verifier heads trained on base model representations.

  7. Quantization Pipeline: INT4 weights, INT16 activations, INT8 LoRA weights, INT32 bias for deployment via Qualcomm GENIE SDK.

Training Recipe
  1. Supervised Fine-Tuning Stage: LoRA adapters (rank 128, α=256) trained on OpenThoughts3 (1.2M samples) and Mixture of Thoughts (350K samples) datasets for 5 epochs using AdamW optimizer (β₁=0.9, β₂=0.95), learning rate 2e-4, batch size 64, cosine schedule with 0.1 warmup, bfloat16 precision.

  2. Budget-Forced RL Stage: GRPO algorithm applied to LoRA parameters using DeepScaleR dataset, group size 8, KL penalty coefficient βₖₗ=1e-3, soft-barrier rewards with budgets {1K, 3K, 4K, 6K} tokens, multiplicative penalty structure.

  3. Switcher Training: Lightweight classifier trained on 2K samples (600 SQuAD2.0, 419 MMLU math, 500 S1K, 500 StrategyQA) to distinguish simple vs. complex queries requiring reasoning.

  4. Hardware Setup: Training conducted on single node with 8x NVIDIA H100 (80GB) GPUs using DeepSpeed zero2 with CPU offloading.

  5. Deployment Pipeline: Models quantized and exported via ONNX for Qualcomm GenAI Inference Engine, compiled for on-device execution with static graph optimization.

    Wall-clock training time: not reported.

Novelty & Lineage

This work builds on established techniques: LoRA (Hu et al., 2021), chain-of-thought prompting (Wei et al., 2022), budget forcing (Snell et al., 2022), and parallel test-time scaling. The closest prior work is budget forcing methods and on-device LLM deployment systems.

Novel contributions include:

  1. Masked LoRA training strategy enabling KV-cache sharing between base and reasoning modes
  2. Dynamic adapter switching with lightweight classifier for selective reasoning activation
  3. Multiplicative soft-barrier reward formulation replacing additive penalties
  4. End-to-end pipeline integrating reasoning adaptation with mobile deployment constraints
  5. Comprehensive system demonstrating reasoning LLMs running on actual mobile devices.

    The core novelty is the holistic system design that makes reasoning LLMs practically deployable on edge devices rather than any single algorithmic breakthrough.

    Rating: INCREMENTAL - combines existing techniques in a well-engineered system with some novel implementation details.

Benchmarks & Results
  1. AIME24: Qwen2.5-7B+OT3 LoRA achieves 0.56 vs. base model 0.10, comparing favorably to R1-Distill-Qwen-7B at 0.55

  2. AIME25: Qwen2.5-7B+OT3 LoRA achieves 0.38 vs. base model 0.17, matching R1-Distill-Qwen-7B at 0.40

  3. MATH500: Qwen2.5-7B+OT3 LoRA achieves 0.93 vs. base model 0.76, matching R1-Distill-Qwen-7B at 0.92

  4. GPQA Diamond: Qwen2.5-7B+OT3 LoRA achieves 0.43 vs. base model 0.37, below R1-Distill-Qwen-7B at 0.49

  5. LiveCodeBench: Qwen2.5-7B+OT3 LoRA achieves 0.54 vs. base model 0.36, below R1-Distill-Qwen-7B at 0.59

  6. HumanEval: Mixed results with some degradation from reasoning specialization (0.60 vs. 0.83 base model)

  7. Budget Forcing Results: Achieves 2.4x average completion length reduction (up to 8x) with minimal accuracy loss (88.3% vs. 82.7% baseline on MATH500)

    Results show the method successfully enables reasoning in small models while achieving significant efficiency gains.

Compute & Efficiency
  1. Model size: 3B and 7B parameter base models with 1.06-15.52% additional trainable parameters from LoRA adapters

  2. Training compute: Single node with 8x NVIDIA H100 (80GB) GPUs, specific GPU hours not reported

  3. Inference speed: Demonstrates real-time mobile inference with videos showing on-device execution, specific latency numbers not provided

  4. Memory footprint: Quantized to INT4 weights, INT16 activations, INT8 LoRA weights for edge deployment, enables KV-cache sharing between reasoning and non-reasoning modes

  5. Deployment practicality: Successfully deployed on actual mobile devices using Qualcomm GENIE SDK, complete end-to-end pipeline from training to mobile app integration demonstrated with project videos

Real-World Applicability
  1. Mobile Device Deployment: Successfully demonstrates reasoning LLMs running on actual mobile devices with videos available on project page showing real-time inference

  2. Hardware Integration: Complete pipeline using Qualcomm FastForward and GENIE SDK for practical deployment on Qualcomm-powered mobile devices

  3. Quantization Validation: Models exported and compiled for on-device execution with static graph optimization, tested on real hardware

  4. System Integration: Android application streaming generation from deployed models, demonstrating practical mobile AI assistant capabilities

  5. Edge Computing Validation: Addresses real constraints of DRAM capacity, power consumption, and NPU utilization on mobile devices rather than just simulated deployment

Limitations & Failure Modes
  1. FUNDAMENTAL: Trade-off between reasoning specialization and general capabilities - models show degradation on direct-answer coding tasks (HumanEval, MBPP) after reasoning fine-tuning

  2. FUNDAMENTAL: Smaller models (3B) show larger performance gaps with LoRA compared to dense fine-tuning, suggesting adapter capacity limitations at smaller scales

  3. ENGINEERING: Limited evaluation on diverse reasoning domains - focuses primarily on mathematical reasoning with less emphasis on commonsense or multi-modal reasoning

  4. ENGINEERING: Budget forcing optimization can be unstable with higher learning rates (5e-4 causes training divergence for 7B models)

  5. EVALUATION: No comparison with other edge deployment methods or alternative efficiency techniques beyond the specific pipeline proposed

    Failure modes:

  6. Reward hacking during budget forcing where models learn to circumvent length penalties by manipulating response structure
  7. Switcher misclassification leading to unnecessary reasoning overhead on simple queries or missing complex reasoning needs