Applied AI Digest — Apr 4, 2026
Today’s Digest at a Glance
Preliminary
Today’s papers explore multimodal agent architectures that integrate tool planning, video-language reasoning for robotics, and pixel-space operations with test-time adaptation.
Interleaved Text-Image Generation
Interleaved text-image generation addresses the challenge of creating coherent documents that seamlessly blend textual content with contextually relevant images. Traditional approaches treat text and image generation as separate sequential steps, leading to inconsistent narratives and poor visual-textual alignment. The core insight is to plan the entire document structure first, then execute generation steps that can reference and build upon previous content.
The mathematical framework treats generation as a sequential planning problem where an agent produces a tool plan $T = [t_1, t_2, …, t_n]$ where each tool $t_i$ can be either text generation or image synthesis with specific parameters. For image tools, the plan includes anchor references via img_index parameters that link generated images to specific textual contexts, enabling coherent visual narratives. The agent must reason about document flow, determining when visual content enhances the narrative and what specific images would be most effective given the current context.
Intuitively, this is like having an intelligent document designer that decides “here I need an image of X to illustrate the concept I just explained” rather than randomly inserting images.
Chain-of-Thought Progress Estimation
Chain-of-thought progress estimation extends traditional reasoning traces to include quantitative progress assessments for goal-directed tasks. Standard chain-of-thought prompting generates explanatory text but lacks mechanisms to measure task completion, making it unsuitable for reward signal generation in reinforcement learning contexts.
The technique structures model outputs as tuples $(reasoning, progress)$ where $reasoning$ provides step-by-step analysis and $progress \in [0,1]$ quantifies completion percentage. For robotics applications, this becomes particularly powerful when applied to video sequences: the model processes temporal windows ${I_0, I_k, …, I_t}$ along with goal descriptions $G$, generating both explanatory reasoning about observed actions and numerical progress estimates. The progress signal can then serve directly as dense reward $r_t = progress_t - progress_{t-1}$ for policy learning.
This transforms language models into both explainable critics and reward generators, providing interpretable dense feedback for complex manipulation tasks.
Reading Guide
The ATP-Bench and Pixelis papers both explore tool-augmented multimodal agents but at different scales - ATP-Bench focuses on document-level planning while Pixelis operates on individual visual reasoning steps. SOLE-R1 and VLLR represent complementary approaches to robot reward learning, with SOLE-R1 providing dense video-language feedback and VLLR combining LLM task decomposition with VLM progress estimation. The KITScenes dataset addresses the evaluation challenge of semantic coherence between reasoning and actions that appears across all these agentic systems.
ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation
Authors: Yinuo Liu, Zi Qian, Heng Zhou, Jiahao Zhang et al. (10 authors) · Institution: Alibaba · Category: cs.AI
Introduces ATP-Bench, the first benchmark for evaluating MLLMs on agentic tool planning for interleaved text-image generation, revealing significant performance gaps with only top proprietary models achieving acceptable quality.
Practical Takeaway: This paper provides the first systematic benchmark for evaluating MLLMs on unified reference-and-generation tasks, revealing significant capability gaps. The key practical insight is that current models struggle with coherent tool planning, with only top-tier proprietary models achieving acceptable performance (Gemini 3 Pro at 79.88 final score). The Multi-Agent Judge (MAM) framework offers a reusable evaluation methodology that could be adapted for other agentic tasks. For practitioners: (1) expect substantial infrastructure complexity when deploying tool-augmented interleaved generation, (2) current open-source models perform poorly and require significant improvement, (3) few-shot prompting substantially helps mid-tier models but has minimal impact on weakest systems. The benchmark and evaluation framework provide concrete targets for future model development in this increasingly important capability area.
Tags: multimodal-generation interleaved-content tool-augmented-llms benchmark-evaluation visual-language-models agentic-ai mllm-judge image-text-alignment
Task & Setting
This paper addresses interleaved text-and-image generation for visual-critical queries in multimodal large language models (MLLMs). Current approaches treat image generation and retrieval as mutually exclusive paths, failing to unify factuality with creativity in a single response.
The task requires MLLMs to act as central controllers that autonomously determine when, where, and which tools to invoke for producing interleaved responses. Given a visual-critical query $q$ and document set $D = {d_1, d_2, …, d_n}$, the model produces an ordered sequence:
\[R = \{s_1, s_2, ..., s_m\}\]where $s_j \in S_{text} \cup S_{tool}$, combining natural language tokens with structured tool-calling instructions.
The toolkit comprises five specialized modules: Reference (cite context images), Diffusion (generate novel images), Search (retrieve real-world visuals), Code (create data visualizations), and Edit (modify existing images).
Success is measured through a Multi-Agent MLLM-as-a-Judge (MAM) system evaluating: Final Score (overall quality 0-100), Success Rate (tool-call precision), and Missed Images (recall of visual opportunities).
ATP-Bench contains 7,702 QA pairs (including 1,592 VQA pairs) across eight categories (Academic, Manual, Recipe, Fashion, Renovation, Product, Travel, Encyclopedia) and 25 visual-critical intents, with expert-verified queries and ground truths.
Architecture & Method
-
Agentic Tool Planning Framework: MLLMs serve as central controllers consuming concatenated input (prompt $p$, query $q$, document set $D$) to generate interleaved tool plans
-
Five-Tool Unified Toolkit: Reference (anchors to context images via img_index), Diffusion (synthesizes novel images from semantic prompts), Search (retrieves real-world visuals via external engines), Code (generates programmatic visualizations), Edit (modifies referenced images with edit prompts)
-
Structured Tool Schema: Each tool invocation follows
<tool>{"tool_name": ..., "description": ..., "params": ...}</tool>format for consistent parsing and execution -
Multi-Agent MLLM-as-a-Judge (MAM): Three specialized agents - Precision Inspector (evaluates tool necessity, boundary compliance, parameter accuracy), Recall Inspector (identifies missed visual opportunities), Chief Judge (synthesizes holistic 0-100 scores across five performance tiers)
-
Visual-Critical Query Classification: Systematic taxonomy across eight high-visual-demand categories with 25 fine-grained intents requiring visual augmentation, cognitive acceleration, or structural illustration
The core technical contribution is the unified paradigm dissolving boundaries between reference and generation, enabling dynamic orchestration of complementary visual capabilities within single responses.
Training Recipe
The paper does not describe training new models. Instead, it evaluates existing pre-trained MLLMs in a zero-shot setting using structured prompts for interleaved generation.
-
Evaluation Protocol: Zero-shot inference on 10 state-of-the-art MLLMs including Claude Sonnet 4.5, Claude Sonnet 4, Gemini 3 Pro, Grok-4.1 Fast Reasoning, GPT-5, GPT-4o, Qwen3-VL-Plus, Qwen2.5-VL-72B, LLaMA-3.2-11B, InternVL3.5-14B
-
Few-Shot Experiments: Additional 3-shot demonstrations tested on subset of models (Claude Sonnet 4.5, GPT-4o, Qwen2.5-VL-72B, LLaMA-3.2-11B)
-
Ground Truth Generation: Three-stage pipeline using MLLMs - textual response generation with factual consistency requirements, image insertion via tool calls, fine-grained annotation and refinement by 15 human annotators
-
MAM Judge Training: Uses Gemini 2.5 Pro as default judge with ablations on Claude Sonnet 4.5 and GPT-5, validated through human agreement study (84-88% agreement rates)
Data: 7,702 QA pairs from existing multimodal benchmarks (MRAMG-Bench, RAG-IGBench, OVEN) with expert verification Optimizer/Hardware: Not reported - evaluation-only study Wall-clock time: Not reported
Novelty & Lineage
Prior Work:
- OpenLEAF (An et al., 2023): Early interleaved generation benchmark with 30 samples, generation-only, no annotated ground truth
- InterleavedBench (Liu et al., 2024a): 815 samples supporting QA+VQA but generation-only, no reference capability
-
RAG-IGBench (Zhang et al., 2025): 6,057 samples focused on retrieval-augmented generation with annotated ground truth but no generation capability
Delta: This paper introduces the first benchmark unifying both reference and generation capabilities in a single framework, with expert-annotated queries/ground truth across hybrid image sourcing. The Multi-Agent MLLM-as-a-Judge (MAM) system enables evaluation without ground truth or end-to-end execution.
Applied-Specific Assessment:
- Architectural Novelty: The unified tool planning paradigm is a reasonable extension of existing tool-augmented MLLMs (ViperGPT, MM-ReAct) to interleaved generation, not fundamentally novel
- Benchmark Gaps: The benchmark fills a genuine gap by being first to support hybrid sourcing with dual query types and expert annotations
- Evaluation Innovation: MAM system is a solid engineering contribution but follows established MLLM-as-a-judge patterns
- Comparisons: Fair comparison across 10 models using same prompts and evaluation protocol
- Scale Dependence: Results show clear correlation with model scale/capability, top performers are largest/most capable models
Verdict: INCREMENTAL — Solid benchmark contribution that systematically combines existing capabilities (tool use + interleaved generation) with proper evaluation methodology, but lacks fundamental algorithmic or architectural innovation.
Benchmarks & Results
-
Final Score (0-100 scale): Gemini 3 Pro leads with 79.88, followed by Claude Sonnet 4.5 (69.34), Claude Sonnet 4 (69.15), Grok-4.1 (68.18), GPT-5 (67.18). Open-source models lag significantly with LLaMA-3.2-11B at 28.97
-
Success Rate (tool-call precision): Gemini 3 Pro achieves 81.77%, Claude Sonnet 4.5 at 75.77%, Claude Sonnet 4 at 73.86%. Large gap to open-source: LLaMA-3.2-11B only 18.34%
-
Missed Images (lower better): Gemini 3 Pro best at 0.49, Claude Sonnet 4 at 0.99, Claude Sonnet 4.5 at 1.19. Weaker models show much higher missed image counts with LLaMA-3.2-11B at 2.16
-
Tool Set F1-Score vs Ground Truth: Gemini 3 Pro leads at 81.21%, Claude models around 80%, with high Spearman correlation (ρ=0.879) to MAM Final Score
-
Category-wise Performance: Academic and Encyclopedia easiest (highest final scores), Travel and Renovation most challenging across all models. Travel particularly difficult with highest missed image counts
-
Few-shot Improvements: GPT-4o improves from 60.35 to 73.19 Final Score with 3-shot examples, Qwen2.5-VL-72B from 53.88 to 72.86. LLaMA-3.2-11B shows minimal improvement
Results show clear capability tiers with proprietary models significantly outperforming open-source alternatives. Mixed performance across categories reveals specific challenges in tool-intensive domains.
Compute & Efficiency
-
Model Parameters: Evaluated models range from 11B (LLaMA-3.2-11B) to 72B (Qwen2.5-VL-72B) parameters, with proprietary models having undisclosed sizes
-
Training Compute: Not applicable - evaluation-only study using pre-trained models. No training compute reported
-
Inference Speed/Latency: Not reported. Paper focuses on planning quality rather than efficiency metrics
-
Memory Footprint: Not reported. Evaluation conducted through API calls to proprietary models and standard inference for open-source models
-
Deployment Practicality: Framework requires external tool execution (Google Image Search, diffusion models, code execution environments). MAM evaluation system needs additional MLLM calls (3 agents per response). Tool execution pipeline includes nano-banana for diffusion, SerpAPI for search, GPT-5 for code generation, making deployment complex and expensive. Real-world deployment would require significant infrastructure for tool orchestration and execution.
Real-World Applicability
-
Tool Execution Infrastructure: Paper implements end-to-end execution using real tools - Google Image Search via SerpAPI, nano-banana for image generation, GPT-5 for code execution, Doubao Seedream 4.0 for image editing
-
Human Evaluation Study: Conducted end-to-end human evaluation on 100 responses per model, measuring inappropriate images, missed images, and overall quality scores on 5-point scale
-
Domain Coverage: Eight real-world categories tested - Academic research, Manual instructions, Recipe guidance, Fashion advice, Home renovation, Product information, Travel planning, Encyclopedia queries
-
Expert Annotation Validation: 10 professional annotators verified query quality, 15 annotators reviewed ground truth with multi-perspective evaluation ensuring practical relevance
-
API Integration: Demonstrates practical tool integration through structured JSON schemas compatible with existing API frameworks
-
Limitation: Framework currently text-image only, doesn’t support audio/video modalities. Limited to five specific tools rather than broader agentic capabilities. Deployment requires significant infrastructure coordination across multiple external services.
Limitations & Failure Modes
-
FUNDAMENTAL: Modality Limitation - Framework restricted to text-image interleaving, cannot handle audio, video, or other rich modalities that users increasingly expect
-
FUNDAMENTAL: Tool Scope Constraint - Limited to five specific tools (Reference, Diffusion, Search, Code, Edit), doesn’t capture broader agentic capabilities or dynamic tool discovery
-
ENGINEERING: Scale-Dependent Performance - Results show clear correlation with model size/capability, requiring large proprietary models for acceptable performance
-
EVALUATION: Judge Model Bias - MAM evaluation depends on specific judge models (Gemini 2.5 Pro default), with calibration differences across judges affecting absolute scores
-
ENGINEERING: Infrastructure Complexity - Real deployment requires orchestrating multiple external services (search APIs, diffusion models, code execution) with associated latency and reliability issues
-
EVALUATION: Ground Truth Subjectivity - Multiple valid tool sequences possible for same query, making evaluation inherently subjective despite expert annotation
Failure Modes:
- Tool Boundary Confusion - Models frequently violate capability boundaries (e.g., using Search for generative content), particularly evident in mid-tier models
- Visual Redundancy - Weaker models generate decorative or semantically empty images providing little value, with LLaMA-3.2-11B showing only 44.99% necessity scores
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Authors: Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen et al. (6 authors) · Institution: MIT, RAI Institute · Category: cs.RO
SOLE-R1 is a video-language reasoning model that generates per-timestep chain-of-thought explanations and progress estimates to serve as dense rewards for zero-shot robot learning, substantially outperforming existing VLM rewarders across 40 manipulation tasks.
Practical Takeaway: If you’re working on robot learning, SOLE-R1 demonstrates that video-native chain-of-thought reasoning can provide much more robust reward signals than standard VLMs for online RL. The key insights are training on authentic failure trajectories (not just expert demos) and using structured CoT outputs that ground reasoning in frame-to-frame changes. Consider implementing the hybrid SFT+RLVR training recipe and multi-frame temporal conditioning if you need reward models that resist exploitation. The approach could be adapted to other sequential decision-making domains where dense rewards are hard to specify.
Tags: robotics reinforcement_learning vision_language_models reward_learning manipulation zero_shot_learning chain_of_thought video_reasoning
Task & Setting
SOLE-R1 addresses the challenge of training robots to perform manipulation tasks without hand-engineered rewards or human demonstrations. Current vision-language models fail as reward functions due to perceptual errors and susceptibility to reward hacking when used in reinforcement learning. This makes it difficult to train robots on new tasks from scratch.
The task is video-language reasoning for dense reward prediction in robotic manipulation. Input: raw RGB video observations {ot}T_t=1 and natural language goal g. Output: per-timestep chain-of-thought reasoning {mt}T_t=1 and scalar progress estimates {pt}T_t=1 ∈ [-100, 100] that serve as dense rewards for online RL. The model uses a sliding temporal window of K frames and previous prediction: xt = [g, o0; ot−K+1:t, pt−1].
Success is measured by task completion rates in zero-shot online RL across 40 manipulation tasks spanning pick-and-place, articulated object manipulation, and button/lever interactions. Evaluation environments include RoboSuite, ManiSkill, Meta-World, LIBERO simulation, and real-world Franka arm experiments.
Architecture & Method
-
Base architecture: Qwen3-VL-8B-Instruct vision-language model fine-tuned for video-native spatiotemporal reasoning
-
Input conditioning: Multi-frame temporal window with goal, initial frame, recent K frames, and previous progress prediction
-
Structured output generation: Autoregressive language tokens forming [
mt ,pt ] where mt is chain-of-thought reasoning and pt is progress estimate -
Dense reward conversion: rt = ψ clip(pt, −c, c) with scaling parameter ψ and clipping bounds c
-
Training data synthesis pipeline: Generates 1.2M spatiotemporal CoT traces from 41K videos by injecting random deviations into expert demonstrations and using geometric distance-based progress supervision in simulation, temporal order supervision for real-world videos
-
Core technical contribution: Video-native temporal reasoning that explicitly integrates spatial and temporal structure through per-timestep CoT explanations grounded in frame-to-frame changes, combined with robust progress prediction resistant to reward hacking
Training Recipe
-
Data synthesis stage: Generate 41K diverse robot videos with varying expertise levels by injecting random deviations into expert demonstrations, create 1.2M CoT reasoning examples with continuous progress supervision
-
Stage 1 - Supervised Fine-Tuning (SFT): Fine-tune on spatiotemporal reasoning mixture including general spatial reasoning (SSR-CoT, 1.2M examples), multi-frame temporal reasoning, and robot video progress data. Train for one epoch with balanced batching across data types. Loss function: LSFT(ϕ) = -E[(i,q,r,a)~D][Σt log pϕ(yt i, q, y<t)] -
Stage 2 - RL from Verifiable Rewards (RLVR): Use GRPO on progress dataset with rule-based rewards r(o) = rformat(o) + racc(o) where racc measures progress accuracy. KL penalty β relative to SFT reference model.
Training details: Qwen3-VL-8B-Instruct backbone, total 10M images/video frames, 4M CoT reasoning traces. Hardware and wall-clock time not reported.
Novelty & Lineage
Prior work: LIV (Ma et al. 2023) trains vision-language models for task progress prediction but uses only near-expert trajectories without intermediate reasoning. ReWiND (Zhang et al. 2025) and VLAC (Zhang et al. 2024) also predict progress from robot videos but lack chain-of-thought reasoning. RoboReward (Lee et al. 2026) provides general-purpose vision-language rewards but suffers from reward hacking.
Delta: SOLE-R1 adds (1) per-timestep spatiotemporal chain-of-thought reasoning explicitly grounded in frame-to-frame changes, (2) training on authentic non-expert/failure trajectories to improve robustness, (3) hybrid SFT+RLVR training framework, and (4) systematic integration of foundational spatial and temporal reasoning.
Applied-specific assessment: The architectural idea of video-native CoT reasoning for progress prediction is novel and non-obvious. Benchmark gains are large (24 vs <10 tasks solved by baselines) and consistent across diverse environments. Comparisons appear fair using same RL algorithm and evaluation protocol. However, SOLE-R1 uses proprietary training data synthesis and 8B parameter model vs API-based baselines, making direct compute comparisons difficult.
Verdict: SIGNIFICANT — Clear advance in reward modeling for robotics with strong empirical validation, though building on established VLM fine-tuning techniques.
Benchmarks & Results
-
Zero-shot online RL across 40 tasks: SOLE-R1 achieves ≥50% success on 24 tasks vs GPT-5 (7 tasks), Gemini-3-Pro (5 tasks), ReWiND, VLAC, LIV (<10 tasks each)
-
Real-world Franka manipulation: Success on pick, touch, insertion, push, drawer tasks (exact scores not tabulated)
-
RoboSuite tasks: Success on Lift, Wipe, Door, PickPlaceCan with scores 50-100%
-
ManiSkill tasks: Success on PickPanda, PushCube, PullCube, Mobile-OpenCabDrawer with scores 20-100%
-
Meta-World tasks: Success on 23 tasks including faucet, window, drawer, button operations with scores 0-100%
-
LIBERO tasks: Success on close/open drawer, stove, microwave tasks with scores 20-100%
-
OpenX Embodiment Value-Order-Correlation: SOLE-R1 achieves higher VOC than GVL baseline across 50 OXE datasets
-
General reasoning benchmarks: Improved performance on SpatialBench, SSRBench, CV-Bench vs SSR baseline
Results show consistent advantages across diverse tasks and environments, though some tasks still achieve 0% success
Compute & Efficiency
-
Model size: 8B parameters (Qwen3-VL-8B-Instruct backbone)
-
Training compute: Not reported for data synthesis pipeline or fine-tuning stages
-
Inference speed/latency: Can run at lower frequency than control frequency with linear interpolation for dense rewards
-
Memory footprint: Not reported
-
Deployment practicality: Requires multi-frame video input processing and autoregressive text generation, likely substantial compute overhead compared to lightweight reward models but enables zero-shot generalization without task-specific tuning
Real-World Applicability
-
Real-world deployment: Successfully tested on Franka Panda arm with modified gripper fingers and wrist camera angle not seen during training
-
Hardware experiments: Tabletop manipulation tasks including picking strawberries, inserting pipes, manipulating cans and cubes
-
Sim-to-real transfer: Generalizes from simulation training (RoboCasa) to real-world without additional fine-tuning
-
Production considerations: Requires only RGB cameras and natural language goals, no privileged state information or task-specific sensors
-
Embodiment generalization: Works across different robot morphologies including Franka, Sawyer, WidowX, and mobile manipulators not seen during training
Limitations & Failure Modes
-
FUNDAMENTAL: Temporal under-detection of brief contact events and state transitions that occur between query steps or under occlusion
-
FUNDAMENTAL: Ambiguous object state estimation under partial observability in cluttered scenes, leading to conservative progress estimates
-
ENGINEERING: Occasional over-reliance on goal-consistent appearance cues rather than true task completion
-
EVALUATION: Limited evaluation on long-horizon tasks and complex multi-step manipulation sequences
-
ENGINEERING: Requires substantial training data synthesis pipeline and compute for fine-tuning
Failure modes:
- Signal-limited failures where reward is too flat/noisy to drive exploration despite recognizing non-success
- Conservative progress estimation in ambiguous states that prevents positive reinforcement for stepping-stone behaviors
Generalizable Dense Reward for Long-Horizon Robotic Tasks
Authors: Silong Yong, Stephen Sheng, Carl Qi, Xiaojie Wang et al. (9 authors) · Institution: Carnegie Mellon University, Amazon Robotics · Category: cs.RO
VLLR combines LLM task decomposition and VLM progress estimation for value initialization with policy self-certainty intrinsic rewards during PPO finetuning of robotic foundation models.
Practical Takeaway: If you’re working on long-horizon robotics RL, the two-stage approach is worth trying: use VLM-based progress signals to initialize your value function early in training, then switch to intrinsic rewards like self-certainty for the main PPO phase. This avoids the computational cost of continuous VLM querying while still leveraging semantic supervision. The self-certainty reward (negative normalized entropy) is simple to implement and may help with convergence speed. However, expect to tune the reward weights carefully and consider whether your tasks have clear hierarchical structure that LLMs can reliably decompose.
Tags: robotics reinforcement learning foundation models long-horizon tasks reward design vision-language models mobile manipulation
Task & Setting
Long-horizon robotic tasks require robots to compose multiple skills (navigation, manipulation, object search) before receiving meaningful supervision, leading to error accumulation and sparse reward problems. Existing foundation policies trained via imitation learning struggle with distribution shift over extended horizons.
The task involves mobile manipulation and navigation in household environments. Input: egocentric RGB observations (384×224 resolution) from two cameras plus natural language instructions. Output: discrete actions from 20-action space including base movement, arm control, and task signals. The objective is to maximize task completion while minimizing episode length, formally measured by Success (S ∈ {0,1}) and Success weighted by Episode Length:
\[\text{SEL} = S \cdot \frac{t_{\min}}{\max(t_{\min}, t)}\]Tasks are evaluated on CHORES benchmark spanning 191,568 ProcThor houses with 6 task types: Fetch (locate and acquire objects), Pick-Up (manipulate visible objects), Object Navigation (navigate to object categories), RoomVisit (visit all rooms), plus two out-of-distribution tasks requiring affordance and relational reasoning.
Architecture & Method
- Foundation policy: SPOC pretrained via imitation learning on discrete action space
- LLM-based task decomposition: Claude-3.7-Sonnet decomposes tasks into ordered subgoals using scene graph context
- VLM-based progress estimation: Amazon Nova Pro evaluates visual observations against subgoals, outputting progress pt ∈ [0,1]
- Progress signal processing: Running maximum with temporal filtering to prevent hallucination-induced saturation
-
VLM reward formulation:
\[R_{\text{VLM}}(s_t, a_t, s_{t+1}) = \hat{p}_{t+1} - \hat{p}_t\]where $\hat{p}_{t+1} = \max(\hat{p}_t, p_{t+1})$
-
Self-certainty intrinsic reward:
\[R_{SC} = -\frac{1}{|A|} \sum_{i=1}^{|A|} \log(|A| \cdot \pi_\theta(i))\] -
Combined reward model:
\[R_{\text{full}} = \alpha \mathbf{1}_{\{\text{Stage I}\}} R_{\text{VLM}} + \beta \mathbf{1}_{\{\text{Stage II}\}} R_{SC} + \phi \mathbf{1}_{\{\text{Stage II}\}} R_{\text{task}}\]The core contribution is using VLM supervision only for value initialization (Stage I) while relying on policy self-certainty for dense per-step guidance during PPO finetuning (Stage II).
Training Recipe
-
Stage I - Value Function Initialization (200K steps): - Data: Environment rollouts with VLM progress evaluation - Reward: VLM-based progress signal (α=1) - Optimizer/schedule: Not specified - Hardware: Not reported
-
Stage II - PPO Finetuning: - Data: 20M steps (ObjectNav, RoomVisit) or 50M steps (Fetch, Pick-up, OOD tasks) - Reward: Self-certainty (β=0.1) + sparse task success (φ=10) - Optimizer: PPO with FLaRe hyperparameters - Learning rate/batch size: Not reported - Hardware: Not reported - Wall-clock time: Not reported
The two-stage design avoids expensive VLM inference during full training while leveraging VLM knowledge for value initialization.
Novelty & Lineage
Prior work:
- FLaRe (2024): RL finetuning of foundation policies using sparse rewards, achieving SOTA on CHORES
- Eureka (2023): LLM-generated reward functions via code generation for robotics
-
VLM-RM (2023): VLM-based similarity scores as rewards for repetitive skills
Delta: This paper combines LLM task decomposition with VLM progress evaluation for value initialization, plus policy self-certainty as intrinsic reward during PPO finetuning.
Applied-specific assessment:
- Architectural idea: Using VLM only for value initialization (not throughout training) is a reasonable engineering choice but not conceptually novel
- Benchmark gains: 5% on in-distribution, 10% on OOD tasks - meaningful but modest improvements
- Fair comparisons: Uses same foundation model (SPOC) and evaluation protocol as FLaRe baseline
- Scale dependency: Method relies on capable LLM/VLM (Claude-3.7, Nova Pro) which may not be accessible to all practitioners
The combination of existing techniques (LLM planning + VLM evaluation + self-certainty) is sensible but individually well-established. The two-stage training is a practical optimization rather than fundamental advance.
Verdict: INCREMENTAL — Solid engineering combining known techniques with modest but consistent improvements across tasks.
Benchmarks & Results
- ObjectNav: 87.6% success (VLLR) vs 85.0% (FLaRe) - 2.6% improvement
- Fetch: 70.7% success (VLLR) vs 65.2% (FLaRe) - 5.5% improvement
- PickUp: 97.0% success (VLLR) vs 91.8% (FLaRe) - 5.2% improvement
- RoomVisit: 68.3% success (VLLR) vs 60.9% (FLaRe) - 7.4% improvement
- ObjNavRel (OOD): 67.4% success (VLLR) vs 66.3% (FLaRe) - 1.1% improvement
-
ObjNavAff (OOD): 90.1% success (VLLR) vs 79.7% (FLaRe) - 10.4% improvement
Results show consistent but modest improvements across tasks. The largest gains are on affordance navigation (10.4%) and RoomVisit (7.4%). Improvements on SEL (episode efficiency) are generally smaller, with RoomVisit showing FLaRe slightly ahead on efficiency despite lower success rate.
Compute & Efficiency
- Model size: Foundation policy parameters not specified, uses SPOC architecture
- Training compute: Stage I uses 200K VLM inference steps, Stage II runs 20-50M PPO steps - hardware not reported
- Inference speed: VLM queries avoided during Stage II, only self-certainty computation required per step
- Memory footprint: Not reported
- Deployment practicality: Two-stage design reduces inference cost by avoiding continuous VLM queries, but still requires access to Claude-3.7 and Nova Pro for initial decomposition and value initialization
Real-World Applicability
- Evaluation limited to CHORES simulation benchmark on ProcThor houses
- No real robot experiments or hardware validation reported
- No sim-to-real transfer discussion provided
- Method relies on high-quality VLM (Nova Pro) which may have different performance characteristics on real robot camera feeds vs simulation renders
- Scene graph input for LLM decomposition would require additional perception pipeline in real environments
Limitations & Failure Modes
- FUNDAMENTAL: Relies on external LLM/VLM capabilities - performance bounded by foundation model quality and hallucination rates
- ENGINEERING: VLM progress estimation remains noisy requiring post-processing - could be improved with better VLM selection or prompt engineering
- EVALUATION: Limited to single simulation benchmark, no real-world validation or comparison to human-engineered dense rewards
- FUNDAMENTAL: Task decomposition requires structured scene graph representation which may not be available in unstructured real environments
-
ENGINEERING: Self-certainty weighting (β=0.1) requires manual tuning and may not generalize across different foundation models
Failure modes:
- VLM hallucinations leading to spurious progress signals despite temporal filtering
- Self-certainty overconfidence causing policy to commit to suboptimal action sequences
Pixelis: Reasoning in Pixels, from Seeing to Acting
Authors: Yunpeng Zhou · Institution: University of Reading · Category: cs.CV
Pixelis introduces a pixel-space agent that uses executable visual tools with curiosity-coherence training and trajectory-level test-time adaptation, achieving consistent but modest improvements across vision-language benchmarks.
Practical Takeaway: Research engineers should consider the curiosity-coherence reward formulation for training tool-using agents - the combination of prediction-error curiosity with adjacent-step coherence appears to produce more structured exploration than either component alone. The trajectory-level voting mechanism for test-time adaptation is also worth implementing, as it provides behavioral consistency without requiring process critics. However, the three-phase training pipeline is complex and the gains are modest enough that simpler baselines (answer-level self-consistency, standard test-time adaptation) might be more practical starting points. The pixel tool interface design and RaPR/RaCPR process metrics offer useful frameworks for auditing agent behavior beyond just task accuracy.
Tags: vision-language-models tool-augmented-AI test-time-adaptation pixel-space-reasoning reinforcement-learning multimodal-agents curiosity-driven-learning visual-reasoning
Task & Setting
Vision-language models are typically passive observers that describe images but cannot act on them or adapt under distribution shift. This limits their practical application in dynamic environments where interaction and continuous learning are essential.
Pixelis addresses this by creating a pixel-space agent that performs executable tool operations directly on images and videos. The input includes images/videos and natural language queries, while the output consists of structured toolchains composed of 6 core operations: SEG (segmentation), ZOOM (crop/zoom), TRK (tracking), OCR (text reading), TEMP (temporal localization), and PROP (property detection). Each tool produces typed, serialized outputs (normalized coordinates, masks, text spans, tracklets) that serve as inputs to subsequent operations.
The objective is to maximize a composite reward function:
\[R(\tau) = w_1 R_{final} + w_2 R_{cur} + w_3 R_{coh} - w_4 R_{pen}\]where $R_{final}$ is task accuracy, $R_{cur}$ rewards curiosity-driven exploration, $R_{coh}$ promotes adjacent-step coherence, and $R_{pen}$ penalizes invalid operations.
Success is measured through task accuracy on benchmarks plus novel process metrics: Rate of Pixel Reasoning (RaPR) measuring valid visual-tool steps, and Rate of Composite Pixel Reasoning (RaCPR) detecting coherent multi-step chains. Tool fidelity is evaluated using IoU, ANLS, and HOTA scores.
The paper evaluates on 6 benchmarks (V*Bench, MMBench, MVBench, InfoVQA, Video-MMMU, VSI-Bench) with a training corpus of 76k Chain-of-Thought-Action trajectories averaging 5.8 steps.
Architecture & Method
-
Base architecture: Qwen3-VL-8B-Instruct backbone with frozen visual encoder and added tool interface
-
Tool interface: 6 executable operations (SEG, ZOOM, TRK, OCR, TEMP, PROP) with typed arguments and serialized outputs
-
Step embeddings: Visual tokens from layer 3 of multimodal stack, combined with text features and previous action via 2-layer MLP:
\[E_t = \frac{g_\phi([v_t \| x_t \| \text{onehot}(a_{t-1})])}{||g_\phi([v_t \| x_t \| \text{onehot}(a_{t-1})])||_2}\] -
Auxiliary tool heads: Single-layer MLPs for box regression, text classification, mask prediction, and temporal localization with loss:
\[L_{tool} = \frac{1}{|D|}\sum_{t \in D}[\lambda_{box}\text{SmoothL1}(\hat{b}_t, b_t) + \lambda_{text}\text{CE}(\hat{y}_t, y_t) + \lambda_{mask}(1-\text{Dice}(\hat{M}_t, M_t)) + \lambda_{temp}\text{BCE}(\hat{u}_t, u_t)]\] -
Curiosity mechanism: Tool-conditioned dynamics head predicts next visual state with uncertainty gating via MC dropout
-
Coherence regularization: Adjacent-step cosine similarity on z-scored step embeddings to prevent tool-hopping
-
Test-time adaptation: Hybrid retrieval using text+pixel keys, trajectory-level voting with behavioral similarity matching, KL-to-EMA safety control
The core technical contribution is the unified pixel-space acting framework with curiosity-coherence training and safe test-time adaptation through trajectory voting.
Training Recipe
-
Supervised Fine-Tuning (SFT): 3 epochs on 76k Chain-of-Thought-Action traces with masked imitation loss upweighting action tokens (weight 2.0), auxiliary tool heads, and 5% feedback dropout. Curriculum uses medium:hard sampling 2:1 with SFT-predicted hardness. Label smoothing 0.05, gradient clipping 1.0, 2% early action dropout.
-
Curiosity-Coherence Reward Fine-Tuning (CC-RFT): GRPO policy gradient with K=8 trajectories, composite reward balancing curiosity, coherence, and efficiency. Target token-KL ≈0.15 with PID controller. Intrinsic rewards are batch-wise z-scored. Stop gradients to backbone during this phase.
-
Pixel Test-Time RL: Online adaptation using neighborhood retrieval (k=8), trajectory voting over N=8 rollouts, EMA anchor (ρ=0.99) for KL stabilization. Updates run for 8k steps with gradient clipping 1.0. Abstention based on entropy and margin thresholds.
Data: 80k/8k/8k train/dev/test images, 28k/2k/2k videos. CoTA trajectories from Grok 4 VLM with tool-constrained generation.
Optimizer details not fully reported - appears to use standard AdamW with learning rate scheduling.
Hardware: 805.2 GPU hours total training cost for 8B model.
Validation uses IoU (SEG/TRK), ANLS (OCR), temporal consistency scores with trajectory acceptance threshold τ₀=0.65.
Novelty & Lineage
Prior Work:
- Visual ChatGPT/MM-REACT (2023): Tool-augmented VLMs with API orchestration but limited verification
- PixelLM (2024): Internal mask generation with codebook, removes external APIs but lacks executability
-
RLHF for VLMs: Constitutional AI, self-consistency methods for chain supervision but require process critics
Delta: This paper adds:
- A compact executable pixel tool interface with replayable traces
- Curiosity-coherence RFT coupling prediction-error exploration with adjacent-step coherence without process critics
-
Pixel TTRL using trajectory-level behavioral voting with KL-to-EMA safety for test-time adaptation.
Applied-Specific Assessment:
- Architectural novelty: The curiosity-coherence coupling is non-obvious, avoiding the typical exploration-exploitation tradeoffs by using uncertainty gating and local coherence constraints
- Benchmark gains: Modest +4.08% average improvement, peaking at +6.03% on VSI-Bench. Gains are consistent across 6 benchmarks but not dramatically large
- Fair comparisons: Same 8B backbone, matched compute budgets, proper ablations. However, comparisons to proprietary models (GPT-5, Gemini) are not directly comparable
- Generalizability: Gains appear to hold across diverse settings, but all experiments use the same Qwen3-VL backbone. Would benefits transfer to other VLMs?
The paper makes solid engineering contributions but the algorithmic novelty is incremental - combining known techniques (curiosity-driven RL, coherence regularization, test-time adaptation) in a new domain.
Verdict: INCREMENTAL — Well-executed combination of existing techniques with consistent but modest improvements across benchmarks.
Benchmarks & Results
- V*Bench: 90.1% vs 86.4% baseline (+3.7pp improvement)
- MMBench v1.1: 89.5% vs 85.0% baseline (+4.5pp improvement)
- MVBench: 73.8% vs 68.7% baseline (+5.1pp improvement, marked as tool-needed subset)
- InfoVQA: 87.9% vs 83.1% baseline (+4.8pp improvement, tool-needed subset)
- Video-MMMU: 69.8% vs 65.3% baseline (+4.5pp improvement, tool-needed subset)
-
VSI-Bench: 64.4% vs 59.4% baseline (+5.0pp improvement, tool-needed subset)
Average relative gain: +4.08% computed as (ours-baseline)/baseline, with peak at +6.03% on VSI-Bench.
Results show consistent improvements across all benchmarks, though gains are modest (3-5 percentage points). All comparisons use same 8B backbone with matched compute budgets.
Process metrics also improve: RaPR increases from baseline levels while maintaining shorter toolchains (3.7 vs ~6 steps average). KL drift stays within [0.10, 0.20] corridor during test-time adaptation.
Tool fidelity scores improve jointly with task accuracy: IoU, ANLS, and HOTA scores increase, suggesting real pixel-grounded reasoning rather than text-only heuristics.
Notable benchmark absences: No evaluation on standard vision-language benchmarks like COCO-VQA, GQA, or TextVQA that might provide broader context.
Compute & Efficiency
-
Model size: 8B parameters (Qwen3-VL backbone) plus ~1.2M additional parameters for tool heads and step embeddings
-
Training compute: 805.2 GPU hours total across all three phases (SFT + CC-RFT + Pixel TTRL)
-
Inference speed: End-to-end latency p50 = 5.8s, p90 = 8.1s, p95 = 10.2s. Tool operations dominate latency (~87%), with segmentation/tracking being most expensive (1.17s + 1.42s). Retrieval+voting adds ~0.35s median overhead.
-
Memory footprint: Not explicitly reported, but 8B model suggests standard VLM memory requirements plus tool execution overhead
-
Deployment practicality: Tool execution requires external verifiers (HOTA/CLEAR-MOT, ANLS evaluators) and is dominated by pixel operations rather than model inference. Average toolchain length of 3.7 decisive steps makes execution reasonably efficient. KL/EMA bookkeeping adds <0.1s overhead.
The system is practically deployable but inference latency is higher than standard VLMs due to multi-step tool execution. The pixel tools create significant computational overhead beyond the base model.
Real-World Applicability
-
Dataset composition: Uses mix of real images (80k train) and videos (28k train) from public sources, but tool execution traces are generated synthetically via Grok 4 VLM with constrained generation
-
Robustness testing: Plan re-execution tested under JPEG compression (q=10-30), ±2% resize, ±1 frame jitter shows accuracy improvement from 78.9% to 83.4%
-
Domain shift evaluation: Tested under lighting and motion shifts with 8k online updates, showing improvement from 73.0% to 76.5% accuracy while maintaining KL corridor
-
Tool fidelity validation: External verifiers (IoU≥0.5 for SEG, HOTA≥0.15 for TRK, ANLS≥0.85 for OCR) show 83-93% success rates across tool types
-
De-duplication audit: Rigorous filtering shows <0.05% residual overlap between train/test splits and 0.00% overlap with retrieval index
-
Cross-dataset transfer: Brief mention of COCO/MOT to LVIS-style transfer preserving fraction of in-domain performance
The work shows promising real-world applicability with proper robustness testing and tool validation, though most evaluation remains on curated benchmarks rather than fully uncontrolled environments.
Limitations & Failure Modes
-
FUNDAMENTAL: Three-phase training with coupled statistics rather than end-to-end optimization - residual SFT bias can persist and affect downstream phases
-
FUNDAMENTAL: Process signals can overfit to stale behaviors or chase high-entropy textures, limiting adaptation to truly novel scenarios
-
FUNDAMENTAL: Coherence constraint is purely local (adjacent steps) - cannot enforce global plan consistency or long-term reasoning
-
ENGINEERING: Non-differentiable tools (segmentation/OCR/tracking) are brittle on thin structures, stylized fonts, dense layouts, or motion blur
-
ENGINEERING: Error propagation through toolchains - early mistakes compound in RaCPR computation and downstream tool calls
-
EVALUATION: May under-adapt or oscillate under abrupt distribution shifts - tested mainly on gradual lighting/motion changes
-
EVALUATION: All experiments use single backbone (Qwen3-VL-8B) - unclear if gains transfer to other architectures
Failure Modes:
- Tool cascade failures: Segmentation errors lead to incorrect tracking, which propagates through temporal localization
- Oscillatory exploration: Without sufficient KL constraints, curiosity can drive repetitive tool-hopping despite coherence penalties
The authors acknowledge these limitations and suggest mitigations including adaptive KL budgets, diversity-aware replay, and tool-noise simulation.
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
Authors: Royden Wagner, Omer Sahin Tas, Jaime Villa, Felix Hauser et al. (21 authors) · Institution: Karlsruhe Institute of Technology (KIT) · Category: cs.CV
Introduces a dataset of long-tail driving scenarios with multi-view videos and expert reasoning traces, proposing Multi-Maneuver Score for evaluating VLMs on autonomous driving while revealing poor semantic coherence between model reasoning and actions.
Practical Takeaway: This dataset provides a valuable resource for evaluating VLMs on long-tail driving scenarios with multi-view video and expert reasoning traces. The key practical insight is that current open-source VLMs struggle significantly in zero-shot driving scenarios but improve substantially with few-shot prompting. However, the low semantic coherence (27-51%) between reasoning and actions reveals a fundamental issue - models either hallucinate explanations or generate poor trajectories. The Multi-Maneuver Score offers a more realistic evaluation than L2 errors by considering multiple viable trajectories. Researchers should focus on improving reasoning-action alignment and consider that converting CoT reasoning to trajectories via kinematic models outperforms direct trajectory prediction, suggesting value in explicit action reasoning.
Tags: autonomous_driving long_tail_scenarios vision_language_models end_to_end_driving multi_modal_evaluation reasoning_traces chain_of_thought trajectory_planning
Task & Setting
Real-world autonomous driving in long-tail scenarios (e.g., construction zones, adverse weather, accidents, overtaking) remains a fundamental challenge for end-to-end models. These rare events are critical for safety but underrepresented in training data, making models prone to failure when encountering unusual situations. Current evaluation methods focus on replicating single expert trajectories rather than considering the inherent multi-modality of driving decisions.
This paper introduces a benchmark for evaluating Vision-Language Models (VLMs) and Vision-Language-Action (VLA) models on end-to-end driving in long-tail scenarios. The input consists of multi-view video data (6 cameras, 360° FoV, 5746×512 px stitched frames), past 4s trajectory, and high-level textual instructions (e.g., “overtake truck driving on the right”). The output includes:
- planned 5s future trajectory as waypoints, and
-
reasoning traces explaining driving actions.
The formal objective evaluates multiple viable trajectories rather than single expert paths, using the Multi-Maneuver Score (MMS):
\[\text{MMS} = \begin{cases} 0, & \text{if } \langle \mathbf{v}_\text{plan}^{(0)}, \mathbf{v}_\text{ref}^{(0)} \rangle \leq 0.5|\mathbf{v}_\text{ref}^{(0)}|, \\ \text{MMS}_\text{ref}, & \text{else if } \text{MMS}_\text{ref} \in \{0, 1\} \text{ and } s \geq 0.4, \\ s \cdot \text{MMS}_\text{ref}, & \text{else if } s \cdot \text{MMS}_\text{ref} \geq 3.5 - \text{CP}, \\ 3.5 - \text{CP}, & \text{otherwise} \end{cases}\]Success is measured through:
- MMS scores (0-10 scale covering safety, comfort, instruction-following)
- L2 trajectory errors, and
-
semantic coherence between reasoning traces and planned actions using Rocchio classification.
The dataset contains 1000 scenarios (9s each) split into train/test/validation (500/400/100), covering specifically selected challenging scenarios (19.8%), intersections (29.6%), overtaking/lane changes (22.7%), construction zones (9.4%), adverse weather (13.3%), and nighttime driving (5.1%). Expert reasoning traces are provided in English, Spanish, and Chinese from domain experts.
Architecture & Method
-
Multi-view data collection: Six-camera rig capturing 360° FoV with frame-wise image stitching using gradual warping instead of single homography transformations to handle overlapping regions.
-
Semantic coherence measurement: Uses EmbeddingGemma 0.3B to generate sentence embeddings of reasoning traces, then applies Rocchio classification to match described actions with trajectory-derived actions:
\[\hat{y} = \arg \max_{c \in \mathbf{C}} \cos(\mathbf{z}, \boldsymbol{\mu}_c)\]where $\mathbf{z}$ is the reasoning trace embedding and $\boldsymbol{\mu}_c$ is the reference embedding for action class $c$.
-
Multi-Maneuver Score (MMS): Core contribution that evaluates trajectories against 5 reference categories (expert-like, wrong speed, neglect instruction, off-road, crash) with comfort penalties based on jerk and tortuosity:
\[\text{average jerk} = \frac{1}{T} \sum_t \left\| \frac{\Delta^3 \mathbf{Y}_{t,:}}{\Delta t^3} \right\|\] \[\text{tortuosity} = \frac{\sum_{t=2}^{T} \|\mathbf{Y}_{t,:} - \mathbf{Y}_{t-1,:}\|}{\|\mathbf{Y}_{T,:} - \mathbf{Y}_{1,:}\|}\] -
Few-shot Chain-of-Thought prompting: Provides 3 examples with expert reasoning traces covering highway overtaking, suburban left turn, and urban right turn scenarios.
-
Kinematic trajectory generation: Maps discrete driving actions from reasoning traces to continuous trajectories using kinematic bicycle model with speed-dependent steering angles and acceleration values.
The core technical contribution is the MMS metric that captures driving multi-modality and the semantic coherence evaluation between reasoning and actions, going beyond single-trajectory L2 error metrics.
Training Recipe
The paper evaluates pre-trained models without additional training:
-
Open-source VLMs evaluated: Pixtral 12B, Gemma 3 12B, Qwen3-VL 8B - all instruction-tuned by model providers.
-
Closed-source models evaluated: Gemini 3 Pro, Gemini Robotics ER 1.5, GPT-5 via API calls.
-
Classical end-to-end models: UniAD and DMAD pre-trained on nuScenes dataset.
-
Inference configurations: - Zero-shot: Direct prompting with current frame, past 4s trajectory, and instruction - Few-shot: 3 example scenarios added to prompt - Few-shot CoT: Expert reasoning traces added to few-shot examples - Few-shot CoT kinematic: Kinematic model converts reasoning actions to trajectories
-
Prompt optimization: Templates optimized using Perplexity Pro for consistency across models.
No model training, fine-tuning, or reinforcement learning is performed. The evaluation focuses on in-context learning capabilities of existing pre-trained models. Data filtering applied Pareto principle using nuScenes as reference (80% cumulative frequency threshold) to identify long-tail scenarios.
Hardware and computational details: Not reported for model evaluation. Dataset collection used high-performance computing resources from KIT’s HoreKa system.
Novelty & Lineage
Prior work:
- Waymo Open E2E (2023): Evaluates end-to-end driving on long-tail scenarios but provides only current timestep images, no video data or reasoning traces, and uses single expert trajectory evaluation.
- DriveLM-Data (2024): Extends nuScenes/CARLA with Q&A labels and graph-based reasoning but evaluates against single expert trajectories and uses ChatGPT-3.5 for semantic alignment measurement.
-
CoVLA-Dataset (2025): Provides front-view videos with auto-generated behavior captions from VLMs but suffers from potential model collapse and single-trajectory evaluation.
Delta: This paper adds:
- Multi-view 360° video data with expert reasoning traces in multiple languages
- Multi-Maneuver Score evaluating multiple viable trajectories rather than single expert paths
- Semantic coherence metric between reasoning traces and planned actions using lightweight Rocchio classification
-
Focus specifically on long-tail scenarios with expert curation.
Applied-specific assessment:
- Architectural novelty: The MMS metric is a straightforward extension of existing trajectory evaluation but addresses a real limitation of L2-based metrics. The semantic coherence measurement using Rocchio classification is a reasonable but not groundbreaking technical choice.
- Benchmark gains: Mixed results - closed-source models achieve MMS ~4.5-5.0, open-source models improve from ~1.1 to ~4.1 with few-shot prompting, but semantic coherence remains low (0.27-0.51) indicating fundamental issues.
- Fair comparisons: Evaluation protocol is reasonable, though different models receive different input modalities (images vs video). Classical end-to-end models (UniAD, DMAD) achieve competitive MMS ~3.6-3.9.
- Scale dependence: Gains appear dependent on model scale and proprietary training - open-source models struggle significantly in zero-shot setting.
Verdict: INCREMENTAL — Solid dataset contribution with reasonable evaluation metrics, but the core technical advances (MMS, semantic coherence) are straightforward extensions of existing methods without fundamental breakthroughs.
Benchmarks & Results
-
Multi-Maneuver Score (MMS) on test set: Gemini 3 Pro achieves 4.99, Gemini Robotics ER 1.5 gets 4.35, GPT-5 scores 4.48. Classical models: UniAD 3.60, DMAD 3.85. Open-source zero-shot: Pixtral 12B 0.05, Qwen3-VL 8B 1.18, Gemma 3 12B 1.11. Open-source few-shot improves to ~4.1-4.2 range.
-
L2 trajectory error (5s horizon): Closed-source models perform best: Gemini 3 Pro 3.19m, GPT-5 4.01m, Gemini Robotics ER 1.5 7.12m. Classical models competitive: DMAD 10.38m, UniAD 11.20m. Open-source models struggle: 22.98-40.69m zero-shot, improve to 4.12-8.52m few-shot.
-
Semantic coherence between reasoning and actions: Low across all models - Qwen3-VL 0.51, Gemma 3 12B 0.30, Pixtral 12B 0.27. Acceleration prediction more coherent than steering.
-
Scenario-specific performance: All models perform best on nighttime scenarios, worst on snow and specifically selected challenging scenarios. Intersection scenarios show poor instruction following (MMS ~4 suggests trajectory mismatch).
-
Correlation analysis: MMS correlates better with Bench2Drive DrivingScore (r=0.59) compared to L2 errors (r=-0.45), validating the metric design.
Results show large gaps between closed-source and open-source models, with few-shot prompting providing substantial improvements for open-source VLMs but semantic coherence remaining problematic across all evaluated models.
Compute & Efficiency
-
Model sizes: Evaluated models range from 8B parameters (Qwen3-VL) to 12B parameters (Pixtral, Gemma 3). Closed-source model sizes not disclosed.
-
Training compute: No additional training performed - evaluation uses pre-trained models. Dataset collection and annotation involved domain experts over 2-year period starting late 2023.
-
Inference speed/latency: Not reported. Evaluation conducted via API calls for closed-source models and local inference for open-source models.
-
Memory footprint: Not specified for model inference. Dataset storage includes multi-view videos (3200×2200 raw, 5746×512 stitched) at 5 Hz for 4s clips across 1000 scenarios.
-
Deployment practicality: High-resolution multi-view video processing and semantic coherence evaluation using EmbeddingGemma 0.3B presents computational overhead. MMS metric designed to be “lightweight and reproducible” compared to neural rendering approaches, but still requires trajectory similarity computation and comfort penalty calculations across multiple reference trajectories.
Real-World Applicability
-
Real-world data collection: Videos recorded in actual driving environments across Karlsruhe, Heidelberg, Mannheim, and Black Forest regions in Germany over 2-year period with deliberate focus on construction zones, intersections, and adverse weather conditions.
-
Hardware setup: Six-camera rig with 360° field of view mounted on research vehicle, capturing at 5 Hz with high resolution (3200×2200 raw images).
-
Deployment considerations: No actual deployment results reported. Evaluation remains on recorded scenarios without closed-loop vehicle control.
-
Environment diversity: Covers urban, suburban, and highway environments with specific attention to long-tail events (19.8% specifically selected challenging scenarios, construction zones, adverse weather, nighttime driving).
-
Sim-to-real analysis: Limited comparison conducted using SimLingo/Bench2Drive simulation environment to validate MMS metric against closed-loop DrivingScore, showing reasonable correlation (r=0.59).
The work focuses on offline evaluation of recorded scenarios rather than online deployment, limiting direct real-world applicability assessment. The multi-view video requirement and computational overhead may present practical deployment challenges.
Limitations & Failure Modes
-
Low semantic coherence - FUNDAMENTAL: All evaluated models show poor alignment (27-51%) between reasoning traces and planned actions, indicating either hallucinated reasoning or unreasonable trajectory planning.
-
Domain gap in reasoning traces - FUNDAMENTAL: Chain-of-thought prompting often worsens performance compared to few-shot prompting, likely due to pretraining data focusing on math/coding rather than driving explanations (context-memory conflicts).
-
Single geographic region - EVALUATION: Dataset limited to German driving environments, potentially limiting generalizability to other traffic patterns, road infrastructure, and driving cultures.
-
Short planning horizon - EVALUATION: 5-second planning horizon may not capture longer-term decision making required for complex scenarios.
-
Multi-view computational overhead - ENGINEERING: High-resolution 360° video processing requires significant computational resources, potentially limiting practical deployment.
-
Reference trajectory annotation - EVALUATION: Manual annotation of reference trajectories for crash/off-road categories may introduce subjective bias and scaling challenges.
Failure modes:
-
Instruction neglect: Models frequently produce trajectories that ignore high-level instructions, particularly in intersection scenarios where multiple maneuvers are viable.
-
Inconsistent planning: Models generate trajectories inconsistent with past vehicle motion, receiving 0 MMS scores due to physics violations or abrupt direction changes.