Apr 10, 2026 Applied AI 5 papers

Applied AI Digest — Apr 10, 2026

Today’s Digest at a Glance

Today’s digest focuses on advanced evaluation methodologies, spatio-temporal reasoning systems, and specialized architectures for embodied AI applications.

Trajectory-Level Preference Modeling

Traditional preference learning evaluates individual responses, but many AI applications—especially autonomous agents—require reasoning about entire sequences of actions and their long-term consequences. The naive approach of applying pointwise preference models to each step fails because it cannot capture dependencies between actions or evaluate whether an agent’s overall strategy is coherent.

Trajectory-level preference modeling addresses this by treating entire execution traces as atomic units for preference comparison. Instead of scoring individual actions $a_t$, the model learns a preference function $P(\tau_1 \succ \tau_2)$ over complete trajectories $\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)$. This requires training data consisting of trajectory pairs with human annotations about which complete sequence better achieves the intended goal.

The key insight is that coherent long-horizon behavior emerges from understanding how individual actions contribute to overall success, rather than optimizing each step in isolation. This enables evaluation of complex behaviors like tool use, multi-step reasoning, and planning consistency that cannot be assessed from individual responses alone.

Extended Spatial Regular Expressions (q-SpRE)

Spatial-temporal reasoning in video requires expressing complex relationships between objects across time, but standard approaches lack the expressiveness to capture quantified spatial relationships or temporal sequences efficiently. Traditional computer vision methods handle individual frames well but struggle with queries like “find all frames where every car is followed by a pedestrian within 5 meters.”

Extended Spatial Regular Expressions (q-SpRE) combine the pattern matching power of regular expressions with spatial logic and quantifiers. The syntax extends standard regex operators (*, +, ?) with spatial predicates and universal/existential quantifiers: $\forall x \in \text{cars}: \exists y \in \text{pedestrians}: \text{distance}(x,y) < 5$. This allows expressing complex spatio-temporal patterns like “$(\text{car} \cdot \forall \text{within}(5m, \text{pedestrian}))^*$” meaning “zero or more occurrences of cars where every car has a nearby pedestrian.”

\[\text{q-SpRE} := \text{regex-ops} \cup \{\forall x \in C: \phi(x), \exists x \in C: \phi(x)\} \cup \text{spatial-predicates}\]

The key insight is that spatial relationships can be treated as first-class citizens in pattern matching, enabling automatic generation of training data for complex video understanding tasks.

Reading Guide

The trajectory evaluation work in Plan-RewardBench directly relates to the embodied AI foundation models in HY-Embodied-0.5, as both address the challenge of evaluating complex multi-step agent behaviors. The spatio-temporal reasoning capabilities of FESTS complement the long-horizon simulation focus of OccSim, with both tackling the temporal consistency problem in different domains. The perception-reasoning decoupling in PRCO provides a methodological foundation that could enhance the multi-modal capabilities demonstrated across the embodied AI and simulation papers.

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

Authors: Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan et al. (6 authors) · Institution: Nanjing University, Alibaba Group · Category: cs.AI

Plan-RewardBench introduces trajectory-level preference evaluation for tool-augmented agents, revealing that current reward models struggle with long-horizon planning consistency and tool grounding across 1,171 carefully constructed preference pairs.

Practical Takeaway: If you’re building agentic systems with tool integration, this benchmark reveals critical evaluation gaps in current reward models. The systematic hard negative construction methodology (combining natural rollouts, rule-based perturbations, and minimal edits) provides a reusable recipe for generating training data for trajectory-level RMs. Most importantly, the results show that even strong LLM judges struggle with long-horizon planning consistency and fall below 70% accuracy on complex multi-turn scenarios - highlighting the need for specialized training rather than relying on general-purpose models as trajectory evaluators in production RL loops.

Tags: reward_modeling agent_evaluation tool_use trajectory_evaluation preference_learning RLHF planning benchmark

arXiv · PDF

Task & Setting

Complex agentic systems increasingly rely on multi-step tool-integrated reasoning (TIR), where agents must plan, execute, and recover across long horizons. Traditional reward model evaluation focuses on response-level preferences, missing critical failures in planning consistency, error recovery, and tool grounding that emerge in trajectory-level interactions.

The task evaluates trajectory-level preference judgment in tool-augmented environments. Input consists of tool environment T (schemas and descriptions), multi-turn user interactions, and two candidate trajectories (τA, τB) containing interleaved assistant messages, tool calls, and tool responses. The objective is pairwise preference classification:

\[P(\text{prefer}(\tau_A, \tau_B) | T, \text{context})\]

Success is measured by pairwise accuracy against human-validated gold labels across four scenario families: Safety Refusal, Tool-Irrelevance/Unavailability, Complex Planning, and Robust Error Recovery.

Plan-RewardBench contains 1,171 preference pairs spanning trajectories with 2-64 turns (mean 10.6). Complex Planning dominates with 484 pairs across difficulty/horizon splits, while other families provide 275-361 pairs each.

Architecture & Method

Multi-source trajectory generation: Natural rollouts from diverse agents (Qwen-Agent, OpenAI-Agent) using different base models, temperatures, and policies to capture realistic success/failure modes.
Hard negative construction: Three complementary approaches - natural negatives from multi-model rollouts (70%), rule-based perturbations (22%), and minimal-edit perturbations (8%) that preserve tool calls while degrading assistant reasoning.
Multi-LLM judge scoring: K=3 judge panel scores each trajectory (1-5 scale) with family-specific rubrics, aggregated by median score and majority-vote diagnostics.
Meta-review filtering: Separate meta-review pass when score ranges ≥2 or critical tags conflict, with ambiguous cases discarded.
Bias-controlled pairing: Difficulty control pairs strong trajectories (Chosen) with lower-ranked candidates, balanced across HardPair (gap=1) and EasyPair (gap≥2). Length/format stratification prevents superficial preference exploitation.

The core contribution is the systematic construction of confusable hard negatives that isolate semantic planning failures rather than surface-level cues.

Training Recipe

Data collection: - Source: Toucan dataset with MCP tool registries and executed responses - Scale: 1,171 preference pairs across 4 scenario families - Filtering: Lightweight sanity checks remove malformed traces and execution failures
Multi-LLM labeling: - Judge panel: K=3 judges per trajectory using family-specific rubrics - Aggregation: Median score, majority-vote diagnostics - Meta-review: Triggered when score range ≥2 or critical tag conflicts - Validation: Independent pairwise judge confirms preference direction
Human audit: - Annotators: 2 independent human judges on stratified subset - Agreement: Cohen’s κ ∈[0.71, 0.86] across families - Resolution: Third senior annotator for disagreements
Quality control: - Generator-judge disjoint sets prevent leakage - Difficulty/bias controls in pair assembly - Post-check filtering retains only consistent pairs

Training details for base trajectory generators and judge models not reported.

Novelty & Lineage

Prior work:

RewardBench/RewardBench2 (Lambert et al. 2025, Malik et al. 2025) - response-level RM evaluation across chat, reasoning, safety
FC-RewardBench (Agarwal et al. 2025) - tool-call correctness evaluation in single-turn settings
Agent-RewardBench (Men et al. 2025) - step-level multimodal agent evaluation

Delta: This paper extends evaluation from response/step-level to full trajectory-level preferences in text-only tool-augmented settings. The key additions are:
long-horizon planning consistency evaluation
systematic hard negative construction methodology
trajectory-level error recovery assessment.

Applied-specific assessment:
- Architectural idea: The multi-source hard negative construction (natural + rule-based + minimal-edit) is a solid engineering contribution but follows established preference data curation practices
- Benchmark gains: Not applicable - this is a benchmark paper rather than a method
- Comparisons: Fair evaluation protocol with bias controls and human validation (κ > 0.7)
- Scale dependency: The benchmark construction process would generalize to other tool environments
Verdict: INCREMENTAL — Solid extension of existing RM benchmarks to trajectory-level evaluation with good methodology, but represents expected evolution rather than fundamental innovation.

Benchmarks & Results

Overall macro-average: Best model Qwen-Plus achieves 69.96%, with competitive scalar RM Inf-ORM-Llama3.1-70B at 69.21%
Complex Planning Multi-turn Easy: Qwen3-4B-Instruct leads at 75.00%, Qwen3-30B-A3B-Instruct at 72.02%
Complex Planning Multi-turn Hard: Best performance Qwen3-4B-Instruct at 76.56%, DeepSeek-V3.2-Exp at 61.58%
Complex Planning Single-turn Easy: Multiple models cluster around 79-84%, with Qwen3-235B-A22B-Instruct-2507 at 84.55%
Complex Planning Single-turn Hard: DeepSeek-V3.2-Exp leads at 74.84%, Qwen-Plus at 74.68%
Robust Error Recovery: Gemini-3-Flash achieves 78.43%, Qwen3-235B-A22B-Thinking-2507 at 78.92%
Safety Refusal: GPT-5 dominates at 84.80%, wide performance variance (40.69-84.80%) among LLM judges
Tool Irrelevance: Gemini-3-Flash leads at 75.55%, most models cluster 60-70%

Results reveal no single model dominates all categories, with sharp performance degradation on long-horizon trajectories (>32k tokens).

Compute & Efficiency

Model sizes: Evaluated models range from 4B (Qwen3-4B-Instruct) to 235B parameters (Qwen3-235B variants), with most scalar RMs at 7B-70B scale
Training compute: Not reported for benchmark construction or evaluation
Inference speed: Not reported, though pairwise protocols inherently require 2x evaluation vs pointwise
Memory footprint: Not specified, but trajectory lengths up to 64 turns (max 29,622 tokens) indicate substantial context requirements
Deployment practicality: Benchmark designed for offline evaluation; requires no tool re-execution or external service access after construction, supporting practical deployment

Real-World Applicability

Tool environments: Built on Toucan dataset with realistic MCP (Model Context Protocol) tool registries representing real-world APIs and services
Trajectory validation: Multi-model natural rollouts (70% of data) capture authentic agent behaviors and failure modes in tool-integrated scenarios
Human validation: Substantial inter-annotator agreement (Cohen’s κ ∈[0.71, 0.86]) confirms alignment with human judgment across scenario families
Production relevance: Addresses critical deployment challenges including safety refusal quality, error recovery, and tool hallucination detection in multi-turn agentic systems
Limitation scope: Current release focuses on English text-only scenarios; multimodal and multi-agent extensions acknowledged as important future work

Limitations & Failure Modes

EVALUATION: Gold labels for complex planning contain inherent subjectivity despite high inter-annotator agreement
ENGINEERING: MCP-style tool registries may not cover proprietary APIs used in production systems
FUNDAMENTAL: Scenario distribution intentionally non-uniform - Safety Refusal smaller due to difficulty constructing high-quality refusal negatives
ENGINEERING: Current release limited to English text-based tool traces, excluding multimodal scenarios
EVALUATION: Single-language evaluation may not generalize to multilingual agentic systems

Failure modes:
Length sensitivity collapse: All evaluator families show sharp performance degradation beyond 32k tokens, with some falling below random chance
Planning logic blindness: Evaluators struggle to distinguish tool-grounded fabrication from valid reasoning, often rewarding effort over correctness

Spatio-Temporal Grounding of Large Language Models from Perception Streams

Authors: Jacob Anderson, Bardh Hoxha, Georgios Fainekos, Hideki Okamoto et al. (5 authors) · Institution: Toyota Motor North America · Category: cs.RO

FESTS uses extended spatial regular expressions to automatically generate training data for fine-tuning LLMs on complex spatio-temporal video reasoning, achieving large F1 improvements on autonomous driving perception tasks.

Practical Takeaway: If you’re working on video understanding or autonomous driving perception, this paper demonstrates a clever approach to automatically generate training data for spatio-temporal reasoning without manual annotation. The key insight is using formal pattern matching (extended SpRE) to create unlimited (query, answer, explanation) tuples from any structured perception dataset. While the architectural contribution is modest, the data generation pipeline could be valuable for training video-language models on complex temporal reasoning tasks. However, you’ll need high-quality pre-labeled perception data and may want to test generalization across different model architectures before deployment.

Tags: spatio-temporal-reasoning video-understanding autonomous-driving formal-methods LLM-fine-tuning perception object-tracking spatial-logic

arXiv · PDF

Task & Setting

Real-world context: Embodied AI systems for autonomous driving, robotics, and household assistance must understand how objects move and interact in 3D space over time. Current LLMs and VLMs struggle with fine-grained spatial relations, metric distances, and temporal orderings—critical failures for safety-critical applications like autonomous vehicles.

Task definition: Given structured video perception logs (object classes, bounding boxes, optional depth/IDs), the goal is to answer complex spatio-temporal queries like “find all frames where the same car and bus start >10m apart and come within 1m within 20 frames.” Input: perception stream with object annotations. Output: frame-level binary matches and natural language explanations. The objective is to maximize frame-level F1 score:

\[F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

Evaluation criteria: Frame-level F1, exact match accuracy, and segment-level F1 across queries of varying temporal length (1-16 frames) and complexity (sequence, spatial, temporal, metric, existential).

Dataset: FESTS benchmark with 27K+ automatically generated (query, frames, match, explanation) tuples from Woven Perception dataset (180 scenes, 1.2K+ perception streams, 126 frames per stream).

Architecture & Method

Extended Spatial Regular Expression (q-SpRE) language: Combines regex syntax with S4u spatial logic, adding universal (∀) and existential (∃) quantifiers for object tracking across frames
Query synthesis pipeline: Templates generate diverse q-SpRE patterns covering 5 categories (sequence, spatial, temporal, metric, existential)
STREM matching framework: Deterministic Finite Automata-based pattern matcher that verifies q-SpRE queries against structured perception logs
Natural language explanation generator: Converts formal matches into readable explanations for training supervision
Fine-tuning architecture: Qwen2.5-3B-Instruct LLM with LoRA adaptation (rank=16, scaling=32, dropout=0.05) on attention and MLP layers
Two-stage training: Stage 1 uses supervised fine-tuning with cross-entropy loss. Stage 2 adds PPO reinforcement learning with hierarchical reward function combining structural validity, match accuracy (mAP IoU), and reasoning fidelity (sentence similarity).

Training Recipe

Stage 1 - Supervised Fine-Tuning: 27K (query, frames, match, explanation) tuples from Woven Perception dataset. LoRA with AdamW 8-bit optimizer, learning rate 1×10^-5, cosine schedule, effective batch size 60, 5 epochs. Training time and hardware not reported.
Stage 2 - Reinforcement Learning: PPO on top of Stage 1 model. Custom hierarchical reward function evaluating structural validity, match accuracy, and reasoning fidelity. AdamW 8-bit optimizer, learning rate 1×10^-6, effective batch size 4, KL divergence coefficient 0.05, 1 PPO epoch with 4 optimization epochs per batch. Training time and hardware not reported.

Data source: Real perception data from Woven Perception (180 autonomous driving scenes). No synthetic data generation reported.

Novelty & Lineage

Prior work:

SpatialVLM (Chen et al. 2024) and SpatialBot (Cai et al. 2024) address spatial reasoning gaps with geometric priors.
V-STaR (Li et al. 2025) shows purely textual fine-tuning improves temporal reasoning in video-LLMs.
NSVS-TL (Choi et al. 2024) uses temporal logic for long-term video reasoning.

Delta: This paper adds (1) extension of SpRE with universal/existential quantifiers, (2) automated generation of verifiable spatio-temporal supervision without human labels, (3) joint spatial-temporal reasoning training with explanations.

Applied-specific assessment: The architectural contribution (quantified SpRE) is a modest extension of existing formal methods. The key insight—using formal pattern matching to generate unlimited training data—is clever but incremental. Benchmark gains are substantial (+39 F1 points) but limited to a single 3B model on one dataset. The comparison to GPT-4.1 uses different training data and compute scales, making it potentially unfair. The approach requires high-quality pre-labeled perception data, limiting scalability.

Verdict: INCREMENTAL — solid engineering contribution that automatically generates training data for spatio-temporal reasoning, but builds incrementally on known formal methods and fine-tuning approaches.

Benchmarks & Results

Frame-level F1 (overall): Qwen2.5-3B baseline 48.5%, Q-SFT 80.4%, Q-SFT+RL 87.5%, GPT-4.1 84.8% (+39.0% improvement over baseline)
Exact Match (overall): Qwen2.5-3B baseline 25.0%, Q-SFT 56.6%, Q-SFT+RL 64.5%, GPT-4.1 35.0% (+39.5% improvement)
Sequence queries F1: Baseline 57.0%, Q-SFT+RL 96.2% (+39.2% improvement)
Spatial queries F1: Baseline 41.5%, Q-SFT+RL 81.9% (+40.4% improvement)
Temporal queries F1: Baseline 50.4%, Q-SFT+RL 86.6% (+36.2% improvement)
Metric queries F1: Baseline 45.7%, Q-SFT+RL 90.3% (+44.6% improvement)
Existential queries F1: Baseline 48.0%, Q-SFT+RL 82.8% (+34.8% improvement)

Results show consistent large improvements across all query types and frame lengths (1-16 frames). Performance gaps with GPT-4.1 remain on existential and spatial queries.

Compute & Efficiency

Model size: 3 billion parameters (Qwen2.5-3B-Instruct base model)
Training compute: Not reported (GPU hours, specific hardware unspecified)
Inference speed/latency: Not reported
Memory footprint: Uses LoRA fine-tuning to reduce memory requirements, specific values not reported
Deployment practicality: Claimed to be “two orders of magnitude smaller” than GPT-4.1, making it more deployable for real-time applications, but lacks concrete deployment metrics

Real-World Applicability

Real-world data testing: Uses Woven Perception dataset with authentic autonomous driving scenarios (180 scenes, 7 camera sensors per scene)
Domain focus: Autonomous vehicle perception with object detection and tracking requirements
Sim-to-real: No discussion of simulation-to-real transfer
Production integration: Framework described as enabling “unlimited training data” for any structured perception dataset, but no production deployment results reported
Hardware experiments: No robot or vehicle deployment experiments described beyond dataset evaluation

Limitations & Failure Modes

ENGINEERING - Requires pre-labeled perception data with object classes and bounding boxes, limiting scalability to unlabeled video
ENGINEERING - Manual curation required for query template creation, not fully automated
EVALUATION - Single model evaluation (only Qwen2.5-3B), generalization across model families unclear
FUNDAMENTAL - Missing automated translation from natural language queries to SpRE syntax
ENGINEERING - Accuracy depends entirely on quality of source perception labels

Failure modes:
Existential queries involving extended object tracking show degraded performance vs. GPT-4.1
Spatial reasoning queries lag behind simpler temporal patterns despite training supervision.

HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

Authors: Tencent Robotics X, HY Vision Team, :, Xumin Yu et al. (23 authors) · Institution: Tencent · Category: cs.CV

HY-Embodied-0.5 introduces Mixture-of-Transformers architecture and comprehensive embodied training to create foundation models optimized for real-world agent deployment, achieving strong performance across 22 benchmarks with an efficient 2B parameter variant.

Practical Takeaway: If you’re building embodied AI systems, this work provides a solid engineering template combining MoT architecture with comprehensive embodied training data. The key practical insights are: (1) modality-specific parameters and attention patterns can improve visual reasoning without major architectural changes, (2) systematic embodied data curation across perception, spatial reasoning, and planning tasks is crucial, (3) iterative post-training with task-aware rewards and on-policy distillation can effectively transfer capabilities from large to small models. The 2B edge model achieving competitive performance makes this approach practically relevant for resource-constrained deployment, though you’ll need substantial training data and compute to replicate the results.

Tags: embodied-ai vision-language-models robotics spatial-reasoning mixture-of-transformers visual-grounding trajectory-prediction affordance-learning

arXiv · PDF

Task & Setting

Real-world embodied agents require sophisticated visual perception and reasoning capabilities to navigate and manipulate physical environments. However, existing Vision-Language Models (VLMs) excel primarily on static benchmarks but lack the fine-grained spatial perception and embodied reasoning necessary for real-world deployment.

The task is to develop foundation models that can process visual inputs (images, videos) and natural language instructions to produce responses for embodied tasks including spatial localization, affordance prediction, trajectory planning, and manipulation reasoning. Input consists of RGB images/video frames with resolution support up to native resolution, paired with natural language queries. Output formats include bounding boxes (normalized 0-1000 coordinates), point coordinates, trajectories (up to 15 waypoints), discrete tokens, and natural language responses.

Success is measured across 22 benchmarks spanning visual perception (CV-Bench, DA-2K), spatial understanding (3DSRBench, ViewSpatial, etc.), and embodied understanding (ERQA, RoboBench, ShareRobot). Metrics include accuracy, IoU for grounding tasks, Dynamic Fréchet Distance for trajectories, and task-specific measures.

The evaluation covers existing public benchmarks with comprehensive coverage of embodied capabilities including perception, spatial reasoning, affordance recognition, and planning.

Architecture & Method

HY-ViT 2.0: 400M parameter native-resolution Vision Transformer with arbitrary resolution support, distilled from a larger internal model
Mixture-of-Transformers (MoT) architecture: Separate FFN and QKV parameters for vision and language tokens, enabling modality-specific computation
Visual attention mechanism: Full bidirectional attention for visual tokens, causal attention for text tokens
Visual latent tokens: Learnable tokens appended to each visual input, supervised by global features from teacher ViT
Three-loss training objective during pretraining:
\[L_{total} = L_{llm} + L_{vision} + L_{global}\]
where vision loss predicts next discrete visual codes:
\[L_{vision} = -\frac{1}{N_v}\sum_{i=1}^{N_v} \log p_i(z_i)\]
and global loss aligns latent tokens with teacher features:
\[L_{global} = -\frac{f_{latent}^T f_{teacher}}{||f_{latent}|| ||f_{teacher}||}\]
Two model variants: MoT-2B (2B activated, 4B total parameters) and MoE-A32B (32B activated, 407B total parameters)

Training Recipe

Pre-training stage: 600B+ tokens (389B general + 236B embodied/perception data) - Data: Web-scale general data, spatial/robotics data (43% of embodied), visual perception tasks
- Optimizer: AdamW, base LR 5e-5, ViT LR 5e-6, weight decay 1e-4, batch size 256 - Context length: 32k tokens with packing, ViT gradients updated every 5 steps - Hardware: Not reported
Mid-training stage: 30M instances (general:embodied:spatial = 12:5:3) - Data: Higher-quality embodied and spatial data with unified coordinate formats - Optimizer: Same base LR with cosine decay, ViT parameters frozen - Training: Short chains for MoE-A32B, mixed long/short chains for MoT-2B
Supervised Fine-tuning: ~100k cold-start Chain-of-Thought samples - Data: Human-model collaborative CoT construction with LLM evaluation - Training: No sequence packing, LR 5e-5, standard cross-entropy loss
Reinforcement Learning: GRPO with task-aware rewards, 50k samples per round - Training: Group size 16, LR 8e-7, asymmetric clipping [0.8,1.35], 5 epochs - Hardware: Gradient checkpointing, parameter offloading enabled
Iterative post-training: Alternating RL and rejection sampling fine-tuning cycles

Novelty & Lineage

Prior work:

Qwen3-VL (2025) - General VLM with thinking capabilities, achieves strong performance on standard benchmarks
RoboBrain2.5 (2026) - Specialized embodied VLM with 4B parameters targeting robot control tasks
MiMo-Embodied (2025) - 7B parameter embodied foundation model

Delta: This paper adds:
Mixture-of-Transformers architecture with modality-specific parameters and attention patterns
visual latent tokens with multi-loss supervision
comprehensive embodied pre-training data (100M+ samples)
iterative post-training combining RL with rejection sampling
on-policy distillation from large to small models.

Assessment:
- Architectural novelty: MoT for embodied tasks is a reasonable extension of existing ideas, not fundamentally novel
- Benchmark gains: Consistent improvements across 22 benchmarks, winning 16/22 for the 2B model is meaningful
- Fair comparisons: Uses same evaluation protocol, though baseline selection may favor their approach
- Scale dependence: Gains likely depend on proprietary training data and compute scale
The core contributions are solid engineering advances rather than breakthrough innovations. The systematic application of MoT, comprehensive embodied data curation, and multi-stage training represent competent incremental progress.

Verdict: INCREMENTAL — solid engineering combining known techniques with extensive embodied data and training, yielding consistent but expected improvements.

Benchmarks & Results

CV-Bench: 89.2% vs previous best 88.8% (MiMo-Embodied 7B), +0.4%
DA-2K: 92.3% vs 79.4% (RoboBrain 2.5 4B), +12.9%
ERQA: 54.5% vs 47.3% (Qwen3-VL 4B), +7.2%
EmbSpatial-Bench: 82.8% vs 80.7% (Qwen3-VL 4B), +2.1%
RoboBench-MCQ: 49.2% vs 45.8% (Qwen3-VL 4B), +3.4%
RoboBench-Planning: 54.2% vs 58.7% (MiMo-Embodied 7B), -4.5%
RoboSpatial-Home: 55.7% vs 63.2% (Qwen3-VL 4B), -7.5%
ShareRobot-Affordance: 26.8% vs 25.5% (tied), +1.3%
ShareRobot-Trajectory: 73.3% vs 81.4% (RoboBrain 2.5), -8.1%
Ego-Plan2: 45.5% vs 52.6% (RoboBrain 2.5), -7.1%
3DSRBench: 57.0% vs 44.8% (RoboBrain 2.5), +12.2%
All-Angles Bench: 55.1% vs 49.0% (MiMo-Embodied), +6.1%
MindCube: 66.3% vs 36.2% (MiMo-Embodied), +30.1%
MMSI-Bench: 33.2% vs 31.9% (MiMo-Embodied), +1.3%
RefSpatial-Bench: 45.8% vs 56.0% (RoboBrain 2.5), -10.2%
SAT: 76.7% vs 78.7% (MiMo-Embodied), -2.0%
SIBench-mini: 58.2% vs 53.1% (MiMo-Embodied), +5.1%
SITE-Bench-Image: 62.7% vs 61.0% (Qwen3-VL 4B), +1.7%
SITE-Bench-Video: 63.5% vs 58.9% (MiMo-Embodied), +4.6%
ViewSpatial: 53.1% vs 41.6% (Qwen3-VL 4B), +11.5%
VSIBench: 60.5% vs 55.2% (Qwen3-VL 4B), +5.3%
Where2Place: 68.0% vs 65.0% (RoboBrain 2.5), +3.0%

Overall: 58.0% average, outperforming Qwen3-VL-4B by 10.2% and RoboBrain2.5-4B by 8.6%. Mixed results with losses on 6 benchmarks, notably trajectory and planning tasks.

Compute & Efficiency

Model size: MoT-2B (2B activated, 4B total parameters), MoE-A32B (32B activated, 407B total parameters)
Training compute: Over 600B tokens pre-training, 30M mid-training samples, multiple RL/RFT cycles - specific GPU hours not reported
Inference speed: Designed for edge deployment (MoT-2B), real-time performance claimed but no specific latency numbers provided
Memory footprint: 400M parameter ViT encoder, modality-specific parameters add computational overhead but “negligible” according to authors
Deployment practicality: Edge-optimized 2B variant specifically designed for real-time deployment, demonstrated in real robot control experiments with 80%+ success rates on manipulation tasks

Real-World Applicability

Robot control experiments: Vision-Language-Action (VLA) model trained using HY-Embodied-0.5 as foundation, achieving 80-85% success rates on real manipulation tasks
Physical evaluation tasks: Packing, hanging, stacking operations with success rates of 85%, 80%, and 85% respectively
Hardware deployment: Edge-optimized 2B model specifically designed for real-time robotic applications
Environment testing: Real-world physical manipulation scenarios, not just simulation
Sim-to-real discussion: Limited - paper focuses on direct real-world training data rather than sim-to-real transfer
Production integration: Open-sourced code and models available, but no details on actual production deployments reported

Limitations & Failure Modes

FUNDAMENTAL: Performance degradation on trajectory prediction and planning tasks compared to specialized baselines (ShareRobot-Trajectory: -8.1%, Ego-Plan2: -7.1%)
ENGINEERING: Requires proprietary training data and substantial compute resources for full reproduction
EVALUATION: Comparison methodology potentially favors their approach by using thinking mode for their model but best-of-both-modes for baselines
FUNDAMENTAL: MoT architecture doubles parameter count while providing only modest gains over standard architectures
ENGINEERING: Dependence on teacher ViT models and multi-stage distillation pipeline increases training complexity
EVALUATION: Limited analysis of failure modes on real robot tasks - only reports success rates without detailed failure analysis

Failure modes:
Likely struggles with long-horizon planning tasks requiring complex temporal reasoning
May hallucinate spatial relationships in cluttered or ambiguous visual scenes based on benchmark performance variations

OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models

Authors: Tianran Liu, Shengwen Zhao, Mozhgan Pourkeshavarz, Weican Li et al. (5 authors) · Institution: University of Toronto · Category: cs.CV

OccSim introduces W-DiT, a geometry-aware diffusion transformer that achieves >80x improvement in stable occupancy world model generation, enabling multi-kilometer autonomous driving simulation from single initial frames.

Practical Takeaway: If you’re building autonomous driving simulators, this work demonstrates that explicitly incorporating 3D geometric priors (rigid transformations) into diffusion models can dramatically improve long-horizon stability over naive video generation approaches. The W-DiT architecture with mask-injected conditioning and SNR-weighted perception loss is worth implementing for occupancy-based simulation tasks. The key insight is that occupancy sequences have inherent geometric structure that should be exploited rather than forcing networks to learn camera motions implicitly. For practitioners, the ability to generate multi-kilometer consistent environments from single frames could significantly reduce dependency on expensive HD map collection.

Tags: autonomous_driving simulation occupancy_prediction world_models diffusion_models long_horizon_generation 3D_perception multi_agent_systems

arXiv · PDF

Task & Setting

Real-world context: Autonomous driving simulation has been limited by reliance on pre-recorded logs or HD maps, constraining generation capabilities to existing dataset scales. This creates a fundamental trade-off between sensor realism, diversity, and interactivity in driving simulators.

Task definition: Generate large-scale, consistent 3D occupancy-based driving simulations from a single initial frame and future ego-actions. Input is one static occupancy frame (200×200 voxels) plus trajectory waypoints. Output is continuous multi-kilometer occupancy sequences (3000+ frames) with populated dynamic agents. The objective is stable autoregressive generation:

\[P(z_{1:K}|z_0, J_{0:K+\sigma}) = \prod_{i=1}^K P_\theta(z_i|z_{i-1}, J_{i-1:i+\sigma-1})\]

where $z_t$ represents occupancy latents and $J$ is ego-trajectory.

Evaluation criteria: Static realism via 2D/3D FID, KID, MMD metrics comparing conditional fidelity and unconditional realism. Generation diversity measured by Vendi scores and semantic IoU diversity. Downstream utility assessed via zero-shot 4D semantic occupancy forecasting performance (mIoU metric).

Dataset: Trained on Occ3D-nuScenes and UniOcc datasets with publicly available data.

Architecture & Method

W-DiT (Warp-DiT) backbone: Novel diffusion transformer explicitly incorporating rigid 3D transformations via mask-injected conditioning mechanism instead of naive temporal concatenation from video generation.
Geometric conditioning: Forward-warp previous latent $z_t$ using rigid transformation matrices $T_{t+1}^t$ derived from ego-trajectory, creating warped latent $\hat{z}_{t+1}$ with visibility mask $M_{vis}$ and random mask $M_{rand}$.
Dual conditioning injection: Token-wise scaling/shifting parameters from both global timestep $\tau$ and spatial condition features, enabling precise spatial control during generation.
Flow matching objective with SNR-weighted perception loss:
\[L_{total} = \mathbb{E}[\|v_\theta - (z_{t+1} - \epsilon)\|_2^2 + \lambda \tau^2 \frac{M_{mask}}{|M_{mask}|} CE(O_{t+1}, D(\hat{z}_{t+1}))]\]

Layout Generator: Compact DiT-S model learning conditional distribution $P(H

z_t)$ for agent placement, mapping discrete agent positions to continuous 2D heatmaps via Gaussian kernels.

Map fusion: Two-pass keyframe-based fusion with morphological refinement and graph-based lane topology extraction for multi-kilometer road networks.

Training Recipe

W-DiT training: 200 epochs on 4 RTX Pro 6000 GPUs, batch size 32, ~100 GPU hours on nuScenes. AdamW optimizer with cosine annealing, learning rate 3.2e-5 to 3.2e-6. 20% condition dropout for classifier-free guidance. Random masking ratio 10-50%.
Layout Generator training: 500 epochs on 8 A100 80GB GPUs, ~160 GPU hours. Same optimizer settings. Extensive spatial augmentations including agent perturbations and rigid transformations (rotations, reflections) to prevent overfitting on 28K training samples.
Loss weighting: $\lambda = 2$ for perception loss component. Noise $\epsilon$ sampled from sigmoid(N(0,I)).
Data details: Publicly available occupancy datasets, no proprietary data used.

Novelty & Lineage

Prior work:

COME (2024): ControlNet-based occupancy world model achieving ~35-40 frame stable rollouts
DOME (2024): Continuous VAE + diffusion for occupancy generation with similar short horizons
OccWorld (2024): Discrete VQ-VAE + autoregressive transformers for occupancy prediction

Delta: This paper introduces W-DiT architecture explicitly incorporating 3D geometric priors through rigid transformations, enabling >80x improvement in stable generation length (3000+ vs <50 frames). Key innovations are mask-injected conditioning, SNR-weighted perception loss, and geometry-aware feature injection replacing naive temporal concatenation.

Applied-specific assessment:
- Architectural novelty: W-DiT’s explicit geometric conditioning is non-obvious and addresses fundamental limitation of prior occupancy world models borrowing video generation architectures
- Benchmark gains: >80x improvement in stable rollout length is substantial and consistent across trajectories
- Fair comparisons: Uses same training data and evaluation protocols as baselines
- Generalization: Gains hold across different VAE latent spaces and trajectory types
The downstream utility demonstration (67% zero-shot performance on occupancy forecasting, 11% better than CARLA) provides meaningful validation beyond generation metrics.

Verdict: SIGNIFICANT — The >80x improvement in stable generation length with novel geometry-aware architecture represents a clear advance that enables practical multi-kilometer simulation capabilities.

Benchmarks & Results

3D occupancy realism (FID/KID): W-DiT maintains stable quality over 1000 frames while baselines degrade rapidly after 30-50 frames across UniScene and OccFM latent spaces.
2D realism on challenging trajectories: W-DiT stays below “chaos” threshold on straight/curved/closed-loop paths while COME exceeds chaos threshold by frame 35-40.
Generation diversity: Higher Vendi scores and pairwise mIoU diversity maintained over 100 timesteps compared to COME and DOME.
Downstream 4D semantic forecasting (zero-shot on nuScenes): - OccWorld trained on OccSim data: 13.51% mIoU vs 11.79% on CARLA data - OccFM trained on OccSim data: 15.93% mIoU vs 6.99% on CARLA data - With 5x more OccSim data: OccWorld reaches 19.54% mIoU, OccFM reaches 25.32% mIoU - Achieves 67-74% of upper bound performance (models trained on same domain)
Inference efficiency: 1.47 seconds per frame vs 2.33-5.48 seconds for baselines, using fewer parameters (181M vs 204-444M).

Compute & Efficiency

Model size: W-DiT has 181.24M parameters (smaller than COME’s 204M and DOME’s 444M)
Training compute: W-DiT trained on 4 RTX Pro 6000 GPUs for ~100 GPU hours; Layout Generator on 8 A100 80GB for ~160 GPU hours
Inference speed: 1.47 seconds per frame (vs 2.33-5.48s for baselines) with 15,570 GFLOPs per frame
Memory footprint: Not explicitly reported, but uses 200×200 voxel resolution with compressed latent representations
Deployment practicality: Achieves real-time generation capability for simulation use cases, with plug-and-play compatibility for trajectory forecasting modules

Real-World Applicability

Simulation validation: Demonstrates closed-loop driving simulation over 4+ kilometer road networks generated from single initial frames
Multi-agent scenarios: Successfully populates generated environments with reactive agents using learned layout generator rather than heuristic rules
Hardware compatibility: Runs on standard GPU hardware (RTX Pro 6000, A100) without specialized equipment
Integration testing: Shows compatibility with existing trajectory forecasting algorithms and IDM-based agent control
Downstream task validation: Pre-trained occupancy forecasting models achieve meaningful zero-shot performance on real nuScenes validation data
No specific robot/vehicle deployment results reported, remains primarily in simulation domain

Limitations & Failure Modes

ENGINEERING: Heuristic-based map fusion and lane graph extraction rather than end-to-end learned approach
ENGINEERING: Limited to 2D planar motion assumption, cannot handle complex 3D maneuvers or elevation changes
FUNDAMENTAL: Still requires initial real occupancy frame as seed, cannot generate entirely novel environments from scratch
EVALUATION: Agent control relies on classical IDM rather than learned behaviors, potentially limiting realism
ENGINEERING: Layout generator trained on only 28K samples, may overfit despite augmentation strategies
EVALUATION: Downstream evaluation limited to occupancy forecasting task, broader autonomous driving capabilities not tested

Failure modes: Model may generate topologically invalid road networks requiring post-processing cleanup; long-term drift possible despite improved stability over baselines

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Authors: Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian et al. (7 authors) · Institution: Shanghai Artificial Intelligence Laboratory · Category: cs.AI

PRCO decouples perception and reasoning in multimodal RL through dual-role framework with separate learning signals, achieving consistent improvements in visual reasoning tasks.

Practical Takeaway: If you’re working on multimodal reasoning with RL, consider separating perception and reasoning optimization signals rather than using shared outcome rewards. The dual-role approach with utility-driven captioning shows promising results for reducing perception errors, which is a key bottleneck in current RLVR methods. The caption-first warmup strategy is crucial for implementation success. However, the method requires additional engineering overhead and may be most beneficial for tasks where visual perception errors are a primary limitation.

Tags: reinforcement_learning multimodal_reasoning visual_question_answering perception RLVR policy_optimization mathematical_reasoning vision_language_models

arXiv · PDF

Task & Setting

Multimodal reasoning models struggle with accurate visual perception, limiting their reasoning capabilities despite advances in reinforcement learning with verifiable rewards (RLVR). Existing RLVR approaches use shared rewards for both perception and reasoning components, creating blurred credit assignment that improves reasoning patterns but fails to enhance visual evidence extraction reliably.

The task involves training multimodal large language models (MLLMs) on datasets with tuples $(I, q, a)$ where $I$ is an image, $q$ is a question, and $a$ is a ground-truth answer. The objective is to maximize accuracy on verifiable multimodal reasoning tasks through reinforcement learning.

\[\text{maximize} \quad \mathbb{E}_{(I,q,a) \sim \mathcal{D}} [V(\hat{a}, a)]\]

where $V(\hat{a}, a) \in {0, 1}$ is a rule-based verifier checking if predicted answer $\hat{a}$ matches ground truth $a$.

Success is measured by accuracy on eight challenging multimodal reasoning benchmarks including MathVerse, MathVision, MathVista, WeMath, DynaMath, LogicVista, MMMU-Pro, and MMStar. The training dataset ViRL39K contains 39,000 verifiable multimodal reasoning questions across diverse visual formats including diagrams and charts.

Architecture & Method

Dual-role framework with shared policy $\pi_\theta$ alternating between Observer and Solver roles via role-specific prompting

Observer generates question-conditioned evidence caption $c \sim \pi_\theta(\cdot

I, q, r^O)$ extracting visual evidence relevant to question

Solver produces final answer $\hat{a} \sim \pi_\theta(\cdot

I^S, q, c, r^S)$ where $I^S \in {\emptyset, I}$ depending on training phase

Observer utility reward with leakage suppression:
\[r^O_k = (1 - I_{\text{leak}}(q, c_k)) \cdot \mathbb{E}_{\hat{a} \sim \pi_\theta}[V(\hat{a}, a)]\]
Solver correctness reward:
\[r^S = \lambda r_{\text{acc}} + (1 - \lambda) r_{\text{format}}\]
Role-specific group relative advantages computed separately for Observer captions and Solver answers
Unified policy optimization combining both trajectory types:
\[L_{\text{dual}}(\theta) = L_{\text{GRPO}}(\theta; \hat{A}^S) + L_{\text{GRPO}}(\theta; \hat{A}^O)\]
The core technical contribution is decoupling perception and reasoning optimization through separate, reliable learning signals while maintaining a shared policy.

Training Recipe

Direct RL training on ViRL39K dataset (39K samples) without supervised fine-tuning stage
AdamW optimizer with learning rate $1 \times 10^{-6}$, rollout batch size 384, trained for 200 optimization steps
Observer rollout group size 4, Solver rollout group size 8
Caption-first warmup: first 40 steps train Solver without image inputs ($I^S = \emptyset$) to encourage caption conditioning, then restore full multimodal inputs ($I^S = I$)
Hardware: 8 NVIDIA H200 GPUs, wall-clock time not reported
Auxiliary components: Qwen3-VL-8B-Instruct for answer leakage checking, rule-based format checker
Clipping factors $\epsilon_l = 0.2$, $\epsilon_h = 0.28$, no KL divergence penalty ($\beta = 0$)
Maximum rollout length: 1024 tokens for Observer, 2048 tokens for Solver

Novelty & Lineage

Prior work:

GRPO
- group relative policy optimization for LLMs with outcome-driven rewards
Vision-R1
- applies RLVR to MLLMs with staged RL schedules
Perception-R1
- introduces explicit perception rewards alongside outcome rewards.
Delta: This paper proposes role-specific learning signals that completely decouple perception and reasoning optimization at the gradient level, rather than using shared outcome rewards or additional perception objectives.

Applied-specific assessment:
- Architectural idea: Novel application of dual-role framework with separate advantage computation, not just adding perception objectives
- Benchmark gains: Consistent 7+ point improvements across model scales and diverse benchmarks, with 39.2% reduction in perception errors vs 7.6% for GRPO
- Fair comparisons: Uses same training data, compute budget, and evaluation protocols as baselines
- Gains without scale: Method works across 3B, 7B, and 8B models suggesting robustness beyond large scale
Verdict: SIGNIFICANT — The role-specific learning signal approach is non-obvious and shows clear improvements in perception error reduction, a key bottleneck that prior RLVR methods fail to address effectively.

Benchmarks & Results

MathVerse: PRCO-7B 49.49% vs previous best DAPO-7B 48.73%, improvement +0.76%
MathVision: PRCO-7B 30.86% vs previous best VPPO-7B 30.52%, improvement +0.34%
MathVista: PRCO-7B 77.10% vs previous best DAPO-7B 76.70%, improvement +0.40%
WeMath: PRCO-7B 50.29% vs previous best MMR1-7B-RL 47.87%, improvement +2.42%
DynaMath: PRCO-7B 29.74% vs previous best VPPO-7B 27.94%, improvement +1.80%
LogicVista: PRCO-7B 49.66% vs previous best MMR1-7B-RL 49.44%, improvement +0.22%
MMMU-Pro: PRCO-7B 42.08% vs previous best DAPO-7B 41.38%, improvement +0.70%
MMStar: PRCO-7B 67.80% vs previous best VPPO-7B 67.20%, improvement +0.60%

Overall average: PRCO-7B 49.63% vs base model 42.45%, improvement +7.18%. Similar pattern for 3B model with +7.65% improvement. Results are consistent but incremental per benchmark.

Compute & Efficiency

Model size: 3B, 7B, and 8B parameter variants tested
Training compute: 8 NVIDIA H200 GPUs for 200 optimization steps, specific GPU-hours not reported
Inference speed/latency: Not reported, uses standard greedy decoding
Memory footprint: Observer max 1024 tokens, Solver max 2048 tokens rollout length
Deployment practicality: Requires dual-role prompting and auxiliary leakage checker model, adds complexity but maintains single shared policy for deployment

Real-World Applicability

Evaluated on curated academic benchmarks only, no real-world deployment results reported
No hardware experiments on physical robots or vehicles
No production integration or commercial application discussed
No sim-to-real transfer evaluation
Limited to problems with short, verifiable answers rather than open-ended generation tasks
Method focuses on mathematical and logical reasoning which may have educational applications but real-world impact unclear

Limitations & Failure Modes

FUNDAMENTAL: Visual evidence representation as text captions is inherently lossy, cannot capture fine-grained spatial relations, global structure, or geometric details
FUNDAMENTAL: Limited to tasks with verifiable, short answers rather than open-ended generation
ENGINEERING: Requires auxiliary supervision for leakage detection and answer verification, adds computational overhead
EVALUATION: Only tested on academic benchmarks, needs evaluation on real-world deployment scenarios
ENGINEERING: Caption-first warmup phase requires careful scheduling, without it method degrades significantly

Failure modes:
Observer may still leak answers despite leakage checker
Important visual details lost in text compression may cause reasoning failures on complex geometric problems.