Applied AI Digest — Apr 10, 2026
Today’s Digest at a Glance
Today’s digest focuses on advanced evaluation methodologies, spatio-temporal reasoning systems, and specialized architectures for embodied AI applications.
Trajectory-Level Preference Modeling
Traditional preference learning evaluates individual responses, but many AI applications—especially autonomous agents—require reasoning about entire sequences of actions and their long-term consequences. The naive approach of applying pointwise preference models to each step fails because it cannot capture dependencies between actions or evaluate whether an agent’s overall strategy is coherent.
Trajectory-level preference modeling addresses this by treating entire execution traces as atomic units for preference comparison. Instead of scoring individual actions $a_t$, the model learns a preference function $P(\tau_1 \succ \tau_2)$ over complete trajectories $\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)$. This requires training data consisting of trajectory pairs with human annotations about which complete sequence better achieves the intended goal.
The key insight is that coherent long-horizon behavior emerges from understanding how individual actions contribute to overall success, rather than optimizing each step in isolation. This enables evaluation of complex behaviors like tool use, multi-step reasoning, and planning consistency that cannot be assessed from individual responses alone.
Extended Spatial Regular Expressions (q-SpRE)
Spatial-temporal reasoning in video requires expressing complex relationships between objects across time, but standard approaches lack the expressiveness to capture quantified spatial relationships or temporal sequences efficiently. Traditional computer vision methods handle individual frames well but struggle with queries like “find all frames where every car is followed by a pedestrian within 5 meters.”
Extended Spatial Regular Expressions (q-SpRE) combine the pattern matching power of regular expressions with spatial logic and quantifiers. The syntax extends standard regex operators (*, +, ?) with spatial predicates and universal/existential quantifiers: $\forall x \in \text{cars}: \exists y \in \text{pedestrians}: \text{distance}(x,y) < 5$. This allows expressing complex spatio-temporal patterns like “$(\text{car} \cdot \forall \text{within}(5m, \text{pedestrian}))^*$” meaning “zero or more occurrences of cars where every car has a nearby pedestrian.”
\[\text{q-SpRE} := \text{regex-ops} \cup \{\forall x \in C: \phi(x), \exists x \in C: \phi(x)\} \cup \text{spatial-predicates}\]The key insight is that spatial relationships can be treated as first-class citizens in pattern matching, enabling automatic generation of training data for complex video understanding tasks.
Reading Guide
The trajectory evaluation work in Plan-RewardBench directly relates to the embodied AI foundation models in HY-Embodied-0.5, as both address the challenge of evaluating complex multi-step agent behaviors. The spatio-temporal reasoning capabilities of FESTS complement the long-horizon simulation focus of OccSim, with both tackling the temporal consistency problem in different domains. The perception-reasoning decoupling in PRCO provides a methodological foundation that could enhance the multi-modal capabilities demonstrated across the embodied AI and simulation papers.
Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
Authors: Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan et al. (6 authors) · Institution: Nanjing University, Alibaba Group · Category: cs.AI
Plan-RewardBench introduces trajectory-level preference evaluation for tool-augmented agents, revealing that current reward models struggle with long-horizon planning consistency and tool grounding across 1,171 carefully constructed preference pairs.
Practical Takeaway: If you’re building agentic systems with tool integration, this benchmark reveals critical evaluation gaps in current reward models. The systematic hard negative construction methodology (combining natural rollouts, rule-based perturbations, and minimal edits) provides a reusable recipe for generating training data for trajectory-level RMs. Most importantly, the results show that even strong LLM judges struggle with long-horizon planning consistency and fall below 70% accuracy on complex multi-turn scenarios - highlighting the need for specialized training rather than relying on general-purpose models as trajectory evaluators in production RL loops.
Tags: reward_modeling agent_evaluation tool_use trajectory_evaluation preference_learning RLHF planning benchmark
Task & Setting
Complex agentic systems increasingly rely on multi-step tool-integrated reasoning (TIR), where agents must plan, execute, and recover across long horizons. Traditional reward model evaluation focuses on response-level preferences, missing critical failures in planning consistency, error recovery, and tool grounding that emerge in trajectory-level interactions.
The task evaluates trajectory-level preference judgment in tool-augmented environments. Input consists of tool environment T (schemas and descriptions), multi-turn user interactions, and two candidate trajectories (τA, τB) containing interleaved assistant messages, tool calls, and tool responses. The objective is pairwise preference classification:
\[P(\text{prefer}(\tau_A, \tau_B) | T, \text{context})\]Success is measured by pairwise accuracy against human-validated gold labels across four scenario families: Safety Refusal, Tool-Irrelevance/Unavailability, Complex Planning, and Robust Error Recovery.
Plan-RewardBench contains 1,171 preference pairs spanning trajectories with 2-64 turns (mean 10.6). Complex Planning dominates with 484 pairs across difficulty/horizon splits, while other families provide 275-361 pairs each.
Architecture & Method
-
Multi-source trajectory generation: Natural rollouts from diverse agents (Qwen-Agent, OpenAI-Agent) using different base models, temperatures, and policies to capture realistic success/failure modes.
-
Hard negative construction: Three complementary approaches - natural negatives from multi-model rollouts (70%), rule-based perturbations (22%), and minimal-edit perturbations (8%) that preserve tool calls while degrading assistant reasoning.
-
Multi-LLM judge scoring: K=3 judge panel scores each trajectory (1-5 scale) with family-specific rubrics, aggregated by median score and majority-vote diagnostics.
-
Meta-review filtering: Separate meta-review pass when score ranges ≥2 or critical tags conflict, with ambiguous cases discarded.
-
Bias-controlled pairing: Difficulty control pairs strong trajectories (Chosen) with lower-ranked candidates, balanced across HardPair (gap=1) and EasyPair (gap≥2). Length/format stratification prevents superficial preference exploitation.
The core contribution is the systematic construction of confusable hard negatives that isolate semantic planning failures rather than surface-level cues.
Training Recipe
-
Data collection: - Source: Toucan dataset with MCP tool registries and executed responses - Scale: 1,171 preference pairs across 4 scenario families - Filtering: Lightweight sanity checks remove malformed traces and execution failures
-
Multi-LLM labeling: - Judge panel: K=3 judges per trajectory using family-specific rubrics - Aggregation: Median score, majority-vote diagnostics - Meta-review: Triggered when score range ≥2 or critical tag conflicts - Validation: Independent pairwise judge confirms preference direction
-
Human audit: - Annotators: 2 independent human judges on stratified subset - Agreement: Cohen’s κ ∈[0.71, 0.86] across families - Resolution: Third senior annotator for disagreements
-
Quality control: - Generator-judge disjoint sets prevent leakage - Difficulty/bias controls in pair assembly - Post-check filtering retains only consistent pairs
Training details for base trajectory generators and judge models not reported.
Novelty & Lineage
Prior work:
- RewardBench/RewardBench2 (Lambert et al. 2025, Malik et al. 2025) - response-level RM evaluation across chat, reasoning, safety
- FC-RewardBench (Agarwal et al. 2025) - tool-call correctness evaluation in single-turn settings
-
Agent-RewardBench (Men et al. 2025) - step-level multimodal agent evaluation
Delta: This paper extends evaluation from response/step-level to full trajectory-level preferences in text-only tool-augmented settings. The key additions are:
- long-horizon planning consistency evaluation
- systematic hard negative construction methodology
-
trajectory-level error recovery assessment.
Applied-specific assessment:
- Architectural idea: The multi-source hard negative construction (natural + rule-based + minimal-edit) is a solid engineering contribution but follows established preference data curation practices
- Benchmark gains: Not applicable - this is a benchmark paper rather than a method
- Comparisons: Fair evaluation protocol with bias controls and human validation (κ > 0.7)
- Scale dependency: The benchmark construction process would generalize to other tool environments
Verdict: INCREMENTAL — Solid extension of existing RM benchmarks to trajectory-level evaluation with good methodology, but represents expected evolution rather than fundamental innovation.
Benchmarks & Results
-
Overall macro-average: Best model Qwen-Plus achieves 69.96%, with competitive scalar RM Inf-ORM-Llama3.1-70B at 69.21%
-
Complex Planning Multi-turn Easy: Qwen3-4B-Instruct leads at 75.00%, Qwen3-30B-A3B-Instruct at 72.02%
-
Complex Planning Multi-turn Hard: Best performance Qwen3-4B-Instruct at 76.56%, DeepSeek-V3.2-Exp at 61.58%
-
Complex Planning Single-turn Easy: Multiple models cluster around 79-84%, with Qwen3-235B-A22B-Instruct-2507 at 84.55%
-
Complex Planning Single-turn Hard: DeepSeek-V3.2-Exp leads at 74.84%, Qwen-Plus at 74.68%
-
Robust Error Recovery: Gemini-3-Flash achieves 78.43%, Qwen3-235B-A22B-Thinking-2507 at 78.92%
-
Safety Refusal: GPT-5 dominates at 84.80%, wide performance variance (40.69-84.80%) among LLM judges
-
Tool Irrelevance: Gemini-3-Flash leads at 75.55%, most models cluster 60-70%
Results reveal no single model dominates all categories, with sharp performance degradation on long-horizon trajectories (>32k tokens).
Compute & Efficiency
-
Model sizes: Evaluated models range from 4B (Qwen3-4B-Instruct) to 235B parameters (Qwen3-235B variants), with most scalar RMs at 7B-70B scale
-
Training compute: Not reported for benchmark construction or evaluation
-
Inference speed: Not reported, though pairwise protocols inherently require 2x evaluation vs pointwise
-
Memory footprint: Not specified, but trajectory lengths up to 64 turns (max 29,622 tokens) indicate substantial context requirements
-
Deployment practicality: Benchmark designed for offline evaluation; requires no tool re-execution or external service access after construction, supporting practical deployment
Real-World Applicability
-
Tool environments: Built on Toucan dataset with realistic MCP (Model Context Protocol) tool registries representing real-world APIs and services
-
Trajectory validation: Multi-model natural rollouts (70% of data) capture authentic agent behaviors and failure modes in tool-integrated scenarios
-
Human validation: Substantial inter-annotator agreement (Cohen’s κ ∈[0.71, 0.86]) confirms alignment with human judgment across scenario families
-
Production relevance: Addresses critical deployment challenges including safety refusal quality, error recovery, and tool hallucination detection in multi-turn agentic systems
-
Limitation scope: Current release focuses on English text-only scenarios; multimodal and multi-agent extensions acknowledged as important future work
Limitations & Failure Modes
-
EVALUATION: Gold labels for complex planning contain inherent subjectivity despite high inter-annotator agreement
-
ENGINEERING: MCP-style tool registries may not cover proprietary APIs used in production systems
-
FUNDAMENTAL: Scenario distribution intentionally non-uniform - Safety Refusal smaller due to difficulty constructing high-quality refusal negatives
-
ENGINEERING: Current release limited to English text-based tool traces, excluding multimodal scenarios
-
EVALUATION: Single-language evaluation may not generalize to multilingual agentic systems
Failure modes:
- Length sensitivity collapse: All evaluator families show sharp performance degradation beyond 32k tokens, with some falling below random chance
- Planning logic blindness: Evaluators struggle to distinguish tool-grounded fabrication from valid reasoning, often rewarding effort over correctness
Spatio-Temporal Grounding of Large Language Models from Perception Streams
Authors: Jacob Anderson, Bardh Hoxha, Georgios Fainekos, Hideki Okamoto et al. (5 authors) · Institution: Toyota Motor North America · Category: cs.RO
FESTS uses extended spatial regular expressions to automatically generate training data for fine-tuning LLMs on complex spatio-temporal video reasoning, achieving large F1 improvements on autonomous driving perception tasks.
Practical Takeaway: If you’re working on video understanding or autonomous driving perception, this paper demonstrates a clever approach to automatically generate training data for spatio-temporal reasoning without manual annotation. The key insight is using formal pattern matching (extended SpRE) to create unlimited (query, answer, explanation) tuples from any structured perception dataset. While the architectural contribution is modest, the data generation pipeline could be valuable for training video-language models on complex temporal reasoning tasks. However, you’ll need high-quality pre-labeled perception data and may want to test generalization across different model architectures before deployment.
Tags: spatio-temporal-reasoning video-understanding autonomous-driving formal-methods LLM-fine-tuning perception object-tracking spatial-logic
Task & Setting
Real-world context: Embodied AI systems for autonomous driving, robotics, and household assistance must understand how objects move and interact in 3D space over time. Current LLMs and VLMs struggle with fine-grained spatial relations, metric distances, and temporal orderings—critical failures for safety-critical applications like autonomous vehicles.
Task definition: Given structured video perception logs (object classes, bounding boxes, optional depth/IDs), the goal is to answer complex spatio-temporal queries like “find all frames where the same car and bus start >10m apart and come within 1m within 20 frames.” Input: perception stream with object annotations. Output: frame-level binary matches and natural language explanations. The objective is to maximize frame-level F1 score:
\[F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]Evaluation criteria: Frame-level F1, exact match accuracy, and segment-level F1 across queries of varying temporal length (1-16 frames) and complexity (sequence, spatial, temporal, metric, existential).
Dataset: FESTS benchmark with 27K+ automatically generated (query, frames, match, explanation) tuples from Woven Perception dataset (180 scenes, 1.2K+ perception streams, 126 frames per stream).
Architecture & Method
-
Extended Spatial Regular Expression (q-SpRE) language: Combines regex syntax with S4u spatial logic, adding universal (∀) and existential (∃) quantifiers for object tracking across frames
-
Query synthesis pipeline: Templates generate diverse q-SpRE patterns covering 5 categories (sequence, spatial, temporal, metric, existential)
-
STREM matching framework: Deterministic Finite Automata-based pattern matcher that verifies q-SpRE queries against structured perception logs
-
Natural language explanation generator: Converts formal matches into readable explanations for training supervision
-
Fine-tuning architecture: Qwen2.5-3B-Instruct LLM with LoRA adaptation (rank=16, scaling=32, dropout=0.05) on attention and MLP layers
-
Two-stage training: Stage 1 uses supervised fine-tuning with cross-entropy loss. Stage 2 adds PPO reinforcement learning with hierarchical reward function combining structural validity, match accuracy (mAP IoU), and reasoning fidelity (sentence similarity).
Training Recipe
-
Stage 1 - Supervised Fine-Tuning: 27K (query, frames, match, explanation) tuples from Woven Perception dataset. LoRA with AdamW 8-bit optimizer, learning rate 1×10^-5, cosine schedule, effective batch size 60, 5 epochs. Training time and hardware not reported.
-
Stage 2 - Reinforcement Learning: PPO on top of Stage 1 model. Custom hierarchical reward function evaluating structural validity, match accuracy, and reasoning fidelity. AdamW 8-bit optimizer, learning rate 1×10^-6, effective batch size 4, KL divergence coefficient 0.05, 1 PPO epoch with 4 optimization epochs per batch. Training time and hardware not reported.
Data source: Real perception data from Woven Perception (180 autonomous driving scenes). No synthetic data generation reported.
Novelty & Lineage
Prior work:
- SpatialVLM (Chen et al. 2024) and SpatialBot (Cai et al. 2024) address spatial reasoning gaps with geometric priors.
- V-STaR (Li et al. 2025) shows purely textual fine-tuning improves temporal reasoning in video-LLMs.
-
NSVS-TL (Choi et al. 2024) uses temporal logic for long-term video reasoning.
Delta: This paper adds (1) extension of SpRE with universal/existential quantifiers, (2) automated generation of verifiable spatio-temporal supervision without human labels, (3) joint spatial-temporal reasoning training with explanations.
Applied-specific assessment: The architectural contribution (quantified SpRE) is a modest extension of existing formal methods. The key insight—using formal pattern matching to generate unlimited training data—is clever but incremental. Benchmark gains are substantial (+39 F1 points) but limited to a single 3B model on one dataset. The comparison to GPT-4.1 uses different training data and compute scales, making it potentially unfair. The approach requires high-quality pre-labeled perception data, limiting scalability.
Verdict: INCREMENTAL — solid engineering contribution that automatically generates training data for spatio-temporal reasoning, but builds incrementally on known formal methods and fine-tuning approaches.
Benchmarks & Results
-
Frame-level F1 (overall): Qwen2.5-3B baseline 48.5%, Q-SFT 80.4%, Q-SFT+RL 87.5%, GPT-4.1 84.8% (+39.0% improvement over baseline)
-
Exact Match (overall): Qwen2.5-3B baseline 25.0%, Q-SFT 56.6%, Q-SFT+RL 64.5%, GPT-4.1 35.0% (+39.5% improvement)
-
Sequence queries F1: Baseline 57.0%, Q-SFT+RL 96.2% (+39.2% improvement)
-
Spatial queries F1: Baseline 41.5%, Q-SFT+RL 81.9% (+40.4% improvement)
-
Temporal queries F1: Baseline 50.4%, Q-SFT+RL 86.6% (+36.2% improvement)
-
Metric queries F1: Baseline 45.7%, Q-SFT+RL 90.3% (+44.6% improvement)
-
Existential queries F1: Baseline 48.0%, Q-SFT+RL 82.8% (+34.8% improvement)
Results show consistent large improvements across all query types and frame lengths (1-16 frames). Performance gaps with GPT-4.1 remain on existential and spatial queries.
Compute & Efficiency
-
Model size: 3 billion parameters (Qwen2.5-3B-Instruct base model)
-
Training compute: Not reported (GPU hours, specific hardware unspecified)
-
Inference speed/latency: Not reported
-
Memory footprint: Uses LoRA fine-tuning to reduce memory requirements, specific values not reported
-
Deployment practicality: Claimed to be “two orders of magnitude smaller” than GPT-4.1, making it more deployable for real-time applications, but lacks concrete deployment metrics
Real-World Applicability
-
Real-world data testing: Uses Woven Perception dataset with authentic autonomous driving scenarios (180 scenes, 7 camera sensors per scene)
-
Domain focus: Autonomous vehicle perception with object detection and tracking requirements
-
Sim-to-real: No discussion of simulation-to-real transfer
-
Production integration: Framework described as enabling “unlimited training data” for any structured perception dataset, but no production deployment results reported
-
Hardware experiments: No robot or vehicle deployment experiments described beyond dataset evaluation
Limitations & Failure Modes
-
ENGINEERING - Requires pre-labeled perception data with object classes and bounding boxes, limiting scalability to unlabeled video
-
ENGINEERING - Manual curation required for query template creation, not fully automated
-
EVALUATION - Single model evaluation (only Qwen2.5-3B), generalization across model families unclear
-
FUNDAMENTAL - Missing automated translation from natural language queries to SpRE syntax
-
ENGINEERING - Accuracy depends entirely on quality of source perception labels
Failure modes:
- Existential queries involving extended object tracking show degraded performance vs. GPT-4.1
- Spatial reasoning queries lag behind simpler temporal patterns despite training supervision.
HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents
Authors: Tencent Robotics X, HY Vision Team, :, Xumin Yu et al. (23 authors) · Institution: Tencent · Category: cs.CV
HY-Embodied-0.5 introduces Mixture-of-Transformers architecture and comprehensive embodied training to create foundation models optimized for real-world agent deployment, achieving strong performance across 22 benchmarks with an efficient 2B parameter variant.
Practical Takeaway: If you’re building embodied AI systems, this work provides a solid engineering template combining MoT architecture with comprehensive embodied training data. The key practical insights are: (1) modality-specific parameters and attention patterns can improve visual reasoning without major architectural changes, (2) systematic embodied data curation across perception, spatial reasoning, and planning tasks is crucial, (3) iterative post-training with task-aware rewards and on-policy distillation can effectively transfer capabilities from large to small models. The 2B edge model achieving competitive performance makes this approach practically relevant for resource-constrained deployment, though you’ll need substantial training data and compute to replicate the results.
Tags: embodied-ai vision-language-models robotics spatial-reasoning mixture-of-transformers visual-grounding trajectory-prediction affordance-learning
Task & Setting
Real-world embodied agents require sophisticated visual perception and reasoning capabilities to navigate and manipulate physical environments. However, existing Vision-Language Models (VLMs) excel primarily on static benchmarks but lack the fine-grained spatial perception and embodied reasoning necessary for real-world deployment.
The task is to develop foundation models that can process visual inputs (images, videos) and natural language instructions to produce responses for embodied tasks including spatial localization, affordance prediction, trajectory planning, and manipulation reasoning. Input consists of RGB images/video frames with resolution support up to native resolution, paired with natural language queries. Output formats include bounding boxes (normalized 0-1000 coordinates), point coordinates, trajectories (up to 15 waypoints), discrete tokens, and natural language responses.
Success is measured across 22 benchmarks spanning visual perception (CV-Bench, DA-2K), spatial understanding (3DSRBench, ViewSpatial, etc.), and embodied understanding (ERQA, RoboBench, ShareRobot). Metrics include accuracy, IoU for grounding tasks, Dynamic Fréchet Distance for trajectories, and task-specific measures.
The evaluation covers existing public benchmarks with comprehensive coverage of embodied capabilities including perception, spatial reasoning, affordance recognition, and planning.
Architecture & Method
-
HY-ViT 2.0: 400M parameter native-resolution Vision Transformer with arbitrary resolution support, distilled from a larger internal model
-
Mixture-of-Transformers (MoT) architecture: Separate FFN and QKV parameters for vision and language tokens, enabling modality-specific computation
-
Visual attention mechanism: Full bidirectional attention for visual tokens, causal attention for text tokens
-
Visual latent tokens: Learnable tokens appended to each visual input, supervised by global features from teacher ViT
-
Three-loss training objective during pretraining:
\[L_{total} = L_{llm} + L_{vision} + L_{global}\]where vision loss predicts next discrete visual codes:
\[L_{vision} = -\frac{1}{N_v}\sum_{i=1}^{N_v} \log p_i(z_i)\]and global loss aligns latent tokens with teacher features:
\[L_{global} = -\frac{f_{latent}^T f_{teacher}}{||f_{latent}|| ||f_{teacher}||}\] -
Two model variants: MoT-2B (2B activated, 4B total parameters) and MoE-A32B (32B activated, 407B total parameters)
Training Recipe
-
Pre-training stage: 600B+ tokens (389B general + 236B embodied/perception data) - Data: Web-scale general data, spatial/robotics data (43% of embodied), visual perception tasks
- Optimizer: AdamW, base LR 5e-5, ViT LR 5e-6, weight decay 1e-4, batch size 256 - Context length: 32k tokens with packing, ViT gradients updated every 5 steps - Hardware: Not reported -
Mid-training stage: 30M instances (general:embodied:spatial = 12:5:3) - Data: Higher-quality embodied and spatial data with unified coordinate formats - Optimizer: Same base LR with cosine decay, ViT parameters frozen - Training: Short chains for MoE-A32B, mixed long/short chains for MoT-2B
-
Supervised Fine-tuning: ~100k cold-start Chain-of-Thought samples - Data: Human-model collaborative CoT construction with LLM evaluation - Training: No sequence packing, LR 5e-5, standard cross-entropy loss
-
Reinforcement Learning: GRPO with task-aware rewards, 50k samples per round - Training: Group size 16, LR 8e-7, asymmetric clipping [0.8,1.35], 5 epochs - Hardware: Gradient checkpointing, parameter offloading enabled
-
Iterative post-training: Alternating RL and rejection sampling fine-tuning cycles
Novelty & Lineage
Prior work:
- Qwen3-VL (2025) - General VLM with thinking capabilities, achieves strong performance on standard benchmarks
- RoboBrain2.5 (2026) - Specialized embodied VLM with 4B parameters targeting robot control tasks
-
MiMo-Embodied (2025) - 7B parameter embodied foundation model
Delta: This paper adds:
- Mixture-of-Transformers architecture with modality-specific parameters and attention patterns
- visual latent tokens with multi-loss supervision
- comprehensive embodied pre-training data (100M+ samples)
- iterative post-training combining RL with rejection sampling
-
on-policy distillation from large to small models.
Assessment:
- Architectural novelty: MoT for embodied tasks is a reasonable extension of existing ideas, not fundamentally novel
- Benchmark gains: Consistent improvements across 22 benchmarks, winning 16/22 for the 2B model is meaningful
- Fair comparisons: Uses same evaluation protocol, though baseline selection may favor their approach
- Scale dependence: Gains likely depend on proprietary training data and compute scale
The core contributions are solid engineering advances rather than breakthrough innovations. The systematic application of MoT, comprehensive embodied data curation, and multi-stage training represent competent incremental progress.
Verdict: INCREMENTAL — solid engineering combining known techniques with extensive embodied data and training, yielding consistent but expected improvements.
Benchmarks & Results
- CV-Bench: 89.2% vs previous best 88.8% (MiMo-Embodied 7B), +0.4%
- DA-2K: 92.3% vs 79.4% (RoboBrain 2.5 4B), +12.9%
- ERQA: 54.5% vs 47.3% (Qwen3-VL 4B), +7.2%
- EmbSpatial-Bench: 82.8% vs 80.7% (Qwen3-VL 4B), +2.1%
- RoboBench-MCQ: 49.2% vs 45.8% (Qwen3-VL 4B), +3.4%
- RoboBench-Planning: 54.2% vs 58.7% (MiMo-Embodied 7B), -4.5%
- RoboSpatial-Home: 55.7% vs 63.2% (Qwen3-VL 4B), -7.5%
- ShareRobot-Affordance: 26.8% vs 25.5% (tied), +1.3%
- ShareRobot-Trajectory: 73.3% vs 81.4% (RoboBrain 2.5), -8.1%
- Ego-Plan2: 45.5% vs 52.6% (RoboBrain 2.5), -7.1%
- 3DSRBench: 57.0% vs 44.8% (RoboBrain 2.5), +12.2%
- All-Angles Bench: 55.1% vs 49.0% (MiMo-Embodied), +6.1%
- MindCube: 66.3% vs 36.2% (MiMo-Embodied), +30.1%
- MMSI-Bench: 33.2% vs 31.9% (MiMo-Embodied), +1.3%
- RefSpatial-Bench: 45.8% vs 56.0% (RoboBrain 2.5), -10.2%
- SAT: 76.7% vs 78.7% (MiMo-Embodied), -2.0%
- SIBench-mini: 58.2% vs 53.1% (MiMo-Embodied), +5.1%
- SITE-Bench-Image: 62.7% vs 61.0% (Qwen3-VL 4B), +1.7%
- SITE-Bench-Video: 63.5% vs 58.9% (MiMo-Embodied), +4.6%
- ViewSpatial: 53.1% vs 41.6% (Qwen3-VL 4B), +11.5%
- VSIBench: 60.5% vs 55.2% (Qwen3-VL 4B), +5.3%
-
Where2Place: 68.0% vs 65.0% (RoboBrain 2.5), +3.0%
Overall: 58.0% average, outperforming Qwen3-VL-4B by 10.2% and RoboBrain2.5-4B by 8.6%. Mixed results with losses on 6 benchmarks, notably trajectory and planning tasks.
Compute & Efficiency
-
Model size: MoT-2B (2B activated, 4B total parameters), MoE-A32B (32B activated, 407B total parameters)
-
Training compute: Over 600B tokens pre-training, 30M mid-training samples, multiple RL/RFT cycles - specific GPU hours not reported
-
Inference speed: Designed for edge deployment (MoT-2B), real-time performance claimed but no specific latency numbers provided
-
Memory footprint: 400M parameter ViT encoder, modality-specific parameters add computational overhead but “negligible” according to authors
-
Deployment practicality: Edge-optimized 2B variant specifically designed for real-time deployment, demonstrated in real robot control experiments with 80%+ success rates on manipulation tasks
Real-World Applicability
-
Robot control experiments: Vision-Language-Action (VLA) model trained using HY-Embodied-0.5 as foundation, achieving 80-85% success rates on real manipulation tasks
-
Physical evaluation tasks: Packing, hanging, stacking operations with success rates of 85%, 80%, and 85% respectively
-
Hardware deployment: Edge-optimized 2B model specifically designed for real-time robotic applications
-
Environment testing: Real-world physical manipulation scenarios, not just simulation
-
Sim-to-real discussion: Limited - paper focuses on direct real-world training data rather than sim-to-real transfer
-
Production integration: Open-sourced code and models available, but no details on actual production deployments reported
Limitations & Failure Modes
-
FUNDAMENTAL: Performance degradation on trajectory prediction and planning tasks compared to specialized baselines (ShareRobot-Trajectory: -8.1%, Ego-Plan2: -7.1%)
-
ENGINEERING: Requires proprietary training data and substantial compute resources for full reproduction
-
EVALUATION: Comparison methodology potentially favors their approach by using thinking mode for their model but best-of-both-modes for baselines
-
FUNDAMENTAL: MoT architecture doubles parameter count while providing only modest gains over standard architectures
-
ENGINEERING: Dependence on teacher ViT models and multi-stage distillation pipeline increases training complexity
-
EVALUATION: Limited analysis of failure modes on real robot tasks - only reports success rates without detailed failure analysis
Failure modes:
- Likely struggles with long-horizon planning tasks requiring complex temporal reasoning
- May hallucinate spatial relationships in cluttered or ambiguous visual scenes based on benchmark performance variations
OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models
Authors: Tianran Liu, Shengwen Zhao, Mozhgan Pourkeshavarz, Weican Li et al. (5 authors) · Institution: University of Toronto · Category: cs.CV
OccSim introduces W-DiT, a geometry-aware diffusion transformer that achieves >80x improvement in stable occupancy world model generation, enabling multi-kilometer autonomous driving simulation from single initial frames.
Practical Takeaway: If you’re building autonomous driving simulators, this work demonstrates that explicitly incorporating 3D geometric priors (rigid transformations) into diffusion models can dramatically improve long-horizon stability over naive video generation approaches. The W-DiT architecture with mask-injected conditioning and SNR-weighted perception loss is worth implementing for occupancy-based simulation tasks. The key insight is that occupancy sequences have inherent geometric structure that should be exploited rather than forcing networks to learn camera motions implicitly. For practitioners, the ability to generate multi-kilometer consistent environments from single frames could significantly reduce dependency on expensive HD map collection.
Tags: autonomous_driving simulation occupancy_prediction world_models diffusion_models long_horizon_generation 3D_perception multi_agent_systems
Task & Setting
Real-world context: Autonomous driving simulation has been limited by reliance on pre-recorded logs or HD maps, constraining generation capabilities to existing dataset scales. This creates a fundamental trade-off between sensor realism, diversity, and interactivity in driving simulators.
Task definition: Generate large-scale, consistent 3D occupancy-based driving simulations from a single initial frame and future ego-actions. Input is one static occupancy frame (200×200 voxels) plus trajectory waypoints. Output is continuous multi-kilometer occupancy sequences (3000+ frames) with populated dynamic agents. The objective is stable autoregressive generation:
\[P(z_{1:K}|z_0, J_{0:K+\sigma}) = \prod_{i=1}^K P_\theta(z_i|z_{i-1}, J_{i-1:i+\sigma-1})\]where $z_t$ represents occupancy latents and $J$ is ego-trajectory.
Evaluation criteria: Static realism via 2D/3D FID, KID, MMD metrics comparing conditional fidelity and unconditional realism. Generation diversity measured by Vendi scores and semantic IoU diversity. Downstream utility assessed via zero-shot 4D semantic occupancy forecasting performance (mIoU metric).
Dataset: Trained on Occ3D-nuScenes and UniOcc datasets with publicly available data.
Architecture & Method
-
W-DiT (Warp-DiT) backbone: Novel diffusion transformer explicitly incorporating rigid 3D transformations via mask-injected conditioning mechanism instead of naive temporal concatenation from video generation.
-
Geometric conditioning: Forward-warp previous latent $z_t$ using rigid transformation matrices $T_{t+1}^t$ derived from ego-trajectory, creating warped latent $\hat{z}_{t+1}$ with visibility mask $M_{vis}$ and random mask $M_{rand}$.
-
Dual conditioning injection: Token-wise scaling/shifting parameters from both global timestep $\tau$ and spatial condition features, enabling precise spatial control during generation.
-
Flow matching objective with SNR-weighted perception loss:
\[L_{total} = \mathbb{E}[\|v_\theta - (z_{t+1} - \epsilon)\|_2^2 + \lambda \tau^2 \frac{M_{mask}}{|M_{mask}|} CE(O_{t+1}, D(\hat{z}_{t+1}))]\] -
Layout Generator: Compact DiT-S model learning conditional distribution $P(H z_t)$ for agent placement, mapping discrete agent positions to continuous 2D heatmaps via Gaussian kernels. - Map fusion: Two-pass keyframe-based fusion with morphological refinement and graph-based lane topology extraction for multi-kilometer road networks.
Training Recipe
-
W-DiT training: 200 epochs on 4 RTX Pro 6000 GPUs, batch size 32, ~100 GPU hours on nuScenes. AdamW optimizer with cosine annealing, learning rate 3.2e-5 to 3.2e-6. 20% condition dropout for classifier-free guidance. Random masking ratio 10-50%.
-
Layout Generator training: 500 epochs on 8 A100 80GB GPUs, ~160 GPU hours. Same optimizer settings. Extensive spatial augmentations including agent perturbations and rigid transformations (rotations, reflections) to prevent overfitting on 28K training samples.
-
Loss weighting: $\lambda = 2$ for perception loss component. Noise $\epsilon$ sampled from sigmoid(N(0,I)).
-
Data details: Publicly available occupancy datasets, no proprietary data used.
Novelty & Lineage
Prior work:
- COME (2024): ControlNet-based occupancy world model achieving ~35-40 frame stable rollouts
- DOME (2024): Continuous VAE + diffusion for occupancy generation with similar short horizons
-
OccWorld (2024): Discrete VQ-VAE + autoregressive transformers for occupancy prediction
Delta: This paper introduces W-DiT architecture explicitly incorporating 3D geometric priors through rigid transformations, enabling >80x improvement in stable generation length (3000+ vs <50 frames). Key innovations are mask-injected conditioning, SNR-weighted perception loss, and geometry-aware feature injection replacing naive temporal concatenation.
Applied-specific assessment:
- Architectural novelty: W-DiT’s explicit geometric conditioning is non-obvious and addresses fundamental limitation of prior occupancy world models borrowing video generation architectures
- Benchmark gains: >80x improvement in stable rollout length is substantial and consistent across trajectories
- Fair comparisons: Uses same training data and evaluation protocols as baselines
- Generalization: Gains hold across different VAE latent spaces and trajectory types
The downstream utility demonstration (67% zero-shot performance on occupancy forecasting, 11% better than CARLA) provides meaningful validation beyond generation metrics.
Verdict: SIGNIFICANT — The >80x improvement in stable generation length with novel geometry-aware architecture represents a clear advance that enables practical multi-kilometer simulation capabilities.
Benchmarks & Results
-
3D occupancy realism (FID/KID): W-DiT maintains stable quality over 1000 frames while baselines degrade rapidly after 30-50 frames across UniScene and OccFM latent spaces.
-
2D realism on challenging trajectories: W-DiT stays below “chaos” threshold on straight/curved/closed-loop paths while COME exceeds chaos threshold by frame 35-40.
-
Generation diversity: Higher Vendi scores and pairwise mIoU diversity maintained over 100 timesteps compared to COME and DOME.
-
Downstream 4D semantic forecasting (zero-shot on nuScenes): - OccWorld trained on OccSim data: 13.51% mIoU vs 11.79% on CARLA data - OccFM trained on OccSim data: 15.93% mIoU vs 6.99% on CARLA data - With 5x more OccSim data: OccWorld reaches 19.54% mIoU, OccFM reaches 25.32% mIoU - Achieves 67-74% of upper bound performance (models trained on same domain)
-
Inference efficiency: 1.47 seconds per frame vs 2.33-5.48 seconds for baselines, using fewer parameters (181M vs 204-444M).
Compute & Efficiency
-
Model size: W-DiT has 181.24M parameters (smaller than COME’s 204M and DOME’s 444M)
-
Training compute: W-DiT trained on 4 RTX Pro 6000 GPUs for ~100 GPU hours; Layout Generator on 8 A100 80GB for ~160 GPU hours
-
Inference speed: 1.47 seconds per frame (vs 2.33-5.48s for baselines) with 15,570 GFLOPs per frame
-
Memory footprint: Not explicitly reported, but uses 200×200 voxel resolution with compressed latent representations
-
Deployment practicality: Achieves real-time generation capability for simulation use cases, with plug-and-play compatibility for trajectory forecasting modules
Real-World Applicability
-
Simulation validation: Demonstrates closed-loop driving simulation over 4+ kilometer road networks generated from single initial frames
-
Multi-agent scenarios: Successfully populates generated environments with reactive agents using learned layout generator rather than heuristic rules
-
Hardware compatibility: Runs on standard GPU hardware (RTX Pro 6000, A100) without specialized equipment
-
Integration testing: Shows compatibility with existing trajectory forecasting algorithms and IDM-based agent control
-
Downstream task validation: Pre-trained occupancy forecasting models achieve meaningful zero-shot performance on real nuScenes validation data
-
No specific robot/vehicle deployment results reported, remains primarily in simulation domain
Limitations & Failure Modes
-
ENGINEERING: Heuristic-based map fusion and lane graph extraction rather than end-to-end learned approach
-
ENGINEERING: Limited to 2D planar motion assumption, cannot handle complex 3D maneuvers or elevation changes
-
FUNDAMENTAL: Still requires initial real occupancy frame as seed, cannot generate entirely novel environments from scratch
-
EVALUATION: Agent control relies on classical IDM rather than learned behaviors, potentially limiting realism
-
ENGINEERING: Layout generator trained on only 28K samples, may overfit despite augmentation strategies
-
EVALUATION: Downstream evaluation limited to occupancy forecasting task, broader autonomous driving capabilities not tested
Failure modes: Model may generate topologically invalid road networks requiring post-processing cleanup; long-term drift possible despite improved stability over baselines
Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
Authors: Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian et al. (7 authors) · Institution: Shanghai Artificial Intelligence Laboratory · Category: cs.AI
PRCO decouples perception and reasoning in multimodal RL through dual-role framework with separate learning signals, achieving consistent improvements in visual reasoning tasks.
Practical Takeaway: If you’re working on multimodal reasoning with RL, consider separating perception and reasoning optimization signals rather than using shared outcome rewards. The dual-role approach with utility-driven captioning shows promising results for reducing perception errors, which is a key bottleneck in current RLVR methods. The caption-first warmup strategy is crucial for implementation success. However, the method requires additional engineering overhead and may be most beneficial for tasks where visual perception errors are a primary limitation.
Tags: reinforcement_learning multimodal_reasoning visual_question_answering perception RLVR policy_optimization mathematical_reasoning vision_language_models
Task & Setting
Multimodal reasoning models struggle with accurate visual perception, limiting their reasoning capabilities despite advances in reinforcement learning with verifiable rewards (RLVR). Existing RLVR approaches use shared rewards for both perception and reasoning components, creating blurred credit assignment that improves reasoning patterns but fails to enhance visual evidence extraction reliably.
The task involves training multimodal large language models (MLLMs) on datasets with tuples $(I, q, a)$ where $I$ is an image, $q$ is a question, and $a$ is a ground-truth answer. The objective is to maximize accuracy on verifiable multimodal reasoning tasks through reinforcement learning.
\[\text{maximize} \quad \mathbb{E}_{(I,q,a) \sim \mathcal{D}} [V(\hat{a}, a)]\]where $V(\hat{a}, a) \in {0, 1}$ is a rule-based verifier checking if predicted answer $\hat{a}$ matches ground truth $a$.
Success is measured by accuracy on eight challenging multimodal reasoning benchmarks including MathVerse, MathVision, MathVista, WeMath, DynaMath, LogicVista, MMMU-Pro, and MMStar. The training dataset ViRL39K contains 39,000 verifiable multimodal reasoning questions across diverse visual formats including diagrams and charts.
Architecture & Method
- Dual-role framework with shared policy $\pi_\theta$ alternating between Observer and Solver roles via role-specific prompting
-
Observer generates question-conditioned evidence caption $c \sim \pi_\theta(\cdot I, q, r^O)$ extracting visual evidence relevant to question -
Solver produces final answer $\hat{a} \sim \pi_\theta(\cdot I^S, q, c, r^S)$ where $I^S \in {\emptyset, I}$ depending on training phase -
Observer utility reward with leakage suppression:
\[r^O_k = (1 - I_{\text{leak}}(q, c_k)) \cdot \mathbb{E}_{\hat{a} \sim \pi_\theta}[V(\hat{a}, a)]\] -
Solver correctness reward:
\[r^S = \lambda r_{\text{acc}} + (1 - \lambda) r_{\text{format}}\] - Role-specific group relative advantages computed separately for Observer captions and Solver answers
-
Unified policy optimization combining both trajectory types:
\[L_{\text{dual}}(\theta) = L_{\text{GRPO}}(\theta; \hat{A}^S) + L_{\text{GRPO}}(\theta; \hat{A}^O)\]The core technical contribution is decoupling perception and reasoning optimization through separate, reliable learning signals while maintaining a shared policy.
Training Recipe
- Direct RL training on ViRL39K dataset (39K samples) without supervised fine-tuning stage
- AdamW optimizer with learning rate $1 \times 10^{-6}$, rollout batch size 384, trained for 200 optimization steps
- Observer rollout group size 4, Solver rollout group size 8
- Caption-first warmup: first 40 steps train Solver without image inputs ($I^S = \emptyset$) to encourage caption conditioning, then restore full multimodal inputs ($I^S = I$)
- Hardware: 8 NVIDIA H200 GPUs, wall-clock time not reported
- Auxiliary components: Qwen3-VL-8B-Instruct for answer leakage checking, rule-based format checker
- Clipping factors $\epsilon_l = 0.2$, $\epsilon_h = 0.28$, no KL divergence penalty ($\beta = 0$)
- Maximum rollout length: 1024 tokens for Observer, 2048 tokens for Solver
Novelty & Lineage
Prior work:
- GRPO
-
- group relative policy optimization for LLMs with outcome-driven rewards
- Vision-R1
-
- applies RLVR to MLLMs with staged RL schedules
- Perception-R1
-
- introduces explicit perception rewards alongside outcome rewards.
Delta: This paper proposes role-specific learning signals that completely decouple perception and reasoning optimization at the gradient level, rather than using shared outcome rewards or additional perception objectives.
Applied-specific assessment:
- Architectural idea: Novel application of dual-role framework with separate advantage computation, not just adding perception objectives
- Benchmark gains: Consistent 7+ point improvements across model scales and diverse benchmarks, with 39.2% reduction in perception errors vs 7.6% for GRPO
- Fair comparisons: Uses same training data, compute budget, and evaluation protocols as baselines
- Gains without scale: Method works across 3B, 7B, and 8B models suggesting robustness beyond large scale
Verdict: SIGNIFICANT — The role-specific learning signal approach is non-obvious and shows clear improvements in perception error reduction, a key bottleneck that prior RLVR methods fail to address effectively.
Benchmarks & Results
- MathVerse: PRCO-7B 49.49% vs previous best DAPO-7B 48.73%, improvement +0.76%
- MathVision: PRCO-7B 30.86% vs previous best VPPO-7B 30.52%, improvement +0.34%
- MathVista: PRCO-7B 77.10% vs previous best DAPO-7B 76.70%, improvement +0.40%
- WeMath: PRCO-7B 50.29% vs previous best MMR1-7B-RL 47.87%, improvement +2.42%
- DynaMath: PRCO-7B 29.74% vs previous best VPPO-7B 27.94%, improvement +1.80%
- LogicVista: PRCO-7B 49.66% vs previous best MMR1-7B-RL 49.44%, improvement +0.22%
- MMMU-Pro: PRCO-7B 42.08% vs previous best DAPO-7B 41.38%, improvement +0.70%
-
MMStar: PRCO-7B 67.80% vs previous best VPPO-7B 67.20%, improvement +0.60%
Overall average: PRCO-7B 49.63% vs base model 42.45%, improvement +7.18%. Similar pattern for 3B model with +7.65% improvement. Results are consistent but incremental per benchmark.
Compute & Efficiency
- Model size: 3B, 7B, and 8B parameter variants tested
- Training compute: 8 NVIDIA H200 GPUs for 200 optimization steps, specific GPU-hours not reported
- Inference speed/latency: Not reported, uses standard greedy decoding
- Memory footprint: Observer max 1024 tokens, Solver max 2048 tokens rollout length
- Deployment practicality: Requires dual-role prompting and auxiliary leakage checker model, adds complexity but maintains single shared policy for deployment
Real-World Applicability
- Evaluated on curated academic benchmarks only, no real-world deployment results reported
- No hardware experiments on physical robots or vehicles
- No production integration or commercial application discussed
- No sim-to-real transfer evaluation
- Limited to problems with short, verifiable answers rather than open-ended generation tasks
- Method focuses on mathematical and logical reasoning which may have educational applications but real-world impact unclear
Limitations & Failure Modes
- FUNDAMENTAL: Visual evidence representation as text captions is inherently lossy, cannot capture fine-grained spatial relations, global structure, or geometric details
- FUNDAMENTAL: Limited to tasks with verifiable, short answers rather than open-ended generation
- ENGINEERING: Requires auxiliary supervision for leakage detection and answer verification, adds computational overhead
- EVALUATION: Only tested on academic benchmarks, needs evaluation on real-world deployment scenarios
-
ENGINEERING: Caption-first warmup phase requires careful scheduling, without it method degrades significantly
Failure modes:
- Observer may still leak answers despite leakage checker
- Important visual details lost in text compression may cause reasoning failures on complex geometric problems.