Applied AI Digest — Mar 20, 2026
Today’s Digest at a Glance
Today’s papers span three major frontiers in applied AI: autonomous systems that perceive and navigate the physical world, intelligent agents that can plan and reason through complex tasks, and multimodal AI systems that seamlessly integrate vision, language, and other sensory modalities.
Computer Vision for Autonomous Systems
Autonomous vehicles and robots need to understand 3D space from camera images, a challenge known as spatial perception. Traditional approaches convert multi-camera feeds into Bird’s-Eye View (BEV) representations—imagine looking down at a scene from above like a satellite image. However, these methods often struggle because they don’t explicitly model the 3D geometry of the world.
3D Gaussian Splatting has emerged as a powerful technique for 3D scene reconstruction. Unlike traditional mesh-based representations, it models scenes as collections of 3D Gaussian functions $\mathcal{G}(x) = A e^{-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu)}$, where each Gaussian has a center $\mu$, covariance matrix $\Sigma$, and amplitude $A$. This representation is both differentiable and efficient to render, making it ideal for learning-based systems. By incorporating explicit 3D reconstruction into BEV perception, we can achieve much more accurate spatial understanding.
The challenge extends beyond just seeing—robots must also navigate and manipulate objects in these understood spaces. Object goal navigation requires agents to find specific items (like “find the red mug”) in unknown environments. This combines computer vision, semantic understanding, and path planning in a unified framework that must work reliably in real-world settings.
Agent Planning and Reasoning
Modern AI agents face the fundamental challenge of moving beyond reactive behavior to true anticipatory planning. While large language models (LLMs) excel at next-token prediction, they often struggle with multi-step reasoning that requires looking ahead and considering consequences. This is particularly critical in domains like technical support, robotics, and scientific reasoning where mistakes are costly.
Reinforcement Learning (RL) provides a mathematical framework for learning optimal decision-making policies. An agent learns a policy $\pi(a\lvert s)$ that maps states to actions by maximizing expected cumulative reward $\mathbb{E}[\sum_{t=0}^{\infty} \gamma^t r_t]$, where $\gamma$ is the discount factor. Recent work combines RL with LLMs through techniques like Policy Optimization, where the language model’s parameters are updated to maximize task-specific rewards rather than just likelihood of human text.
Scene graphs represent spatial relationships between objects as structured graphs $G = (V, E)$, where vertices $V$ represent objects and edges $E$ encode relationships like “on,” “next to,” or “inside.” This structured representation allows agents to reason about spatial configurations and resolve ambiguities through targeted queries—for instance, asking “which red cup?” when multiple red cups are present.
The integration of physics-based reasoning into AI systems represents another frontier. Neuro-symbolic approaches combine the pattern recognition capabilities of neural networks with the interpretability and guarantees of symbolic reasoning. This is particularly important in scientific domains where predictions must be physically plausible and interpretable.
Multimodal AI Integration
The convergence of vision, language, and action represents one of the most exciting developments in AI. Vision-Language-Action (VLA) models extend the success of large language models to embodied AI, where agents must understand visual scenes, interpret natural language instructions, and execute physical actions in the world.
Flow Matching and Diffusion Models have revolutionized generative AI by learning to transform noise into structured outputs through learned denoising processes. In flow matching, we learn a vector field $v_t(x)$ that defines trajectories $\frac{dx}{dt} = v_t(x)$ from a simple source distribution to the target data distribution. These models can generate images, but controlling their outputs for specific spatial arrangements remains challenging.
Multimodal reward models face the challenge of evaluating outputs across different modalities. Traditional approaches rely on human preferences, but scalable solutions require automatic evaluation mechanisms. This involves learning reward functions $R(x, y)$ that can assess whether a multimodal output $y$ appropriately responds to input $x$, requiring careful rubric design and validation.
Memory management in long-horizon agents presents unique challenges. Unlike traditional databases, agent memory must handle contradictory information, temporal decay, and utility-based retrieval. This requires governance policies that actively manage the information lifecycle, similar to how biological memory systems forget irrelevant information while retaining important experiences.
Reading Guide
For autonomous systems researchers: Start with Paper 1 (Splat2BEV) to understand modern 3D perception, then Papers 10-11 for navigation and control approaches. Paper 9 provides a concrete industrial application.
For agent reasoning enthusiasts: Begin with Paper 6 (TraceR1) for anticipatory planning foundations, then Papers 2-4 for domain-specific applications in technical support, robotics, and scientific reasoning. Paper 12 addresses the critical memory management challenge.
For multimodal AI practitioners: Papers 7-8 showcase visual reasoning and generation, while Papers 13-15 demonstrate practical applications in software engineering and robotics with audio integration. Paper 14 provides essential evaluation methodology.
Cross-cutting themes: Papers 3, 5, and 15 all address ambiguity resolution and multimodal integration in robotics, while Papers 4 and 11 both incorporate physics-based priors into learning systems.
Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting
Authors: Yiren Lu, Xin Ye, Burhaneddin Yaman, Jingru Luo et al. (7 authors) · Institution: Bosch Research North America, Case Western Reserve University · Category: cs.CV
Splat2BEV improves BEV perception for autonomous driving by incorporating explicit 3D scene reconstruction via Gaussian Splatting, achieving significant performance gains through geometry-aligned feature learning.
Practical Takeaway: If you’re working on BEV perception for autonomous driving, this paper demonstrates that explicit 3D reconstruction can significantly improve performance over end-to-end approaches, particularly for structured elements like lanes (+21.4% IoU). The three-stage training paradigm and foundation model distillation (DINO, Metric3D) are key technical contributions worth implementing. However, consider the added complexity and computational overhead before deployment. The approach is most valuable when geometric accuracy is critical and you have sufficient compute budget for the multi-stage training process.
Tags: autonomous_driving bev_perception 3d_gaussian_splatting multi_view_reconstruction semantic_segmentation foundation_models computer_vision
Task & Setting
This work addresses Bird’s-Eye-View (BEV) perception for autonomous driving, a critical capability that fuses multi-camera surround-view images into a unified top-down representation for downstream tasks like 3D object detection, semantic segmentation, and motion prediction. The challenge lies in accurately transforming perspective camera views into geometrically consistent BEV features without explicit 3D understanding, leading to suboptimal performance in safety-critical applications.
The task takes as input multi-view perspective images from vehicle-mounted cameras (typically 6 views) at resolutions of 224×480 or 448×800 pixels. The output is a BEV feature map of resolution 200×200 covering a 100m×100m spatial area around the vehicle. The objective combines reconstruction loss:
\[\mathcal{L}_{\text{render}} = \frac{1}{k} \sum_{i=1}^{k} \| \widehat{\mathbf{C}}_i - \mathbf{C}_i^{\text{gt}} \|_2^2\]with downstream segmentation loss using focal loss, centerness L2 loss, and offset L1 loss.
Success is measured using Intersection-over-Union (IoU) for semantic segmentation tasks including vehicle, pedestrian, and lane segmentation on the BEV plane.
Experiments are conducted on nuScenes dataset (1000 driving scenes, 700 train/150 val/150 test) and Argoverse1 dataset (113 driving logs) with diverse weather and lighting conditions across multiple cities.
Architecture & Method
-
Gaussian Generator: A feed-forward network with multi-view branch (based on UniMatch) and per-view branch (ViT-S backbone) that predicts 3D Gaussian parameters (position μ, covariance Σ, opacity σ, spherical harmonics SH, and feature vector f) from multi-view inputs.
-
3D Gaussian Splatting Representation: Scene represented as anisotropic Gaussians {Gi}ⁿᵢ₌₁ where each Gaussian Gi = (μᵢ, Σᵢ, σᵢ, SHᵢ, fᵢ) enables differentiable rendering via α-blending:
\[\mathbf{C}(p) = \sum_{i=1}^{N} T_i \, \sigma_i \, \mathbf{c}_i, \text{ where } T_i = \prod_{j < i} (1 - \sigma_j)\] -
Foundation Model Distillation: DINO encoder extracts dense semantic features, supervised with cosine similarity loss:
\[\mathcal{L}_{\text{feat}} = \frac{1}{k} \sum_{i=1}^{k} \Big( 1 - \cos \big( \widehat{\mathbf{F}}_i, \, \mathbf{F}^{\text{DINO}}_i \big) \Big)\] -
Depth Supervision: Metric3Dv2 provides reference depth with L1 and SILog losses for geometric consistency.
-
BEV Projection: Orthogonal projection of reconstructed 3D Gaussians to BEV plane using differentiable rasterization.
-
Task Head: BEV encoder + segmentation head operating on projected BEV features.
The core contribution is incorporating explicit 3D reconstruction via Gaussian Splatting into BEV perception, contrasting with existing end-to-end implicit approaches.
Training Recipe
-
Stage 1 - Gaussian Generator Pre-training: - Data: nuScenes/Argoverse1 multi-view images with Metric3Dv2 depth supervision and DINO features - Multi-view branch: learning rate 2×10⁻⁴, monocular branch: 2×10⁻⁶ (initialized from Depth Anything V2) - 10 epochs - Hardware: not reported
-
Stage 2 - Task Head Training: - Data: Same datasets, Gaussian generator frozen - BEV encoder and segmentation head: learning rate 2×10⁻⁴ - 10 epochs
- Hardware: not reported -
Stage 3 - Joint Fine-tuning: - Data: End-to-end optimization of all components - Monocular branch: 2×10⁻⁶, all other components: 2×10⁻⁶ - 20 epochs - Hardware: not reported
Loss coefficients: λ₁ = 0.2, λ₂ = 0.8, λ₃ = 0.1 for reconstruction; λ₄ = 2.0, λ₅ = 0.1 for BEV tasks.
Training compute, wall-clock time, and specific hardware details not reported.
Novelty & Lineage
This work builds on 3D Gaussian Splatting (Kerbl et al. 2023) and BEV perception methods like LSS (Philion & Fidler 2020), BEVFormer (Li et al. 2022). Closest related works are GaussianLSS (Lu et al. 2025) and GaussianBEV (Chabot et al. 2025), which also use Gaussian Splatting for BEV but treat Gaussians merely as feature carriers without explicit 3D reconstruction.
The key delta is incorporating explicit 3D scene reconstruction as an intermediate step in BEV perception, using foundation model distillation (DINO, Metric3D) to create geometry-aligned features, and a three-stage training paradigm that separates reconstruction learning from task-specific optimization.
Rating: SIGNIFICANT - Introduces a new paradigm for BEV perception by bridging explicit 3D reconstruction with downstream tasks, showing consistent improvements across multiple benchmarks.
Benchmarks & Results
-
nuScenes Vehicle Segmentation (224×480): IoU 39.6% vs previous best PointBEV 38.7%, +2.3% improvement
-
nuScenes Vehicle Segmentation (448×800): IoU 42.7% vs PointBEV 42.1%, +1.4% improvement
-
nuScenes Vehicle Segmentation (224×480, visibility filtered): IoU 44.6% vs PointBEV 44.0%, +1.4% improvement
-
nuScenes Vehicle Segmentation (448×800, visibility filtered): IoU 48.2% vs PointBEV 47.6%, +1.3% improvement
-
nuScenes Pedestrian Segmentation: IoU 20.1% vs PointBEV 18.5%, +8.6% improvement
-
nuScenes Lane Segmentation: IoU 60.2% vs PointBEV 49.6%, +21.4% improvement
-
nuScenes Multi-class Segmentation: Mean IoU 34.4% vs DiffBEV 29.6%, +16.2% improvement
-
Argoverse1 Multi-class Segmentation: Mean IoU 24.4% vs TaDe 22.3%, +9.4% improvement
Results show consistent improvements across all tasks, with particularly strong gains on lane segmentation (+21.4%) and pedestrian segmentation (+8.6%).
Compute & Efficiency
-
Model size: Not explicitly reported, but uses ViT-S backbone and multi-view transformer components
-
Training compute: Not reported - missing GPU hours, hardware specifications
-
Inference speed: Not reported - no latency measurements provided
-
Memory footprint: Not reported - no memory usage analysis
-
Deployment practicality: Limited assessment - three-stage training adds complexity, requires foundation model dependencies (DINO, Metric3D), but achieves better performance. Real-time constraints not addressed.
Real-World Applicability
-
Real-world datasets: Evaluated on nuScenes and Argoverse1 datasets collected from actual autonomous vehicles in diverse urban environments (Boston, Singapore, Pittsburgh, Miami)
-
Multi-weather conditions: Testing includes various weather and lighting conditions present in the datasets
-
Production readiness: No discussion of deployment to actual autonomous vehicles or real-time constraints
-
Hardware requirements: No analysis of computational requirements for deployment on automotive hardware
-
Sim-to-real: No discussion of simulation to real-world transfer, though datasets are real-world collected
Limitations & Failure Modes
-
ENGINEERING: Three-stage training paradigm adds complexity compared to end-to-end approaches, potentially harder to optimize and deploy
-
ENGINEERING: Dependency on foundation models (DINO, Metric3D) increases computational overhead and model complexity
-
EVALUATION: No real-time performance analysis or deployment constraints considered for autonomous driving applications
-
EVALUATION: Limited analysis of failure cases or robustness to domain shifts between training and test environments
-
FUNDAMENTAL: Orthogonal BEV projection may lose important 3D geometric information compared to perspective-aware approaches
Likely failure modes:
- Performance degradation in scenarios with limited multi-view overlap where 3D reconstruction quality suffers
- Sensitivity to camera calibration errors that could affect the geometric consistency of projected BEV features.
Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction
Authors: Yi Yu, Junzhuo Ma, Chenghuang Shen, Xingyan Liu et al. (18 authors) · Institution: Fudan University, Alibaba Group · Category: cs.LG
A lightweight adaptation framework for LLMs in technical service domains that combines latent logic augmentation, multi-ground-truth training, and hybrid reward mechanisms to overcome myopic imitation and response diversity challenges.
Practical Takeaway: This work provides a comprehensive blueprint for adapting LLMs to complex technical domains where traditional single-reference training fails. The key insight is that technical service tasks require both explicit reasoning augmentation and multi-reference training to handle response diversity. Research engineers should consider implementing the Latent Logic Augmentation approach (combining backward reasoning and forward planning) when fine-tuning for complex decision-making tasks. The Hybrid Reward Mechanism offers a practical solution to the computational bottleneck of LLM-as-a-Judge systems, achieving comparable performance with 30% less compute. The Multi-GT construction pipeline could be adapted to other domains where valid response diversity exists, such as creative writing, code generation, or medical diagnosis.
Tags: LLM reinforcement-learning technical-support multi-reference-training planning reasoning customer-service reward-modeling
Task & Setting
This work addresses the challenge of adapting Large Language Models (LLMs) to complex technical service domains, where human expert demonstrations often lack explicit reasoning chains and valid responses exhibit inherent diversity. Traditional training paradigms struggle with “myopic imitation” that degrades performance in complex tasks requiring multi-step reasoning and planning.
The task involves fine-tuning LLMs for technical customer service scenarios where agents must provide accurate responses to customer queries in domains like cloud services. The input consists of customer service tickets with historical context, and the output is agent responses that should be both accurate and follow business logic. The training objective combines supervised fine-tuning with reinforcement learning using multiple valid ground truths rather than single references.
Success is measured using the Ensemble-Consistency Score (ECS) defined as:
\[\text{S}_{\text{ECS}}(x, y) = \max_{y^* \in Y^*(x)} J_{\text{con}}(x, y, y^*)\]where $J_{\text{con}}$ is the mean score across an ensemble of consistency judges, and $Y^*(x)$ represents the set of valid responses for query $x$.
The paper introduces a Multi-Ground Truth (Multi-GT) dataset expanded from a proprietary technical service dataset of 10k queries each for decision and planning SFT, plus 5,120/1k/1k queries for RL training/validation/test. The Multi-GT expansion roughly doubles the number of references (e.g., test: 1,000 → 1,975).
Architecture & Method
The framework comprises three key components operating on Qwen3-4B as the base model:
-
Latent Logic Augmentation: Two methods to inject explicit reasoning into training data: - Decision Reasoning Augmentation (DRA) generates backward chain-of-thought rationales $c_t$ for each action $a_t$, with loss:
\[L_{\text{Decision}} = -\mathbb{E}_{(q_t,c_t,a_t) \sim D_{\text{DRA}}}[\log p_\theta(c_t, a_t | q_t)]\]\[L_{\text{PATM}} = -\mathbb{E}_{(y^{\text{PATM}}_t) \sim D_{\text{PATM}}}[\log p_\theta(\tilde{a}_t, \tilde{q}_{t+1}, \tilde{a}_{t+1} | q_t)]\]- Planning-Aware Trajectory Modeling (PATM) constructs forward-looking 3-step sequences $(a\_t, q\_{t+1}, a\_{t+1})$ with loss: -
Robust Noise Reduction: Constructs Multi-GT datasets using dual-filtering with Consistency Judge (92% human alignment) and Utility Judge (83% human alignment) to validate diverse valid responses.
-
Lightweight Adaptation: Hybrid Reward Mechanism (HRM) combines lightweight Qwen3-4B reranker with Qwen3-32B judge using cascade strategy:
\[R_\theta(S_R, S_J) = \begin{cases}\] \[w_1 S_R + (1-w_1) S_J & \text{if } S_R < \tau_a \\\] \[S_R & \text{if } \tau_a \leq S_R \leq \tau_b \\\] \[w_2 S_R + (1-w_2) S_J & \text{if } S_R > \tau_b\] \[\end{cases}\]
Training Recipe
-
Data Construction: Multi-GT expansion via offline exploration (Qwen3-4B, T=1.2) and online adaptation (RL rollouts), filtered through dual-judge system.
-
SFT Stage: - Data: 10k queries each for decision and planning, expanded to Multi-GT format - Optimizer: AdamW with ZeRO-3, learning rate 2×10⁻⁶, cosine schedule - Training: 1 epoch, batch size 256, max sequence length 20,000 - Hardware: Not specified
-
RL Stage using DAPO: - Data: Multi-GT dataset (5,120 training queries) - Optimizer: Actor learning rate 5×10⁻⁶, constant schedule - Training: 20 episodes, batch size 128, 16 samples per prompt - Reward: Hybrid mechanism with fast interval [0.68, 0.98], mixing weights w₁=0.05, w₂=0.72 - Hardware: Not reported - Wall-clock time: 30% reduction in reward computation time compared to Judge-only baseline
Novelty & Lineage
This work builds on established techniques but makes significant novel contributions in their combination and application to technical service domains. The closest prior works include RLHF/PPO methods for alignment, LLM-as-a-Judge paradigms, and planning-aware language modeling approaches.
Key novelties:
- Latent Logic Augmentation: Novel combination of backward reasoning (DRA) and forward planning (PATM) to address “myopic imitation” in technical domains
- Multi-GT Construction: Automated dual-filtering approach using specialized Consistency and Utility judges to systematically capture semantic diversity
- Hybrid Reward Mechanism: Innovative cascade strategy balancing computational efficiency with reward fidelity, reducing compute by 30% while maintaining performance
The framework addresses a fundamental gap in applying LLMs to domains where single-reference training is inadequate and explicit reasoning is absent from demonstrations.
Rating: SIGNIFICANT - The integrated approach of data augmentation, multi-reference training, and efficient reward mechanisms represents a substantial advance for practical LLM deployment in technical domains.
Benchmarks & Results
-
Multi-ECS (primary metric): This paper: 0.441 (full framework), baseline SFT: 0.337, improvement: +30.9%
-
Single-ECS: This paper: 0.347 (full framework), baseline SFT: 0.242, improvement: +43.4%
-
Call Tool Accuracy: This paper: 0.279 (SFT-Mix w/ DRA), baseline: 0.082, improvement: +240%
Component ablation results:
- SFT without reasoning augmentation shows degradation (Multi-ECS: 0.293 vs 0.299 original)
- PATM + DRA combination achieves best SFT performance (Multi-ECS: 0.337)
- Hybrid Reward vs alternatives: 0.429 vs 0.389 (Reranker-only) vs 0.413 (Soft Judge)
- Multi-GT vs Single-GT: 0.441 vs 0.429 Multi-ECS
Mixed results: Single-ECS slightly drops when upgrading to Multi-GT (0.357 → 0.347), highlighting the limitation of single-reference evaluation when models learn diverse valid responses.
Notable absence: No comparison against other multi-reference RL methods or standard technical service benchmarks beyond their proprietary dataset.
Compute & Efficiency
-
Model size: Qwen3-4B base model (4 billion parameters)
-
Training compute: Not specified for total GPU hours or hardware details
-
Inference speed: Hybrid Reward Mechanism achieves 30% reduction in reward computation time compared to LLM-as-a-Judge baseline through cascade strategy
-
Memory footprint: Not reported, though uses DeepSpeed ZeRO-3 optimization for SFT stage
-
Deployment practicality: High - framework designed specifically for computational efficiency while maintaining performance. The lightweight reranker (Qwen3-4B) handles majority of reward computations, with expensive judge (Qwen3-32B) only used for ambiguous cases. The cascade strategy with optimized thresholds [0.68, 0.98] makes the approach practical for production deployment.
Real-World Applicability
-
Real-world deployment: Evaluated on proprietary cloud service customer support dataset with 10k+ real queries, demonstrating practical applicability beyond synthetic benchmarks.
-
Production integration: Framework designed with computational efficiency in mind (30% reward computation reduction), suggesting readiness for production deployment in technical service domains.
-
Domain specificity: Successfully handles complex technical service scenarios including policy explanations, operation guidance, and information collection - core requirements for real customer service applications.
-
Scalability validation: Multi-GT construction pipeline automates the expansion of training data, making it feasible to apply to large-scale real-world datasets without manual annotation.
-
Business logic compliance: Consistency Judge achieves 92% alignment with human expert judgments, and Utility Judge achieves 83% alignment, indicating reliable quality control for production use.
Limitations & Failure Modes
-
FUNDAMENTAL: Dependency on powerful teacher models (Qwen3-32B, DeepSeek models) for data augmentation and judgment, limiting accessibility and introducing potential bias propagation.
-
FUNDAMENTAL: Fixed cascade thresholds in Hybrid Reward Mechanism may become suboptimal as policy distribution shifts during RL training, requiring manual recalibration.
-
EVALUATION: Evaluation limited to proprietary dataset, preventing direct reproducibility and comparison with other technical service domains or public benchmarks.
-
ENGINEERING: Multi-GT construction quality depends on the capabilities of judge models, which may degrade with domain shift or when applied to different technical service areas.
-
ENGINEERING: The framework requires careful hyperparameter tuning (cascade thresholds, mixing weights) that may not transfer across different base models or domains.
Likely failure modes:
- Judge model disagreement: When Consistency and Utility judges provide conflicting assessments, the system may include invalid responses in Multi-GT training data
- Reward hacking: Despite the hybrid mechanism, the policy may still exploit imperfections in the lightweight reranker, leading to responses that score well but lack true utility
SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations
Authors: Akshat Rana, Peeyush Agarwal, K. P. S. Rana, Amarjit Malhotra · Institution: Netaji Subhas University of Technology · Category: cs.RO
SG-CoT enables LLM-based robots to detect and resolve ambiguities by iteratively querying scene graph representations, improving success rates through grounded reasoning and targeted clarification questions.
Practical Takeaway: If you’re building LLM-based robotic planners, this framework offers a concrete approach to handle ambiguous instructions by iteratively querying structured scene representations. The key insight is preventing hallucination through explicit grounding - instead of letting the LLM reason from memory, force it to query actual environmental state. The retrieval functions (retrieve_node, retrieve_edge) provide a template for tool-augmented reasoning that could be adapted to your domain. However, invest in robust scene graph construction first, as errors here propagate through the reasoning chain. Consider this approach if your robots operate in environments where asking clarifying questions is safer than guessing, though be prepared for increased computational costs.
Tags: robotics llm-planning scene-graphs ambiguity-resolution multi-agent vision-language human-robot-interaction chain-of-thought
Task & Setting
Robotic planners using large language models (LLMs) face critical challenges when encountering ambiguous situations, whether from underspecified user instructions (“pick up the cup” when multiple cups exist) or environment-induced uncertainty (target objects being absent or multiple). This poses safety risks as robots may execute suboptimal or dangerous actions without seeking clarification.
The task involves generating appropriate action sequences or clarification questions from natural language instructions I and visual observations O. The system must handle three ambiguity types:
-
multiplicity ( X_match > 1 where X_match are objects satisfying instruction properties) - absence (X_match = ∅), and
-
underspecification (vague instruction terms). In multi-agent settings, partial observability creates additional ambiguities where robot R_i sees subset O_Ri ⊆ O.
Success is measured by Success Rate (SR) - correct action execution for unambiguous cases or appropriate clarification for ambiguous ones - and Correct Question Rate (CQR) - whether the generated clarification question matches the ground truth ambiguity type. Experiments use 400 trials in simulated tabletop environments (100 per ambiguity category plus 100 clear scenarios) and multi-agent tasks from LEMMA benchmark.
Architecture & Method
-
Scene graph construction: Uses vision-language model (VLM) like Grounding DINO or Qwen3-VL-2B-Instruct to detect objects and generate structured graph G = (V, E) where nodes V represent objects with attributes and edges E capture spatial/semantic relationships
-
Iterative reasoning framework: LLM equipped with two retrieval functions - retrieve_node(attr_key, attr_val) to query objects by attributes and retrieve_edge(source, target, relation) to query relationships between objects
-
Chain-of-thought with grounding: At each turn t, LLM generates reasoning trace T_t and function call f_t, retrieves results R_t from scene graph, updates conversation history H_{t+1} = H_t ⊕ (T_t ⊕ f_t ⊕ R_t)
-
Ambiguity detection and clarification: Process continues until LLM has sufficient context for definitive action or identifies ambiguity source to generate targeted clarification question with appropriate tag (multiplicity, absence, underspecified)
-
Multi-agent communication: Extended with ask_robot action enabling agents to query each other’s local observations in partially observable environments
Training Recipe
No explicit model training reported - this is a prompting and reasoning framework that uses pre-trained models:
-
Vision-language models: Uses existing Qwen3-VL-2B-Instruct and Gemini-2.5-Flash models without additional training
-
Scene graph generation: Prompting-based approach using VLM to generate JSON-structured graphs in single pass, avoiding O(n²) complexity of pairwise object analysis
-
LLM reasoning: Uses structured prompts with conversation history maintenance, no fine-tuning or reinforcement learning reported
Training details, optimization parameters, and hardware specifications not reported as this work focuses on inference-time reasoning framework rather than model training.
Novelty & Lineage
This work builds on SayPlan (scene graph + LLM planning), CLARA (ambiguity classification), and KnowNo (uncertainty-based clarification). The key novelty is integrating iterative scene graph querying with chain-of-thought reasoning specifically for ambiguity detection and resolution.
Prior works like CLARA focus mainly on instruction-level ambiguity, while SayPlan assumes pre-constructed graphs without clarification capabilities. This paper extends to environment-induced ambiguities and multi-agent partial observability scenarios not addressed by previous approaches.
The core delta is the iterative grounding mechanism that prevents hallucination accumulation and enables pinpointing ambiguity sources for targeted clarification. Rating: INCREMENTAL - combines existing techniques (scene graphs, CoT, VLMs) in a novel way but doesn’t introduce fundamentally new architectures or training methods.
Benchmarks & Results
-
Single-agent tabletop (SayCan-derived): SG-CoT achieves 72%/66% (SR/CQR) with Qwen vs best baseline Inner Monologue 65%/55%, representing 7%/11% improvement in overall performance
-
Multiplicity scenarios: SG-CoT 71%/52% (Qwen) vs Inner Monologue 65%/53%, showing 6% SR improvement but slight CQR decrease
-
Absence scenarios: SG-CoT 90%/80% (Qwen) vs Inner Monologue 67%/52%, demonstrating 23%/28% improvement margins
-
Underspecified scenarios: SG-CoT 77%/67% (Qwen) vs Inner Monologue 75%/62%, showing 2%/5% improvement
-
Multi-agent LEMMA benchmark: SG-CoT achieves 59% SR (Qwen) vs ProgPrompt 44%, representing 15% improvement in partially observable environments
-
Gemini-2.5-Flash consistently outperforms Qwen3-VL-2B across all metrics, with SG-CoT reaching 80%/78% overall SR/CQR vs 72%/66% for Qwen
Compute & Efficiency
-
Model size: Uses existing Qwen3-VL-2B-Instruct (2B parameters) and Gemini-2.5-Flash (size not specified) - no custom model training
-
Training compute: Not applicable - framework uses pre-trained models with prompting approach
-
Inference speed: SG-CoT requires 4.24 LLM calls per episode vs 1.00 for ProgPrompt, with average 1141.72 input tokens and 191.28 output tokens per call, leading to higher latency
-
Memory footprint: Higher token usage due to conversation history concatenation and scene graph information storage, but scales linearly with environment complexity
-
Deployment practicality: Computationally more expensive than baselines due to iterative reasoning, but complexity doesn’t scale proportionally with environment size as retrieval depth depends on instruction complexity rather than total objects
Real-World Applicability
-
Evaluation limited to simulation: All experiments conducted in PyBullet-based tabletop environments and LEMMA benchmark simulations - no real robot deployments reported
-
Hardware experiments: None reported - uses simulated UR5e robot with 2-finger gripper in controlled tabletop scenarios
-
Sim-to-real gap: Authors acknowledge limitation that VLMs may hallucinate with real-world objects and introduce incorrect scene graph edges, potentially hampering reasoning
-
Production integration: No deployment results or production use cases discussed
-
Real-world challenges: Paper identifies that current VLM-based scene graph generation may not handle complex real-world spatial relationships reliably, limiting immediate practical deployment
Limitations & Failure Modes
-
VLM hallucination in scene graph generation - FUNDAMENTAL: relies on single-pass VLM inference for spatial/semantic relationships which can introduce incorrect edges
-
Computational overhead - ENGINEERING: requires 4x more LLM calls than baselines with higher token usage, limiting scalability for complex instructions
-
Simulation-only evaluation - EVALUATION: no real-world robot experiments to validate sim-to-real transfer
-
Multiplicity detection challenges - FUNDAMENTAL: model may arbitrarily pick one valid option instead of recognizing ambiguity when multiple correct choices exist
-
Token usage scaling - ENGINEERING: latency increases with number of ambiguities present in scene, potentially limiting complex scenarios
Failure modes:
- Incorrect scene graph edges leading to wrong reasoning chains
- Over-confidence in multiplicity scenarios causing arbitrary action selection instead of clarification requests
OMNIFLOW: A Physics-Grounded Multimodal Agent for Generalized Scientific Reasoning
Authors: Hao Wu, Yongheng Zhang, Yuan Gao, Fan Xu et al. (10 authors) · Institution: Tsinghua University, Tencent · Category: cs.LG
OMNIFLOW introduces a training-free neuro-symbolic architecture that grounds frozen multimodal LLMs in physical laws for interpretable scientific reasoning across fluid dynamics applications.
Practical Takeaway: Research engineers working on scientific AI should consider this training-free paradigm for domains requiring physical consistency and interpretability. The key insight is decoupling numerical simulation from cognitive reasoning rather than end-to-end training. The Semantic-Symbolic Alignment mechanism and Physics-Guided Chain-of-Thought could be adapted to other scientific domains beyond fluid dynamics. However, be aware of the inference latency trade-offs and ensure your application can benefit from the interpretability gains. The approach is most valuable when physical consistency and explainability matter more than raw speed.
Tags: physics-informed-ai scientific-computing fluid-dynamics weather-forecasting multimodal-llms neuro-symbolic interpretable-ai pde-solving
Task & Setting
This work addresses the challenge of applying Large Language Models (LLMs) to physical systems governed by Partial Differential Equations (PDEs), where traditional approaches either produce non-physical hallucinations or require expensive domain-specific fine-tuning that limits cross-domain generalization.
The task involves multimodal scientific reasoning for fluid dynamics forecasting. Inputs include heterogeneous data streams (satellite imagery, buoy readings) containing high-dimensional flow tensors. The system must produce:
- accurate spatiotemporal forecasts of physical states, and
-
interpretable analysis reports with physical grounding and decision logic. The formal objective combines prediction accuracy with physical consistency:
\[\mathcal{L} = \mathcal{L}_{pred}(x_{t+\tau}, \hat{x}_{t+\tau}) + \lambda \mathcal{L}_{physics}(\hat{x}_{t+\tau})\]where $\mathcal{L}_{physics}$ enforces conservation laws like mass conservation $\nabla \cdot v = 0$.
Success is measured through dual evaluation:
- Physical accuracy via RMSE, SSIM, PSNR on predicted tensor fields, and
-
Interpretive quality via Mechanism F1 score measuring physical grounding accuracy of generated reports.
The paper evaluates on three benchmarks: 2D Turbulence (128×128, 100 timesteps, microscopic flow), SEVIR (384×384 regional weather), and ERA5 (180×360 global climate, 200-day forecasts with 21 variables).
Architecture & Method
-
Physics Perception Loop: Neural Earth Simulator (NES) based on improved Diffusion Transformer (DiT) generates ensemble forecasts via latent space perturbation with Gaussian noise injection: $z^{(k)}_{init} = E(x_{init}) + \lambda \cdot \xi^{(k)}$ where $\xi^{(k)} \sim \mathcal{N}(0, I)$
-
Visual Symbolic Projector: Cross-attention mechanism extracts topological features from visual encodings using learnable query embeddings $Q \in \mathbb{R}^{N \times d}$: $H_{vis} = \text{Softmax}(Q(vW_K)^T/\sqrt{d})(vW_V)$
-
Semantic-Symbolic Alignment: Maximizes mutual information between visual tokens and textual descriptions via contrastive loss: $\mathcal{L}_{align} = -\sum_{i=1}^N \log \frac{\exp(\text{sim}(h_i, t_{pos})/\tau)}{\sum_j \exp(\text{sim}(h_i, t_j)/\tau)}$
-
Agentic Reasoning Core: Gemini 3 Flash executes Physics-Guided Chain-of-Thought (PG-CoT) with ReAct strategy, selecting actions $a_t \sim \pi(a_t\lvert M_t, H_{vis}, I)$ from Retrieve/Simulate/Reason
-
Physics Consistency Constraint: Critic function validates trajectories against conservation laws, forcing backtracking when violations detected
-
Counterfactual Feedback Loop: Active probing mechanism triggers alternative scenario simulation when uncertainty exceeds threshold: $P_{counter} = \text{NES}(x’_{init}\lvert \text{do}(\text{condition}))$
-
Hierarchical Knowledge Retrieval: RAG system accesses stratified vector database with physical laws ($K_{phy}$), protocols ($K_{prot}$), and historical reports ($K_{hist}$)
Training Recipe
-
Neural Earth Simulator (NES) Training: Pre-trained Diffusion Transformer on fluid dynamics data, specific training details not reported for optimizer, learning rate, or hardware requirements
-
Visual Symbolic Projector Training: Trained via contrastive learning to align visual features with text embeddings using temperature parameter τ = 0.07, other training specifics not reported
-
Core LLM (Gemini 3 Flash): Frozen foundation model, no domain-specific parameter updates or fine-tuning performed
-
Knowledge Base Construction: Hierarchical vector database populated with domain literature, operational protocols, and historical reports, construction methodology not detailed
-
System Integration: Training-free neuro-symbolic framework that orchestrates pre-trained components without end-to-end parameter optimization
Training data scale, wall-clock time, and hardware requirements not reported for most components.
Novelty & Lineage
The core novelty lies in the training-free neuro-symbolic architecture that grounds frozen multimodal LLMs in physical laws without parameter updates. This contrasts with prior approaches that either:
- use specialized deep learning surrogates (FNO 2020, GraphCast 2023) lacking interpretability, or
-
fine-tune LLMs on scientific data (costly and prone to catastrophic forgetting).
Key innovations include: Semantic-Symbolic Alignment mechanism projecting flow tensors to linguistic descriptors, Physics-Guided Chain-of-Thought with reflexive consistency checking, and Counterfactual Feedback Loop enabling active causal probing.
The approach builds on recent work in multimodal LLMs (LLaVA 2024), physics-informed neural networks (PINNs 2019), and agentic reasoning with tool use (ReAct 2022), but uniquely combines these into a training-free physics-grounded agent.
Rating: SIGNIFICANT - Novel architectural paradigm with clear technical contributions, though builds incrementally on established components.
Benchmarks & Results
-
2D Turbulence: RMSE 0.582±0.008 (vs EarthFarseer 0.654), SSIM 0.715±0.006 (vs EarthFarseer 0.642), PSNR 28.66±0.10 (vs EarthFarseer 27.45) - significant improvements
-
ERA5 Global Weather: RMSE 0.552±0.005 (vs EarthFarseer 0.615), SSIM 0.931±0.002 (vs EarthFarseer 0.895), PSNR 32.11±0.06 (vs EarthFarseer 30.22) - consistent improvements
-
SEVIR Regional Weather: RMSE 0.405±0.004 (vs EarthFarseer 0.437), SSIM 0.882±0.003 (vs EarthFarseer 0.842), PSNR 31.50±0.08 (vs EarthFarseer 30.15) - modest improvements
-
Zero-shot comparison vs foundation models: Dramatically outperforms ChatGPT-Images, Seedream 4.5, Banana Pro on all benchmarks with 30-40% RMSE reduction and 90-128% SSIM improvement
-
Reasoning Quality: Mechanism F1 score 83.2% vs Qwen3-VL series, demonstrating superior physical grounding in generated reports
Results consistently favor OMNIFLOW across all scales from microscopic to global dynamics.
Compute & Efficiency
-
Model size: Not explicitly reported, but uses frozen Gemini 3 Flash plus lightweight DiT-based simulator
-
Training compute: Minimal due to training-free architecture, specific GPU hours not reported
-
Inference speed: Higher latency than end-to-end models due to iterative reflexive loops and counterfactual probing, specific latency numbers not provided
-
Memory footprint: Not quantified, but includes frozen LLM, visual encoder, and knowledge retrieval database
-
Deployment practicality: Limited by inference latency from multi-step reasoning workflow, though offers interpretability benefits for scientific decision-making applications
Real-World Applicability
-
Marine Heatwave Case Study: Demonstrates end-to-end application on January 2021 global ocean data with actionable outputs including fishery alerts and shipping route optimization
-
Multi-scale Validation: Tested across real-world datasets from microscopic turbulence to global weather patterns, not just synthetic benchmarks
-
Operational Integration: Generates structured reports compatible with emergency protocols and decision support systems
-
Physical Consistency: Enforces conservation laws and retrieves operational standards, making outputs suitable for scientific and regulatory applications
-
However, no actual deployment results in operational weather forecasting systems or real-time decision-making environments reported
Limitations & Failure Modes
-
ENGINEERING: Increased inference latency due to iterative reflexive loops and counterfactual probing limits real-time deployment applications
-
FUNDAMENTAL: Reasoning accuracy remains coupled to underlying neural simulator fidelity - biases in DiT propagate through reasoning chain
-
ENGINEERING: Representing fine-grained sub-grid dynamics through linguistic descriptors remains challenging, may lose critical physical details
-
EVALUATION: Limited evaluation of failure modes when physics constraints conflict or when operating far outside training distribution
-
ENGINEERING: Scalability concerns for larger ensemble sizes or higher-resolution simulations not addressed
Failure modes:
- System may produce overconfident predictions when simulator uncertainty is underestimated
- Knowledge retrieval may surface irrelevant or conflicting physical principles leading to reasoning inconsistencies.
V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors
Authors: Songjia He, Zixuan Chen, Hongyu Ding, Dian Shao et al. (8 authors) · Institution: Nanjing University · Category: cs.RO
V-Dreamer automatically generates diverse robot manipulation datasets from natural language by combining LLM scene planning, video generation for motion priors, and visual-kinematic alignment for executable trajectories.
Practical Takeaway: V-Dreamer represents a breakthrough in automated robotic data generation by combining multiple foundation models into a practical pipeline. Research engineers should pay attention to the video-prior approach for trajectory synthesis and the robust visual-kinematic alignment techniques. The 600 trajectories/hour throughput on consumer hardware makes this potentially game-changing for robotics labs lacking large-scale real-world data collection capabilities. However, current limitations to rigid-body tabletop tasks and need for careful domain alignment suggest this is still early-stage technology requiring further development for broader applications.
Tags: robotic_manipulation simulation_synthesis video_generation sim_to_real imitation_learning foundation_models automated_data_generation zero_shot_transfer
Task & Setting
Training generalist robots requires large-scale, diverse manipulation datasets, but real-world data collection is expensive and existing simulators are limited by fixed asset libraries. The core challenge is automatically generating both physically realistic 3D environments and executable robot trajectories from natural language instructions.
The task takes natural language instructions (e.g., “Put the bowl onto the plate in a simple living room”) as input and produces:
- a physics-validated 3D simulation scene with collision-free object layouts, and
-
executable robot end-effector trajectories τ = {(pt,qt)}T_{t=0} where pt denotes continuous poses and qt represents discrete gripper states. The method must handle open-vocabulary objects and environments without manual asset curation.
Success is measured by:
- simulation policy success rate on unseen objects
- zero-shot sim-to-real transfer success rate on physical hardware, and
- trajectory generation throughput (600 trajectories/hour reported). Evaluation uses tabletop pick-and-place tasks with 40 target objects, 40 receptacles, and 20 environments for training, tested on 10 held-out novel mugs.
Architecture & Method
-
Semantic-to-Physics Scene Synthesis: Uses Qwen-Max LLM for semantic parsing into JSON asset manifests, then Flux diffusion model generates 2D assets with SAM3 segmentation for background removal.
-
Memory-Efficient 3D Reconstruction: Applies SAM3D with dynamic GPU/CPU offloading to lift 2D images to 3D meshes, optimizing VRAM usage for consumer hardware.
-
Physics-Grounded Layout: Combines LLM spatial reasoning with Genesis physics engine for collision detection and gravity alignment, ensuring stable object placement via AABB checking.
-
Video-Prior Trajectory Generation: Captures stabilized scene as initial frame I0, applies style refinement via Qwen-Image-Edit, then feeds I’0 to Wan2.2-I2V-Flash video generation with negative prompts to suppress non-physical artifacts.
-
Sim-to-Gen Alignment: Uses VGGT for depth estimation, CoTracker3 for dense 2D point tracking within SAM3-generated object masks, TAPIP3D for 3D motion lifting, and Graspgen for grasp pose generation with inverse kinematics mapping.
Training Recipe
-
Scene Generation: No training required - uses pre-trained foundation models (Qwen-Max, Flux, SAM3, SAM3D) with procedural assembly and physics validation.
-
Trajectory Synthesis: Uses pre-trained Wan2.2-I2V-Flash video model with targeted negative prompting, CoTracker3 for tracking, VGGT for depth, no additional training.
-
Policy Training: ACT (Action Chunking with Transformers) trained on synthesized demonstrations using standard imitation learning, specific training details not reported.
-
Hardware: 8×RTX 4090 workstation for generation pipeline, achieves 600 LeRobot-formatted trajectories per hour throughput.
-
Real-world deployment uses zero-shot transfer without target-domain fine-tuning, relying solely on synthetic data.
Novelty & Lineage
This is the first fully automated end-to-end pipeline combining open-vocabulary scene synthesis, video-prior trajectory generation, and real-world deployment. Prior work: GenSim (2024) and RoboGen (2024) use LLMs for task generation but rely on fixed assets; Holodeck (2024) does 3D scene composition but lacks behavior synthesis; TraceGen and video-based planners like UniPi (2022) work in 2D without 3D grounding. The key delta is the complete automation from language to executable trajectories using video generation as motion priors with robust visual-kinematic alignment. This represents a SIGNIFICANT contribution by unifying previously separate capabilities into a practical system.
Benchmarks & Results
-
Simulation Zero-shot Generalization: Success rate on 10 novel mugs with 2,500 synthetic demonstrations achieves 36.96% (baseline with 500 demos: 3.46%).
-
Real-world One-shot Transfer: 50% success with visual distractors, 20% with novel objects (apple, mango, pear, bottle), 15% with spatial perturbations, 0% with sensor occlusion.
-
Data Scaling: Performance scales from 3.46% (500 demos) to 25.90% (1,000 demos) to peak 36.96% (2,500 demos).
No comparison to prior automated synthesis methods provided due to lack of comparable end-to-end systems. Results focus on demonstrating system functionality rather than beating specific benchmarks.
Compute & Efficiency
-
Model size: Uses multiple pre-trained foundation models (Qwen-Max, Flux, SAM3, Wan2.2-I2V-Flash) - individual parameter counts not reported.
-
Training compute: 8×RTX 4090 workstation, achieves 600 trajectory generation per hour throughput.
-
Inference speed: Real-time policy execution on 6-DoF Piper arm with RGB-D camera input.
-
Memory footprint: Dynamic GPU/CPU offloading implemented for 3D reconstruction to handle VRAM constraints on consumer hardware.
-
Deployment practicality: Demonstrated on consumer-grade hardware (RTX 4090s) with successful real-world transfer, indicating practical accessibility.
Real-World Applicability
-
Hardware deployment: Successfully tested on 6-DoF Piper robotic arm (left arm of Cobot-Magic platform) with parallel-jaw gripper and dual Orbbec DaBai cameras.
-
Sim-to-real protocol: Zero-shot transfer using photo-conditioned scene alignment, black tape on robot links to match simulation appearance, no real-world fine-tuning required.
-
Real-world robustness: Tested under visual distractors, novel objects (fruits, bottles), spatial perturbations, and sensor occlusion conditions.
-
One-shot learning: Policy trained on single synthetic demonstration achieves meaningful real-world manipulation, demonstrating extreme data efficiency.
-
Practical integration: Uses standard robotics formats (LeRobot) and established frameworks (ACT) for easy adoption.
Limitations & Failure Modes
-
FUNDAMENTAL: Limited to rigid-body tabletop manipulation, cannot handle articulated or deformable objects.
-
ENGINEERING: Lacks automated physics-aware trajectory filtering, leading to potential low-quality generations requiring manual intervention.
-
FUNDAMENTAL: Video generation models may produce physically inconsistent motions despite negative prompting constraints.
-
EVALUATION: No comparison to other automated data synthesis methods due to lack of comparable end-to-end baselines.
-
ENGINEERING: Requires careful camera calibration and visual domain alignment (black tape) for sim-to-real transfer.
Failure modes:
- Complete failure under sensor occlusion (0% success) indicating heavy reliance on visual feedback
- Performance degradation with spatial perturbations suggesting limited robustness to layout variations.
Anticipatory Planning for Multimodal AI Agents
Authors: Yongyuan Liang, Shijie Zhou, Yu Gu, Hao Tan et al. (9 authors) · Institution: University of Maryland, Adobe Research · Category: cs.AI
TraceR1 introduces a two-stage RL framework that trains multimodal agents to forecast multi-step trajectories before execution, improving planning coherence and achieving modest gains over reactive baselines on GUI and tool-use benchmarks.
Practical Takeaway: The key insight is that training multimodal agents with trajectory-level rewards before step-level execution refinement can improve planning coherence for long-horizon tasks. The two-stage approach (trajectory alignment followed by grounded fine-tuning) provides a general recipe that could be applied to other agent training scenarios. However, the improvements are modest relative to the added complexity, so practitioners should weigh whether the 3-6 percentage point gains justify the more complex training pipeline. The approach is most promising for tasks requiring multi-step reasoning where global consistency matters more than immediate action accuracy.
Tags: multimodal_agents GUI_automation reinforcement_learning anticipatory_planning trajectory_optimization tool_use computer_vision grounded_execution
Task & Setting
Building multimodal AI agents that can effectively plan and execute long-horizon tasks across GUI environments and tool-use scenarios remains challenging because most existing systems are reactive, making decisions based only on current observations without anticipating future states or long-term consequences. This reactive approach leads to planning incoherence and prevents reliable completion of multi-step tasks that require coordinated sequences of actions.
The task involves training agents to take multimodal inputs (screenshots, user instructions, interaction history) and generate both immediate actions and multi-step trajectory forecasts. Input consists of current observation $s_t$, user instruction $u$, and K-step interaction history $\tau_{1:t-1}$. Output includes predicted action $\hat{a}_t$, step instruction $\hat{g}_t$, and future trajectory $\hat{\tau}_{t:T}$ containing action sequences. The training objective combines trajectory-level alignment with discounted rewards:
\[R(\hat{\tau}, \tau^*) = \sum_{t=1}^T \gamma^{t-1} r_t\]where $r_t = \lambda_{align} \text{sim}(\hat{a}_t, a_t^*) - \lambda_{rep} \text{rep}(\hat{a}_{1:t})$.
Success is measured by task completion rates on GUI benchmarks (AndroidWorld, OSWorld-Verified) and answer accuracy on tool-use benchmarks (GAIA, GTA). Step-level precision is evaluated using coordinate matching for GUI actions and answer correctness for tool calls.
The paper evaluates across 7 benchmarks spanning online GUI environments (AndroidWorld: 116 tasks across 20 apps, OSWorld-Verified: long-horizon desktop operations), offline GUI benchmarks (AndroidControl-High, GUI-Odyssey: 203 tasks across 6 apps, Multimodal-Mind2Web), and multimodal reasoning tasks (GAIA: 446 tasks with document understanding, GTA: 229 visual reasoning tasks).
Architecture & Method
- Base model: Qwen3-VL-8B-Thinking, a vision-language model extended for multimodal agent tasks
- Two-stage reinforcement learning framework using Group-Relative Policy Optimization (GRPO)
- Stage 1 - Anticipatory Trajectory Optimization: Model predicts multi-step future trajectories and is trained with trajectory-level rewards measuring global consistency between predicted and reference action sequences
- Trajectory alignment reward with temporal discounting: $R(\hat{\tau}, \tau^*) = \sum_{t=1}^T \gamma^{t-1} r_t$ where $r_t = \lambda_{align} \text{sim}(\hat{a}_t, a_t^*) - \lambda_{rep} \text{rep}(\hat{a}_{1:t})$
- GRPO policy update: $\nabla_\theta J(\theta) = E_{\hat{\tau}}[\hat{A}(\hat{\tau}, \tau^*) \nabla_\theta \log \pi_\theta(\hat{\tau}\lvert u, s_t, \tau_{1:t-1})]$
- Stage 2 - Grounded Reinforcement Fine-tuning: Uses execution feedback from frozen tool agents to refine step-level accuracy
- Grounded reward function: $r_t^G = \mathbf{1}[\text{coord match}]$ for GUI steps, $\mathbf{1}[\text{answer match}]$ for tool-calling steps
-
Inference operates in plan-act loop: predict multi-step trajectory, execute only first action, receive environment feedback, re-plan
The core technical contribution is explicit training of anticipatory reasoning through trajectory-level RL combined with grounded execution refinement, moving beyond reactive decision-making to enable foresight and global planning coherence.
Training Recipe
- Initialization: Start from Qwen3-VL-8B-Thinking pretrained model
- Stage 1 Trajectory-Level RL: Train on large-scale agent trajectory datasets (AgentNet, AndroidControl, GUI-Odyssey, Multimodal-Mind2Web, AgentTrek) using GRPO with trajectory alignment rewards, AdamW optimizer at 1e-6 learning rate, global batch size 128, 143 training steps, temporal discount γ=0.8
- Stage 2 Grounded Fine-tuning: Use same trajectory datasets but execute first predicted action through frozen tool agents (UI-TARS-7B, UI-TARS-1.5-7B, Qwen3-VL-32B-Thinking), GRPO optimization with grounded rewards based on coordinate/answer matching, same hyperparameters as Stage 1
- Tool-use training: Stage 1 uses tool-use trajectory dataset from T3-Agent toolbox, Stage 2 applies grounded RFT with executable tools from T3-Agent framework
- Training framework: EasyR1 framework for reinforcement learning implementation
- Hardware and wall-clock time: not reported
Novelty & Lineage
The work builds on recent GUI agent frameworks (Agent S/S2 2024, UI-TARS 2025, GTA1 2025) and R1-style reasoning approaches (GUI-R1 2025, InfiGUI-R1 2025). Prior work focuses on reactive planning with proprietary models or single-stage RL with step-level rewards.
The specific delta is the two-stage RL framework that explicitly trains anticipatory reasoning through trajectory-level rewards before grounding with execution feedback. Unlike existing methods that optimize step-level actions in isolation, this approach trains models to forecast multi-step sequences and optimize for global trajectory coherence.
The novelty is combining trajectory-level RL (Stage 1) with grounded execution refinement (Stage 2), bridging high-level anticipatory planning with low-level precision. This differs from purely reactive approaches or single-stage RL methods.
Rating: SIGNIFICANT - introduces a new training paradigm for multimodal agents with substantial empirical improvements, though builds incrementally on existing RL and GUI agent foundations.
Benchmarks & Results
- AndroidWorld: 64.8% success rate vs best open-source 61.4% (Qwen3-VL-32B), improvement of +3.4 percentage points
- OSWorld-Verified: 41.2% success rate vs best open-source 38.1% (Qwen3-VL-235B), improvement of +3.1 percentage points
- AndroidControl-High: 75.3% step success rate vs best open-source 74.7% (UI-TARS-32B), improvement of +0.6 percentage points
- GUI-Odyssey: 88.2% step success rate vs best open-source 88.6% (UI-TARS-32B), slight decrease of -0.4 percentage points
- Multimodal-Mind2Web: 65.3% step success rate vs best open-source 64.7% (UI-TARS-32B), improvement of +0.6 percentage points
- GAIA: 40.2% answer accuracy vs GPT-4o 33.4%, improvement of +6.8 percentage points over proprietary baseline
-
GTA: 56.7% answer accuracy with 65.7% tool accuracy vs best open-source T3-Agent 53.8% answer accuracy
Results show consistent but modest improvements on most benchmarks, with stronger gains on reasoning-heavy tasks (GAIA) and online GUI environments compared to offline step-level tasks.
Compute & Efficiency
- Model size: 8B parameters for TraceR1 planner, works with various executor models (7B to 32B parameters)
- Training compute: not reported (GPU hours, hardware specifications not provided)
- Inference speed/latency: not reported, operates in plan-act loop requiring multiple forward passes
- Memory footprint: not reported, though uses frozen tool agents suggesting memory overhead from multiple models
- Deployment practicality: moderate - requires coordination between planner and executor models, two-stage training pipeline adds complexity but achieves performance comparable to proprietary systems with open-source components
Real-World Applicability
- Evaluated on AndroidWorld using live Android emulator with 116 real-world mobile application tasks across 20 apps
- OSWorld-Verified tests on actual desktop environments with dynamic interactions and real GUI interfaces
- Tool-use evaluation uses executable tools from T3-Agent toolbox with real file processing (PPTX, PDF, XLSX documents)
- No production deployment results or sim-to-real analysis reported
- GUI experiments involve actual screenshot-to-action execution rather than purely synthetic environments
- Method designed for real GUI environments but evaluation limited to controlled benchmark settings
Limitations & Failure Modes
- FUNDAMENTAL: Short-horizon trajectory updates provide only local corrections and cannot reshape agent’s understanding of long-term feasibility or task structure
- ENGINEERING: Requires two-stage training pipeline increasing complexity compared to single-stage approaches
- ENGINEERING: Performance improvements are modest (typically 3-6 percentage points) despite added training complexity
- EVALUATION: Limited analysis of failure modes or breakdown of where anticipatory planning helps vs. hurts performance
- ENGINEERING: Relies on frozen tool agents for grounding, creating dependency on separate executor models
-
FUNDAMENTAL: Trajectory forecasting becomes noisy for very long horizons (T>10), limiting look-ahead capability
Likely failure modes:
- Over-optimistic planning when predicted trajectories don’t account for environment stochasticity
- Trajectory-level rewards may not align well with actual task success in complex multi-step scenarios.
Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning
Authors: Haokun Zhao, Wanshi Xu, Haidong Yuan, Songjun Cao et al. (6 authors) · Institution: Fudan University, Peking University, Tencent Youtu Lab · Category: cs.AI
This paper introduces A2PO, a reinforcement learning framework that teaches multimodal models strategic visual construction for geometric reasoning through adaptive reward shaping that considers both timing and quality of auxiliary constructions.
Practical Takeaway: If you’re working on multimodal reasoning for mathematics or geometry, this paper demonstrates that strategic visual construction significantly improves problem-solving performance, but current approaches are limited by visual generation capabilities. The key insight is that interleaved visual-textual reasoning outperforms single-modality approaches, and perplexity can serve as a quality signal for geometric constructions. Consider implementing the tri-partition sampling strategy and adaptive reward shaping for geometry-focused RL training, but be aware that practical deployment currently requires high-quality pre-existing visual aids rather than autonomous visual generation.
Tags: geometric_reasoning multimodal_reasoning reinforcement_learning visual_chain_of_thought policy_optimization auxiliary_construction mathematics visual_text_interleaving
Task & Setting
-
Real-world context: Geometric problem solving requires “thinking with constructions”—the dynamic manipulation of visual aids to bridge gaps between problem conditions and solutions. However, existing multimodal large language models (MLLMs) are confined to passive inference with static diagrams, lacking strategic knowledge of when and how to construct effective visual aids like auxiliary lines.
-
Task definition: The input consists of geometry problems with textual descriptions and initial diagrams (512×512 resolution). The task requires generating step-by-step solutions that strategically incorporate auxiliary visual constructions. The objective is to maximize problem-solving accuracy through interleaved visual-textual reasoning:
\[\max_{\pi} \mathbb{E}_{q \sim D} [R(\text{solution}_{\pi}(q, I_{\text{orig}}))]\]where $q$ is a geometry problem, $I_{\text{orig}}$ is the initial diagram, and $R$ measures solution correctness.
-
Evaluation criteria: Success is measured by accuracy on geometry problem benchmarks. The paper uses exact match accuracy as the primary metric, with additional analysis of reasoning perplexity (PPL) as an indicator of solution quality.
-
The paper introduces GeoAux-Bench comprising 4,334 geometry problems with 8,470 diagrams. Each problem explicitly aligns textual construction steps ($T_{aux}$) with corresponding visual updates ($I_{aux}$), creating precise interleaved mappings for training and evaluation.
Architecture & Method
-
The method builds on Group Relative Policy Optimization (GRPO) using Qwen2.5-VL as the backbone multimodal model with frozen vision tower during training.
-
Tri-Partition Sampling creates three trajectory subsets: Mandatory ($O^+$) with enforced auxiliary constructions via prefix forcing, Prohibited ($O^-$) with masked auxiliary tokens, and Natural ($O$) with autonomous sampling.
-
Visual Re-prompting mechanism detects completed auxiliary commands and injects ground-truth visual aids when constructions match expected patterns, simulating interleaved visual feedback.
-
Adaptive Reward Shaping uses a composite reward function:
\[R(o) = w_1 r_{acc} + w_2 r_{fmt} + w_3 r_{time} + w_4 r_{qual}\] -
Timing Reward regulates construction necessity:
\[r_{time}(o) = I_{aux}(o) \cdot \begin{cases} 1 & \text{if } \Delta > \tau \\ -1 & \text{if } \Delta < -\tau \\ 0 & \text{otherwise} \end{cases}\] -
Quality Reward favors low-perplexity constructions:
\[r_{qual}(o) = I_{aux}(o) \cdot r_{acc}(o) \cdot \mathbf{1}[\text{PPL}(o) < \bar{P} + \delta]\]
Training Recipe
-
Supervised Fine-Tuning (SFT) warm-up: 5 epochs on 1,600 samples from GeoAux-Bench, GeomVerse, and Geometry3k training splits using AdamW optimizer with 5e-5 learning rate, cosine schedule, batch size 32.
-
A2PO reinforcement learning: 650 steps with 1e-6 constant learning rate, batch size 24, rollout batch size 72, 8 generations per prompt, KL coefficient β=0.01.
-
Training data uses mixed-prompt strategy with standard and prohibited prompt variants, filtered for marginal solvability (mixed correct/incorrect outcomes).
-
Hardware details not explicitly reported, uses bfloat16 precision throughout.
-
Wall-clock training time not reported.
Novelty & Lineage
The core novelty is the systematic integration of visual-textual interleaved reasoning for geometry through Adaptive Reward Shaping that explicitly models construction timing and quality. Prior work includes GeometryZero (2025) for geometry RL optimization, GRPO (2024) for group-based policy optimization, and MathCanvas (2025) for visual chain-of-thought. The specific delta is the tri-partition sampling with counterfactual baselines and perplexity-based quality rewards for strategic visual construction. The contribution is SIGNIFICANT as it addresses a fundamental limitation in geometric reasoning MLLMs through principled reward design, though builds incrementally on existing RL optimization frameworks.
Benchmarks & Results
- GeoAux-Bench (new): Accuracy 42.97% (A2PO) vs 40.18% (GeometryZero baseline), +2.79% improvement
- Geomverse: Accuracy 70.70% (A2PO) vs 68.30% (GeometryZero), +2.40% improvement
- Geometry3k: Accuracy 53.61% (A2PO) vs 53.72% (GeometryZero), marginal difference
- Overall average improvement: 55.76% vs 54.07% baseline, +1.69% gain
- GeoAux-Bench evaluation on SOTA models: Gemini-2.5-Pro achieves 83.16%, GPT-5 at 80.62%, with large gaps to open-source models
- The paper shows consistent but modest improvements across benchmarks, with strongest gains on the newly introduced GeoAux-Bench dataset
Compute & Efficiency
- Model size: 7B parameters (Qwen2.5-VL-7B-Instruct backbone)
- Training compute: Not explicitly reported, uses frozen vision tower to reduce computational overhead
- Inference speed: Not reported, but visual re-prompting likely adds latency overhead
- Memory footprint: 8,192 max sequence length, 512×512 image resolution, bfloat16 precision
- Deployment practicality: Limited by reliance on ground-truth visual injection during re-prompting, making current approach more suitable for research than production deployment
Real-World Applicability
- The method currently relies on retrieval-based visual injection using ground-truth auxiliary diagrams rather than native visual generation, limiting real-world deployment
- No production deployment results or hardware experiments reported
- The approach requires pre-existing high-quality geometric diagrams for the re-prompting mechanism
- Authors acknowledge the limitation that current unified MLLMs lack fine-grained visual actuation for precise geometric editing
- Real-world applicability is constrained until advances in multimodal pre-training enable reliable geometric diagram generation
Limitations & Failure Modes
- FUNDAMENTAL: Current implementation requires ground-truth auxiliary diagrams rather than model-generated visuals, limiting autonomous reasoning capability
- ENGINEERING: Visual re-prompting adds inference latency and complexity that could be optimized with better visual generation models
- FUNDAMENTAL: The method cannot generate precise geometric constructions natively, relying on retrieval-based injection
- EVALUATION: Potential data contamination observed in SOTA model performance on Olympiad problems suggests memorization rather than reasoning
-
ENGINEERING: The tri-partition sampling strategy increases computational overhead during training
Failure modes:
- Visual hallucinations in native unified models lead to geometric reasoning errors
- Models often resort to “analytic shortcuts” using coordinate systems instead of pure geometric reasoning
Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation
Authors: Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu et al. (6 authors) · Institution: Harbin Institute of Technology · Category: cs.AI
AFS-Search introduces a training-free closed-loop framework that uses VLM feedback and parallel rollout search to dynamically steer Flow Matching trajectories for precise spatially grounded text-to-image generation.
Practical Takeaway: Research engineers should consider this closed-loop paradigm for applications requiring precise spatial control in image generation. The key insight is moving from external re-prompting to internal trajectory steering using energy-based velocity field modulation. The training-free nature makes it immediately applicable to existing FLUX.1-dev deployments. However, the 3-6x computational overhead requires careful consideration for production use. The parallel rollout search and VLM-guided correction mechanisms could be adapted to other flow-based generation models beyond T2I, potentially benefiting video generation or 3D synthesis tasks.
Tags: text-to-image diffusion-models flow-matching vision-language-models agentic-ai compositional-generation spatial-grounding closed-loop-control
Task & Setting
T2I generation models face fundamental challenges with spatial reasoning and compositional accuracy despite recent advances. Traditional text encoders struggle with complex relational semantics, and standard open-loop sampling propagates initial ambiguities throughout generation, leading to spatial constraint violations and attribute misalignment.
The task is spatially grounded text-to-image generation: given a complex natural language prompt describing spatial relationships and multiple objects with specific attributes (colors, positions, counts), generate a 1024×1024 image that precisely follows all constraints. The objective function can be formulated as:
\[\max_{\mathbf{x}} P(\mathbf{x}|\mathbf{y}) \cdot \text{Spatial}(\mathbf{x}, \mathbf{y}) \cdot \text{Attribute}(\mathbf{x}, \mathbf{y})\]where $\mathbf{x}$ is the generated image and $\mathbf{y}$ is the text prompt.
Evaluation criteria include compositional accuracy measured on T2I-CompBench (attribute binding, object relationships, complex scenes), object-focused evaluation on GenEval (object presence, counting, positioning), and reasoning capability assessment on R2I-Bench across five dimensions (causal, logical, commonsense, compositional, mathematical reasoning). The paper evaluates against 15+ baseline models including FLUX.1-dev, SDXL, and recent agentic frameworks.
Architecture & Method
-
Base Architecture: Built upon FLUX.1-dev, a 12B parameter rectified flow transformer using Flow Matching paradigm with ODE trajectory:
\[\frac{d\mathbf{x}_t}{dt} = v_\theta(\mathbf{x}_t, t, \mathbf{y})\] -
VLM Integration: Uses Qwen-VL-MAX (Pro version) or Qwen2.5-VL-7B (Fast version) as semantic critic for diagnosis and scoring.
-
Linear Trajectory Projection: Exploits constant-velocity property of rectified flows to project noisy latents to clean preview:
\[\hat{\mathbf{z}}_0 = \mathbf{z}_t - t \cdot \mathbf{v}_t\] -
Contrastive Energy Function: Defines energy over CLIP embeddings:
\[\mathcal{E}(\hat{\mathbf{z}}_0) = \cos(\text{CLIP}(\hat{\mathbf{x}}_0), \mathbf{e}_{neg}) - \cos(\text{CLIP}(\hat{\mathbf{x}}_0), \mathbf{e}_{pos})\] -
Time-Scaled Velocity Modulation: Applies spatially-masked corrections to velocity field:
\[\mathbf{v}_t^{corrected} = \mathbf{v}_t + \eta t \cdot \nabla_{\hat{\mathbf{z}}_0} \mathcal{E} \odot \mathbf{M}\] -
Parallel Rollout Search: At critical timestep (t=0.6), explores three branches: baseline continuation, corrective steering via AFS, and stochastic exploration with Gaussian noise injection.
Training Recipe
This is a training-free framework that does not modify the base FLUX.1-dev model parameters.
- Base Model: Uses pre-trained FLUX.1-dev (12B parameters, trained on internet-scale data)
- VLM Components: Leverages pre-trained Qwen-VL-MAX or Qwen2.5-VL-7B for semantic supervision
- SAM3 Integration: Uses pre-trained SAM3 for object segmentation and spatial grounding
-
CLIP Encoder: Employs pre-trained CLIP for contrastive energy computation
No additional training is performed. The framework operates purely through test-time computation and search, preserving the open-world generalization capabilities of the foundation models while adding closed-loop feedback control.
Novelty & Lineage
Prior Work: Builds on agentic T2I frameworks like RPG (2024), AgentComp (2024), and SILMM, which use VLMs for external prompt refinement but operate in open-loop fashion. Also relates to attention manipulation methods like Attend-and-Excite and Prompt-to-Prompt.
Key Delta: Introduces internal closed-loop control within the ODE trajectory rather than external re-prompting. The core innovations are:
- Agentic Flow Steering that directly modulates the velocity field using energy gradients
- Parallel Rollout Search with lookahead simulation at critical timesteps
-
Time-scaled intervention that naturally decays correction strength as generation progresses.
Rating: SIGNIFICANT - Represents a paradigm shift from external feedback loops to internal trajectory steering, with solid theoretical grounding in optimal transport and practical improvements across multiple benchmarks.
Benchmarks & Results
-
T2I-CompBench: AFS-Search-Pro achieves 0.6748 average (vs 0.4770 FLUX baseline), +7.86% improvement over best prior method AgentComp (0.5962)
-
GenEval: Shows consistent improvements across object counting, positioning, and color accuracy (visual results provided, quantitative scores not detailed)
-
R2I-Bench: AFS-Search achieves 0.48 average vs 0.33 FLUX baseline, outperforming 9 baseline models across causal (0.45), logical (0.58), commonsense (0.51), compositional (0.66), and mathematical (0.20) reasoning
-
Inference Speed: AFS-Search-Pro: 62.3s vs FLUX 11.7s, AFS-Search-Fast: 32.5s (faster than other agentic methods like RPG 104.2s, EvoGen 125.3s)
Results show consistent improvements across all benchmarks, with particularly strong gains in spatial relationships and complex compositional tasks.
Compute & Efficiency
-
Model Size: 12B parameters (FLUX.1-dev base) + VLM components (Qwen-VL-MAX or Qwen2.5-VL-7B)
-
Training Compute: Zero additional training required (training-free approach)
-
Inference Speed: AFS-Search-Pro 62.3s, AFS-Search-Fast 32.5s per 1024×1024 image (vs 11.7s FLUX baseline)
-
Memory Footprint: Not explicitly reported, but requires simultaneous loading of FLUX.1-dev, VLM, SAM3, and CLIP models
-
Deployment Assessment: Computationally intensive due to parallel rollout search and multiple model inference, but still faster than competing agentic frameworks. The 3-6x slowdown over baseline may limit real-time applications but acceptable for high-quality generation tasks.
Real-World Applicability
-
Dataset Testing: Evaluated on standard T2I benchmarks (T2I-CompBench, GenEval, R2I-Bench) with complex real-world prompts including spatial relationships, object counting, and compositional scenes
-
Prompt Complexity: Handles complex natural language descriptions like “four seasons in one picture” and professional scenarios with multiple objects and precise spatial constraints
-
Production Integration: Framework is modular and could integrate with existing FLUX.1-dev deployments, though computational overhead may require optimization for production use
-
Cross-domain Generalization: Leverages pre-trained foundation models without fine-tuning, preserving open-world capabilities for diverse domains and artistic styles
-
User Study: No human evaluation or user studies reported to validate real-world preference alignment
Limitations & Failure Modes
-
FUNDAMENTAL: Relies on VLM’s visual understanding capabilities - errors in VLM diagnosis propagate to final generation quality
-
ENGINEERING: Computational overhead (3-6x slower than baseline) limits real-time applications and scalability
-
ENGINEERING: SAM3 segmentation accuracy affects spatial grounding precision - incorrect masks can misguide correction
-
EVALUATION: Limited to standard benchmarks without human preference evaluation or real deployment studies
-
ENGINEERING: Global retry mechanism (up to 2 retries) adds further computational cost and unpredictable latency
Failure Modes:
- VLM Misdiagnosis: Incorrect intermediate assessment leads to wrong corrective actions
- Energy Function Limitations: CLIP-based contrastive energy may not capture fine-grained spatial relationships, potentially causing over-correction or under-correction of defects
Robotic Agentic Platform for Intelligent Electric Vehicle Disassembly
Authors: Zachary Allen, Max Conway, Lyle Antieau, Allen Ponraj et al. (5 authors) · Institution: University of Colorado Boulder · Category: cs.RO
RAPID presents a gantry-mounted robotic platform that combines open-vocabulary vision, analytical motion planning, and agentic AI to achieve 97% success in fastener removal for full-scale EV battery disassembly.
Practical Takeaway: This work demonstrates that combining open-vocabulary vision with agentic AI can create flexible robotic systems for complex industrial tasks like battery recycling. The key insight is that explicit tool interfaces vastly outperform automatic service discovery for LLM-robot integration (100% vs 57% success). Research engineers should consider: (1) analytical IK solvers for redundant systems to avoid planning failures, (2) structured tool APIs rather than generic service discovery for reliable LLM control, and (3) the importance of force control and visual servoing for precision manipulation tasks. The open-source platform provides a solid foundation for exploring human-robot collaboration in industrial settings.
Tags: robotics recycling electric_vehicles manipulation computer_vision agentic_ai industrial_automation sustainability
Task & Setting
Electric vehicle (EV) adoption creates urgent demand for scalable battery recycling, but EV battery pack disassembly remains largely manual due to high design variability, safety hazards from 400-800V systems, and economic pressures from lower-value LFP batteries. The task is to develop an autonomous robotic system for disassembling full-scale EV battery packs through fastener removal operations.
Input: Full-scale EV battery pack (2.12m × 1.22m, 450kg, 800V Hyundai Ioniq5), RGB-D images from Intel RealSense D435i. Output: Successful removal of fasteners (screws, nuts) with pose estimation and manipulation commands. The objective is to maximize fastener removal success rate while minimizing disassembly time:
\[\max_{s} \text{Success Rate}(s) - \lambda \cdot \text{Time}(s)\]where $s$ represents the fastener removal strategy.
Success is measured by:
- fastener removal success rate (%)
- total disassembly time (minutes)
- object detection mAP@0.5, and
- agentic AI task completion rate (%). The system is evaluated on 204 fastener removal trials across three strategies: taught-in poses, one-shot vision execution, and visual servoing. A complete manual disassembly baseline is established using 12 subtasks totaling 6778 man-seconds.
Architecture & Method
-
Robotic Hardware: Universal Robot UR16e (6-DOF, 16kg payload) mounted on Parker-Hannifin 2.1m linear gantry with custom nut-runner tool (AlloyPower ARW801) and Intel RealSense D435i RGB-D camera.
-
Vision Pipeline: YoloWorld open-vocabulary object detection model achieving 0.9757 mAP@0.5, trained on 563 annotated images with 7 component classes (bolts, bus bars, screws, nuts, etc.). 3D point clouds generated from RGB-D data and stored in kD-Tree for efficient retrieval.
-
Motion Planning: Custom analytical inverse kinematics solver for 7-DOF system (6-DOF arm + 1-DOF gantry) using multi-objective cost function:
\[\text{Cost} = d_J(q_c, q_k) + \frac{1}{d_{ee}} + \frac{1}{A_\triangle}\]where $d_J$ is joint distance, $d_{ee}$ is wrist-gantry distance, and $A_\triangle$ is arm triangle area for singularity avoidance.
-
Fastener Removal: Three strategies implemented - (1) taught-in poses with 10N downward force control, (2) vision-guided execution using centroids, and (3) visual servoing with interaction matrix $L$ for alignment correction.
-
Agentic AI: SmolAgents framework connecting LLMs (GPT-4o-mini, Qwen 3.5 9B/4B) to robot capabilities through structured tool calls and ROS services, enabling natural language task specification and execution.
Training Recipe
-
Vision Model Training: YoloWorld finetuned on custom dataset of 563 RGB images with 4,614 labeled instances across 7 battery component classes. Training details (optimizer, learning rate, hardware) not reported.
-
Motion Planning: Analytical solver implemented with empirically tuned cost function weights (gantry w=0.1, arm joints w=1.0) through ablation study. No learning-based training involved.
-
Agentic AI: Pre-trained LLMs used without additional training - GPT-4o-mini via API, Qwen 3.5 (4B/9B parameters) deployed locally on NVIDIA Jetson Thor. System prompts and tool configurations provided but training methodology not applicable.
-
System Integration: Manual calibration and parameter tuning for force control (10N downward force), visual servoing gains, and TSP-based task sequencing using Google OR-Tools solver.
Hardware and training time details not reported for most components.
Novelty & Lineage
This work extends prior battery disassembly research focused on smaller hybrid battery packs (Qu et al. 2024, Al Assadi et al. 2024) to full-scale EV batteries requiring gantry-mounted systems. Key innovations include:
- analytical IK solver for 7-DOF gantry systems addressing redundancy challenges
- integration of open-vocabulary detection (YoloWorld) with agentic AI for flexible task specification, and
-
systematic comparison of fastener removal strategies on realistic EV hardware.
The closest prior work is Choux et al. 2021 on task planning for EV battery disassembly, but focused on smaller batteries without physical implementation. The agentic AI integration builds on recent LLM-robotics work (Lynch et al. 2023, Rachwał et al. 2025) but applies it specifically to industrial disassembly.
Rating: INCREMENTAL - combines existing techniques in a novel application domain with solid engineering contributions.
Benchmarks & Results
-
Object Detection Performance: YoloWorld achieves 0.9757 mAP@0.5 vs YoloV11L baseline 0.9708 mAP@0.5, demonstrating comparable performance with open-vocabulary benefits.
-
Fastener Removal Success: Taught-in poses achieve 97.06% success rate (24.09 min), one-shot vision 57.35% success (28.70 min), visual servoing 82.84% success (36.29 min) across n=204 trials.
-
Inverse Kinematics: Custom analytical solver achieves 99.3% success rate vs KDL 25.7% (1 seed) and 73.0% (50 seeds), with 0.31s vs 10.46s planning time.
-
Agentic AI Task Completion: Tool-based interface achieves 100% success vs MCP-based 56.7% success (43.3% failure rate) across simple motion and complex reasoning tasks.
-
Manual Disassembly Baseline: Complete manual disassembly requires 6778 man-seconds across 12 subtasks, providing comparative baseline for automation assessment.
No comparison to other robotic disassembly systems on equivalent full-scale batteries due to lack of existing benchmarks.
Compute & Efficiency
-
Model Size: YoloWorld parameters not specified. Qwen 3.5 models: 4B and 9B parameters deployed locally.
-
Training Compute: Not reported for vision model training. No training required for analytical motion planning components.
-
Inference Speed: Vision processing and motion planning times not explicitly reported. Task execution ranges 24-36 minutes for fastener removal operations.
-
Memory Footprint: Qwen models deployed on NVIDIA Jetson Thor edge hardware, demonstrating feasibility of local deployment.
-
Deployment Practicality: System demonstrates real-world deployment on full-scale EV battery with $80,000 hardware cost. Open-source software stack enables reproducibility and adoption.
Real-World Applicability
-
Physical Hardware Testing: Extensive evaluation on actual Hyundai Ioniq5 EV battery (450kg, 800V) in laboratory setting with full-scale gantry system.
-
Industrial Environment: System designed for integration into existing human disassembly stations with collaborative safety considerations and predictable motion planning.
-
Safety Implementation: Battery discharged to 0V using Webasto AV-900 cycler prior to disassembly. Force-controlled interactions and collision avoidance implemented.
-
Economic Viability: Techno-economic analysis shows potential >40% cost savings compared to manual disassembly, with $80,000 total system cost.
-
Production Readiness: Open-source platform released to enable systematic investigation and further development toward scalable deployment.
No deployment in actual recycling facilities reported, remaining at research demonstration level.
Limitations & Failure Modes
-
FUNDAMENTAL: Vision system struggles with reflective surfaces (black sheet metal, plastic components) due to active infrared interference from RealSense sensor.
-
ENGINEERING: Robot operates significantly slower than human workers (22 min vs 17 min for comparable tasks), though running well below maximum speed.
-
ENGINEERING: Visual servoing exhibits convergence issues due to poor depth estimates causing overshoot/undershoot in control signals.
-
EVALUATION: Only evaluated on single battery type (Hyundai Ioniq5), limiting generalizability claims across diverse EV designs.
-
ENGINEERING: End-effector slack prevents 100% success rate even with taught-in poses (97% achieved).
Failure Modes:
- False positive detections leading to attempted removal of non-existent fasteners
- Agentic AI systems marking tasks complete without actual execution, particularly with smaller language models in complex scenarios.
GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System
Authors: MoniJesu James, Amir Atef Habel, Aleksey Fedoseev, Dzmitry Tsetserokou · Institution: Skoltech · Category: cs.RO
GoalVLM presents a zero-shot multi-agent framework that integrates Vision-Language Models into frontier-based exploration for open-vocabulary object navigation, achieving competitive success rates without task-specific training.
Practical Takeaway: If you’re building multi-robot navigation systems, GoalVLM demonstrates that zero-shot VLM-guided exploration can achieve competitive object-finding performance without expensive training. The key insight is integrating VLMs directly into frontier selection rather than just perception. The depth-projected goal localization (GoalProjector) and corrected camera intrinsics for non-uniform rescaling are immediately implementable improvements. However, expect significant path inefficiency compared to trained methods - consider this for exploration scenarios where training data is unavailable rather than production deployments requiring optimal paths.
Tags: multi-agent-systems object-goal-navigation vision-language-models zero-shot-learning semantic-mapping frontier-exploration robotics embodied-ai
Task & Setting
-
Real-world context: Object-goal navigation requires robots to autonomously explore unfamiliar environments to locate specific objects based on natural language descriptions. Existing multi-agent approaches are limited to closed-set vocabularies and require retraining for new object categories, precluding deployment in dynamic real-world scenarios where novel objects may appear.
-
Task definition: Given N agents in an unknown environment E, the task is to locate object instances belonging to open-vocabulary categories g ∈ G (379+ categories) specified in natural language. Each agent receives RGB-D observations and pose at each timestep. The formal objective is a decentralized partially observable Markov decision process (Dec-POMDP) where agents must navigate to within 1.0m Euclidean distance of target objects in sequential multi-subtask episodes.
-
Evaluation criteria: Success is measured by Subtask Success Rate (SR) - fraction of subtasks completed within distance threshold, Success weighted by Path Length (SPL) using geodesic distance, and Distance to Goal (DTG) at episode termination.
-
The paper evaluates on GOAT-Bench val_unseen split comprising 360 multi-subtask episodes across HM3D scenes, with each episode containing 5-7 sequential open-vocabulary object-goal subtasks (1032 total subtasks).
Architecture & Method
-
Ego-centric semantic mapping: Each agent constructs BEV occupancy grids from RGB-D observations using depth-projected voxel splatting with corrected camera intrinsics for non-uniform rescaling
-
Zero-shot perception: SAM3 provides text-prompted object detection and segmentation, while GoalProjector back-projects detected masks through calibrated depth into BEV coordinates
-
VLM spatial reasoning: SpaceOM Vision-Language Model estimates frontier probabilities via structured prompt chains (scene captioning, room-type classification, perception gating)
-
Constraint-guided frontier selection: Combined scoring function
\[U(f_i) = (1-w) \cdot s^{vlm}_i + w \cdot \hat{v}_i\]where VLM scores are blended with Bayesian value map estimates
-
Bayesian value map: Maintains probabilistic beliefs updated via
\[V'(x,y) = \frac{\sigma^2_{obs} \cdot V(x,y) + \Sigma(x,y) \cdot c_t \cdot m(x,y)}{\Sigma(x,y) + \sigma^2_{obs}}\] -
Multi-agent coordination: Decentralized architecture with max-pooling map fusion
\[M^{global}(x,y,c) = \max_i M^i_t(x,y,c)\]and sequential greedy frontier allocation
-
Navigation: Fast Marching Method solves Eikonal equation
\[|\nabla T(x)| = \frac{1}{v(x)}\]for geodesic path planning
Training Recipe
This is a zero-shot, training-free approach that requires no task-specific training. The method leverages:
- Pre-trained components: SAM3 for detection/segmentation, SpaceOM VLM for spatial reasoning
- No fine-tuning or adaptation required
- Immediately deployable in novel environments without training data
- No optimization phases - uses engineered pipeline components
Novelty & Lineage
The core novelty is integrating Vision-Language Models directly into multi-agent frontier-based exploration for open-vocabulary object navigation. Prior works like VLFM (2024) and L3MVN (2023) used VLMs for single-agent navigation, while MCoCoNav (2023) addressed multi-agent coordination with closed vocabularies. The specific delta is:
- zero-shot multi-agent coordination without precomputed graphs
- depth-projected goal localization via GoalProjector, and
- constraint-guided VLM reasoning for frontier selection. This represents SIGNIFICANT progress by combining multiple existing techniques in a novel architecture that achieves competitive results without training.
Benchmarks & Results
- GOAT-Bench val_unseen: Subtask SR 55.8% vs Modular GOAT baseline 29.4% (26.4 point improvement), SPL 18.3% vs 17.5% (0.8 point improvement)
- Comparison to AstraNav-Memory: SR 55.8% vs 62.7% (6.9 point gap), SPL 18.3% vs 56.9% (38.6 point gap)
- Multi-agent benefit: N=2 vs N=1 agents shows 56.2% vs 38.6% SR (17.6 point improvement)
- Results are mixed - competitive SR but significant SPL gap vs trained methods
Compute & Efficiency
- Model size: Uses pre-trained SAM3 and SpaceOM VLM (exact parameters not reported)
- Training compute: Zero - no training required
- Inference speed: Real-time capable with 20 Hz state updates, 10 Hz camera processing on Orange Pi 5
- Memory footprint: BEV semantic maps with voxel grids, specific memory usage not reported
- Deployment practicality: High - zero-shot approach deployable without adaptation, demonstrated on real multi-drone hardware
Real-World Applicability
- Preliminary real-world validation on custom quadrotor platform with Intel RealSense D435 cameras and Vicon motion capture
- Hardware setup: 2 drones with Orange Pi 5 onboard compute, ArduPilot flight controllers, ZMQ communication over WiFi
- Environment: Indoor lab (10×5m, max 2m altitude) with successful object detection and BEV mapping
- Sim-to-real transfer: Perception pipeline transfers effectively from simulation to real RGB-D data
- Future work planned for full autonomous navigation experiments
Limitations & Failure Modes
- FUNDAMENTAL: 2D BEV mapping cannot represent multi-floor environments or stacked objects
- ENGINEERING: SAM3 detection failures on transparent, reflective, and small objects (mirrors achieve only 47.6% SR)
- ENGINEERING: Substantial path efficiency gap (18.3% vs 56.9% SPL) due to frontier exploration overhead
- EVALUATION: Limited real-world validation - only perception tested, not full autonomous navigation
-
ENGINEERING: GoalProjector depth projection errors with reflective surfaces causing goal mislocalization
Failure modes:
- Reflected scene confusion in mirrors causing goal projection behind walls
- Small/occluded objects failing multi-view confirmation threshold leading to exploration failures
Articulated-Body Dynamics Network: Dynamics-Grounded Prior for Robot Learning
Authors: Sangwoo Shin, Kunzhao Ren, Xiaobin Xiong, Josiah Hanna · Institution: University of Wisconsin–Madison · Category: cs.RO
ABD-NET embeds the computational structure of forward dynamics (inertia propagation) directly into graph neural network policy architectures, achieving superior sample efficiency and generalization for articulated robot control.
Practical Takeaway: If you’re working on RL for articulated robots, ABD-NET offers a principled way to embed physics structure into policy networks that goes beyond simple connectivity graphs. The key insight—using forward dynamics computational structure as architectural prior—could be valuable for any articulated system control problem. The method is production-ready with real hardware validation, though the sequential computation overhead may require JAX implementation for efficient training. Consider this approach especially for dynamic locomotion tasks where physics-aware representations matter most.
Tags: robotics reinforcement_learning graph_neural_networks articulated_bodies locomotion physics_informed_ml sim_to_real policy_learning
Task & Setting
Reinforcement learning for articulated robot control faces the challenge of learning efficient policies that can generalize across different dynamics and morphologies. Traditional approaches use spatial connectivity but ignore how forces and motion propagate through rigid body systems according to physics laws.
The task is to learn control policies for articulated robots (humanoids, quadrupeds) that map proprioceptive observations (joint positions/velocities, IMU data, velocity commands) to joint control actions (target positions or torques). The objective is to maximize expected cumulative reward:
\[\max_\pi E_\pi\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)\right]\]Success is measured by:
- sample efficiency during training
- final task performance (locomotion velocity tracking, balancing)
- robustness to dynamics shifts (mass changes)
- sim-to-real transfer capability. Environments include Genesis and SAPIEN simulators with humanoid/quadruped robots, plus real hardware validation on Unitree G1 and Go2 robots.
Architecture & Method
-
Observation Encoding: Link-wise projection layers ${\phi_i}_{i=0}^{K-1}$ transform global observation $s$ into per-link embeddings $z_i = \phi_i(s)$
-
Dynamics-Informed Message Passing: Inspired by Articulated Body Algorithm, features propagate from child to parent links following: - Message aggregation: $m_i = \sum_{j \in CH(i)} v_j^a$ - Link representation: $v_i = \text{softplus}(z_i + B_i) + m_i$ - Child contribution: $v_j^a = v_j - v_j \odot (W_j W_j^T v_j)$
-
Orthogonality Constraint: Auxiliary loss encourages structured projections:
\[L_{orth} = \frac{1}{K} \sum_{i=0}^{K-1} \|W_i^T \text{diag}(v_i) W_i - I\|_F^2\] -
Action Decoding: Per-joint action heads ${\psi_j}_{j=1}^{K-1}$ output actions from parent representations: $a_j = \psi_j(v_{PA(j)})$
Core contribution: Embedding forward dynamics computational structure (inertia propagation from Articulated Body Algorithm) directly into policy architecture as learnable message passing.
Training Recipe
-
Model-free RL Training: PPO algorithm with identical MLP value networks across all methods - Data: Parallel simulation environments (Genesis/SAPIEN simulators) - Optimizer: Not specified, standard PPO hyperparameters - Hardware: NVIDIA RTX 4090 for training - Wall-clock time: Not reported
-
Domain Randomization: Applied for sim-to-real transfer - Randomized: foot friction, encoder bias, base center-of-mass offset, external force perturbations - Scale: Not reported
-
Real Robot Deployment: Policies run at 50Hz on onboard NVIDIA Jetson Orin NX - PD control at 200Hz for low-level joint control - Inference time: <5ms worst-case for real-time control
Novelty & Lineage
Builds on structural priors for robot RL: NerveNet (2018) used basic GNNs, BOT (2024) and SWAT (2022) used transformers with morphological attention masks. Closest work is Rodrigues Network (2025) which embeds forward kinematics structure, but only for motion prediction/imitation learning.
Specific delta: First to embed the computational structure of forward dynamics (specifically Articulated Body Algorithm’s inertia propagation) directly into policy architecture for model-free RL. Replaces generic message passing with physics-inspired feature propagation that mirrors how inertial quantities accumulate in rigid body dynamics.
Rating: SIGNIFICANT - Novel architectural insight with strong empirical validation, though incremental over existing morphology-aware methods.
Benchmarks & Results
- Genesis locomotion tasks (T1, G1, Go1, Go2): IQM 0.85 vs SWAT 0.79 (+7.6% improvement)
- SAPIEN tasks (Humanoid Walk/Stand/Run, Hopper Hop/Stand): IQM 0.97 vs SWAT 0.71 (+36.6% improvement)
- Mass generalization (1.1-2.0x mass increase): ABD-NET achieves 23.9% higher retention rate than SWAT baseline
- Computational efficiency: 3x lower FLOPs than transformer baselines while maintaining faster inference
-
Real robot validation: Successful sim-to-real transfer on Unitree G1 (humanoid) and Go2 (quadruped) with diverse locomotion behaviors
Results show consistent improvements, with larger gains on complex morphologies and dynamic tasks. Minor gains on simpler quasi-static tasks (Go1/Go2 tracking, Hopper Stand).
Compute & Efficiency
- Model size: ~91-95K parameters (comparable across all methods)
- Training compute: NVIDIA RTX 4090, parallel simulation, wall-clock time not reported
- Inference speed: 4-6x slower wall-clock than MLP in PyTorch (reduced to 2x in JAX), but 3x lower FLOPs than transformers
- Memory footprint: Not reported
- Deployment practicality: Real-time capable with <5ms inference on Jetson Orin NX, suitable for 50Hz control loops
Real-World Applicability
- Hardware validation: Successful deployment on Unitree G1 humanoid and Go2 quadruped robots with onboard inference
- Real environments: Tested across diverse terrains (asphalt, packed dirt, grass, indoor tile) for Go2
- Dynamic behaviors: G1 demonstrates forward/lateral walking and complex dance motion sequences
- Sim-to-real transfer: Standard domain randomization pipeline with real-time 50Hz policy control
- Production readiness: Compatible with existing RL-to-hardware deployment workflows, sub-20ms control budget
Limitations & Failure Modes
- ENGINEERING: Sequential leaf-to-root computation causes 2-6x higher training wall-clock time than parallel methods
- FUNDAMENTAL: Current formulation limited to proprioceptive observations, cannot handle high-dimensional sensory inputs like images
- EVALUATION: Limited to locomotion tasks, manipulation and contact-rich interactions unexplored
- ENGINEERING: Orthogonality constraint approximation may not hold under all conditions
-
EVALUATION: Mass generalization only tested with 1.1-2.0x increases, extreme dynamics shifts unexplored
Failure modes:
- Policy may fail when robot morphology significantly deviates from tree structure assumptions
- Performance degradation likely with very high-dimensional action spaces or complex contact scenarios.
MemArchitect: A Policy Driven Memory Governance Layer
Authors: Lingavasan Suresh Kumar, Yang Ba, Rong Pan · Institution: Arizona State University · Category: cs.AI
MemArchitect introduces a policy-driven governance layer for LLM agent memory that actively manages information lifecycle, resolving contradictions and preventing context pollution through biological decay models and utility tracking.
Practical Takeaway: MemArchitect demonstrates that explicit governance policies can significantly improve memory quality for long-running agents, particularly for complex reasoning tasks. The key insight is treating memory as an active resource requiring lifecycle management rather than passive storage. Research engineers working on persistent agents should consider implementing similar policy frameworks, especially the FSRS-based decay and utility tracking components which show strong empirical benefits. However, the aggressive pruning approach requires careful calibration to avoid losing important facts, and several critical policies remain unimplemented.
Tags: memory-management long-term-agents RAG policy-driven-systems LLM-governance context-management forgetting-mechanisms agent-safety
Task & Setting
Long-running Large Language Model (LLM) agents accumulate memories across extended conversations, creating critical governance challenges. Standard Retrieval-Augmented Generation (RAG) systems treat memory as passive storage, leading to context pollution from outdated information (“zombie memories”), contradictory facts, and privacy violations. This becomes particularly problematic for autonomous agents that must maintain coherent, safe operations over extended periods.
The task involves implementing a policy-driven memory governance layer that manages the complete lifecycle of agent memories. Input consists of conversational history, user queries, and retrieved memory candidates. Output is a governed context window containing only relevant, consistent, and up-to-date information for LLM generation. The system must enforce policies across four domains: lifecycle management, consistency resolution, adaptive retrieval, and safety compliance.
Success is measured by accuracy on long-horizon conversational benchmarks, specifically multi-hop reasoning, temporal consistency, and factual recall tasks. The evaluation focuses on preventing hallucinations while maintaining useful information retention.
The paper evaluates on LoCoMo-10 benchmark, containing 1,986 question-answer pairs across multi-session dialogues, categorized into Single-Hop, Multi-Hop, Temporal Reasoning, and Open Domain tasks.
Architecture & Method
-
MemArchitect implements a middleware governance layer that sits between user queries and the LLM agent, operating through three execution paths: Read, Reflect, and Background processing.
-
FSRS Decay Engine: Replaces exponential decay with Free Spaced Repetition Scheduler (FSRS v4) for biological forgetting curves. Retrievability is computed as:
\[R(t) = \left(1 + \frac{19}{9} \cdot \frac{t}{S}\right)^{-1}\] -
Kalman Utility Filter: Tracks memory utility using Kalman filtering to reinforce useful memories and downweight hallucinations:
\[U_k = U_{k-1} + K_k \cdot (z_k - U_{k-1})\] -
Adaptive Scoring System: Dynamically weights retrieval based on query type (Fact vs. Reasoning) and memory characteristics:
\[\text{Score} = \text{Sim} \times R^\lambda \times (1 + \beta \cdot U)\] -
Hebbian Graph Expansion: Implements associative memory linking where frequently co-occurring memories are retrieved together when P(B A) > 0.7. -
Cross-Encoder Discriminator: Acts as a “veto gate” filtering semantically irrelevant candidates even after vector similarity matching.
- Entropy-Triggered Consolidation: Monitors information density and triggers cleanup when compression ratio < 0.4, consolidating fading episodic memories into semantic facts.
Training Recipe
The paper does not describe model training as MemArchitect operates as a governance middleware layer over existing pre-trained models. The system uses:
- Pre-trained LLMs: Qwen2.5-3B-Instruct and Meta-Llama-3.1-8B-Instruct as backbone models
- Pre-trained embeddings: BGE-M3 and nomic-embed-text-v1.5 for dense retrieval
-
Pre-trained Cross-Encoder: Used for relevance discrimination filtering
No fine-tuning, optimization details, or training procedures are reported as the contribution is purely architectural - a policy engine that manages memory without modifying model weights.
Novelty & Lineage
The paper builds on established memory management approaches including MemGPT (2024), MemOS (2025), SimpleMem (2025), and Generative Agents (2023). The specific delta is introducing explicit policy-driven governance that actively manages memory lifecycle, resolves contradictions, and enforces compliance - moving from passive storage to active adjudication.
Key novelty includes:
- FSRS-based biological decay modeling for LLM memory
- Kalman filtering for utility tracking
- policy-based “auction” system for context window allocation, and
-
unified governance across lifecycle, consistency, retrieval, and safety domains.
The approach is distinguished from prior work by treating memory as a governed resource requiring active management rather than simple append-only storage.
Rating: INCREMENTAL - The individual components (FSRS, Kalman filters, cross-encoders) are established techniques, but their integration into a unified memory governance framework represents a meaningful architectural contribution.
Benchmarks & Results
- LoCoMo-10 Single-Hop (Qwen-3B): 95.0% vs 79.4% SimpleMem baseline (+15.6% improvement)
- LoCoMo-10 Multi-Hop (Qwen-3B): 95.8% vs 66.7% SimpleMem baseline (+29.1% improvement)
- LoCoMo-10 Temporal Reasoning (Qwen-3B): 95.3% vs 56.1% SimpleMem baseline (+39.2% improvement)
- LoCoMo-10 Open Domain (Qwen-3B): 93.1% vs 67.1% SimpleMem baseline (+26.0% improvement)
- LoCoMo-10 Single-Hop (Llama-3.1-8B): 53.2% vs 74.0% MemOS baseline (-20.8% decrease)
- LoCoMo-10 Multi-Hop (Llama-3.1-8B): 40.1% vs 36.0% MemOS baseline (+4.1% improvement)
- LoCoMo-10 Temporal Reasoning (Llama-3.1-8B): 27.1% vs 72.0% MemOS baseline (-44.9% decrease)
-
LoCoMo-10 Open Domain (Llama-3.1-8B): 56.3% vs 45.0% MemOS baseline (+11.3% improvement)
Results are mixed: strong performance against compression-based SimpleMem, but worse recall against maximalist MemOS due to aggressive pruning policies.
Compute & Efficiency
- Model size: Uses existing models (Qwen2.5-3B, Llama-3.1-8B) without modification - no additional parameters for the governance layer
- Training compute: Not applicable - no training required, pure inference-time policy engine
- Inference speed/latency: Not reported - overhead from policy evaluation, cross-encoder filtering, and Kalman updates not quantified
- Memory footprint: Not reported - storage requirements for utility tracking, co-occurrence graphs, and policy state not specified
- Deployment practicality: High - model-agnostic middleware that can be integrated with existing RAG systems without retraining
Real-World Applicability
- No deployment results on real-world production systems reported
- No hardware experiments or robotic integration demonstrated
- No production integration case studies provided
- Evaluation limited to academic benchmarks (LoCoMo-10) with synthetic conversational data
- Authors acknowledge need for validation on additional benchmarks including LongMemEval, PreFEval, and PersonaMem for real-world applicability assessment
Limitations & Failure Modes
- FUNDAMENTAL: Aggressive decay policies sacrifice raw recall for coherence, leading to significant performance drops on factual retrieval tasks (-44.9% on temporal reasoning vs MemOS)
- ENGINEERING: Incomplete implementation - several key policies (conflict resolution, toxic memory filter, GDPR compliance) are marked as “planned” future work
- EVALUATION: Limited to single benchmark (LoCoMo-10) with only two baseline comparisons, lacking evaluation on established long-term memory benchmarks
- ENGINEERING: No analysis of computational overhead or latency impact from policy evaluation during inference
-
FUNDAMENTAL: Policy calibration challenges - FSRS parameters optimized for human flashcard learning may not transfer to LLM memory management
Failure modes:
- Over-pruning leading to loss of important but infrequently accessed facts
- Policy conflicts where lifecycle management contradicts consistency requirements
FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair
Authors: Ruize Ma, Yilei Jiang, Shilin Zhang, Zheng Ma et al. (10 authors) · Institution: Nanjing University, SenseTime, Chinese University of Hong Kong, University of Texas at Dallas · Category: cs.SE
FailureMem improves multimodal program repair by learning from historical failures through a hierarchical memory bank and using active visual perception tools for region-level screenshot analysis.
Practical Takeaway: If you’re building automated debugging systems, FailureMem demonstrates three key principles worth implementing: (1) contrast failed attempts with successful fixes to extract negative constraints, (2) use active visual tools rather than processing full screenshots passively, and (3) combine structured workflows with flexible agentic reasoning rather than purely one or the other. The failure memory bank concept is particularly valuable - systematically learning from mistakes rather than treating each repair attempt independently. Consider implementing similar failure-aware mechanisms in your own agent systems, especially for domains where visual context matters.
Tags: multimodal program_repair software_engineering LLM_agents visual_reasoning failure_learning automated_debugging GUI_repair
Task & Setting
Multimodal Automated Program Repair (MAPR) addresses real-world software debugging where developers must analyze code alongside visual artifacts like GUI screenshots and UI mockups. Traditional program repair systems only process code and text, missing crucial visual context that explains many software defects. The task requires jointly reasoning over three modalities to generate correct patches.
Given a software repository with codebase $C$, textual issue description $D$, and visual screenshots $I$, MAPR aims to generate a code patch $\Delta$ such that the patched codebase satisfies:
\[F(C') |= T_{spec}\]where $C’ = C \oplus \Delta$ is the patched codebase, $F(\cdot)$ denotes program execution, and $T_{spec}$ includes fail-to-pass tests (verifying the issue is fixed) and pass-to-pass tests (ensuring no regression). Success is measured by resolved rate - the percentage of issues correctly fixed on the SWE-bench Multimodal benchmark.
The benchmark comprises 617 real-world GitHub issues from 17 JavaScript repositories, where visual information is necessary for resolving over 83% of tasks.
Architecture & Method
-
FailureMem introduces a hybrid workflow-agent architecture that combines structured localization with flexible reasoning, using deterministic workflows for file/element identification and agentic loops only for patch generation.
-
A hierarchical Failure Memory Bank stores historical repair trajectories in three layers: Contextual (issue summaries, visual analyses), Cognitive (diagnosis, negative constraints, golden principles), and Code (failed vs. golden patch summaries).
-
Memory construction uses offline distillation with Gemini 3 Pro to contrast failed patches against ground-truth fixes, extracting reusable repair patterns from 84 failed trajectories.
-
Active perception tools enable region-level visual grounding: Crop tool extracts sub-regions from screenshots for detailed inspection, Grounding tool overlays bounding boxes to highlight bug locations.
-
Interactive Bash environment allows agents to explore repository structure and verify assumptions before code modification.
-
Memory retrieval uses a Selector Agent to identify top-k relevant cases based on semantic similarity between current issues and historical Contextual Layers.
-
Three-phase repair process: Phase 1 performs file localization with memory guidance, Phase 2 identifies key elements using skeleton compression, Phase 3 generates patches using agentic reasoning with tool access.
Training Recipe
-
No model training is performed - FailureMem is a framework that uses existing LLMs (GPT-5.1, GPT-4.1, Claude 4.5) as backbone models.
-
Memory bank construction uses offline distillation with Gemini 3 Pro to process 84 failed repair trajectories from SWE-bench Multimodal development set.
-
All experiments use sampling temperature of 0 for deterministic generation following Pass@1 evaluation protocol.
-
Hardware: experiments conducted on four NVIDIA A100 (80GB) GPUs.
-
Wall-clock time: not reported.
-
No fine-tuning, pretraining, or RLHF stages involved - purely prompt-based approach with structured memory injection.
Novelty & Lineage
The closest prior work is GUIRepair (Huang et al., 2025c), which first introduced multimodal program repair but uses rigid workflows and passive visual processing. FailureMem’s specific deltas are:
- hybrid workflow-agent architecture balancing structure with flexibility
- active perception tools for region-level visual grounding rather than full-page processing, and
-
failure-aware memory bank that transforms repair failures into reusable guidance.
The memory bank design contrasting failed vs. successful patches is novel. Prior APR systems like SWE-agent (Yang et al., 2024b) and Agentless (Ma et al., 2024) lack multimodal capabilities and treat each repair independently without learning from failures.
Rating: SIGNIFICANT - introduces multiple novel components (failure memory, active perception, hybrid architecture) that address real limitations of existing multimodal repair systems.
Benchmarks & Results
-
SWE-bench Multimodal: Resolved Rate metric, FailureMem (GPT-5.1) achieves 33.1%, GUIRepair baseline 29.4%, improvement +3.7%
-
SWE-bench Multimodal: FailureMem (GPT-4.1) achieves 31.1%, GUIRepair 28.8%, improvement +2.3%
-
SWE-bench Multimodal: FailureMem (Claude 4.5) achieves 33.8%, GUIRepair 31.5%, improvement +2.3%
-
Outperforms other baselines: SWE-agent Multimodal (12.4%), Computer-Use Agents (20.1%), Agentless Lite (28.4%), Zencoder (27.5%)
-
Repository-level breakdown shows consistent improvements across multiple JavaScript projects (next, bpmn-js, carbon, etc.)
Results are consistently positive across all tested models and repositories, with no conspicuously absent benchmarks for this specialized multimodal repair domain.
Compute & Efficiency
-
Model size: Uses existing LLMs as backbones (GPT-5.1, GPT-4.1, Claude 4.5) - parameter counts not specified but these are large commercial models
-
Training compute: No training required - uses pre-trained models with prompt-based approach on four NVIDIA A100 GPUs
-
Inference speed/latency: Not reported, but acknowledged increased cost due to iterative agentic loops and memory contexts
-
Memory footprint: Not reported, but memory bank stores 84 structured entries from failed trajectories
-
Deployment practicality: Inference cost increases 13% to $0.33 per issue vs. GUIRepair’s $0.29, considered acceptable given performance gains. Commercial model dependency limits deployment flexibility.
Real-World Applicability
-
Evaluated on real-world GitHub issues from 617 production software repositories, not synthetic benchmarks
-
Issues span 17 popular JavaScript repositories including production systems (next.js, eslint, lighthouse, prettier)
-
No deployment results in production environments reported
-
No hardware experiments beyond GPU inference infrastructure
-
Framework designed as decision-support tool requiring human review before production deployment
-
Memory bank constructed from actual developer-verified patches, ensuring real-world grounding
The work operates on authentic software defects but lacks production deployment validation.
Limitations & Failure Modes
-
ENGINEERING: Increased inference cost (13% higher than baseline) due to iterative agentic loops and detailed memory contexts
-
FUNDAMENTAL: Dependency on diversity of offline memory bank - rare or unprecedented failure modes without historical analogues bypass the contrastive distillation process
-
ENGINEERING: Reliance on commercial LLM APIs limits deployment control and introduces external dependencies
-
EVALUATION: Memory bank constructed only from 84 failed trajectories from development set, potentially limiting generalization
-
ENGINEERING: Skeleton compression strategy may lose important implementation details in very large codebases
Failure modes:
- System reverts to standard agentic behavior when encountering novel failure patterns not captured in memory bank
- Active perception tools may fail on non-standard UI layouts or unconventional visual artifacts.
Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models
Authors: Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li et al. (9 authors) · Institution: Alibaba · Category: cs.CV
Proxy-GRM trains vision-language reward models with transferable rubrics by using dedicated proxy agents to verify rubric quality during reinforcement learning, achieving state-of-the-art results with 4× less training data.
Practical Takeaway: Research engineers should consider implementing proxy-guided training for reward models, especially when interpretability and transferability matter. The key insight is training dedicated evaluator models to verify intermediate reasoning (rubrics) rather than only final outputs. Use SFT-based proxy agents over RL-based ones for more reliable verification signals. The 4× data efficiency improvement makes this approach practically attractive. However, ensure proxy agents are sufficiently capable (avoid small models like 3B) and test transferability across diverse model families before deployment.
Tags: vision-language-models reward-modeling reinforcement-learning multimodal-evaluation preference-learning rubric-generation transferability proxy-agents
Task & Setting
Generative reward models (GRMs) for vision-language models evaluate outputs through a three-stage pipeline: rubric generation, criterion-based scoring, and final verdict. However, existing methods only optimize the final answer while leaving intermediate rubrics unsupervised, leading to post-hoc rationalization rather than principled evaluation. This undermines the transferability of rubrics to independent evaluators.
The task is to train multimodal generative reward models that produce structured critiques for vision-language model outputs. Given a multimodal query $q$ (text + image $I$) and candidate response pair $(r_1, r_2)$, the model generates:
\[y = \pi_\theta(I, q, r_1, r_2) = \langle\text{rubric}\rangle R \langle/\text{rubric}\rangle \langle\text{eval}\rangle E \langle/\text{eval}\rangle \langle\text{answer}\rangle A \langle/\text{answer}\rangle\]where $R$ is evaluation criteria, $E$ is criterion-by-criterion assessment, and $A \in {1,2}$ is the preference verdict. The objective is to maximize rubric transferability:
\[\text{Transferability}(R) = \mathbf{1}[\phi(q, I, r_1, r_2, R) = A^*]\]where $\phi$ is an independent proxy evaluator.
Success is measured by accuracy on preference prediction benchmarks: VL-RewardBench (1,247 pairs), Multimodal Reward Bench (5,000 samples), and MM-RLHF-Reward Bench. Key metrics include overall accuracy and transferability to unseen evaluator models.
The paper uses 60k curated preference samples from LaVA-Critic-113k, RLAIF-V, RLHF-V, and MMIF-23k datasets.
Architecture & Method
- Base architecture: Qwen2.5-VL-7B-Instruct for both policy model (Proxy-GRM) and proxy agents
- Teacher model: Qwen3-VL-235B-A22B for data distillation and structured critique generation
- Proxy agent training: Two variants trained to consume rubrics and predict preferences - Proxy-SFT: Supervised fine-tuning on 5k samples with cross-entropy loss - Proxy-RL: Additional RL training on 10k samples with binary accuracy reward
-
Policy model training uses composite reward function:
\[r = r_{\text{acc}} + r_{\text{proxy}} + 0.5 \cdot r_{\text{format}}\]where $r\_{\text{acc}} = +1$ if final verdict correct, $r\_{\text{proxy}} = +1$ if proxy agrees with verdict using generated rubric, $r\_{\text{format}} = +1$ for proper XML format - Reinforcement learning via GRPO (Group Relative Policy Optimization) with frozen proxy agents
- Core contribution: Closed-loop rubric quality verification through independent proxy agents that measure transferability during training, unlike prior work that only optimizes final answers
Training Recipe
- Data distillation: Qwen3-VL-235B-A22B generates structured critiques for 60k samples, yielding 25k correct samples
- Proxy-SFT training: 5k samples, learning rate 1×10⁻⁵, cosine scheduling, 1 epoch, ms-swift framework
- Proxy-RL training: 10k samples, learning rate 5×10⁻⁶, GRPO with group size 7, verl framework
- Policy cold-start SFT: 10k samples, learning rate 1×10⁻⁵, cosine scheduling, 1 epoch
- Policy RL training: 45k samples (10k correct + 35k hard negatives), learning rate 5×10⁻⁶, GRPO, batch size 256, mini-batch 128 Hardware and wall-clock time: not reported Data filtering: automatic quality, difficulty, and similarity filters applied to source datasets
Novelty & Lineage
Prior work includes R1-Reward (2025), Unified-Reward (2025), and Auto-Rubric (2025). Closest work is Auto-Rubric which extracts rubrics from annotations, and Rubrics-as-Rewards (2025) which uses rubrics as structured rewards in text-only settings.
The specific delta is introducing proxy-guided rubric verification into the RL training loop for multimodal settings. Unlike LLM-as-judge approaches that provide non-differentiable external feedback, this method trains dedicated proxy agents to evaluate rubric transferability and integrates this as a reward signal during policy training.
Key finding: SFT-based proxy agents surprisingly outperform RL-based ones for rubric evaluation, revealing tension between outcome-level RL and process-level evaluation fidelity.
Rating: SIGNIFICANT - addresses fundamental limitation of existing GRMs with novel closed-loop verification approach.
Benchmarks & Results
- VL-RewardBench: 75.22% overall accuracy vs 73.8% (Unified-Reward-Think), +1.42 points improvement, 73.93% macro accuracy
- Multimodal Reward Bench: 85.62% accuracy vs 82.2% (R1-Reward), +3.42 points improvement
- MM-RLHF-Reward Bench: 82.94% accuracy vs 80.59% (R1-Reward), +2.35 points improvement, 56.52% on Acc+ metric
- Rubric transferability: Proxy-GRM rubrics improve external evaluator accuracy by 3-10 points when transferred to Qwen2.5-VL-7B/32B-Instruct and Unified-Reward-SFT
-
Data efficiency: achieves SOTA with ~50k training samples vs >200k for comparable methods (4× less data)
Results are consistently positive across all benchmarks. Performance improvements are substantial and the method demonstrates strong transferability to unseen evaluators.
Compute & Efficiency
- Model size: 7B parameters for both policy and proxy models (Qwen2.5-VL-7B base)
- Training compute: not reported (GPU hours, hardware specifications not provided)
- Inference speed/latency: not reported
- Memory footprint: not reported
- Deployment practicality: Good - uses standard 7B model size, requires additional proxy agent at training time but not at inference. Data efficiency (4× less training data) suggests practical training costs
Real-World Applicability
- No deployment results or production integration reported
- No hardware experiments or real-world evaluation environments described
- Evaluation limited to curated benchmark datasets (VL-RewardBench, Multimodal Reward Bench, MM-RLHF-Reward Bench)
- Rubric transferability tested only on academic models, not production systems
-
No discussion of sim-to-real gaps or domain adaptation challenges
The work is primarily evaluated on academic benchmarks without real-world deployment validation.
Limitations & Failure Modes
- EVALUATION: Limited to curated benchmark datasets, no real-world deployment testing
- ENGINEERING: Proxy agent training requires additional computational overhead and careful model selection
- FUNDAMENTAL: Method relies on proxy agent quality - insufficient capability (e.g., 3B model) introduces harmful noise
- ENGINEERING: No analysis of failure modes when policy and proxy disagree at inference time
- EVALUATION: Transferability only tested on similar model families (Qwen variants), not diverse architectures
-
FUNDAMENTAL: Composite reward design may not generalize to other reward formulations or task domains
Failure modes: 1) Proxy agents may provide misleading feedback if trained on biased or insufficient data, 2) Policy model may learn to game the proxy reward rather than generate genuinely transferable rubrics.
Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation
Authors: Chang Nie, Tianchen Deng, Guangming Wang, Zhe Liu et al. (5 authors) · Institution: Shanghai Jiao Tong University · Category: cs.RO
HEAR introduces a streaming audio memory architecture to solve the Blind Execution Interval problem in Vision-Language-Action models, enabling robots to react to transient environmental sounds during continuous manipulation.
Practical Takeaway: If you’re building robot policies that need to react to environmental sounds, this work identifies a critical blindspot in current VLA architectures: the Blind Execution Interval where transient audio is lost during action chunking. The streaming audio memory approach is worth implementing, especially the Historizer module that maintains causal context across execution gaps. The HEAR-Bench evaluation framework with strict timing constraints is also valuable for properly testing audio-reactive policies. However, the significant sim-to-real performance gap (81% to 54%) suggests you’ll need substantial real-world validation and potentially domain adaptation techniques.
Tags: robotics multimodal-learning audio-processing vision-language-action continuous-control action-chunking streaming-audio robot-manipulation
Task & Setting
Real-world context: Robots operating in dynamic environments must process multiple sensory modalities simultaneously, but current Vision-Language-Action (VLA) models treat sound as static pre-execution prompts, missing critical real-time acoustic feedback during task execution. This creates a significant gap for tasks requiring immediate response to transient environmental sounds like microwave beeps or collision clicks, which provide essential state verification that vision alone cannot capture.
Task definition: The paper formalizes Vision-Sound-Language-Action (VSLA) as continuous control conditioned on multi-view RGB images $I^{1:V}_t$, streaming audio $A_t$, language instructions $l$, and proprioception $s_t$ under delayed decision loops. The core challenge is the Blind Execution Interval (BEI) where acoustic events occurring during open-loop action chunking are lost. Success requires causal timing constraints:
\[\text{Success}(\xi) = \mathbb{I}[t_{snd} \leq t_{goal} \leq T]\]where $t_{snd}$ is when the required acoustic cue occurs and $t_{goal}$ is when the task goal is reached.
Evaluation criteria: Success measured by timing-sensitive criteria that reject premature actions completed before required acoustic triggers, even if visually correct. Performance evaluated on task completion rates under strict causal timing rules.
Dataset/Benchmark: Introduces OpenX-Sound (100 skills, 120k episodes, 680 objects with synthesized audio) for pretraining and HEAR-Bench (7 sound tasks, 30 objects) for evaluation with real-time audio-physics co-simulation enforcing causal timing rules.
Architecture & Method
-
Historizer: Streaming Stateful Transformer maintaining causal audio memory $h_{t_k}$ across execution gaps via packet-wise updates:
\[h_{t_k}^{(j)} = \text{Hist}_\phi(h_{t_k}^{(j-1)}, \phi_{audio}(P_{t_k}^{(j)}))\] -
Envisioner: Hierarchical design with high-level omni-modal model (Qwen3-Omni) producing semantic latent $z_{t_k}$ and KV cache, plus low-level LLM (Qwen3-0.6B) generating control features $u_{t_k}$ from cached context.
-
Advancer: Audio world model predicting near-future audio codes via decoder-only transformer with loss:
\[\mathcal{L}_{adv} = -\mathbb{E}_D[\log p_\eta(z^a_{t_k \rightarrow t_{k+1}} | z_{t_k})]\] -
Realizer: Conditional Flow Matching policy generating smooth action chunks with vector field regression:
\[\mathcal{L}_{flow} = \mathbb{E}[\|v_\xi(x_\lambda, \lambda, u_{t_k}) - (x_1 - x_0)\|_2^2]\]Core contribution: Decoupling high-frequency sensory memory from low-frequency decision-making to solve the Blind Execution Interval problem through persistent causal audio context.
Training Recipe
-
Pretraining stage: Trained on OpenX-Sound dataset (120k episodes with synthesized audio tracks generated via video-to-audio models) - Data: 100 manipulation skills, 680 objects, audio generated using advanced video-to-audio generation models - Optimizer: Not reported - Learning rate, schedule, batch size: Not reported - Hardware and wall-clock time: Not reported
-
Fine-tuning stage: Task-specific fine-tuning on HEAR-Bench simulation data - Data: 7 sound-centric tasks with strict causal timing rules - Training details: Not reported
-
Multi-objective training: Combines imitation learning loss with auxiliary objectives: - Stage description loss $\mathcal{L}_{text}$ for JSON stage predictions - Audio prediction loss $\mathcal{L}_{adv}$ for temporal grounding - Flow matching loss $\mathcal{L}_{flow}$ for action generation
Novelty & Lineage
Prior work: Builds on VLA models like Octo (2024), OpenVLA (2024), and π0.5 (2025), plus recent audio-VLA work like OmniVLA and Audio-VLA (2025). Flow matching extends π0.5’s approach.
Novel contributions:
- First formalization of the Blind Execution Interval problem in action chunking
- Streaming stateful audio memory architecture bridging execution gaps
- Audio world model for temporal grounding in quasi-static visual scenes
- Real-time audio-physics co-simulation with causally strict evaluation rules
Assessment: SIGNIFICANT - addresses a fundamental architectural limitation in current VLA systems with novel streaming audio memory and introduces important evaluation infrastructure, though builds incrementally on existing transformer and flow matching foundations.
Benchmarks & Results
-
HEAR-Bench simulation: HEAR achieves 81% success rate vs. waveform rendering (61%) and ASR baseline (35%)
-
Real-world robot evaluation: HEAR achieves 54% success rate across 4 sound-centric tasks on physical Franka Panda robot
-
Component ablation: Shows importance of each module (Historizer, Envisioner, Advancer, Realizer) with degraded performance when components removed
-
Timing analysis: Demonstrates successful capture of transient audio events during Blind Execution Intervals that other methods miss
Note: Limited comparison to other multimodal robot learning methods beyond basic ASR and waveform baselines. Missing comparisons to other recent audio-VLA works like OmniVLA in identical settings.
Compute & Efficiency
-
Model size: Multi-component architecture with Qwen3-Omni (large) + Qwen3-0.6B (small) + specialized audio/flow modules, total parameters not reported
-
Training compute: GPU hours and hardware not reported
-
Inference speed: Real-time deployment on physical robot, specific latency numbers not provided but designed for low-rate decision loops with action chunking
-
Memory footprint: Maintains compact causal audio memory, specific memory usage not quantified
-
Deployment practicality: Successfully deployed on real Franka Panda robot, demonstrates practical feasibility but compute requirements not fully characterized
Real-World Applicability
-
Physical robot deployment: Successfully tested on Franka Panda robot with external microphone across 4 sound-centric manipulation tasks
-
Hardware setup: Uses standard robot arm with external microphone for audio input, no specialized acoustic hardware required
-
Environment testing: Real-world indoor manipulation environment with ambient noise and contact sounds
-
Performance validation: 54% real-world success rate demonstrates sim-to-real transfer, though with performance gap from 81% simulation results
-
Practical constraints: Addresses real system latency and asynchronous sensor updates, designed for actual deployment constraints rather than idealized settings
Limitations & Failure Modes
-
FUNDAMENTAL: Relies on discrete audio tokenization which may lose fine-grained acoustic details important for subtle contact interactions
-
FUNDAMENTAL: Performance degrades significantly from simulation (81%) to real-world (54%), indicating substantial sim-to-real gap
-
ENGINEERING: Audio memory horizon limited by transformer context length, may struggle with very long monitoring tasks
-
ENGINEERING: Requires careful tuning of audio packet sizes and memory update rates for different acoustic environments
-
EVALUATION: Limited comparison baselines - missing evaluation against other recent multimodal robot learning methods
-
EVALUATION: HEAR-Bench uses synthesized audio which may not capture all real-world acoustic complexity
Failure modes:
- May fail when critical acoustic cues are masked by robot ego-noise during motion
- Performance likely degrades in acoustically complex environments with multiple overlapping sound sources