Applied AI Digest — May 4, 2026
Today’s Digest at a Glance
Preliminary
Today’s papers explore multimodal robotics with novel control architectures, specialized visual recognition systems, and high-throughput simulation frameworks for embodied AI.
Model Predictive Path Integral (MPPI) Control
MPPI addresses the challenge of real-time optimal control under uncertainty by reformulating stochastic control as a sampling problem. Traditional model predictive control (MPC) requires solving computationally expensive optimization problems at each timestep, making it impractical for high-frequency robotic control with nonlinear dynamics and constraints.
The core insight of MPPI is to approximate the optimal control distribution using importance sampling. Given a dynamics model $\mathbf{x}_{t+1} = f(\mathbf{x}_t, \mathbf{u}_t) + \mathbf{w}_t$ where $\mathbf{w}_t$ is process noise, MPPI samples $K$ candidate control sequences ${\mathbf{u}^{(k)}}_{k=1}^K$ from a noise distribution and evaluates their costs $S^{(k)}$. The optimal control is computed as a weighted average:
\[\mathbf{u}_t^* = \sum_{k=1}^K w^{(k)} \mathbf{u}_t^{(k)}\]where weights are $w^{(k)} = \frac{\exp(-\beta S^{(k)})}{\sum_{j=1}^K \exp(-\beta S^{(j)})}$ and $\beta$ is a temperature parameter.
Essentially, MPPI turns optimal control into a weighted Monte Carlo sampling problem where better trajectories get higher weights. The method naturally handles constraints and nonlinear dynamics by encoding them in the cost function, making it particularly suitable for contact-rich manipulation tasks where analytical solutions are intractable.
Mixed Complementarity Problems (MCP) for Contact Dynamics
Contact dynamics in robotics simulation requires handling the fundamental challenge that contact forces are unknown until contacts occur, but contacts depend on the forces. Traditional approaches either use penalty methods (which can be stiff and unstable) or pure constraint-based methods (which can be computationally expensive for many contacts).
| MCP formulates contact dynamics as a system where complementarity conditions capture the either-or nature of contacts. For each potential contact point, we have three states: separated ($d > 0, f_n = 0$), sliding ($d = 0, | f_t | = \mu f_n$), or sticking ($d = 0, | f_t | < \mu f_n$). These can be expressed as complementarity constraints: $0 \leq d \perp f_n \geq 0$ (gap distance and normal force cannot both be positive). |
The complete system becomes:
\[\mathbf{M}\ddot{\mathbf{q}} = \mathbf{h} + \mathbf{J}^T \boldsymbol{\lambda}\]subject to complementarity constraints $\mathbf{0} \leq \boldsymbol{\phi}(\mathbf{q}) \perp \boldsymbol{\lambda} \geq \mathbf{0}$ where $\mathbf{M}$ is the mass matrix, $\mathbf{h}$ contains Coriolis and external forces, $\mathbf{J}$ is the contact Jacobian, $\boldsymbol{\lambda}$ are contact forces, and $\boldsymbol{\phi}$ are gap functions. The Projected Gauss-Seidel method solves this iteratively by projecting force updates onto feasible regions.
Intuitively, MCP lets the simulation “discover” contact forces by solving a mathematical puzzle where forces and gaps cannot both be active simultaneously.
2D Rotary Positional Embedding (2D RoPE)
2D RoPE extends the rotary positional encoding mechanism from 1D sequences to 2D spatial grids, addressing the limitation that standard positional encodings cannot effectively capture spatial relationships in images with varying resolutions and aspect ratios.
Standard RoPE applies rotations to query-key pairs based on relative positions: for 1D position difference $m$, it rotates feature dimensions by $m\theta_i$ where $\theta_i = 10000^{-2i/d}$. For 2D images, this becomes problematic because a single scalar position cannot capture both horizontal and vertical relationships.
2D RoPE decomposes the feature dimension $d$ into two parts: $d/2$ dimensions for horizontal encoding and $d/2$ for vertical encoding. For a spatial position $(x, y)$, the rotation angles become:
\[\theta_{i,x} = x \cdot 10000^{-2i/d}, \quad \theta_{i,y} = y \cdot 10000^{-2(i+d/2)/d}\]The attention computation then applies separate rotations for each spatial dimension, allowing the model to understand both horizontal and vertical spatial relationships independently.
This allows vision transformers to maintain spatial understanding across different image resolutions and aspect ratios without requiring resolution-specific training.
Reading Guide
Action Agent demonstrates how LLM orchestration can coordinate multiple specialized models (video generation, flow control, evaluation) for complex robotic navigation tasks. DiagramNet tackles a complementary visual reasoning challenge by decomposing circuit diagram understanding into specialized detection and reasoning agents. CoRAL bridges these approaches by using LLMs to design cost functions for MPPI-based manipulation control, while GS-Playground provides the simulation infrastructure needed to train such multimodal systems at scale using efficient 3DGS rendering.
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
Authors: Jeffrin Sam, Nguyen Khang, Yara Mahmoud, Miguel Altamirano Cabrera et al. (5 authors) · Institution: Skoltech · Category: cs.RO
Action Agent decouples robot navigation into agentic video generation followed by flow-constrained diffusion control, achieving competitive navigation performance with 161× fewer parameters than foundation VLA models.
Practical Takeaway: If you’re working on robot navigation, the key insight is the explicit decoupling of trajectory imagination (video generation) from execution (velocity control) through a Visual Intermediate Representation. The agentic optimization loop for video validation is practically valuable, improving success from 35% to 86%. The FlowDiT architecture demonstrates that combining optical flow with semantic features can achieve competitive navigation performance with 161× fewer parameters than foundation VLAs. However, be cautious about the video-to-metric scale ambiguity in real deployments - closed-loop execution will likely be necessary for robust performance. The approach is most suitable for structured indoor environments where the computational overhead of Stage I video generation can be amortized across multiple similar navigation tasks.
Tags: robotics navigation diffusion_models video_generation vision_language embodied_ai multimodal_control optical_flow
Task & Setting
The work addresses language-guided robot navigation in indoor environments, where robots must understand natural language instructions and translate them into safe physical motion. This is challenging because it requires grounding abstract language semantics into concrete visual perception and motor control while handling diverse robot embodiments and unpredictable real-world scenarios.
The task takes as input:
- a natural language instruction L (e.g., “walk toward the table”)
- an initial first-person RGB observation I0, and optionally
-
a goal image Ig. The system outputs continuous velocity commands at = (vx, vy, ω) at 40-47 Hz for robot control. The objective can be formalized as:
\[\text{Stage I: } (L, I_0) \rightarrow V_{\text{goal}}\] \[\text{Stage II: } (V_{\text{goal}}, [I_t], L) \rightarrow a_t\]Success is measured by:
- navigation success rate (reaching goal without collision)
- mean Absolute Trajectory Error (ATE)
- Direction Accuracy (DA), and
-
task completion rate in real-world trials.
The paper evaluates on 50 first-person navigation tasks across indoor environments (warehouses, hospital corridors) with three robot embodiments: Unitree G1 humanoid, quadrotor drone, and wheeled mobile robot.
Architecture & Method
-
Stage I - Agentic Video Generation: An LLM orchestration agent coordinates three components: (i) Qwen3-VL vision-language model for prompt construction, (ii) WAN 2.2 or LTX-Video diffusion generators for first-person navigation video synthesis, and (iii) Cosmos-Reason1 evaluator that scores videos on Prompt Adherence (PA), Physical Plausibility (PP), and Visual Quality (VQ).
-
Iterative Optimization: The system optimizes prompt parameters p to maximize a multi-objective reward:
\[p^* = \arg\max_p [\lambda_1 PA(V_p) + \lambda_2 PP(V_p) + \lambda_3 VQ(V_p)]\]subject to mean(PA, PP, VQ) ≥ 80, using bottleneck-first refinement over maximum 5 iterations.
-
Stage II - FlowDiT Architecture: A 43M-parameter Flow-Constrained Diffusion Transformer with 8 transformer blocks, 512 hidden dimensions, and 8 attention heads. It conditions on a 2304-dimensional vector combining DINOv2 visual features (768D), learned optical flow embeddings (256D), optional live observation (768D), and CLIP language embeddings (512D).
-
Diffusion Policy Formulation: Models action distribution using denoising diffusion with forward process at = √ᾱt a0 + √(1-ᾱt) ε, optimizing noise prediction loss:
\[L = E_{a_0,t,ε}[||ε - ε_θ(a_t, t, c)||^2]\] -
Receding Horizon Control: Predicts 8-step velocity sequences, executes first action, re-evaluates at next timestep for model-predictive control at 40-47 Hz frequency.
Training Recipe
-
Pretraining Stage: FlowDiT pretrained on RECON outdoor navigation dataset (11,830 episodes) from Open X-Embodiment collection using Clearpath Jackal wheeled robot data for general visual navigation priors.
-
Fine-tuning Stage: 203 Unitree G1 humanoid episodes (162 train / 41 val) collected in Isaac Sim indoor environments (warehouse and hospital corridors) to calibrate velocity dynamics for humanoid embodiment.
-
Training Configuration: AdamW optimizer with learning rate 1e-4, batch size 8 with FP16 precision, linear noise schedule β₁=10⁻⁴ to βₜ=2×10⁻², 100 diffusion steps for training / 10 DDIM steps for inference.
-
Hardware: NVIDIA RTX 5090 (32 GB), training time not reported.
-
Data Processing: 224×224 RGB frames, action horizon H=8, normalized velocity actions to [-1,1]³, uses frozen DINOv2 and CLIP encoders with only 43M parameters trainable.
Novelty & Lineage
Prior Work:
- NoMaD (2023) - goal-conditioned diffusion policies for navigation using ~100M parameters, achieving 68% success on RECON
- ViNT (2023) - vision transformer navigation with ~25M parameters, 61% success on RECON
-
OpenVLA (2024) - foundation VLA model with ~7B parameters, 65% success rate
Delta: This paper introduces:
- explicit two-stage decomposition separating trajectory imagination (video generation) from execution (velocity control)
- agentic optimization loop for video validation using LLM orchestration
-
flow-constrained diffusion combining optical flow with semantic features for ego-motion representation.
Applied-Specific Assessment:
- Architecture: The two-stage decomposition is a reasonable engineering choice but not architecturally novel - separating planning from control is classical robotics
- Benchmark Gains: 73.2% vs 68% (NoMaD) represents modest 5.2 percentage point improvement, with different datasets making direct comparison questionable
- Fair Comparisons: Baseline comparisons acknowledge different datasets/robots (“for architectural context rather than direct benchmarking”), undermining claims of superiority
- Scale Dependence: The 161× parameter reduction vs OpenVLA is impressive, but success likely depends heavily on the sophisticated video generation models in Stage I
The agentic video generation loop (35% to 86% success improvement) shows clear value, but the overall approach combines known techniques (diffusion policies, optical flow, two-stage planning) rather than introducing fundamental innovations.
Verdict: INCREMENTAL — Solid engineering combining established techniques with meaningful parameter efficiency gains, but lacks fundamental algorithmic breakthroughs.
Benchmarks & Results
-
Stage I Video Generation Success: Action Agent achieves 86% overall success rate vs 35% single-shot baseline across 50 navigation tasks, with Unitree G1 (92%), Mobile Robot (83%), Drone (77%)
-
FlowDiT Navigation Success (Simulation): 73.2% success rate on G1 validation split (41 episodes) vs ViNT 61%, NoMaD 68%, OpenVLA 65% - though different datasets limit direct comparison
-
Real Hardware Performance: 64.7% task completion rate (11/17 trials) on physical Unitree G1 in unseen indoor environments under open-loop execution
-
Parameter Efficiency: 43M trainable parameters represents 161× reduction compared to OpenVLA’s ~7B parameters while maintaining competitive performance
-
Inference Speed: 40-47 Hz execution frequency (~20 ms per step) on RTX 5090, generating 121 velocity waypoints in 3.68s average
-
Ablation Results: Full model (73.4% SR) vs Vision-only (58.5%), No Flow (73.2%), No Language (65.9%) - showing complementary contributions of multimodal conditioning
Results are mixed with simulation numbers not directly comparable to baselines due to different evaluation protocols and datasets.
Compute & Efficiency
-
Model Size: 43M trainable parameters (FlowDiT core), uses frozen DINOv2 (86.6M) and CLIP encoders, 161× smaller than OpenVLA
-
Training Compute: NVIDIA RTX 5090 (32 GB) for fine-tuning, pretraining compute not reported, wall-clock training time not specified
-
Inference Speed: 40-47 Hz control frequency (~20 ms per control step), 3.68s average for generating 121 velocity waypoints from reference video
-
Memory Footprint: 32 GB GPU memory during training with FP16 precision, inference memory requirements not specified
-
Deployment Practicality: Compact 43M parameter execution module enables edge deployment on consumer hardware, but Stage I video generation requires access to large foundation models (WAN 2.2, LTX-Video) which may require cloud inference or high-end hardware
Real-World Applicability
-
Real Hardware Deployment: Successfully tested on physical Unitree G1 humanoid robot in unseen indoor lab/office environments with 64.7% task completion rate (11/17 trials)
-
Open-Loop Execution: System operates without live camera feedback during motion execution, relying solely on pre-generated reference video and language instructions
-
Embodiment Transfer: Same pipeline deployed across three different robot types (humanoid, drone, wheeled) without retraining, demonstrating cross-embodiment generalization
-
Sim-to-Real Gap: ~8.5 percentage point drop from simulation (73.2%) to real hardware (64.7%), indicating reasonable but not perfect transfer
-
Environmental Constraints: Real-world testing limited to indoor structured environments (labs, offices), with failures primarily from video-to-metric scale ambiguity and trajectory drift accumulation in open-loop mode
Limitations & Failure Modes
-
Video-to-metric scale ambiguity - FUNDAMENTAL: Generated videos imply motion magnitudes mismatched with actual robot displacement, inherent to pixel-to-physical mapping
-
Trajectory drift in open-loop - ENGINEERING: Small heading errors compound without correction, addressable with closed-loop visual feedback
-
Limited planning horizon - ENGINEERING: Video generation architectures constrain clips to 5-15 seconds, limiting single-stage planning scope
-
Constrained space navigation - FUNDAMENTAL: Elevated failure rates in highly constrained spaces like narrow doorframes requiring centimeter-level precision
-
Evaluation limitations - EVALUATION: Baseline comparisons use different datasets, robots, and success criteria, making performance claims questionable
Failure Modes:
- Obstacle collision from accumulated drift: Open-loop execution without visual correction leads to collision with obstacles not anticipated in reference video
- Semantic stopping failures: Model follows visual trajectory correctly but fails to execute proper stopping behavior at intended goal locations
DiagramNet: An End-to-End Recognition Framework and Dataset for Non-Standard System-Level Diagrams
Authors: Jincheng Lou, Ruohan Xu, Jiapeng Li, Junyin Pi et al. (9 authors) · Institution: Peking University · Category: cs.AI
DiagramNet introduces the first dataset and multi-agent framework for system-level circuit diagram recognition, achieving competitive performance through decomposed visual reasoning that separates component detection from topological connection prediction.
Practical Takeaway: This work demonstrates that system-level circuit diagram understanding benefits more from decomposed multi-agent workflows than end-to-end approaches. The key insight is separating visual grounding (YOLO detection) from topological reasoning (VLM connection prediction), which provides 128× improvement for Gemini-2.5-Pro on structured tasks. Research engineers should consider this decomposition pattern for complex visual reasoning tasks involving implicit relationships. The component-wise connection prediction strategy (predicting outputs per source component) is more stable than full-graph prediction and could apply to other graph extraction problems. However, the approach requires domain-specific dataset construction and multi-stage training, making it most suitable for applications with sufficient data and engineering resources.
Tags: circuit_diagrams multimodal_learning electronic_design_automation vision_language_models multi_agent_systems graph_understanding topology_reasoning reinforcement_learning
Task & Setting
System-level diagrams serve as architectural blueprints for chip design, encoding module functions, dataflows, and interface protocols. These diagrams help architects plan system functionality before implementation but pose unique recognition challenges compared to standardized analog-mixed-signal (AMS) schematics. Unlike AMS circuits with fixed component libraries, system-level diagrams use non-standardized symbols that vary across organizations and have implicit connectivity that lacks explicit input/output port markings.
The task decomposes system-level diagram understanding into four subtasks: (1) Listing extracts component names from visual entities:
\[f_{list} : I → C = \{c_1, ..., c_n\}\](2) Localization provides spatial coordinates:
\[f_{loc} : (I, c_i) → b_i ∈ [0,1]^4\](3) Connection predicts output targets:
\[f_{conn} : (I, c_i, C) → T_i ⊆ C\](4) Circuit QA answers domain questions:
\[f_{qa} : (I, q) → (r, a)\]Success is measured by F1 scores on component detection (S1), output count prediction (S2), connection identification (S3), and Circuit QA accuracy (Task 2). The overall score combines: ScoreTask1 = 0.4×S1 + 0.2×S2 + 0.4×S3, with Scoreoverall = 0.6×ScoreTask1 + 0.4×ScoreTask2.
DiagramNet provides 1,000 system-level diagrams with 10,977 connection annotations and 15,515 chain-of-thought QA pairs across four tasks, extracted from major chip design venues.
Architecture & Method
-
Multi-agent workflow with three specialized agents: Perception Agent uses YOLOv11-nano for single-class component detection and applies row-major ordering to establish spatial structure for downstream reasoning.
-
Reasoning Agent uses Qwen2.5-VL-3B as the vision-language backbone to predict component-wise connections iteratively, processing each source component to predict its output targets rather than the full graph simultaneously.
-
Knowledge Agent loads task-specific LoRA weights into the VLM backbone based on query type, enabling efficient adaptation to Circuit QA tasks without retraining the full model.
-
The core technical contribution is the decoupled architecture that separates visual grounding from topological reasoning, combined with component-wise connection prediction that reduces output space complexity compared to end-to-end graph prediction.
-
Row-major spatial ordering (left-to-right, top-to-bottom) provides consistent positional priors and removes ordering ambiguity for the reasoning agent.
Training Recipe
-
Phase 1 - Multi-task Supervised Fine-Tuning: Base model initialized from Qwen2.5-VL-3B pre-trained on AI2D-RST for topological understanding. Cross-entropy loss applied:
\[L_{SFT} = -\sum_{i=1}^L \log P(y_i|X_v, X_t, y_{<i}; \Theta)\]All four subtasks mixed in each batch. Data: DiagramNet training set. Optimizer details not reported. Hardware: NVIDIA A40 GPUs.
-
Phase 2 - Hard-Sample Reinforcement Learning: RL applied to difficult samples selected by inference instability, visual ambiguity, and high-density connectivity. Compound reward combining format validity and accuracy:
\[R_{total} = \sum_{t} (\lambda_{f,t}R_{fmt}^{(t)} + \lambda_{a,t}R_{acc}^{(t)})\]Connection reward uses F1 + length penalty. Hardware: NVIDIA A100-SXM4 (80GB) GPUs. Training framework: Verl with DeepSpeed ZeRO-3.
-
Phase 3 - LoRA Fine-tuning: Low-rank adaptation for task-specific refinement:
\[h = W_0x + \alpha BAx/r\]where $r \ll \min(d,k)$. Frozen base weights with trainable low-rank matrices. Hardware: RTX 4090 GPUs. Wall-clock time not reported.
Novelty & Lineage
Prior work: AMSBench (Shi et al. 2025) provides multimodal understanding for analog-mixed-signal schematics with 6,000 connection annotations. Netlistify (Huang et al. 2025) creates 40k synthetic AMS schematic dataset for schematic-to-netlist conversion. Image2Net (Xu et al. 2025) offers 2,914 connections for AMS circuits.
Delta: This paper extends from standardized AMS schematics to non-standardized system-level diagrams. Key additions:
- First dataset for system-level diagrams vs. transistor-level AMS circuits
- Multi-agent workflow that decouples perception from reasoning vs. end-to-end approaches
-
Component-wise connection prediction vs. full-graph prediction.
Applied-specific assessment:
- Architectural idea: The multi-agent decomposition is a reasonable engineering approach but not fundamentally novel - separating detection from reasoning follows established computer vision patterns
- Benchmark gains: 21.8× improvement on S2 and 52.5× on S3 over base model, but comparisons use the same underlying VLM backbone, making gains less surprising
- SOTA comparisons: Fair comparisons within domain, though commercial models (GPT-5, Gemini-2.5-Pro) tested end-to-end while proposed method uses specialized workflow
- Scale dependence: Method requires custom YOLO training and domain-specific dataset, limiting generalizability without similar data investment
Verdict: INCREMENTAL — Solid engineering extension of AMS circuit understanding to system-level diagrams with reasonable performance gains, but core techniques are standard multi-agent decomposition applied to a new domain.
Benchmarks & Results
-
DiagramNet S1 (component detection): DiagramNet-3B 0.988, EDA Elite Winner 0.984, Netlistify 0.986, improvement +0.4% over EDA winner
-
DiagramNet S2 (output count): DiagramNet-3B 0.828, EDA Elite Winner 0.787, Qwen2.5-VL-3B 0.038, improvement +4.1% over EDA winner, +21.8× over base model
-
DiagramNet S3 (connection identification): DiagramNet-3B 0.735, EDA Elite Winner 0.777, Netlistify 0.150, underperforms EDA winner by -4.2% but outperforms Netlistify by +4.9×
-
DiagramNet Task 1 (overall): DiagramNet-3B 0.855, EDA Elite Winner 0.862, Claude-Sonnet-4 0.364, trails EDA winner by -0.7%
-
DiagramNet Task 2 (Circuit QA): DiagramNet-3B 0.395, GPT-5 0.730, EDA Elite Winner 0.370, improvement +2.5% over EDA winner but trails GPT-5 by -33.5%
-
AMSBench Connection Identification: DiagramNet-3B 0.500, Netlistify 0.433, GPT-5 0.530, improvement +6.7% over Netlistify
-
AMSBench Connection Judgement: DiagramNet-3B 0.667, tied with Netlistify and Hint-GRPO
Results are mixed - strong gains over base models but inconsistent performance against domain competitors.
Compute & Efficiency
-
Model size: 3B parameters for Qwen2.5-VL-3B backbone, YOLOv11-nano for detection (size not specified)
-
Training compute: NVIDIA A100-SXM4 (80GB) for RL phase, A40 GPUs for SFT, RTX 4090 for LoRA. Specific GPU hours not reported.
-
Inference speed/latency: Not reported, though multi-agent workflow likely adds overhead compared to end-to-end approaches
-
Memory footprint: Not reported, but 3B parameter VLM plus YOLO detector should be manageable on consumer hardware
-
Deployment practicality: Requires custom YOLO detector training (60 images minimum for adaptation), multi-stage inference pipeline, and domain-specific LoRA weights. Moderately complex deployment compared to single-model approaches.
Real-World Applicability
-
Dataset source: Real conference/journal figures from chip design venues, indicating genuine real-world data rather than synthetic benchmarks
-
Cross-domain transfer: Demonstrates zero-shot transfer to AMSBench with only 60 images for YOLO detector retraining, achieving competitive performance with GPT-5 and Claude-Sonnet-4
-
Competition performance: Surpasses 2025 EDA Elite Challenge winner on overall benchmark score (0.671 vs 0.665)
-
Industry integration: No production deployment results reported, though paper targets EDA workflows where manual diagram recognition currently requires domain expertise
-
Hardware constraints: Method successfully runs on consumer RTX 4090 GPUs for LoRA adaptation, suggesting reasonable deployment requirements for engineering teams
Limitations & Failure Modes
-
Symbol heterogeneity challenges - FUNDAMENTAL: Non-standardized symbols require continuous dataset expansion as new diagram styles emerge
-
Visual ambiguity handling - ENGINEERING: Weak directional cues, crossing ambiguities, and multi-fan-out structures cause systematic S3 errors that could be addressed with better visual reasoning
-
Domain knowledge gaps - ENGINEERING: Task 2 performance (0.395) significantly trails commercial models like GPT-5 (0.730), indicating insufficient circuit domain knowledge that could be improved with more comprehensive training data
-
Dataset coverage limitations - EVALUATION: Only 1,000 diagrams may not capture the full diversity of system-level diagram styles across different organizations and design domains
-
Multi-stage complexity - ENGINEERING: Requires coordinated training of YOLO detector, VLM backbone, and LoRA adapters, increasing training complexity compared to end-to-end approaches
Failure modes:
- Dense diagrams with many crossing wires cause connection prediction errors
- Non-standard diagram conventions from different organizations may not generalize without domain-specific retraining.
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
Authors: Berk Çiçek, Mert K. Er, Özgür S. Öğüz · Institution: Bilkent University · Category: cs.RO
CoRAL uses LLMs to design cost functions and contact strategies for MPPI-based manipulation control, enabling zero-shot contact-rich tasks through online parameter adaptation and semantic reasoning.
Practical Takeaway: This paper demonstrates a viable engineering approach for combining foundation models with motion planning, particularly valuable for contact-rich tasks where pure VLA methods struggle. The key insight is using LLMs as cost function designers rather than direct controllers, maintaining the benefits of semantic reasoning while ensuring real-time reactive control. The online adaptation mechanism is particularly noteworthy for handling sim-to-real gaps. However, the approach requires significant system integration effort and depends on high-quality pose tracking, making it more suitable for research applications than production deployment. Research engineers should consider this modular architecture for scenarios where zero-shot capability is more important than peak performance.
Tags: robotics manipulation contact-rich LLM VLM motion-planning MPPI zero-shot
Task & Setting
Contact-rich robotic manipulation requires precise force regulation and strategic interaction with objects, but current approaches either rely on expensive demonstrations (VLA models) or lack semantic understanding (traditional planners). These tasks demand both high-level reasoning about object physics and low-level reactive control to manage complex contact dynamics.
The task takes RGB-D images I, 3D object models M, and natural language instructions T as input, producing continuous 6-DoF end-effector control actions ut. The system must estimate object poses, infer physical properties (mass, friction), and execute manipulation strategies involving purposeful contact forces. The objective is formulated as a stochastic optimal control problem:
\[U^* = \arg\min_U \mathbb{E}\left[\phi(x_H) + \sum_{t=0}^{H-1} q(x_t, u_t)\right]\]where q(xt, ut) is an LLM-generated cost function encoding task-specific objectives.
Success is measured by task completion rates across 10 randomized trials per task. The evaluation suite includes 6 manipulation scenarios: multi-stage push-and-pick, pick-and-place, constant force regulation, dynamic flipping, and wall-assisted manipulation. Real-world validation uses a Franka Emika Panda robot with force/torque sensing and motion capture for pose tracking.
Architecture & Method
-
Vision pipeline using FoundationPose for continuous 6-DoF object tracking and VLM (GPT-4o) for semantic physics parameter estimation (mass, friction coefficients)
-
LLM-based task formulation module that generates: (a) structured MPPI cost functions J0 as executable Python code, and (b) contact strategy C0 defining semantic regions of interest as ellipsoids around predicted contact points
-
Model Predictive Path Integral (MPPI) controller running at 10Hz with K=256 trajectory samples over H=32 step horizon, using adaptive temperature selection via effective sample size (ESS) targeting
-
Hierarchical control architecture: 1kHz impedance control for safety, 10Hz MPPI trajectory planning, ~1Hz LLM reasoning with reactive augmentation:
\[\nu_t = u_t + K_f \cdot (x_{des} - x_{measured})\] -
Online adaptation loop where LLM analyzes execution episodes Et to refine world model parameters θ and cost function structure based on interaction feedback
-
Retrieval-Augmented Generation (RAG) memory unit storing successful (strategy, parameters) pairs for experience reuse across similar tasks
The core contribution is using LLMs as cost function designers rather than direct controllers, enabling zero-shot planning while maintaining real-time reactive execution through the decoupled architecture.
Training Recipe
-
No model training required - leverages pre-trained foundation models (GPT-4o for VLM/LLM, FoundationPose for tracking)
-
MPPI controller uses analytical dynamics model in MuJoCo simulation for trajectory rollouts, no learning involved
-
System operates zero-shot by: - VLM provides initial semantic physics priors from visual inspection - LLM generates task-specific cost functions and contact strategies from language descriptions - Online system identification refines physical parameters during execution
-
Memory unit populated incrementally by storing successful episodes during operation, indexed by task semantics and environmental parameters
-
Real-world deployment uses identical software stack with motion capture replacing FoundationPose for pose estimation
-
Hardware setup: Intel i9-13900K CPU, 64GB RAM, RTX 4060 Ti GPU for simulation and LLM inference
Training data: Not applicable - system designed for zero-shot operation without demonstration data or policy learning.
Novelty & Lineage
Prior work:
- Language-to-Rewards (L2R, 2023) uses LLMs to generate reward functions for MPC but lacks adaptation mechanisms and contact strategy reasoning.
- VoxPoser
- and VLMPC
- use VLMs to generate static cost maps for motion planning.
- OpenVLA
-
and other VLA models learn end-to-end policies from demonstration data.
Delta: This paper adds several components:
- LLM generates both cost functions AND semantic contact strategies (not just rewards)
- online adaptation loop where LLM refines world model parameters mid-execution based on interaction feedback
- separation of VLM (perception) from LLM (strategy) roles
-
RAG-based experience memory for strategy reuse.
Applied-specific assessment:
- Architectural idea: The neuro-symbolic decoupling is a reasonable engineering choice but not fundamentally novel - separating high-level reasoning from low-level control is well-established
- Benchmark gains: 50%+ improvement over baselines sounds significant, but baselines include pre-trained models not specifically designed for contact-rich tasks
- Fair comparisons: Comparison methodology appears sound, though some baselines (OpenVLA, π0.5) may be disadvantaged on contact-heavy tasks they weren’t specifically trained for
- Generalization: The zero-shot capability is valuable, but the approach still requires carefully engineered prompts and system integration
The online adaptation mechanism is the most interesting contribution, but the overall approach combines existing techniques rather than introducing fundamentally new capabilities.
Verdict: INCREMENTAL — solid engineering that combines LLM reasoning with motion planning in a principled way, but lacks breakthrough conceptual innovations.
Benchmarks & Results
- Push and Pick Cutting Board: CoRAL 5/10, OpenVLA-OFT 0/10, π0.5 0/10, L2R 0/10, Expert FSM 8/10
- Pick Box: CoRAL 10/10, OpenVLA-OFT 10/10, π0.5 10/10, L2R 10/10, Expert FSM 10/10
- Pick and Place in Clutter: CoRAL 10/10, OpenVLA-OFT 9/10, π0.5 8/10, L2R 9/10, Expert FSM 10/10
- Push with Constant Force: CoRAL 9/10, OpenVLA-OFT 0/10, π0.5 0/10, L2R 5/10, Expert FSM 10/10
- Flip Box: CoRAL 9/10, OpenVLA-OFT 1/10, π0.5 3/10, L2R 4/10, Expert FSM 10/10
-
Flip with Wall: CoRAL 7/10, OpenVLA-OFT 0/10, π0.5 0/10, L2R 1/10, Expert FSM 9/10
Results show strong performance on contact-rich tasks (T1, T4, T5, T6) but performance gaps remain compared to expert-designed finite state machines. Standard pick-and-place tasks (T2, T3) show little advantage over existing VLA methods. Real-world validation shows reasonable sim-to-real transfer with success rates of 4-10/10 across tasks.
Compute & Efficiency
- Model size: No learned parameters - uses pre-trained GPT-4o API and FoundationPose
- Training compute: Zero - no training required, pure inference system
- Inference speed: LLM calls ~1Hz, MPPI planning 10Hz, impedance control 1kHz
- Memory footprint: MPPI requires 256 trajectory rollouts with 32-step horizon in MuJoCo simulation
- Deployment practicality: Requires GPU for simulation rollouts, API access for GPT-4o, and motion capture or robust pose estimation - moderately complex but feasible deployment
Real-World Applicability
- Real-world validation on Franka Emika Panda robot across all 6 tasks with 4-10/10 success rates
- Motion capture system (6-camera Vicon setup) provides ground-truth pose tracking, replacing FoundationPose from simulation
- Force/torque regulation demonstrated with robot’s built-in sensors, maintaining ~5N contact forces within target bounds
- Sim-to-real gap handled through online parameter adaptation - system successfully diagnoses and corrects friction/mass estimates during execution
- No production deployment details provided - remains research prototype requiring motion capture infrastructure
Limitations & Failure Modes
- FUNDAMENTAL: System performance bounded by VLM’s semantic physics estimation accuracy and LLM’s reasoning capabilities
- ENGINEERING: Requires high-fidelity pose tracking (motion capture in real-world), limiting deployment scenarios
- ENGINEERING: Latency constraints from GPT-4o API calls (~1Hz) limit adaptation speed
- EVALUATION: Limited to relatively simple objects and controlled environments, no evaluation on deformable materials or complex multi-object scenes
-
ENGINEERING: Dependence on proprietary foundation models (GPT-4o) creates deployment and cost concerns
Failure modes:
- VLM hallucinations for physics parameters can lead to completely wrong world models
- LLM-generated contact strategies may specify geometrically invalid or unsafe contact points requiring extensive safety checking.
GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning
Authors: Yufei Jia, Heng Zhang, Ziheng Zhang, Junzhe Wu et al. (42 authors) · Institution: THU · Category: cs.RO
A simulation framework integrating custom parallel physics with memory-efficient batch 3D Gaussian Splatting to achieve 10^4 FPS photorealistic rendering for large-scale vision-based robot learning.
Practical Takeaway: If you’re working on vision-based robot learning, GS-Playground offers a compelling alternative to Isaac Lab/Sim for scenarios requiring both high visual fidelity and massive parallel throughput. The 10^4 FPS rendering performance and automated Real2Sim pipeline could significantly accelerate vision-based policy development. However, be aware of the lighting limitations - you’ll need consistent lighting between training and deployment. The framework appears most suitable for rigid-body manipulation and locomotion tasks rather than soft-body interactions. The cross-platform development workflow (prototype locally, train on GPU clusters) could streamline development cycles.
Tags: simulation 3DGS gaussian-splatting parallel-physics robotics sim-to-real locomotion manipulation
Task & Setting
Vision-based robot learning requires massive-scale parallel simulation to train policies for dynamic tasks like locomotion and contact-rich manipulation. Existing simulators struggle with the computational overhead of high-fidelity rendering, forcing a compromise between visual realism and simulation throughput. Additionally, creating simulation-ready 3D assets remains laborious and time-consuming.
The task is to develop a simulation framework that integrates:
- Input: Multi-modal sensor data (RGB images at 640×480, depth maps, LiDAR point clouds, contact forces/torques)
- Output: High-throughput parallel physics simulation with photorealistic 3D Gaussian Splatting (3DGS) rendering
The core objective is to maximize parallel simulation throughput while maintaining visual fidelity:
\[\text{Maximize: } \frac{\text{FPS} \times \text{Visual Quality}}{\text{Memory Usage}}\]Success is measured by:
- Rendering throughput (FPS at 640×480 resolution)
- Physics simulation stability (contact force accuracy, momentum conservation)
-
Sim-to-real transfer success rates across locomotion, navigation, and manipulation tasks.
The framework introduces an automated “Image-to-Physics” pipeline that converts single RGB images into simulation-ready digital twins with both 3DGS representations and collision meshes.
Architecture & Method
-
Physics Engine: Custom cross-platform (Windows/Linux/macOS) velocity-impulse formulation with Mixed Complementarity Problem (MCP) solver using Projected Gauss-Seidel method for contact resolution
-
Batch 3DGS Renderer: Memory-efficient Gaussian Splatting pipeline with 90%+ point pruning maintaining <0.05 PSNR drop, achieving 10^4 FPS throughput at 640×480 resolution
-
Rigid-Link Gaussian Kinematics (RLGK): Zero-overhead coupling between physics rigid bodies and 3DGS clusters for temporal consistency
-
Constraint Island Parallelization: Dynamic constraint dependency graph partitioning for multi-core CPU parallel solving
-
Real2Sim Pipeline: Automated workflow using Grounding-DINO + SAM for segmentation, LaMa inpainting, AnySplat background reconstruction, and SAM-3D object modeling
The core contribution is harmonizing high-performance parallel physics with memory-efficient batch 3DGS rendering through specialized point-pruning and rigid-body kinematics synchronization.
Training Recipe
- Physics Simulation:
- Data: Massively parallel environments (up to 4096 simultaneous scenes)
- Solver: Custom velocity-impulse with warm-starting from temporal coherence
- Time steps: Up to 10ms supported with high stability
- Hardware: Both CPU and GPU backends supported
- Batch Rendering:
- Data: 3DGS scenes with 90% point pruning via state-of-the-art efficient pruning
- Resolution: 640×480 standard, up to 1280×720 supported
- Batch size: Up to 2048 scenes simultaneously
- Hardware: NVIDIA RTX 4090/6000 Ada/A100 tested
- Policy Training:
- Algorithm: PPO for all tasks (locomotion, navigation, manipulation)
- Environments: 1024-2048 parallel instances
- Training time: 10 minutes (quadruped), 6 hours (humanoid), not reported (manipulation)
- Data: Mix of proprioceptive states and high-resolution RGB observations
- Real2Sim Asset Generation:
- Processing time: 5 minutes end-to-end per scene (excluding model loading)
- Pipeline: 25s segmentation/inpainting, 8s AnySplat, 10s per object SAM-3D
- Hardware: NVIDIA RTX 3090 tested
Novelty & Lineage
Prior Work:
- GaussGym (2024): First to apply 3DGS to RL but limited to small-scale, non-contact scenarios
- Isaac Lab (2024): State-of-the-art parallel physics with ray-tracing rendering, but memory-intensive and lower throughput
-
Genesis (2024): GPU-based parallel physics with Madrona rendering, lacks photorealistic fidelity
Delta: This work adds:
- Custom physics engine optimized for 3DGS integration with MCP contact solver
- 90%+ 3DGS point pruning achieving 10^4 FPS batch rendering
- Automated Real2Sim pipeline for asset generation
-
Rigid-Link Gaussian Kinematics for artifact-free dynamic rendering.
Applied-Specific Assessment:
- Architectural novelty: The RLGK coupling and specialized point-pruning for rigid-body scenes is novel, though builds on established 3DGS techniques
- Benchmark gains: 32× speedup over MuJoCo, 600× over MjWarp in complex scenes is substantial but heavily dependent on their custom physics engine
- Fair comparisons: Physics comparisons use equivalent scenarios, but rendering comparisons against Isaac Sim use different asset generation methods
- Scale dependency: The gains appear to require their specific 3DGS optimizations and wouldn’t necessarily transfer to other visual rendering approaches
Verdict: SIGNIFICANT — The integration of high-throughput physics with memory-efficient 3DGS rendering addresses a real bottleneck in vision-based robot learning, with convincing performance gains and demonstrated sim-to-real transfer.
Benchmarks & Results
-
Physics Stability: Newton’s Cradle momentum conservation - better preservation than MuJoCo; Boston Dynamics Spot stability at 10ms timestep - reduced drift vs MuJoCo
-
Rendering Throughput: 10,000 FPS at 640×480 (batch size 2048) vs Isaac Sim’s ~2,000 FPS; maintains advantage across RTX 4090/6000 Ada/A100 GPUs
-
Physics Scaling: 1,015 FPS at N=50 humanoids vs MuJoCo 32 FPS (32× speedup) and MjWarp 1.71 FPS (600× speedup)
-
Visual Quality: PSNR 26.87 vs 27.15 (original 3DGS) with 70% fewer Gaussians; SSIM 0.802 vs 0.830
-
Sim-to-Real Transfer: Quadruped locomotion (10 min training), humanoid control (6 hours training), manipulation grasping (90% success rate), navigation (successful cone following)
-
Memory Efficiency: 90%+ Gaussian reduction with <0.05 PSNR drop
Results show consistent advantages in throughput and stability, though some comparisons use different asset generation methods.
Compute & Efficiency
-
Model Size: Up to 4096 parallel environments; 3DGS scenes with 90% point pruning (70% fewer Gaussians than standard)
-
Training Compute: Tested on NVIDIA RTX 4090/6000 Ada/A100; physics supports both CPU (AMD 9950x) and GPU backends
-
Inference Speed: 10^4 FPS rendering throughput at 640×480, physics scaling to 1,015 FPS with 50 humanoids
-
Memory Footprint: Significantly reduced through 90% 3DGS point pruning; avoids OOM failures that plague Isaac Sim at high batch sizes
-
Deployment Practicality: Cross-platform development (Windows/Linux/macOS), successful real-world deployment on Unitree Go2/G1 and Airbot Play arm
Real-World Applicability
-
Quadruped Locomotion: Successfully deployed state-based policies on Unitree Go2 for velocity tracking, trained in 10 minutes
-
Humanoid Control: 23-DoF balancing and walking on Unitree G1, trained in 6 hours with 2048 parallel environments
-
Visual Manipulation: End-to-end RGB-based grasping on Airbot Play arm with 90% success rate in uncontrolled real-world scenes
-
Visual Navigation: Real-time cone following on Unitree Go2 using hierarchical RL with high-level visual policy and low-level locomotion controller
All deployments demonstrated zero-shot transfer from simulation without additional real-world fine-tuning.
Limitations & Failure Modes
-
Lighting Dependency (FUNDAMENTAL): 3DGS struggles with randomized lighting/shadows unlike ray-tracing; asset generation depends on source image lighting conditions
-
Rigid-Body Assumption (FUNDAMENTAL): RLGK only supports rigid bodies; cannot handle deformable objects, cloth, or fluids
-
Limited Relighting (ENGINEERING): No algorithmic relighting capability to decouple object appearance from environmental lighting
-
Single-View Reconstruction (EVALUATION): Real2Sim pipeline uses single RGB images, may miss occluded geometry compared to multi-view approaches
-
3DGS Memory Scaling (ENGINEERING): Despite 90% pruning, memory usage still grows with scene complexity
Failure Modes:
- Dynamic lighting changes during deployment may degrade visual policy performance
- Contact-rich manipulation of soft/deformable objects not supported by current rigid-body physics
Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
Authors: Wenhui Huang, Songyan Zhang, Collister Chua, Yang Liang et al. (7 authors) · Institution: Nanyang Technological University · Category: cs.CV
Introduces LTD, the first open-ended traffic VQA dataset using roadside cameras, and UniVLT, a VLM trained via curriculum transfer to unify autonomous driving and city-scale traffic analysis.
Practical Takeaway: If you’re working on intelligent transportation systems, this paper demonstrates a reasonable approach for extending VLMs from autonomous driving to city-scale traffic analysis. The curriculum training strategy (general → AD → traffic) is worth implementing if you have multi-domain data. However, the core architecture is standard, and the main value is in the dataset and training procedure rather than novel modeling. The multi-image reasoning capability could be useful for other surveillance applications beyond traffic.
Tags: vision-language-models intelligent-transportation traffic-analysis multi-image-reasoning roadside-cameras curriculum-learning autonomous-driving dataset
Task & Setting
-
Real-world context: Urban transportation systems face growing safety challenges from complex interactions between diverse road users (pedestrians, cyclists, motorcycles). While foundation models have shown promise for autonomous driving scenarios, city-scale traffic analysis from roadside infrastructure remains underexplored. Traditional traffic monitoring methods prove inadequate for the complexity of modern urban mobility.
-
Task definition: The paper addresses three complementary tasks using roadside camera imagery: (a) Fine-grained multi-object grounding - detecting motorcycles and pedestrians with bounding box coordinates, (b) Multi-image camera selection - identifying which of 3 uncorrelated camera views show potential risks, and (c) Multi-image risk analysis - open-ended reasoning about hazardous objects, contributing factors, and risky road directions across 3 minimally correlated camera views. Input: RGB images from heterogeneous roadside cameras. Output: Textual responses for VQA tasks and normalized bounding box coordinates for grounding.
-
Evaluation criteria: GPT-Score for multi-image risk analysis, accuracy for camera ID selection, F1 score for multi-object grounding. Standard NLP metrics (CIDEr, METEOR, BLEU) for autonomous driving benchmarks. LingoJudge for LingoQA evaluation.
-
Dataset scale: Land Transportation Dataset (LTD) contains 11.6K high-quality VQA pairs from roadside cameras across Singapore, spanning diverse road geometries, traffic participants, illumination conditions, and weather.
Architecture & Method
-
Vision encoder: Redesigned Vision Transformer (ViT) handling dynamic input resolutions up to 225,792 pixels, with 2D Rotary Positional Embedding (RoPE) and RMSNorm normalization.
-
Language model backbone: Qwen2.5-VL 7B with multimodal 1D RoPE (MRoPE), Grouped Query Attention (GQA), and RMSNorm.
-
Multi-image processing: Visual tokens from all images concatenated in temporal/camera order to enable cross-image reasoning over long-range dependencies.
-
Loss function: Standard autoregressive language modeling loss computed as:
\[p(X_a|X_v, X_q) = \prod_{l=1}^{L} p(x_l|X_v, X_q, X_{a,<l})\]where $X_v$ are visual tokens, $X_q$ are instruction tokens, and $X_a$ are target answer tokens.
Core technical contribution: Curriculum-based knowledge transfer strategy unifying microscopic autonomous driving reasoning with macroscopic traffic analysis, enabling joint reasoning over minimally correlated multi-view roadside camera observations.
Training Recipe
-
Pre-training stage: Initialize with pre-trained Qwen2.5-VL 7B weights (no additional training)
-
Fine-tuning Stage 1 (AD domain adaptation): - Data: 727.1K QA pairs from LingoQA (413.8K), OmniDrive (287.9K), CODA-LM (25.4K) - Optimizer: LoRA fine-tuning technique - Hardware and wall-clock time: Not reported - Learning rate, schedule, batch size: Not reported
-
Fine-tuning Stage 2 (Traffic domain expansion): - Data: 11.6K LTD samples + 3K samples each from AD datasets (experience replay) - Optimizer: LoRA fine-tuning - Hardware and wall-clock time: Not reported - Learning rate, schedule, batch size: Not reported
Novelty & Lineage
Step 1 — Prior work:
- SUTD-TrafficQA (2021): 62.5K traffic QA pairs in multiple-choice format
- TUMTraffic-VideoQA (2025): 87.3K video QA pairs with template-based annotations
- LingoQA (2024): 413.8K open-ended driving VQA pairs for autonomous vehicles
Step 2 — Delta: This paper adds (1) first open-ended traffic VQA dataset using roadside cameras, (2) multi-image reasoning over uncorrelated camera views, (3) curriculum transfer from general → AD → traffic domains.
Step 3 — Applied-specific assessment:
- Architectural idea: Standard VLM architecture with multi-image concatenation - not novel architecturally
- Benchmark gains: Large margins on their own dataset (0.66 vs 0.46 GPT-score), modest gains on established benchmarks (69.0% vs 67.8% on LingoQA)
- Fair comparisons: Reasonable baselines but evaluation primarily on their own dataset where advantage is expected
- Generalizability: Would likely require similar roadside camera infrastructure and multi-stage training
Verdict: INCREMENTAL — Solid dataset contribution and engineering of curriculum training, but core VLM architecture is standard and gains are primarily on their own benchmark.
Benchmarks & Results
- LTD Multi-Image Risk Analysis: UniVLT 0.66 GPT-Score vs Qwen2.5-VL 0.46 (43% improvement)
- LTD Camera ID Selection: UniVLT 0.66 accuracy vs Qwen2.5-VL 0.48 (38% improvement)
- LTD Multi-Object Grounding: UniVLT 0.64 F1 vs Qwen3-VL 0.62 (marginal improvement)
- LingoQA: UniVLT 69.0% LingoJudge vs ReCogDrive 67.8% (1.2% improvement)
- OmniDrive Object Recognition: UniVLT 0.89 GPT-Score vs InternVL2.5 0.87 (marginal)
- OmniDrive Driving Suggestion: UniVLT 0.87 GPT-Score vs ReCogDrive 0.84 (modest)
- CODA-LM General Perception: UniVLT 5.18 vs RoboTron-Drive 5.15 (marginal)
-
CODA-LM Region Perception: RoboTron-Drive 7.66 vs UniVLT 7.25 (UniVLT second best)
Results show strong performance on their own LTD dataset but modest improvements on established benchmarks.
Compute & Efficiency
- Model size: 7B parameters (Qwen2.5-VL backbone)
- Training compute: Not reported
- Inference speed/latency: Not reported
- Memory footprint: Not reported
- Deployment practicality: Reasonable for deployment given 7B parameter size, but requires multi-image processing capability and roadside camera infrastructure
Real-World Applicability
- Dataset collected from real roadside cameras across Singapore transportation network
- Covers diverse road geometries, traffic participants, illumination conditions, and adverse weather
- Focuses on safety-critical scenarios including vulnerable road users (pedestrians, motorcycles)
- No reported real-world deployment or production integration
- No sim-to-real evaluation discussed
- Limited to Singapore road infrastructure and traffic patterns
Limitations & Failure Modes
- Dataset limited to Singapore roadside cameras - FUNDAMENTAL (geographical and infrastructure specific)
- Requires multi-stage training with AD data - ENGINEERING (complex training pipeline)
- Multi-image inputs must be minimally correlated - FUNDAMENTAL (specific to roadside camera setup)
- Grounding task sensitive to small object detection - FUNDAMENTAL (vulnerable road users occupy small image regions)
- Open-ended evaluation prone to subjective scoring - EVALUATION (GPT-based metrics)
-
Limited cross-city generalization demonstrated - EVALUATION (single city dataset)
Failure modes:
- Performance likely degrades on road infrastructure significantly different from Singapore
- Multi-image reasoning may fail when camera views have unexpected correlations or occlusions.