May 4, 2026 Applied AI 5 papers

Applied AI Digest — May 4, 2026

Today’s Digest at a Glance

Preliminary

Today’s papers explore multimodal robotics with novel control architectures, specialized visual recognition systems, and high-throughput simulation frameworks for embodied AI.

Model Predictive Path Integral (MPPI) Control

MPPI addresses the challenge of real-time optimal control under uncertainty by reformulating stochastic control as a sampling problem. Traditional model predictive control (MPC) requires solving computationally expensive optimization problems at each timestep, making it impractical for high-frequency robotic control with nonlinear dynamics and constraints.

The core insight of MPPI is to approximate the optimal control distribution using importance sampling. Given a dynamics model $\mathbf{x}_{t+1} = f(\mathbf{x}_t, \mathbf{u}_t) + \mathbf{w}_t$ where $\mathbf{w}_t$ is process noise, MPPI samples $K$ candidate control sequences ${\mathbf{u}^{(k)}}_{k=1}^K$ from a noise distribution and evaluates their costs $S^{(k)}$. The optimal control is computed as a weighted average:

\[\mathbf{u}_t^* = \sum_{k=1}^K w^{(k)} \mathbf{u}_t^{(k)}\]

where weights are $w^{(k)} = \frac{\exp(-\beta S^{(k)})}{\sum_{j=1}^K \exp(-\beta S^{(j)})}$ and $\beta$ is a temperature parameter.

Essentially, MPPI turns optimal control into a weighted Monte Carlo sampling problem where better trajectories get higher weights. The method naturally handles constraints and nonlinear dynamics by encoding them in the cost function, making it particularly suitable for contact-rich manipulation tasks where analytical solutions are intractable.

Mixed Complementarity Problems (MCP) for Contact Dynamics

Contact dynamics in robotics simulation requires handling the fundamental challenge that contact forces are unknown until contacts occur, but contacts depend on the forces. Traditional approaches either use penalty methods (which can be stiff and unstable) or pure constraint-based methods (which can be computationally expensive for many contacts).

MCP formulates contact dynamics as a system where complementarity conditions capture the either-or nature of contacts. For each potential contact point, we have three states: separated ($d > 0, f_n = 0$), sliding ($d = 0,

f_t

= \mu f_n$), or sticking ($d = 0,

f_t

< \mu f_n$). These can be expressed as complementarity constraints: $0 \leq d \perp f_n \geq 0$ (gap distance and normal force cannot both be positive).

The complete system becomes:

\[\mathbf{M}\ddot{\mathbf{q}} = \mathbf{h} + \mathbf{J}^T \boldsymbol{\lambda}\]

subject to complementarity constraints $\mathbf{0} \leq \boldsymbol{\phi}(\mathbf{q}) \perp \boldsymbol{\lambda} \geq \mathbf{0}$ where $\mathbf{M}$ is the mass matrix, $\mathbf{h}$ contains Coriolis and external forces, $\mathbf{J}$ is the contact Jacobian, $\boldsymbol{\lambda}$ are contact forces, and $\boldsymbol{\phi}$ are gap functions. The Projected Gauss-Seidel method solves this iteratively by projecting force updates onto feasible regions.

Intuitively, MCP lets the simulation “discover” contact forces by solving a mathematical puzzle where forces and gaps cannot both be active simultaneously.

2D Rotary Positional Embedding (2D RoPE)

2D RoPE extends the rotary positional encoding mechanism from 1D sequences to 2D spatial grids, addressing the limitation that standard positional encodings cannot effectively capture spatial relationships in images with varying resolutions and aspect ratios.

Standard RoPE applies rotations to query-key pairs based on relative positions: for 1D position difference $m$, it rotates feature dimensions by $m\theta_i$ where $\theta_i = 10000^{-2i/d}$. For 2D images, this becomes problematic because a single scalar position cannot capture both horizontal and vertical relationships.

2D RoPE decomposes the feature dimension $d$ into two parts: $d/2$ dimensions for horizontal encoding and $d/2$ for vertical encoding. For a spatial position $(x, y)$, the rotation angles become:

\[\theta_{i,x} = x \cdot 10000^{-2i/d}, \quad \theta_{i,y} = y \cdot 10000^{-2(i+d/2)/d}\]

The attention computation then applies separate rotations for each spatial dimension, allowing the model to understand both horizontal and vertical spatial relationships independently.

This allows vision transformers to maintain spatial understanding across different image resolutions and aspect ratios without requiring resolution-specific training.

Reading Guide

Action Agent demonstrates how LLM orchestration can coordinate multiple specialized models (video generation, flow control, evaluation) for complex robotic navigation tasks. DiagramNet tackles a complementary visual reasoning challenge by decomposing circuit diagram understanding into specialized detection and reasoning agents. CoRAL bridges these approaches by using LLMs to design cost functions for MPPI-based manipulation control, while GS-Playground provides the simulation infrastructure needed to train such multimodal systems at scale using efficient 3DGS rendering.

Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

Authors: Jeffrin Sam, Nguyen Khang, Yara Mahmoud, Miguel Altamirano Cabrera et al. (5 authors) · Institution: Skoltech · Category: cs.RO

Action Agent decouples robot navigation into agentic video generation followed by flow-constrained diffusion control, achieving competitive navigation performance with 161× fewer parameters than foundation VLA models.

Practical Takeaway: If you’re working on robot navigation, the key insight is the explicit decoupling of trajectory imagination (video generation) from execution (velocity control) through a Visual Intermediate Representation. The agentic optimization loop for video validation is practically valuable, improving success from 35% to 86%. The FlowDiT architecture demonstrates that combining optical flow with semantic features can achieve competitive navigation performance with 161× fewer parameters than foundation VLAs. However, be cautious about the video-to-metric scale ambiguity in real deployments - closed-loop execution will likely be necessary for robust performance. The approach is most suitable for structured indoor environments where the computational overhead of Stage I video generation can be amortized across multiple similar navigation tasks.

Tags: robotics navigation diffusion_models video_generation vision_language embodied_ai multimodal_control optical_flow

arXiv · PDF

Task & Setting

The work addresses language-guided robot navigation in indoor environments, where robots must understand natural language instructions and translate them into safe physical motion. This is challenging because it requires grounding abstract language semantics into concrete visual perception and motor control while handling diverse robot embodiments and unpredictable real-world scenarios.

The task takes as input:

a natural language instruction L (e.g., “walk toward the table”)
an initial first-person RGB observation I0, and optionally
a goal image Ig. The system outputs continuous velocity commands at = (vx, vy, ω) at 40-47 Hz for robot control. The objective can be formalized as:
\[\text{Stage I: } (L, I_0) \rightarrow V_{\text{goal}}\] \[\text{Stage II: } (V_{\text{goal}}, [I_t], L) \rightarrow a_t\]
Success is measured by:
navigation success rate (reaching goal without collision)
mean Absolute Trajectory Error (ATE)
Direction Accuracy (DA), and
task completion rate in real-world trials.

The paper evaluates on 50 first-person navigation tasks across indoor environments (warehouses, hospital corridors) with three robot embodiments: Unitree G1 humanoid, quadrotor drone, and wheeled mobile robot.

Architecture & Method

Stage I - Agentic Video Generation: An LLM orchestration agent coordinates three components: (i) Qwen3-VL vision-language model for prompt construction, (ii) WAN 2.2 or LTX-Video diffusion generators for first-person navigation video synthesis, and (iii) Cosmos-Reason1 evaluator that scores videos on Prompt Adherence (PA), Physical Plausibility (PP), and Visual Quality (VQ).
Iterative Optimization: The system optimizes prompt parameters p to maximize a multi-objective reward:
\[p^* = \arg\max_p [\lambda_1 PA(V_p) + \lambda_2 PP(V_p) + \lambda_3 VQ(V_p)]\]
subject to mean(PA, PP, VQ) ≥ 80, using bottleneck-first refinement over maximum 5 iterations.
Stage II - FlowDiT Architecture: A 43M-parameter Flow-Constrained Diffusion Transformer with 8 transformer blocks, 512 hidden dimensions, and 8 attention heads. It conditions on a 2304-dimensional vector combining DINOv2 visual features (768D), learned optical flow embeddings (256D), optional live observation (768D), and CLIP language embeddings (512D).
Diffusion Policy Formulation: Models action distribution using denoising diffusion with forward process at = √ᾱt a0 + √(1-ᾱt) ε, optimizing noise prediction loss:
\[L = E_{a_0,t,ε}[||ε - ε_θ(a_t, t, c)||^2]\]
Receding Horizon Control: Predicts 8-step velocity sequences, executes first action, re-evaluates at next timestep for model-predictive control at 40-47 Hz frequency.

Training Recipe

Pretraining Stage: FlowDiT pretrained on RECON outdoor navigation dataset (11,830 episodes) from Open X-Embodiment collection using Clearpath Jackal wheeled robot data for general visual navigation priors.
Fine-tuning Stage: 203 Unitree G1 humanoid episodes (162 train / 41 val) collected in Isaac Sim indoor environments (warehouse and hospital corridors) to calibrate velocity dynamics for humanoid embodiment.
Training Configuration: AdamW optimizer with learning rate 1e-4, batch size 8 with FP16 precision, linear noise schedule β₁=10⁻⁴ to βₜ=2×10⁻², 100 diffusion steps for training / 10 DDIM steps for inference.
Hardware: NVIDIA RTX 5090 (32 GB), training time not reported.
Data Processing: 224×224 RGB frames, action horizon H=8, normalized velocity actions to [-1,1]³, uses frozen DINOv2 and CLIP encoders with only 43M parameters trainable.

Novelty & Lineage

Prior Work:

NoMaD (2023) - goal-conditioned diffusion policies for navigation using ~100M parameters, achieving 68% success on RECON
ViNT (2023) - vision transformer navigation with ~25M parameters, 61% success on RECON
OpenVLA (2024) - foundation VLA model with ~7B parameters, 65% success rate

Delta: This paper introduces:
explicit two-stage decomposition separating trajectory imagination (video generation) from execution (velocity control)
agentic optimization loop for video validation using LLM orchestration
flow-constrained diffusion combining optical flow with semantic features for ego-motion representation.

Applied-Specific Assessment:
- Architecture: The two-stage decomposition is a reasonable engineering choice but not architecturally novel - separating planning from control is classical robotics
- Benchmark Gains: 73.2% vs 68% (NoMaD) represents modest 5.2 percentage point improvement, with different datasets making direct comparison questionable
- Fair Comparisons: Baseline comparisons acknowledge different datasets/robots (“for architectural context rather than direct benchmarking”), undermining claims of superiority
- Scale Dependence: The 161× parameter reduction vs OpenVLA is impressive, but success likely depends heavily on the sophisticated video generation models in Stage I
The agentic video generation loop (35% to 86% success improvement) shows clear value, but the overall approach combines known techniques (diffusion policies, optical flow, two-stage planning) rather than introducing fundamental innovations.

Verdict: INCREMENTAL — Solid engineering combining established techniques with meaningful parameter efficiency gains, but lacks fundamental algorithmic breakthroughs.

Benchmarks & Results

Stage I Video Generation Success: Action Agent achieves 86% overall success rate vs 35% single-shot baseline across 50 navigation tasks, with Unitree G1 (92%), Mobile Robot (83%), Drone (77%)
FlowDiT Navigation Success (Simulation): 73.2% success rate on G1 validation split (41 episodes) vs ViNT 61%, NoMaD 68%, OpenVLA 65% - though different datasets limit direct comparison
Real Hardware Performance: 64.7% task completion rate (11/17 trials) on physical Unitree G1 in unseen indoor environments under open-loop execution
Parameter Efficiency: 43M trainable parameters represents 161× reduction compared to OpenVLA’s ~7B parameters while maintaining competitive performance
Inference Speed: 40-47 Hz execution frequency (~20 ms per step) on RTX 5090, generating 121 velocity waypoints in 3.68s average
Ablation Results: Full model (73.4% SR) vs Vision-only (58.5%), No Flow (73.2%), No Language (65.9%) - showing complementary contributions of multimodal conditioning

Results are mixed with simulation numbers not directly comparable to baselines due to different evaluation protocols and datasets.

Compute & Efficiency

Model Size: 43M trainable parameters (FlowDiT core), uses frozen DINOv2 (86.6M) and CLIP encoders, 161× smaller than OpenVLA
Training Compute: NVIDIA RTX 5090 (32 GB) for fine-tuning, pretraining compute not reported, wall-clock training time not specified
Inference Speed: 40-47 Hz control frequency (~20 ms per control step), 3.68s average for generating 121 velocity waypoints from reference video
Memory Footprint: 32 GB GPU memory during training with FP16 precision, inference memory requirements not specified
Deployment Practicality: Compact 43M parameter execution module enables edge deployment on consumer hardware, but Stage I video generation requires access to large foundation models (WAN 2.2, LTX-Video) which may require cloud inference or high-end hardware

Real-World Applicability

Real Hardware Deployment: Successfully tested on physical Unitree G1 humanoid robot in unseen indoor lab/office environments with 64.7% task completion rate (11/17 trials)
Open-Loop Execution: System operates without live camera feedback during motion execution, relying solely on pre-generated reference video and language instructions
Embodiment Transfer: Same pipeline deployed across three different robot types (humanoid, drone, wheeled) without retraining, demonstrating cross-embodiment generalization
Sim-to-Real Gap: ~8.5 percentage point drop from simulation (73.2%) to real hardware (64.7%), indicating reasonable but not perfect transfer
Environmental Constraints: Real-world testing limited to indoor structured environments (labs, offices), with failures primarily from video-to-metric scale ambiguity and trajectory drift accumulation in open-loop mode

Limitations & Failure Modes

Video-to-metric scale ambiguity - FUNDAMENTAL: Generated videos imply motion magnitudes mismatched with actual robot displacement, inherent to pixel-to-physical mapping
Trajectory drift in open-loop - ENGINEERING: Small heading errors compound without correction, addressable with closed-loop visual feedback
Limited planning horizon - ENGINEERING: Video generation architectures constrain clips to 5-15 seconds, limiting single-stage planning scope
Constrained space navigation - FUNDAMENTAL: Elevated failure rates in highly constrained spaces like narrow doorframes requiring centimeter-level precision
Evaluation limitations - EVALUATION: Baseline comparisons use different datasets, robots, and success criteria, making performance claims questionable

Failure Modes:
Obstacle collision from accumulated drift: Open-loop execution without visual correction leads to collision with obstacles not anticipated in reference video
Semantic stopping failures: Model follows visual trajectory correctly but fails to execute proper stopping behavior at intended goal locations

DiagramNet: An End-to-End Recognition Framework and Dataset for Non-Standard System-Level Diagrams

Authors: Jincheng Lou, Ruohan Xu, Jiapeng Li, Junyin Pi et al. (9 authors) · Institution: Peking University · Category: cs.AI

DiagramNet introduces the first dataset and multi-agent framework for system-level circuit diagram recognition, achieving competitive performance through decomposed visual reasoning that separates component detection from topological connection prediction.

Practical Takeaway: This work demonstrates that system-level circuit diagram understanding benefits more from decomposed multi-agent workflows than end-to-end approaches. The key insight is separating visual grounding (YOLO detection) from topological reasoning (VLM connection prediction), which provides 128× improvement for Gemini-2.5-Pro on structured tasks. Research engineers should consider this decomposition pattern for complex visual reasoning tasks involving implicit relationships. The component-wise connection prediction strategy (predicting outputs per source component) is more stable than full-graph prediction and could apply to other graph extraction problems. However, the approach requires domain-specific dataset construction and multi-stage training, making it most suitable for applications with sufficient data and engineering resources.

Tags: circuit_diagrams multimodal_learning electronic_design_automation vision_language_models multi_agent_systems graph_understanding topology_reasoning reinforcement_learning

arXiv · PDF

Task & Setting

System-level diagrams serve as architectural blueprints for chip design, encoding module functions, dataflows, and interface protocols. These diagrams help architects plan system functionality before implementation but pose unique recognition challenges compared to standardized analog-mixed-signal (AMS) schematics. Unlike AMS circuits with fixed component libraries, system-level diagrams use non-standardized symbols that vary across organizations and have implicit connectivity that lacks explicit input/output port markings.

The task decomposes system-level diagram understanding into four subtasks: (1) Listing extracts component names from visual entities:

\[f_{list} : I → C = \{c_1, ..., c_n\}\]

(2) Localization provides spatial coordinates:

\[f_{loc} : (I, c_i) → b_i ∈ [0,1]^4\]

(3) Connection predicts output targets:

\[f_{conn} : (I, c_i, C) → T_i ⊆ C\]

(4) Circuit QA answers domain questions:

\[f_{qa} : (I, q) → (r, a)\]

Success is measured by F1 scores on component detection (S1), output count prediction (S2), connection identification (S3), and Circuit QA accuracy (Task 2). The overall score combines: ScoreTask1 = 0.4×S1 + 0.2×S2 + 0.4×S3, with Scoreoverall = 0.6×ScoreTask1 + 0.4×ScoreTask2.

DiagramNet provides 1,000 system-level diagrams with 10,977 connection annotations and 15,515 chain-of-thought QA pairs across four tasks, extracted from major chip design venues.

Architecture & Method

Multi-agent workflow with three specialized agents: Perception Agent uses YOLOv11-nano for single-class component detection and applies row-major ordering to establish spatial structure for downstream reasoning.
Reasoning Agent uses Qwen2.5-VL-3B as the vision-language backbone to predict component-wise connections iteratively, processing each source component to predict its output targets rather than the full graph simultaneously.
Knowledge Agent loads task-specific LoRA weights into the VLM backbone based on query type, enabling efficient adaptation to Circuit QA tasks without retraining the full model.
The core technical contribution is the decoupled architecture that separates visual grounding from topological reasoning, combined with component-wise connection prediction that reduces output space complexity compared to end-to-end graph prediction.
Row-major spatial ordering (left-to-right, top-to-bottom) provides consistent positional priors and removes ordering ambiguity for the reasoning agent.

Training Recipe

Phase 1 - Multi-task Supervised Fine-Tuning: Base model initialized from Qwen2.5-VL-3B pre-trained on AI2D-RST for topological understanding. Cross-entropy loss applied:
\[L_{SFT} = -\sum_{i=1}^L \log P(y_i|X_v, X_t, y_{<i}; \Theta)\]
All four subtasks mixed in each batch. Data: DiagramNet training set. Optimizer details not reported. Hardware: NVIDIA A40 GPUs.
Phase 2 - Hard-Sample Reinforcement Learning: RL applied to difficult samples selected by inference instability, visual ambiguity, and high-density connectivity. Compound reward combining format validity and accuracy:
\[R_{total} = \sum_{t} (\lambda_{f,t}R_{fmt}^{(t)} + \lambda_{a,t}R_{acc}^{(t)})\]
Connection reward uses F1 + length penalty. Hardware: NVIDIA A100-SXM4 (80GB) GPUs. Training framework: Verl with DeepSpeed ZeRO-3.
Phase 3 - LoRA Fine-tuning: Low-rank adaptation for task-specific refinement:
\[h = W_0x + \alpha BAx/r\]
where $r \ll \min(d,k)$. Frozen base weights with trainable low-rank matrices. Hardware: RTX 4090 GPUs. Wall-clock time not reported.

Novelty & Lineage

Prior work: AMSBench (Shi et al. 2025) provides multimodal understanding for analog-mixed-signal schematics with 6,000 connection annotations. Netlistify (Huang et al. 2025) creates 40k synthetic AMS schematic dataset for schematic-to-netlist conversion. Image2Net (Xu et al. 2025) offers 2,914 connections for AMS circuits.

Delta: This paper extends from standardized AMS schematics to non-standardized system-level diagrams. Key additions:

First dataset for system-level diagrams vs. transistor-level AMS circuits
Multi-agent workflow that decouples perception from reasoning vs. end-to-end approaches
Component-wise connection prediction vs. full-graph prediction.

Applied-specific assessment:
- Architectural idea: The multi-agent decomposition is a reasonable engineering approach but not fundamentally novel - separating detection from reasoning follows established computer vision patterns
- Benchmark gains: 21.8× improvement on S2 and 52.5× on S3 over base model, but comparisons use the same underlying VLM backbone, making gains less surprising
- SOTA comparisons: Fair comparisons within domain, though commercial models (GPT-5, Gemini-2.5-Pro) tested end-to-end while proposed method uses specialized workflow
- Scale dependence: Method requires custom YOLO training and domain-specific dataset, limiting generalizability without similar data investment
Verdict: INCREMENTAL — Solid engineering extension of AMS circuit understanding to system-level diagrams with reasonable performance gains, but core techniques are standard multi-agent decomposition applied to a new domain.

Benchmarks & Results

DiagramNet S1 (component detection): DiagramNet-3B 0.988, EDA Elite Winner 0.984, Netlistify 0.986, improvement +0.4% over EDA winner
DiagramNet S2 (output count): DiagramNet-3B 0.828, EDA Elite Winner 0.787, Qwen2.5-VL-3B 0.038, improvement +4.1% over EDA winner, +21.8× over base model
DiagramNet S3 (connection identification): DiagramNet-3B 0.735, EDA Elite Winner 0.777, Netlistify 0.150, underperforms EDA winner by -4.2% but outperforms Netlistify by +4.9×
DiagramNet Task 1 (overall): DiagramNet-3B 0.855, EDA Elite Winner 0.862, Claude-Sonnet-4 0.364, trails EDA winner by -0.7%
DiagramNet Task 2 (Circuit QA): DiagramNet-3B 0.395, GPT-5 0.730, EDA Elite Winner 0.370, improvement +2.5% over EDA winner but trails GPT-5 by -33.5%
AMSBench Connection Identification: DiagramNet-3B 0.500, Netlistify 0.433, GPT-5 0.530, improvement +6.7% over Netlistify
AMSBench Connection Judgement: DiagramNet-3B 0.667, tied with Netlistify and Hint-GRPO

Results are mixed - strong gains over base models but inconsistent performance against domain competitors.

Compute & Efficiency

Model size: 3B parameters for Qwen2.5-VL-3B backbone, YOLOv11-nano for detection (size not specified)
Training compute: NVIDIA A100-SXM4 (80GB) for RL phase, A40 GPUs for SFT, RTX 4090 for LoRA. Specific GPU hours not reported.
Inference speed/latency: Not reported, though multi-agent workflow likely adds overhead compared to end-to-end approaches
Memory footprint: Not reported, but 3B parameter VLM plus YOLO detector should be manageable on consumer hardware
Deployment practicality: Requires custom YOLO detector training (60 images minimum for adaptation), multi-stage inference pipeline, and domain-specific LoRA weights. Moderately complex deployment compared to single-model approaches.

Real-World Applicability

Dataset source: Real conference/journal figures from chip design venues, indicating genuine real-world data rather than synthetic benchmarks
Cross-domain transfer: Demonstrates zero-shot transfer to AMSBench with only 60 images for YOLO detector retraining, achieving competitive performance with GPT-5 and Claude-Sonnet-4
Competition performance: Surpasses 2025 EDA Elite Challenge winner on overall benchmark score (0.671 vs 0.665)
Industry integration: No production deployment results reported, though paper targets EDA workflows where manual diagram recognition currently requires domain expertise
Hardware constraints: Method successfully runs on consumer RTX 4090 GPUs for LoRA adaptation, suggesting reasonable deployment requirements for engineering teams

Limitations & Failure Modes

Symbol heterogeneity challenges - FUNDAMENTAL: Non-standardized symbols require continuous dataset expansion as new diagram styles emerge
Visual ambiguity handling - ENGINEERING: Weak directional cues, crossing ambiguities, and multi-fan-out structures cause systematic S3 errors that could be addressed with better visual reasoning
Domain knowledge gaps - ENGINEERING: Task 2 performance (0.395) significantly trails commercial models like GPT-5 (0.730), indicating insufficient circuit domain knowledge that could be improved with more comprehensive training data
Dataset coverage limitations - EVALUATION: Only 1,000 diagrams may not capture the full diversity of system-level diagram styles across different organizations and design domains
Multi-stage complexity - ENGINEERING: Requires coordinated training of YOLO detector, VLM backbone, and LoRA adapters, increasing training complexity compared to end-to-end approaches

Failure modes:
Dense diagrams with many crossing wires cause connection prediction errors
Non-standard diagram conventions from different organizations may not generalize without domain-specific retraining.

CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

Authors: Berk Çiçek, Mert K. Er, Özgür S. Öğüz · Institution: Bilkent University · Category: cs.RO

CoRAL uses LLMs to design cost functions and contact strategies for MPPI-based manipulation control, enabling zero-shot contact-rich tasks through online parameter adaptation and semantic reasoning.

Practical Takeaway: This paper demonstrates a viable engineering approach for combining foundation models with motion planning, particularly valuable for contact-rich tasks where pure VLA methods struggle. The key insight is using LLMs as cost function designers rather than direct controllers, maintaining the benefits of semantic reasoning while ensuring real-time reactive control. The online adaptation mechanism is particularly noteworthy for handling sim-to-real gaps. However, the approach requires significant system integration effort and depends on high-quality pose tracking, making it more suitable for research applications than production deployment. Research engineers should consider this modular architecture for scenarios where zero-shot capability is more important than peak performance.

Tags: robotics manipulation contact-rich LLM VLM motion-planning MPPI zero-shot

arXiv · PDF

Task & Setting

Contact-rich robotic manipulation requires precise force regulation and strategic interaction with objects, but current approaches either rely on expensive demonstrations (VLA models) or lack semantic understanding (traditional planners). These tasks demand both high-level reasoning about object physics and low-level reactive control to manage complex contact dynamics.

The task takes RGB-D images I, 3D object models M, and natural language instructions T as input, producing continuous 6-DoF end-effector control actions ut. The system must estimate object poses, infer physical properties (mass, friction), and execute manipulation strategies involving purposeful contact forces. The objective is formulated as a stochastic optimal control problem:

\[U^* = \arg\min_U \mathbb{E}\left[\phi(x_H) + \sum_{t=0}^{H-1} q(x_t, u_t)\right]\]

where q(xt, ut) is an LLM-generated cost function encoding task-specific objectives.

Success is measured by task completion rates across 10 randomized trials per task. The evaluation suite includes 6 manipulation scenarios: multi-stage push-and-pick, pick-and-place, constant force regulation, dynamic flipping, and wall-assisted manipulation. Real-world validation uses a Franka Emika Panda robot with force/torque sensing and motion capture for pose tracking.

Architecture & Method

Vision pipeline using FoundationPose for continuous 6-DoF object tracking and VLM (GPT-4o) for semantic physics parameter estimation (mass, friction coefficients)
LLM-based task formulation module that generates: (a) structured MPPI cost functions J0 as executable Python code, and (b) contact strategy C0 defining semantic regions of interest as ellipsoids around predicted contact points
Model Predictive Path Integral (MPPI) controller running at 10Hz with K=256 trajectory samples over H=32 step horizon, using adaptive temperature selection via effective sample size (ESS) targeting
Hierarchical control architecture: 1kHz impedance control for safety, 10Hz MPPI trajectory planning, ~1Hz LLM reasoning with reactive augmentation:
\[\nu_t = u_t + K_f \cdot (x_{des} - x_{measured})\]
Online adaptation loop where LLM analyzes execution episodes Et to refine world model parameters θ and cost function structure based on interaction feedback
Retrieval-Augmented Generation (RAG) memory unit storing successful (strategy, parameters) pairs for experience reuse across similar tasks

The core contribution is using LLMs as cost function designers rather than direct controllers, enabling zero-shot planning while maintaining real-time reactive execution through the decoupled architecture.

Training Recipe

No model training required - leverages pre-trained foundation models (GPT-4o for VLM/LLM, FoundationPose for tracking)
MPPI controller uses analytical dynamics model in MuJoCo simulation for trajectory rollouts, no learning involved
System operates zero-shot by: - VLM provides initial semantic physics priors from visual inspection - LLM generates task-specific cost functions and contact strategies from language descriptions - Online system identification refines physical parameters during execution
Memory unit populated incrementally by storing successful episodes during operation, indexed by task semantics and environmental parameters
Real-world deployment uses identical software stack with motion capture replacing FoundationPose for pose estimation
Hardware setup: Intel i9-13900K CPU, 64GB RAM, RTX 4060 Ti GPU for simulation and LLM inference

Training data: Not applicable - system designed for zero-shot operation without demonstration data or policy learning.

Novelty & Lineage

Prior work:

Language-to-Rewards (L2R, 2023) uses LLMs to generate reward functions for MPC but lacks adaptation mechanisms and contact strategy reasoning.
VoxPoser
and VLMPC
use VLMs to generate static cost maps for motion planning.
OpenVLA
and other VLA models learn end-to-end policies from demonstration data.

Delta: This paper adds several components:
LLM generates both cost functions AND semantic contact strategies (not just rewards)
online adaptation loop where LLM refines world model parameters mid-execution based on interaction feedback
separation of VLM (perception) from LLM (strategy) roles
RAG-based experience memory for strategy reuse.

Applied-specific assessment:
- Architectural idea: The neuro-symbolic decoupling is a reasonable engineering choice but not fundamentally novel - separating high-level reasoning from low-level control is well-established
- Benchmark gains: 50%+ improvement over baselines sounds significant, but baselines include pre-trained models not specifically designed for contact-rich tasks
- Fair comparisons: Comparison methodology appears sound, though some baselines (OpenVLA, π0.5) may be disadvantaged on contact-heavy tasks they weren’t specifically trained for
- Generalization: The zero-shot capability is valuable, but the approach still requires carefully engineered prompts and system integration
The online adaptation mechanism is the most interesting contribution, but the overall approach combines existing techniques rather than introducing fundamentally new capabilities.

Verdict: INCREMENTAL — solid engineering that combines LLM reasoning with motion planning in a principled way, but lacks breakthrough conceptual innovations.

Benchmarks & Results

Push and Pick Cutting Board: CoRAL 5/10, OpenVLA-OFT 0/10, π0.5 0/10, L2R 0/10, Expert FSM 8/10
Pick Box: CoRAL 10/10, OpenVLA-OFT 10/10, π0.5 10/10, L2R 10/10, Expert FSM 10/10
Pick and Place in Clutter: CoRAL 10/10, OpenVLA-OFT 9/10, π0.5 8/10, L2R 9/10, Expert FSM 10/10
Push with Constant Force: CoRAL 9/10, OpenVLA-OFT 0/10, π0.5 0/10, L2R 5/10, Expert FSM 10/10
Flip Box: CoRAL 9/10, OpenVLA-OFT 1/10, π0.5 3/10, L2R 4/10, Expert FSM 10/10
Flip with Wall: CoRAL 7/10, OpenVLA-OFT 0/10, π0.5 0/10, L2R 1/10, Expert FSM 9/10

Results show strong performance on contact-rich tasks (T1, T4, T5, T6) but performance gaps remain compared to expert-designed finite state machines. Standard pick-and-place tasks (T2, T3) show little advantage over existing VLA methods. Real-world validation shows reasonable sim-to-real transfer with success rates of 4-10/10 across tasks.

Compute & Efficiency

Model size: No learned parameters - uses pre-trained GPT-4o API and FoundationPose
Training compute: Zero - no training required, pure inference system
Inference speed: LLM calls ~1Hz, MPPI planning 10Hz, impedance control 1kHz
Memory footprint: MPPI requires 256 trajectory rollouts with 32-step horizon in MuJoCo simulation
Deployment practicality: Requires GPU for simulation rollouts, API access for GPT-4o, and motion capture or robust pose estimation - moderately complex but feasible deployment

Real-World Applicability

Real-world validation on Franka Emika Panda robot across all 6 tasks with 4-10/10 success rates
Motion capture system (6-camera Vicon setup) provides ground-truth pose tracking, replacing FoundationPose from simulation
Force/torque regulation demonstrated with robot’s built-in sensors, maintaining ~5N contact forces within target bounds
Sim-to-real gap handled through online parameter adaptation - system successfully diagnoses and corrects friction/mass estimates during execution
No production deployment details provided - remains research prototype requiring motion capture infrastructure

Limitations & Failure Modes

FUNDAMENTAL: System performance bounded by VLM’s semantic physics estimation accuracy and LLM’s reasoning capabilities
ENGINEERING: Requires high-fidelity pose tracking (motion capture in real-world), limiting deployment scenarios
ENGINEERING: Latency constraints from GPT-4o API calls (~1Hz) limit adaptation speed
EVALUATION: Limited to relatively simple objects and controlled environments, no evaluation on deformable materials or complex multi-object scenes
ENGINEERING: Dependence on proprietary foundation models (GPT-4o) creates deployment and cost concerns

Failure modes:
VLM hallucinations for physics parameters can lead to completely wrong world models
LLM-generated contact strategies may specify geometrically invalid or unsafe contact points requiring extensive safety checking.

GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning

Authors: Yufei Jia, Heng Zhang, Ziheng Zhang, Junzhe Wu et al. (42 authors) · Institution: THU · Category: cs.RO

A simulation framework integrating custom parallel physics with memory-efficient batch 3D Gaussian Splatting to achieve 10^4 FPS photorealistic rendering for large-scale vision-based robot learning.

Practical Takeaway: If you’re working on vision-based robot learning, GS-Playground offers a compelling alternative to Isaac Lab/Sim for scenarios requiring both high visual fidelity and massive parallel throughput. The 10^4 FPS rendering performance and automated Real2Sim pipeline could significantly accelerate vision-based policy development. However, be aware of the lighting limitations - you’ll need consistent lighting between training and deployment. The framework appears most suitable for rigid-body manipulation and locomotion tasks rather than soft-body interactions. The cross-platform development workflow (prototype locally, train on GPU clusters) could streamline development cycles.

Tags: simulation 3DGS gaussian-splatting parallel-physics robotics sim-to-real locomotion manipulation

arXiv · PDF

Task & Setting

Vision-based robot learning requires massive-scale parallel simulation to train policies for dynamic tasks like locomotion and contact-rich manipulation. Existing simulators struggle with the computational overhead of high-fidelity rendering, forcing a compromise between visual realism and simulation throughput. Additionally, creating simulation-ready 3D assets remains laborious and time-consuming.

The task is to develop a simulation framework that integrates:

Input: Multi-modal sensor data (RGB images at 640×480, depth maps, LiDAR point clouds, contact forces/torques)
Output: High-throughput parallel physics simulation with photorealistic 3D Gaussian Splatting (3DGS) rendering

The core objective is to maximize parallel simulation throughput while maintaining visual fidelity:

\[\text{Maximize: } \frac{\text{FPS} \times \text{Visual Quality}}{\text{Memory Usage}}\]

Success is measured by:

Rendering throughput (FPS at 640×480 resolution)
Physics simulation stability (contact force accuracy, momentum conservation)
Sim-to-real transfer success rates across locomotion, navigation, and manipulation tasks.

The framework introduces an automated “Image-to-Physics” pipeline that converts single RGB images into simulation-ready digital twins with both 3DGS representations and collision meshes.

Architecture & Method

Physics Engine: Custom cross-platform (Windows/Linux/macOS) velocity-impulse formulation with Mixed Complementarity Problem (MCP) solver using Projected Gauss-Seidel method for contact resolution
Batch 3DGS Renderer: Memory-efficient Gaussian Splatting pipeline with 90%+ point pruning maintaining <0.05 PSNR drop, achieving 10^4 FPS throughput at 640×480 resolution
Rigid-Link Gaussian Kinematics (RLGK): Zero-overhead coupling between physics rigid bodies and 3DGS clusters for temporal consistency
Constraint Island Parallelization: Dynamic constraint dependency graph partitioning for multi-core CPU parallel solving
Real2Sim Pipeline: Automated workflow using Grounding-DINO + SAM for segmentation, LaMa inpainting, AnySplat background reconstruction, and SAM-3D object modeling

The core contribution is harmonizing high-performance parallel physics with memory-efficient batch 3DGS rendering through specialized point-pruning and rigid-body kinematics synchronization.

Training Recipe

Physics Simulation:
- Data: Massively parallel environments (up to 4096 simultaneous scenes)
- Solver: Custom velocity-impulse with warm-starting from temporal coherence
- Time steps: Up to 10ms supported with high stability
- Hardware: Both CPU and GPU backends supported
Batch Rendering:
- Data: 3DGS scenes with 90% point pruning via state-of-the-art efficient pruning
- Resolution: 640×480 standard, up to 1280×720 supported
- Batch size: Up to 2048 scenes simultaneously
- Hardware: NVIDIA RTX 4090/6000 Ada/A100 tested
Policy Training:
- Algorithm: PPO for all tasks (locomotion, navigation, manipulation)
- Environments: 1024-2048 parallel instances
- Training time: 10 minutes (quadruped), 6 hours (humanoid), not reported (manipulation)
- Data: Mix of proprioceptive states and high-resolution RGB observations
Real2Sim Asset Generation:
- Processing time: 5 minutes end-to-end per scene (excluding model loading)
- Pipeline: 25s segmentation/inpainting, 8s AnySplat, 10s per object SAM-3D
- Hardware: NVIDIA RTX 3090 tested

Novelty & Lineage

Prior Work:

GaussGym (2024): First to apply 3DGS to RL but limited to small-scale, non-contact scenarios
Isaac Lab (2024): State-of-the-art parallel physics with ray-tracing rendering, but memory-intensive and lower throughput
Genesis (2024): GPU-based parallel physics with Madrona rendering, lacks photorealistic fidelity

Delta: This work adds:
Custom physics engine optimized for 3DGS integration with MCP contact solver
90%+ 3DGS point pruning achieving 10^4 FPS batch rendering
Automated Real2Sim pipeline for asset generation
Rigid-Link Gaussian Kinematics for artifact-free dynamic rendering.

Applied-Specific Assessment:
- Architectural novelty: The RLGK coupling and specialized point-pruning for rigid-body scenes is novel, though builds on established 3DGS techniques
- Benchmark gains: 32× speedup over MuJoCo, 600× over MjWarp in complex scenes is substantial but heavily dependent on their custom physics engine
- Fair comparisons: Physics comparisons use equivalent scenarios, but rendering comparisons against Isaac Sim use different asset generation methods
- Scale dependency: The gains appear to require their specific 3DGS optimizations and wouldn’t necessarily transfer to other visual rendering approaches
Verdict: SIGNIFICANT — The integration of high-throughput physics with memory-efficient 3DGS rendering addresses a real bottleneck in vision-based robot learning, with convincing performance gains and demonstrated sim-to-real transfer.

Benchmarks & Results

Physics Stability: Newton’s Cradle momentum conservation - better preservation than MuJoCo; Boston Dynamics Spot stability at 10ms timestep - reduced drift vs MuJoCo
Rendering Throughput: 10,000 FPS at 640×480 (batch size 2048) vs Isaac Sim’s ~2,000 FPS; maintains advantage across RTX 4090/6000 Ada/A100 GPUs
Physics Scaling: 1,015 FPS at N=50 humanoids vs MuJoCo 32 FPS (32× speedup) and MjWarp 1.71 FPS (600× speedup)
Visual Quality: PSNR 26.87 vs 27.15 (original 3DGS) with 70% fewer Gaussians; SSIM 0.802 vs 0.830
Sim-to-Real Transfer: Quadruped locomotion (10 min training), humanoid control (6 hours training), manipulation grasping (90% success rate), navigation (successful cone following)
Memory Efficiency: 90%+ Gaussian reduction with <0.05 PSNR drop

Results show consistent advantages in throughput and stability, though some comparisons use different asset generation methods.

Compute & Efficiency

Model Size: Up to 4096 parallel environments; 3DGS scenes with 90% point pruning (70% fewer Gaussians than standard)
Training Compute: Tested on NVIDIA RTX 4090/6000 Ada/A100; physics supports both CPU (AMD 9950x) and GPU backends
Inference Speed: 10^4 FPS rendering throughput at 640×480, physics scaling to 1,015 FPS with 50 humanoids
Memory Footprint: Significantly reduced through 90% 3DGS point pruning; avoids OOM failures that plague Isaac Sim at high batch sizes
Deployment Practicality: Cross-platform development (Windows/Linux/macOS), successful real-world deployment on Unitree Go2/G1 and Airbot Play arm

Real-World Applicability

Quadruped Locomotion: Successfully deployed state-based policies on Unitree Go2 for velocity tracking, trained in 10 minutes
Humanoid Control: 23-DoF balancing and walking on Unitree G1, trained in 6 hours with 2048 parallel environments
Visual Manipulation: End-to-end RGB-based grasping on Airbot Play arm with 90% success rate in uncontrolled real-world scenes
Visual Navigation: Real-time cone following on Unitree Go2 using hierarchical RL with high-level visual policy and low-level locomotion controller

All deployments demonstrated zero-shot transfer from simulation without additional real-world fine-tuning.

Limitations & Failure Modes

Lighting Dependency (FUNDAMENTAL): 3DGS struggles with randomized lighting/shadows unlike ray-tracing; asset generation depends on source image lighting conditions
Rigid-Body Assumption (FUNDAMENTAL): RLGK only supports rigid bodies; cannot handle deformable objects, cloth, or fluids
Limited Relighting (ENGINEERING): No algorithmic relighting capability to decouple object appearance from environmental lighting
Single-View Reconstruction (EVALUATION): Real2Sim pipeline uses single RGB images, may miss occluded geometry compared to multi-view approaches
3DGS Memory Scaling (ENGINEERING): Despite 90% pruning, memory usage still grows with scene complexity

Failure Modes:
- Dynamic lighting changes during deployment may degrade visual policy performance
- Contact-rich manipulation of soft/deformable objects not supported by current rigid-body physics

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

Authors: Wenhui Huang, Songyan Zhang, Collister Chua, Yang Liang et al. (7 authors) · Institution: Nanyang Technological University · Category: cs.CV

Introduces LTD, the first open-ended traffic VQA dataset using roadside cameras, and UniVLT, a VLM trained via curriculum transfer to unify autonomous driving and city-scale traffic analysis.

Practical Takeaway: If you’re working on intelligent transportation systems, this paper demonstrates a reasonable approach for extending VLMs from autonomous driving to city-scale traffic analysis. The curriculum training strategy (general → AD → traffic) is worth implementing if you have multi-domain data. However, the core architecture is standard, and the main value is in the dataset and training procedure rather than novel modeling. The multi-image reasoning capability could be useful for other surveillance applications beyond traffic.

Tags: vision-language-models intelligent-transportation traffic-analysis multi-image-reasoning roadside-cameras curriculum-learning autonomous-driving dataset

arXiv · PDF

Task & Setting

Real-world context: Urban transportation systems face growing safety challenges from complex interactions between diverse road users (pedestrians, cyclists, motorcycles). While foundation models have shown promise for autonomous driving scenarios, city-scale traffic analysis from roadside infrastructure remains underexplored. Traditional traffic monitoring methods prove inadequate for the complexity of modern urban mobility.
Task definition: The paper addresses three complementary tasks using roadside camera imagery: (a) Fine-grained multi-object grounding - detecting motorcycles and pedestrians with bounding box coordinates, (b) Multi-image camera selection - identifying which of 3 uncorrelated camera views show potential risks, and (c) Multi-image risk analysis - open-ended reasoning about hazardous objects, contributing factors, and risky road directions across 3 minimally correlated camera views. Input: RGB images from heterogeneous roadside cameras. Output: Textual responses for VQA tasks and normalized bounding box coordinates for grounding.
Evaluation criteria: GPT-Score for multi-image risk analysis, accuracy for camera ID selection, F1 score for multi-object grounding. Standard NLP metrics (CIDEr, METEOR, BLEU) for autonomous driving benchmarks. LingoJudge for LingoQA evaluation.
Dataset scale: Land Transportation Dataset (LTD) contains 11.6K high-quality VQA pairs from roadside cameras across Singapore, spanning diverse road geometries, traffic participants, illumination conditions, and weather.

Architecture & Method

Vision encoder: Redesigned Vision Transformer (ViT) handling dynamic input resolutions up to 225,792 pixels, with 2D Rotary Positional Embedding (RoPE) and RMSNorm normalization.
Language model backbone: Qwen2.5-VL 7B with multimodal 1D RoPE (MRoPE), Grouped Query Attention (GQA), and RMSNorm.
Multi-image processing: Visual tokens from all images concatenated in temporal/camera order to enable cross-image reasoning over long-range dependencies.
Loss function: Standard autoregressive language modeling loss computed as:
\[p(X_a|X_v, X_q) = \prod_{l=1}^{L} p(x_l|X_v, X_q, X_{a,<l})\]
where $X_v$ are visual tokens, $X_q$ are instruction tokens, and $X_a$ are target answer tokens.

Core technical contribution: Curriculum-based knowledge transfer strategy unifying microscopic autonomous driving reasoning with macroscopic traffic analysis, enabling joint reasoning over minimally correlated multi-view roadside camera observations.

Training Recipe

Pre-training stage: Initialize with pre-trained Qwen2.5-VL 7B weights (no additional training)
Fine-tuning Stage 1 (AD domain adaptation): - Data: 727.1K QA pairs from LingoQA (413.8K), OmniDrive (287.9K), CODA-LM (25.4K) - Optimizer: LoRA fine-tuning technique - Hardware and wall-clock time: Not reported - Learning rate, schedule, batch size: Not reported
Fine-tuning Stage 2 (Traffic domain expansion): - Data: 11.6K LTD samples + 3K samples each from AD datasets (experience replay) - Optimizer: LoRA fine-tuning - Hardware and wall-clock time: Not reported - Learning rate, schedule, batch size: Not reported

Novelty & Lineage

Step 1 — Prior work:

SUTD-TrafficQA (2021): 62.5K traffic QA pairs in multiple-choice format
TUMTraffic-VideoQA (2025): 87.3K video QA pairs with template-based annotations
LingoQA (2024): 413.8K open-ended driving VQA pairs for autonomous vehicles

Step 2 — Delta: This paper adds (1) first open-ended traffic VQA dataset using roadside cameras, (2) multi-image reasoning over uncorrelated camera views, (3) curriculum transfer from general → AD → traffic domains.

Step 3 — Applied-specific assessment:

Architectural idea: Standard VLM architecture with multi-image concatenation - not novel architecturally
Benchmark gains: Large margins on their own dataset (0.66 vs 0.46 GPT-score), modest gains on established benchmarks (69.0% vs 67.8% on LingoQA)
Fair comparisons: Reasonable baselines but evaluation primarily on their own dataset where advantage is expected
Generalizability: Would likely require similar roadside camera infrastructure and multi-stage training

Verdict: INCREMENTAL — Solid dataset contribution and engineering of curriculum training, but core VLM architecture is standard and gains are primarily on their own benchmark.

Benchmarks & Results

LTD Multi-Image Risk Analysis: UniVLT 0.66 GPT-Score vs Qwen2.5-VL 0.46 (43% improvement)
LTD Camera ID Selection: UniVLT 0.66 accuracy vs Qwen2.5-VL 0.48 (38% improvement)
LTD Multi-Object Grounding: UniVLT 0.64 F1 vs Qwen3-VL 0.62 (marginal improvement)
LingoQA: UniVLT 69.0% LingoJudge vs ReCogDrive 67.8% (1.2% improvement)
OmniDrive Object Recognition: UniVLT 0.89 GPT-Score vs InternVL2.5 0.87 (marginal)
OmniDrive Driving Suggestion: UniVLT 0.87 GPT-Score vs ReCogDrive 0.84 (modest)
CODA-LM General Perception: UniVLT 5.18 vs RoboTron-Drive 5.15 (marginal)
CODA-LM Region Perception: RoboTron-Drive 7.66 vs UniVLT 7.25 (UniVLT second best)

Results show strong performance on their own LTD dataset but modest improvements on established benchmarks.

Compute & Efficiency

Model size: 7B parameters (Qwen2.5-VL backbone)
Training compute: Not reported
Inference speed/latency: Not reported
Memory footprint: Not reported
Deployment practicality: Reasonable for deployment given 7B parameter size, but requires multi-image processing capability and roadside camera infrastructure

Real-World Applicability

Dataset collected from real roadside cameras across Singapore transportation network
Covers diverse road geometries, traffic participants, illumination conditions, and adverse weather
Focuses on safety-critical scenarios including vulnerable road users (pedestrians, motorcycles)
No reported real-world deployment or production integration
No sim-to-real evaluation discussed
Limited to Singapore road infrastructure and traffic patterns

Limitations & Failure Modes

Dataset limited to Singapore roadside cameras - FUNDAMENTAL (geographical and infrastructure specific)
Requires multi-stage training with AD data - ENGINEERING (complex training pipeline)
Multi-image inputs must be minimally correlated - FUNDAMENTAL (specific to roadside camera setup)
Grounding task sensitive to small object detection - FUNDAMENTAL (vulnerable road users occupy small image regions)
Open-ended evaluation prone to subjective scoring - EVALUATION (GPT-based metrics)
Limited cross-city generalization demonstrated - EVALUATION (single city dataset)

Failure modes:
Performance likely degrades on road infrastructure significantly different from Singapore
Multi-image reasoning may fail when camera views have unexpected correlations or occlusions.