Applied AI 5 papers

Applied AI Digest — Apr 2, 2026

Today’s Digest at a Glance

Preliminary

Today’s digest focuses on multimodal architectures for autonomous driving, with papers exploring hierarchical planning frameworks, specialized expert routing, world models, and continual learning approaches.

Mixture-of-Transformers (MoT)

Mixture-of-Transformers addresses the fundamental conflict between spatial perception tasks (requiring precise geometric reasoning) and semantic understanding tasks (requiring high-level abstraction) in vision-language-action models. Traditional approaches force a single transformer to optimize for both objectives simultaneously, leading to suboptimal performance on each.

The core insight is architectural specialization: instead of using standard transformer blocks, MoT replaces each block with multiple specialized transformer variants, each optimized for different cognitive functions. For autonomous driving, this typically involves three experts: understanding transformers $T_{\text{und}}$ for semantic reasoning, perception transformers $T_{\text{per}}$ for spatial-geometric tasks, and action transformers $T_{\text{act}}$ for motion planning. A learned routing mechanism determines which expert(s) to activate based on the input and task context.

Mathematically, each MoT block computes:

\[\text{MoT}(x) = \sum_{i} g_i(x) \cdot T_i(x)\]

where $g_i(x)$ are learned gating weights and $T_i$ are the specialized transformer experts. The key is that each $T_i$ can have different architectures, attention patterns, and training objectives tailored to its cognitive function.

Intuitively, MoT is like having different specialists (a navigator, a visual analyst, and a motor controller) work on the same driving scenario, with a coordinator deciding who takes the lead.

Null-Space Projection for Continual Learning

Null-space projection solves the catastrophic forgetting problem in continual learning by mathematically preventing new task learning from interfering with previously acquired knowledge. The naive approach of simply fine-tuning on new tasks overwrites previous knowledge, while storing all previous data is impractical.

The technique leverages linear algebra: for each previous task, it computes the null space of the gradient directions that were important for that task. When learning a new task, parameter updates are projected onto this null space, ensuring they don’t modify the subspace critical for previous tasks. Formally, if $G_\text{prev}$ contains gradients from previous tasks, the null space projector is $P = I - G_\text{prev}G_\text{prev}^+$ where $G_\text{prev}^+$ is the Moore-Penrose pseudoinverse.

During new task learning, gradients $g_\text{new}$ are modified as:

\[g_\text{projected} = P \cdot g_\text{new}\]

This ensures updates lie in the orthogonal complement of previous task gradients, preventing interference while still allowing learning in the remaining parameter space.

Essentially, null-space projection creates “protected zones” in parameter space for old knowledge while finding unused dimensions for new learning.

Reading Guide

Papers 1-3 represent different approaches to autonomous driving architecture: hierarchical LLM-based planning (Paper 1), specialized expert routing via MoT (Paper 2), and unified generative world models (Paper 3). Paper 4 introduces continual learning techniques that could enable these driving systems to adapt to new domains without forgetting previous capabilities. Paper 5 focuses on improving text conditioning for existing VLA models through structured causal reasoning, offering a complementary approach to architectural modifications.


Bridging Large-Model Reasoning and Real-Time Control via Agentic Fast-Slow Planning

Authors: Jiayi Chen, Shuai Wang, Guangxu Zhu, Chengzhong Xu · Institution: Chinese University of Hong Kong (Shenzhen), Chinese Academy of Sciences, University of Macau · Category: cs.RO

A hierarchical framework that bridges LLM reasoning with real-time control by using VLM-compressed topologies for cloud decision-making and semantic-guided A* planning with agentic hyperparameter tuning.

Practical Takeaway: If you’re building LLM-enhanced autonomous systems, this paper provides a clean architectural template for separating reasoning, planning, and control across appropriate timescales. The key insight is using soft semantic costs in classical planners rather than hard constraints or direct trajectory generation. The VLM-to-topology compression is practical for bandwidth-constrained edge-cloud systems. However, the approach requires significant infrastructure (cloud LLM access, memory storage) and lacks real-world validation. Consider implementing the semantic-guided A* concept in your planning stack, but budget for extensive real-world testing and robust fallback mechanisms.

Tags: autonomous_driving LLM_planning vision_language_models model_predictive_control path_planning semantic_guidance edge_cloud_computing CARLA_simulation

arXiv · PDF

Task & Setting

This paper addresses the challenge of bridging high-level semantic reasoning from large language models (LLMs) with real-time control for autonomous driving. Current approaches either have LLMs generate trajectories directly (brittle, latency-prone) or adapt Model Predictive Control (MPC) objectives online (mixing slow deliberation with fast control). This creates a fundamental mismatch between reasoning timescales and control requirements.

The task involves mapping front camera images and navigation goals to real-time control actions (throttle, steering). The input is RGB imagery I from vehicle sensors plus high-level navigation objectives. The system must produce control commands u = (throttle, steering) at 10 Hz while maintaining safety constraints. The objective is to minimize trajectory tracking error while satisfying kinematic constraints:

\[\min_u \sum_{h=0}^H ||s_{t+h} - s^*_{t+h}||^2 + \lambda \sum_{h=0}^{H-1} ||u_{t+h}||^2\]

subject to dynamics and obstacle avoidance constraints.

Success is measured by: finish time, trajectory length, average lateral deviation, speed variation, and maximum lateral deviation. The evaluation uses CARLA simulator with perturbed maps to test robustness, comparing against pure MPC and A*-guided MPC baselines across three scenarios with varying obstacle layouts and map shifts.

Architecture & Method
  1. Perception2Decision Bridge: On-vehicle VLM-based topology detector (2B parameter InternVL) extracts ego-centric topology graphs G = {z_j} where z_j = (b_j, d_j, φ_j, t_j) contains bounding box, distance, orientation, and semantic class. Cloud-based LLM decision maker (DeepSeek-V3) processes serialized graphs to generate symbolic driving directives.

  2. VLM Training: Two-stage fine-tuning with next-token prediction loss:

    \[L(θ) = -\sum_{t=1}^T \log p_θ(y_t | I, y_{<t})\]

    Stage 1 freezes language backbone, Stage 2 unfreezes all parameters.

  3. Decision2Trajectory Bridge: Semantic-Guided A* extends classical search with soft semantic costs. State representation includes previous move m_prev and directive progress index k. Cost function augments geometric cost with semantic penalty:

    \[g(s') = g(s) + c_{geom}(s,s') + c_{sem}(m_{prev}, m_{curr}, a_k)\]
  4. Agentic Refinement Module: LLM-driven hyperparameter tuning using structured prompting, feedback analysis, and cloud memory for warm-start parameter retrieval.

  5. Cloud-Guided MPC: Switching controller selecting between local lightweight reference τ_local or cloud-provided global reference τ_cloud with binary indicator z_t.

Training Recipe
  1. VLM Training: Two-stage fine-tuning on 4×A100 GPUs for 12 epochs, learning rate 8×10^-5. Stage 1: freeze language backbone, adapt vision encoder. Stage 2: unfreeze all parameters for joint alignment. Data: 50,000 CARLA frames from Towns 2-7 and 10HD with bounding box annotations and topology graphs.

  2. LLM Decision Making: Uses pre-trained DeepSeek-V3 via API with structured prompting. No additional training reported - relies on in-context learning with role description, examples, and chain-of-thought reasoning.

  3. Semantic-Guided A*: Classical search algorithm with hand-crafted semantic cost weights. No learning required - uses rule-based translation logic Φ to map move transitions to alignment categories.

  4. Agentic Refinement: Uses pre-trained DeepSeek-V3 with LangChain for iterative hyperparameter adjustment. Cloud memory stores successful (scene, guidance, parameters) triplets for warm-start retrieval.

  5. MPC Controller: Fixed hyperparameters (path-following weight w_s = 0.37, speed-following weight w_u = 0.2), prediction horizon H = 15, time step Δt = 0.2s, control frequency 10 Hz. No training reported.

Novelty & Lineage

Prior work:

  1. LanguageMPC
  2. maps language commands to MPC objectives but lacks principled LLM-to-planning interfaces.
  3. DriveVLM
  4. uses VLMs for spatial reasoning but generates trajectories directly without classical planning integration.
  5. DiLu
  6. adds rule-based mechanisms for LLM driving but doesn’t address real-time control bridging.

    Delta: This paper adds:

  7. Hierarchical separation of perception/reasoning/planning/control across timescales
  8. Semantic-Guided A* that embeds language-derived soft costs into classical search
  9. Agentic refinement for automated hyperparameter tuning.

    Applied-specific assessment:

    • The architectural idea of soft semantic costs in A* is novel and non-obvious, providing a principled way to inject LLM guidance into classical planning
    • Benchmark gains are modest: 12% completion time reduction, 45% lateral deviation reduction on CARLA scenarios
    • Comparisons appear fair with same MPC controller and evaluation protocol across methods
    • Gains may not hold without cloud computing infrastructure for LLM reasoning and parameter storage

    Evaluation gaps: Limited to CARLA simulation, no real-world validation, only compared against basic baselines (pure MPC, vanilla A*), missing comparisons to other LLM-planning hybrids.

    Verdict: INCREMENTAL — solid engineering contribution that combines existing techniques (VLM detection, LLM reasoning, A* search, MPC) in a principled way, but core components are well-known with modest performance gains.

Benchmarks & Results
  1. VLM Topology Detection: BBox IoU 93.0%, Category error 0.04%, Distance/orientation errors 0.41/0.10, Distance threshold error 1.31%. Comparison shows two-stage fine-tuning outperforms single-stage approaches.

  2. LLM Decision Consistency: Average similarity score 0.73 between VLM and LLM decisions on 48 scenarios. LLM achieves 4.13s average latency vs 10.24s for VLM direct processing.

  3. CARLA Driving Performance (averaged over 3 scenarios): - Finish Time: AFSP 14.87s, A-MPC 16.84s, Pure MPC 16.91s (12% improvement) - Average Lateral Deviation: AFSP 1.08m, A-MPC 1.53m, Pure MPC 1.89m (43% improvement) - Maximum Lateral Deviation: AFSP 3.46m, A-MPC 5.44m, Pure MPC 6.27m (45% improvement) - Speed Variation: AFSP 1.45 m/s, A-MPC 2.05 m/s, Pure MPC 2.84 m/s

  4. Robustness under map perturbations: Semantic-Guided A* maintains intent consistency under spatial shifts while vanilla A* fails to preserve prescribed maneuver sequences.

    Results show consistent but modest improvements. Missing comparisons to other LLM-driving systems or end-to-end learned approaches.

Compute & Efficiency
  1. Model size: VLM topology detector uses 2B parameter InternVL model. LLM decision maker uses DeepSeek-V3 (parameter count not specified, likely ~600B+ based on model family).

  2. Training compute: VLM fine-tuning on 4×A100 GPUs for 12 epochs. Wall-clock time not reported. LLM uses pre-trained model via API calls.

  3. Inference speed: LLM decision making averages 4.13s latency vs 10.24s for direct VLM processing. MPC control runs at 10 Hz. Overall system timescales: seconds for reasoning, deciseconds for planning, centiseconds for control.

  4. Memory footprint: Topology graphs provide significant compression over raw images for cloud transmission. Cloud memory stores (scene, guidance, parameters) triplets for hyperparameter warm-start. Specific memory requirements not quantified.

  5. Deployment practicality: Requires cloud connectivity for LLM reasoning and parameter storage. On-vehicle components (VLM detection, MPC control) run locally. Practical deployment would need robust cloud-edge communication and fallback strategies for connectivity loss.

Real-World Applicability
  1. Real-world validation: None reported. All experiments conducted in CARLA simulator only.

  2. Hardware experiments: No physical robot or vehicle testing described.

  3. Production integration: No deployment results or production system integration reported.

  4. Sim-to-real discussion: Paper lacks discussion of sim-to-real transfer challenges, domain gap between CARLA and real driving scenarios, or validation on real sensor data.

  5. Robustness testing: Limited to map perturbations in simulation (spatial shifts, enlarged obstacle radii). No evaluation on weather variations, lighting changes, or sensor noise that would occur in real deployment.

    The work remains purely simulation-based without addressing practical deployment challenges or real-world validation.

Limitations & Failure Modes
  1. Cloud dependency (FUNDAMENTAL): System requires reliable cloud connectivity for LLM reasoning and memory access, creating single point of failure.

  2. Simulation-only validation (EVALUATION): No real-world testing limits confidence in practical applicability and robustness.

  3. Limited baseline comparisons (EVALUATION): Missing comparisons to other LLM-planning hybrids or end-to-end learned approaches.

  4. Hyperparameter sensitivity (ENGINEERING): Despite agentic refinement, A* search remains fundamentally sensitive to cost weight tuning.

  5. Scale constraints (ENGINEERING): VLM processing and cloud communication may not scale to complex urban environments with many objects.

  6. Safety verification gaps (FUNDAMENTAL): Semantic guidance uses soft costs without hard safety guarantees, potentially allowing unsafe trajectories.

    Failure modes:

  7. Connectivity loss to cloud would disable high-level reasoning, forcing fallback to pure MPC.
  8. LLM hallucination or inconsistent directives could bias A* toward suboptimal or conflicting paths.

UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

Authors: Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao et al. (14 authors) · Institution: Xiaomi EV, Huazhong University of Science and Technology · Category: cs.CV

UniDriveVLA uses Mixture-of-Transformers to decouple understanding, perception, and action experts in autonomous driving, achieving competitive performance while addressing the spatial perception vs semantic reasoning conflict in VLA models.

Practical Takeaway: Research engineers working on autonomous driving VLA models should consider the core insight about representation interference between spatial and semantic features in shared parameter architectures. The expert decoupling approach via Mixture-of-Transformers provides a principled way to address this conflict, though the implementation requires careful three-stage progressive training. The sparse perception paradigm extracting spatial priors from 2D VLM features is a more practical alternative to dense 3D representations. However, be cautious about the computational overhead and complexity - the approach may be better suited for research exploration than immediate production deployment. The masked joint attention mechanism for controlling cross-expert communication could be valuable for other multimodal applications beyond driving.

Tags: autonomous-driving vision-language-models mixture-of-experts multimodal-learning trajectory-planning 3d-perception end-to-end-driving spatial-reasoning

arXiv · PDF

Task & Setting

UniDriveVLA addresses autonomous driving planning by combining vision-language understanding with spatial perception in a single model. Current Vision-Language-Action (VLA) models for autonomous driving face a fundamental trade-off: directly adopting 2D Vision-Language Models provides strong semantic reasoning but limited spatial perception, while enhancing them with 3D spatial representations improves spatial perception at the expense of native semantic reasoning capabilities.

The task takes multi-view camera observations $I_{cam} \in \mathbb{R}^{K \times V \times H \times W \times 3}$, historical trajectory $I_{hist} \in \mathbb{R}^{T_{hist} \times 2}$, and navigation command $L_{nav}$ as inputs to predict future trajectory:

\[T_{traj} = \Phi(I_{cam}, I_{hist}, L_{nav})\]

where $T_{traj} = {(x_t, y_t)}_{t=1}^T$ denotes the predicted future trajectory.

Success is measured through open-loop evaluation on nuScenes (L2 displacement error, collision rate) and closed-loop evaluation on Bench2Drive (driving score, success rate, efficiency, comfort). The model also performs 3D detection (mAP, NDS), online mapping (APped, APdivider, APboundary), motion forecasting (minADE, minFDE), and driving-oriented VQA.

The paper evaluates on nuScenes (1,000 driving sequences from Boston/Singapore) and Bench2Drive (CARLA-based simulator with 6-view camera inputs at 900×1600 resolution).

Architecture & Method
  1. Vision-Language Backbone: Uses Qwen3-VL with SigLIP-2 vision encoder, MLP-based vision-language merger, and Qwen3 language model, processing 960×544 driving frames

  2. Mixture-of-Transformers Architecture: Three specialized experts for understanding (Tund), perception (Tper), and action (Tact), with expert-specific projections:

    \[Q_g = T_g W_g^Q, K_g = T_g W_g^K, V_g = T_g W_g^V\]

    where $g \in {und, per, act}$

  3. Masked Joint Attention: Concatenates expert representations and applies masked attention:

    \[Z = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V\]

    where understanding tokens use causal masking, perception tokens attend to understanding tokens, and action tokens aggregate both semantic and spatial information

  4. Sparse Spatial Perception: Extracts spatial priors from 2D VLM features using task-specific sparse queries for detection, mapping, ego-status, motion forecasting, and occupancy prediction

  5. Unified Optimization: Joint training with combined loss:

    \[L_{total} = \lambda_1 L_{ar} + \lambda_2 L_{per} + \lambda_3 L_{act}\]

    The core contribution is decoupling understanding, perception, and action into separate expert pathways while maintaining controlled cross-expert communication to resolve the spatial perception vs semantic reasoning conflict.

Training Recipe
  1. Stage 1 - Semantic Anchoring: Full fine-tuning of base VLM for 3 epochs with learning rate 4×10^-5 on driving-specific VQA and general multimodal data (3:7 ratio), using FineVision for general domain data
    • Data: Mixture of driving-specific and general vision-language data
    • Hardware/time: not reported
  2. Stage 2 - Joint Optimization: Joint training for 30 epochs using AdamW optimizer with base learning rate 2×10^-4, VLM backbone uses 0.5× multiplier (effective 1×10^-4)
    • Data: Combined autoregressive language modeling, spatial perception tasks (3D detection, mapping, occupancy), flow-matching trajectory generation
    • LoRA applied to language model, EMA during training
    • Hardware/time: not reported
  3. Stage 3 - Expert Specialization: Freeze vision-language model, fine-tune Perception and Action Experts for 15 epochs with learning rate 1×10^-4
    • Data: Adds motion forecasting objective for dynamic priors
    • EMA maintained during training
    • Hardware/time: not reported

    All stages use AdamW optimizer. Specific hardware details and wall-clock training times are not reported.

Novelty & Lineage

Prior Work:

  1. Orion (ICCV 2025): Vision-language instructed action generation using LLaVA-7B, achieves 0.34 L2 error and 77.74 driving score on benchmarks
  2. AutoVLA (NeurIPS 2025): VLA model using Qwen2.5-VL-3B, achieves 0.48 L2 error but limited spatial perception
  3. OpenDriveVLA: 3D-enhanced VLA with BEV encoders and 3D Q-Formers, achieves 0.33 L2 error but degrades semantic reasoning

    Delta: This paper introduces Mixture-of-Transformers architecture that decouples understanding, perception, and action into separate expert pathways with masked joint attention, combined with sparse perception from 2D VLM features.

    Applied-Specific Assessment:

    • Architectural novelty: The expert decoupling approach is conceptually similar to π0 (robotics) and general MoT architectures, but the specific application to resolve spatial perception vs semantic reasoning conflict in autonomous driving is reasonably novel
    • Benchmark gains: Achieves SOTA 78.37 driving score on Bench2Drive and competitive L2 errors on nuScenes, but improvements are modest (e.g., 78.37 vs 77.74 from Orion)
    • Fair comparisons: Uses same evaluation protocols but different base models (Qwen3-VL vs others’ LLaVA/Qwen2), making direct comparisons somewhat limited
    • Scale dependency: The three-stage progressive training and expert architecture likely require substantial compute, though specific requirements not reported

    The core insight about representation interference between spatial and semantic features is valid, but the solution largely applies existing MoT patterns to driving. The sparse perception approach is incremental over dense 3D representations.

    Verdict: INCREMENTAL — Solid engineering work applying MoT to resolve a real problem in driving VLA, but the architectural ideas are well-established and performance gains are modest.

Benchmarks & Results
  1. Bench2Drive closed-loop: Driving Score 78.37 (previous SOTA Orion 77.74), Success Rate 51.82% (vs Orion 54.62%), Efficiency 198.86 (vs Orion 151.48), L2 error 0.72

  2. nuScenes ST-P3 protocol (with ego): L2 error 0.43m average (vs OpenDriveVLA 0.33m), Collision rate 0.10% (vs OpenDriveVLA 0.10%)

  3. nuScenes ST-P3 protocol (without ego): L2 error 0.51m average (best reported), Collision rate 0.11% (vs SparseDrive 0.08%)

  4. nuScenes UniAD protocol (with ego): L2 error 0.77m average (vs OpenDriveVLA 0.67m), Collision rate 0.23% (vs OpenDriveVLA 0.30%)

  5. nuScenes perception: Detection mAP 0.407, NDS 0.460, Map mAP 0.535 (competitive but not SOTA)

  6. DriveBench VQA: 51.97% average (vs GPT-4o 51.96%, ReCogDrive† 56.71%)

  7. Bench2Drive multi-ability: Best on Merging (38.75%), Overtaking (80.00%), competitive mean 51.53%

  8. General VQA benchmarks: Maintains reasonable performance (e.g., 49.9% RealWorldQA, 76.3% ChartQA) but below general-purpose VLMs

    Mixed results - achieves some SOTA performance on specific metrics but not consistently across all benchmarks. Performance gains are often marginal.

Compute & Efficiency
  1. Model size: Base version uses Qwen3-VL-2B, Large version uses Qwen3-VL-8B parameters

  2. Training compute: Not reported - missing GPU hours, hardware specifications, and wall-clock training times across the three-stage training process

  3. Inference speed/latency: Not reported - no inference timing or FPS measurements provided

  4. Memory footprint: Not reported - no memory usage statistics during training or inference

  5. Deployment practicality: Limited assessment - the three-stage progressive training strategy and MoT architecture likely require substantial compute resources, but specific requirements unclear. The 8B parameter Large model may be challenging for real-time autonomous driving deployment without optimization

Real-World Applicability
  1. Real-world data evaluation: Uses nuScenes dataset collected from real driving scenarios in Boston and Singapore, demonstrating performance on authentic multi-view camera inputs

  2. Simulation validation: Extensive closed-loop evaluation on Bench2Drive CARLA simulator with 6-view cameras at 900×1600 resolution

  3. Production deployment: No evidence of actual vehicle deployment, hardware integration, or real-world road testing reported

  4. Hardware experiments: No specific hardware validation or edge deployment experiments described

  5. Sim-to-real analysis: Paper does not discuss sim-to-real transfer capabilities or domain gap analysis between CARLA simulation and real-world driving scenarios

    The work remains primarily academic with evaluation limited to datasets and simulation environments. No clear path to production deployment demonstrated.

Limitations & Failure Modes
  1. ENGINEERING: Motion prediction performance lags behind specialized baselines (minADE 1.264m vs better specialized models) - fixable with more focused motion modeling

  2. EVALUATION: Missing critical deployment metrics including inference latency, memory usage, and compute requirements - gaps in how the method was tested for real-time driving

  3. FUNDAMENTAL: General VQA performance drops significantly compared to general-purpose VLMs (e.g., 43.3% MMStar vs 63.0% Qwen3-VL) - inherent to specialized driving adaptation

  4. ENGINEERING: Three-stage progressive training strategy adds complexity and likely substantial compute overhead - could be simplified with better initialization

  5. EVALUATION: Limited comparison with concurrent VLA methods due to different base models and evaluation protocols

    Failure modes:

  6. Long-tail scenarios: Despite VLM reasoning capabilities, may still struggle with rare driving situations not well-represented in training data
  7. Real-time constraints: The MoT architecture and sparse perception likely introduce computational overhead that could impact real-time performance requirements in actual vehicles

DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

Authors: Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang et al. (11 authors) · Institution: GigaAI, University of Toronto · Category: cs.CV

DriveDreamer-Policy unifies depth generation, video prediction, and action planning in autonomous driving through structured causal attention, achieving modest improvements over existing world-action models.

Practical Takeaway: The key insight is using explicit depth generation as geometric scaffolding for world-action models, structured through causal attention (depth→video→action). The modular query-based interface between LLM backbone and generative experts is elegant for controlling compute vs. capability trade-offs. However, the approach still requires careful hyperparameter tuning (loss weights λ_d, λ_v, λ_a) and depends on external depth models for supervision. Consider this architecture if you need interpretable geometric reasoning in world models, but be prepared for the additional complexity of multi-objective training and the computational overhead of generating multiple modalities.

Tags: autonomous_driving world_models vision_language_action depth_estimation video_generation motion_planning diffusion_models multimodal_learning

arXiv · PDF

Task & Setting

Real-world context: Autonomous driving requires both understanding the current scene and predicting how it will evolve under different actions - a key limitation of existing VLA planners that optimize actions without explicitly modeling future world states. Pure world models can simulate futures but need external action signals. Recent world-action models bridge this gap but often model only 2D appearance without geometric grounding, limiting their utility for 3D physical reasoning essential in driving.

Task definition: Given multi-view RGB camera images (3 views at 144×256 resolution), natural language instructions, and current action context, the model must jointly produce: 1) current-scene depth maps, 2) action-conditioned future video sequences (9 frames), and 3) future trajectory waypoints parameterized as (x, y, cos θ, sin θ). The formal objective combines three loss terms:

\[\mathcal{L} = \lambda_d \mathcal{L}_d + \lambda_v \mathcal{L}_v + \lambda_a \mathcal{L}_a\]

where each component uses flow matching for continuous targets.

Evaluation criteria: Planning performance measured via PDMS (Predictive Driver Model Score) on Navsim v1 and EPDMS (Extended PDMS) on v2, aggregating safety metrics (no-collision, drivable area compliance, time-to-collision), progress (ego advancement), and comfort. World generation evaluated using FVD, LPIPS, PSNR for video quality and AbsRel, δ-accuracy for depth quality.

Dataset: Navsim benchmark with 100k training and 12k test samples sampled at 2Hz from real-world driving logs, providing synchronized surround-view camera inputs and trajectory labels.

Architecture & Method
  1. LLM backbone: Qwen3-VL-2B processes tokenized text instructions, multi-view visual patches (via vision encoder), and action embeddings, along with learned depth, video, and action query tokens arranged in causal order.

  2. Query interface: Fixed-size bottleneck with 64 depth queries, 64 video queries, and 8 action queries that serve as compact world/action embeddings, enabling modular expert conditioning.

  3. Depth generator: Pixel-space diffusion transformer initialized from PPD, takes concatenated noisy depth + RGB image as input, cross-attends to depth query embeddings, trained with flow matching objective:

    \[\mathcal{L}_{FM} = \mathbb{E}_{x_0,x_1,t}[\|v_\theta(x_t, t|c) - (x_1 - x_0)\|_2^2]\]
  4. Video generator: Latent-space diffusion transformer adapted from Wan-2.1-T2V-1.3B, processes VAE-encoded current frames and noisy future latents, conditioned on video query embeddings + CLIP visual features.

  5. Action generator: Diffusion transformer mapping noise trajectories to feasible actions, conditioned on action query embeddings that aggregate upstream geometric and temporal cues.

  6. Structured attention: Causal masking enforces depth → video → action information flow within the LLM, allowing video queries to attend to depth context and action queries to attend to both.

Training Recipe
  1. Single-stage joint training: All components trained simultaneously for 100k steps with multi-task loss (λ_d=0.1, λ_v=λ_a=1.0)

  2. Data preprocessing: Depth labels from Depth Anything 3 (DA3), log-normalized to [-0.5, 0.5] range. Video training horizon: 9 frames at 144×256 resolution

  3. Optimization: AdamW optimizer, learning rate 1×10^-5, batch size 32 across 8 NVIDIA H20 GPUs

  4. Model initialization: Qwen3-VL-2B for LLM, PPD for depth generator, Wan-2.1-T2V-1.3B adapted for video generator

  5. Training data: Only Navsim navtrain split (100k samples), no additional datasets or extra pretraining beyond initialized backbones

  6. Hardware/timing: Not reported for wall-clock training time

Novelty & Lineage

Prior work: 1) PWM (NeurIPS’25) treats unified transformer as Policy World Model performing action-free forecasting and collaborative state-action prediction. 2) DriveVLA-W0 (ICLR’26) adds future-image world modeling to VLA with lightweight MoE action expert. 3) Epona (ICCV’25) uses autoregressive diffusion world model decoupling causal temporal latents from per-step generation.

Delta: This paper explicitly generates pixel-space depth as geometric scaffold alongside video, using structured causal attention (depth→video→action) to ground future prediction in 3D geometry. Unlike prior work modeling only 2D appearance or latent representations, incorporates explicit 3D geometric reasoning.

Applied-specific assessment:

  • Architectural idea is incremental - adds depth generation to existing world-action model paradigm using standard diffusion components
  • Benchmark gains are modest: +0.8 PDMS over DriveVLA-W0, +1.1 over PWM on Navsim v1. Video FVD improvement (-32.36 vs PWM) is larger but only one comparison point
  • Comparisons appear fair on same dataset/protocol, though missing comparisons to other recent world-action methods
  • Benefits likely depend on depth foundation model quality (DA3) and structured attention design

Verdict: INCREMENTAL — Solid engineering combining existing techniques (depth generation, world-action modeling) with clear but expected improvements over baselines.

Benchmarks & Results
  1. Navsim v1 PDMS: 89.2 vs previous best DriveVLA-W0 88.4 (+0.8), PWM 88.1 (+1.1)

  2. Navsim v2 EPDMS: 88.7 vs previous best DriveVLA-W0 86.1 (+2.6)

  3. Video generation FVD: 53.59 vs PWM 85.95 (-32.36), showing substantial video quality improvement

  4. Video generation LPIPS: 0.20 vs PWM 0.23 (-0.03), PSNR 21.05 vs 21.57 (-0.52)

  5. Depth estimation AbsRel: 8.1 vs fine-tuned PPD 9.3 (-1.2), δ1 accuracy 92.8 vs 91.4 (+1.4)

  6. Component ablations: Full model (89.2 PDMS) vs action-only (88.0), depth+action (88.5), video+action (88.9)

    Planning results show consistent but modest improvements across safety/progress metrics. Video generation shows larger gains but limited baseline comparison. Missing comparisons to other recent world-action methods like DriveLaW, UniPGT.

Compute & Efficiency
  1. Model size: Qwen3-VL-2B backbone (~2B parameters), plus lightweight diffusion transformers for depth/video/action generation (specific parameter counts not reported)

  2. Training compute: 8 NVIDIA H20 GPUs for 100k steps, batch size 32 (wall-clock time not reported)

  3. Inference speed/latency: Not reported, though paper mentions “controllable latency” through modular design allowing planning-only mode

  4. Memory footprint: Training at reduced 144×256 resolution “to reduce computational and memory costs” - full resolution requirements not specified

  5. Deployment practicality: Modular architecture enables flexible operating modes (planning-only, imagination-enabled planning, full generation), but no concrete deployment metrics or hardware requirements provided for real-time operation

Real-World Applicability
  1. Real-world data: Trained and evaluated on Navsim benchmark derived from real-world driving logs with synchronized multi-view camera inputs

  2. Deployment results: No reported deployment on actual vehicles or hardware systems

  3. Production integration: No discussion of integration with existing autonomous driving stacks or real-time constraints

  4. Sim-to-real: No explicit sim-to-real analysis, though Navsim provides realistic driving scenarios from logged data

  5. Hardware experiments: No experiments on actual autonomous driving platforms or edge devices - evaluation limited to offline benchmark assessment

Limitations & Failure Modes
  1. Limited resolution: Training at 144×256 to reduce compute costs may limit real-world applicability where high-resolution sensing is critical (ENGINEERING)

  2. Depth dependency: Reliance on external depth foundation model (DA3) for training labels rather than learning from true geometric supervision (FUNDAMENTAL)

  3. Single dataset: Only evaluated on Navsim without cross-dataset generalization or diverse driving conditions (EVALUATION)

  4. Missing real-time analysis: No inference speed or memory analysis for practical deployment constraints (EVALUATION)

  5. Limited baseline comparisons: Video generation compared against only PWM, missing other recent world-action methods (EVALUATION)

    Failure modes:

    • Depth generation errors could cascade to video/action prediction through causal attention structure
    • Short 9-frame horizon may be insufficient for complex multi-step maneuvers requiring longer-term planning

Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

Authors: Mohammad R. Abu Ayyash · Institution: Brains Build Research · Category: cs.CL

Brainstacks introduces a modular architecture for continual LLM fine-tuning using frozen MoE-LoRA stacks with null-space projection and outcome-based routing, discovering that domain adapters encode transferable cognitive primitives rather than domain-specific knowledge.

Practical Takeaway: If you’re building domain-specific LLM systems, Brainstacks offers a genuinely modular alternative to monolithic fine-tuning. The core insight—that domains may learn transferable cognitive primitives rather than domain knowledge—deserves investigation in your own applications. The engineering is solid: MoE-LoRA building blocks, frozen stacking with null-space projection, and outcome-based routing. However, implement carefully: the 2× inference overhead and dataset contamination sensitivity require production-ready engineering. Most valuable for organizations needing to incrementally add domains without full retraining, though the cognitive primitives claim needs validation on your specific domains before assuming medical→math transferability holds generally.

Tags: continual-learning parameter-efficient-fine-tuning mixture-of-experts LoRA domain-adaptation catastrophic-forgetting modular-architectures meta-learning

arXiv · PDF

Task & Setting

The paper addresses continual multi-domain fine-tuning of large language models without catastrophic forgetting, a critical need for deploying specialized LLMs that must handle diverse domains while maintaining modular capabilities. Current approaches couple all domain knowledge into shared parameters, requiring full retraining for new domains and lacking selective activation mechanisms.

The task involves training domain-specific adapters that can be composed at inference. Input is multi-domain text data across domains like chat, code, math, medical, and reasoning. Output is a modular system where frozen MoE-LoRA “stacks” can be selectively activated per prompt. The core objective minimizes standard cross-entropy loss:

\[L = L_{\text{task}} + \lambda_{\text{aux}} \cdot L_{\text{aux}}\]

where $L_{\text{aux}} = N \cdot \sum_e P(e) \cdot f(e)$ is the MoE load balancing term.

Success is measured through domain-specific validation losses, zero-shot benchmark performance (HellaSwag, ARC, MMLU, GSM8K, etc.), and crucially the discovery that optimal routing often excludes the nominal domain (medical prompts route to chat+math 97% of the time).

The paper validates on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks) using curated datasets: Alpaca for chat, GSM8K for math, MedQA for medical, OpenThoughts for reasoning.

Architecture & Method
  1. MoE-LoRA Building Block: Each transformer projection replaced with Mixture-of-Experts LoRA using N=4 experts, top-K=2 routing, rank r=16 with rsLoRA scaling $s = \alpha/\sqrt{r}$. Applies Shazeer-style noisy routing with learned noise: $\ell_{\text{noisy}} = W_r \cdot x + \text{softplus}(W_n \cdot x) \odot \epsilon$ where $\epsilon \sim \mathcal{N}(0,I)$.

  2. Frozen Stack Architecture: Stacked MoE-LoRA layers compose additively: $\text{output} = W_{\text{frozen}}(x) + \sum_j \text{frozen\_stack}_j(x) + \text{active\_stack}(x)$. Each stack frozen after training and offloaded to CPU.

  3. Inner Loop Residual Boosting: Sequential stacks within domains learn residual corrections. Stack 2 trains on shifted loss landscape where Stack 1’s frozen contribution already corrects part of the output.

  4. Null-Space Projection: Before training domain d+1, compute SVD of frozen stacks’ outputs on validation data, extract top-K=64 principal directions, form projection matrix $P = V \cdot V^T$, constrain active stack: $\delta_{\text{projected}} = \delta - \delta \cdot P$.

  5. Outcome-Based Meta-Router: Lightweight neural network (2M params) with sigmoid outputs per domain. Trained on empirically discovered routing targets by testing all domain combinations and selecting those reducing loss > 0.01 threshold. Architecture uses cross-attention with learnable domain queries.

Training Recipe
  1. Pretraining: Uses existing pretrained models (TinyLlama-1.1B, Gemma 3 12B IT) under 4-bit NF4 quantization as frozen base.

  2. Domain SFT (Outer Loop): Sequential training across domains in curriculum order: Chat → Code → Math → Medical → Reasoning. Each domain trains for 400-600 steps, batch size 16, learning rate 2×10^-4 to 1×10^-4, sequence length 512, gradient checkpointing enabled.

  3. Residual Boosting (Inner Loop): Within each domain, up to 3 sequential stacks with early stopping when validation loss improvement < 0.002. BestStackCallback prevents overfitting by restoring best weights if validation spikes > 0.1.

  4. Meta-Router Training: Post-SFT training on outcome targets discovered via exhaustive domain combination testing. BCE loss with confidence margin penalty, 8 epochs, cosine scheduling. Training data: router-specific subsets to avoid format contamination.

  5. Hardware: TinyLlama on consumer GPU (9.5-202 minutes per domain), Gemma 3 12B on Colab G4 96GB. Null-space projection uses randomized SVD with 400 validation samples. Wall-clock time not comprehensively reported.

Novelty & Lineage

Prior Work: MoLoRA (Zadouri et al., 2023) and MixLoRA (Li et al., 2024) apply MoE to LoRA but limit to FFN layers or use single LoRA on attention. InfLoRA (Liang & Li, CVPR 2024) uses null-space projection but for standard LoRA. C-LoRA and Online-LoRA freeze adapters per task but select one at inference, no additive composition.

Delta: This paper uniquely combines:

  1. Shazeer noisy routing on ALL 7 transformer projections including attention (q,k,v,o)
  2. rsLoRA scaling with MoE-LoRA
  3. residual boosting through frozen MoE-LoRA stacking
  4. null-space projection via randomized SVD
  5. outcome-based sigmoid routing discovering empirical domain combinations.

    Assessment:

    • The architectural combination is genuinely novel - no prior work applies noisy routing to all 7 projections or residually stacks MoE-LoRA modules
    • The key finding about cognitive primitives vs domain knowledge is significant if validated more rigorously
    • However, individual components are incremental improvements on known techniques
    • Benchmark gains are mixed/marginal (±0.02-0.03 on 200 samples falls within noise)
    • The TinyLlama and Gemma results show engineering value but limited scale

    Verdict: INCREMENTAL — Solid engineering contribution combining multiple known techniques in a novel architecture, but the cognitive primitives claim needs stronger evidence and benchmark improvements are within noise margins.

Benchmarks & Results
  1. HellaSwag: Base 0.670, Routed 0.650 (-0.020)
  2. ARC-Easy: Base 0.510, Routed 0.515 (+0.005)
  3. ARC-Challenge: Base 0.525, Routed 0.495 (-0.030)
  4. TruthfulQA: Base 0.350, Routed 0.370 (+0.020)
  5. MMLU: Base 0.450, Routed 0.435 (-0.015)
  6. GSM8K: Base 0.665, Routed 0.665 (0.000)
  7. MedQA: Base 0.385, Routed 0.350 (-0.035)
  8. MedMCQA: Base 0.330, Routed 0.360 (+0.030)

    Results are mixed with small improvements on 3/8 benchmarks, degradation on 4/8, identical on 1/8. All differences are within sampling noise at n=200 samples. The paper acknowledges 95% CI ~±0.07 at this sample size. No comparison to other continual learning or MoE-LoRA methods on the same benchmarks. Missing: more comprehensive evaluation with larger sample sizes, comparison to recent continual learning baselines.

Compute & Efficiency
  1. Model Size: TinyLlama base 1.1B + 9 stacks × 53.6M = ~1.6B total parameters. Gemma 3 12B + 10 stacks × ~200M = ~14B total parameters.

  2. Training Compute: TinyLlama total 485 minutes on consumer GPU. Gemma 3 12B on Colab G4 96GB, specific wall-clock times not fully reported. MoE-LoRA 2× slower than single LoRA (20.2 vs 9.5 min) but converges 2.5× faster per step.

  3. Inference Speed: Meta-router adds one forward pass through 2M parameter network per prompt. Disk-offloaded stack loading enables constant GPU memory regardless of total stack count (“Superposition LLM principle”).

  4. Memory Footprint: Only active stack occupies GPU memory in full precision. Frozen stacks in half-precision on CPU, shuttled to GPU only during forward pass. GPU memory remains constant as stacks accumulate.

  5. Deployment: Modular architecture allows selective stack loading per prompt from disk. Individual domains can be added/removed without retraining. Production practicality limited by 2× inference overhead from MoE routing and CPU-GPU shuttling latency.

Real-World Applicability
  1. Benchmark-Only Evaluation: All experiments conducted on standard benchmarks (HellaSwag, ARC, MMLU, GSM8K, MedQA) with curated datasets. No real-world deployment results reported.

  2. Production Integration: The modular architecture supports hot-swapping domain stacks without model reloading, enabling production systems to add/remove capabilities dynamically. However, no actual production deployment described.

  3. Hardware Constraints: Demonstrated on consumer hardware (TinyLlama) and cloud instances (Gemma 3 12B on Colab G4), suggesting reasonable accessibility. Disk-offloaded inference system addresses memory scalability concerns.

  4. Domain Generalization: The cognitive primitives finding (medical prompts routing to chat+math 97% of time) suggests potential for zero-shot domain transfer, but requires validation on truly unseen domains and real-world medical applications.

  5. Sim-to-Real Gap: No discussion of performance differences between curated training datasets and real-world domain-specific data. The PSN pretraining experiment provides some evidence of capability transfer but uses synthetic TinyStories data.

Limitations & Failure Modes
  1. EVALUATION: Limited to 200 samples per benchmark where differences fall within sampling noise (±0.07 CI). No comparison to other continual learning methods or recent MoE-LoRA baselines.

  2. FUNDAMENTAL: The cognitive primitives claim relies on a single finding (medical→chat+math routing) that could be dataset artifact. Medical flashcards may not represent real clinical reasoning complexity.

  3. ENGINEERING: Meta-router training requires exhaustive domain combination testing, scaling quadratically with domain count. Dataset contamination issues required manual decontamination subsystem.

  4. ENGINEERING: 2× inference overhead from MoE routing plus CPU-GPU shuttling latency for frozen stacks may limit real-time applications.

  5. FUNDAMENTAL: Null-space projection assumes domains occupy orthogonal subspaces, but with only 64 directions per domain in 3840-dim space, capacity limits unclear beyond 50+ domains.

    Failure Modes:

  6. Router training data contamination causes domain conflation (reasoning confused with code due to OpenThoughts formatting)
  7. Ungated stack accumulation causes catastrophic interference where all stacks fire simultaneously, drowning coherent output with magnitude accumulation.

Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving

Authors: Yun Li, Yidu Zhang, Simon Thompson, Ehsan Javanmardi et al. (5 authors) · Institution: University of Tokyo · Category: cs.RO

CSN restructures VLA text inputs through intent-constraint causal alignment using explicit linguistic connectives, achieving +31.1% driving performance improvement in CARLA simulation without model retraining.

Practical Takeaway: If you’re working on VLA models for autonomous driving, the key insight is that explicitly linking navigation intent to environmental constraints through causal connectives (BUT, YIELD_BEFORE, BECAUSE) can provide substantial performance gains without model retraining. The 39.1% contribution from causal structure alone suggests that text organization matters beyond information quantity. However, the approach currently requires privileged environmental data, so focus on developing robust perception pipelines that can extract the structured information CSN needs. The component interaction findings (CSN+Safety degradation) also highlight that safety interventions must be carefully designed to avoid interfering with improved decision-making from better text conditioning.

Tags: autonomous-driving vision-language-action causal-reasoning text-conditioning safety-supervision carla-simulation preference-learning runtime-monitoring

arXiv · PDF

Task & Setting
  1. Real-world context: Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs including navigation commands, hazard warnings, and traffic state descriptions. Current systems present these as disconnected fragments, forcing models to discover which environmental constraints are relevant to the current maneuver. This creates a reasoning bottleneck where the model must internally recover missing causal dependencies.

  2. Task definition: The input consists of RGB images and LiDAR point clouds from a vehicle-mounted sensor suite, combined with natural language descriptions of navigation intent and environmental conditions. The output is a sequence of predicted waypoints in the ego-vehicle frame. The formal objective minimizes waypoint prediction loss:

    \[\mathcal{L} = \sum_{i=1}^N ||w_i - w_i^*||_2\]

    where $w_i$ are predicted waypoints and $w_i^*$ are ground truth waypoints.

  3. Evaluation criteria: Success is measured using CARLA Leaderboard metrics - Driving Score (DS) as primary metric (route completion weighted by infraction penalty), Route Completion (RC) percentage, and Infraction Score (IS) as cumulative penalty multiplier.

  4. Dataset: Multi-town closed-loop CARLA evaluation on 16 routes across 8 towns including night, rain, and fog conditions. Training uses 51,124 Plackett-Luce preference samples from CARLA Town01 with ranked driving actions.

Architecture & Method
  1. Base VLA architecture: LMDrive with LLaMA-7B backbone and ResNet-50 vision encoder for processing camera images and LiDAR point clouds

  2. Causal Scene Narration (CSN): Three-stage text restructuring pipeline operating at CPU-only inference: - Quantitative grounding: Replace vague terms with metric values (distances, speeds, timing) - Structured separation: Organize information into four-part sequence covering ego-state, road topology, traffic signals, and causal reasoning - Intent-constraint alignment: Link navigation intent to relevant constraints using causal connectives

  3. Text generation uses conflict-side analysis and connective selection rule:

    \[\gamma(I, c_k) = \begin{cases} \text{BUT} & \text{if } \tau_k = \text{blocking} \\ \text{YIELD_BEFORE} & \text{if } \tau_k = \text{temporal} \\ \text{BECAUSE} & \text{if } \tau_k = \text{explanatory} \end{cases}\]
  4. Runtime safety supervisor: Simplex architecture with semantic safety envelope monitoring direction consistency and stuck detection

  5. Training enhancement: PL-DPO-NLL combining Plackett-Luce preference optimization with NLL regularization:

    \[\mathcal{L}_{\text{PL-DPO-NLL}} = \mathcal{L}_{\text{PL-DPO}} + \lambda \cdot \mathcal{L}_{\text{NLL}}\]
Training Recipe
  1. Base model: Pre-trained LLaMA-7B with Q-Former alignment for vision-language integration from LMDrive

  2. Preference data collection: 51,124 Plackett-Luce preference samples from CARLA Town01 across 67 route configurations with expert/rejected action rankings

  3. PL-DPO-NLL training: - Optimizer: AdamW-8bit with learning rate 10^-5 - Batch: 4 per device, 8 gradient accumulation steps (effective batch 32)
    - Hardware: 3× NVIDIA RTX 6000 Ada GPUs - Duration: 3 epochs with warmup ratio 0.03, BF16 mixed precision

  4. Architecture adaptation: LoRA adapters (r=32, α=32) applied to all attention and MLP projections

  5. CSN pipeline: Zero additional training - operates purely at inference time on CPU with <1ms overhead

Novelty & Lineage

Prior work:

  1. LMDrive (Shao et al., 2024) established VLA driving with LLaMA backbone but used template text with disconnected instruction/notice fragments.
  2. TLS-Assist (Schmidt et al., 2025) and GraphPilot (Schmidt et al., 2026) showed +14.1% and +15.6% improvements respectively through structured text enrichment without retraining.
  3. DriveVLM (Tian et al., 2024) used Chain-of-Thought but required full model retraining.

    Delta: This paper introduces intent-constraint causal alignment via explicit linguistic connectives (BUT, YIELD_BEFORE, BECAUSE) at zero GPU cost, plus semantic safety supervision and controlled ablation separating information content from causal structure.

    Applied-specific assessment: The architectural idea (causal connectives) is a straightforward application of natural language structure to VLA inputs. Benchmark gains are substantial (+31.1% DS) but achieved on a single architecture in simulation. The controlled ablation showing 39.1% of gain attributable to causal structure is methodologically sound. However, the approach relies on privileged CARLA API data and comparison is limited to one baseline method.

    The gains would likely not hold without structured environmental data extraction, limiting real-world applicability. The preference training component shows no multi-town generalization, revealing distribution shift issues.

    Verdict: INCREMENTAL — solid engineering contribution applying known linguistic structures to improve VLA text conditioning, but the core insight about causal connectives is not architecturally novel.

Benchmarks & Results
  1. CARLA multi-town evaluation (16 routes, 8 towns): CSN achieves DS 42.67±2.74 vs baseline 32.54±3.00 (+31.1% improvement)

  2. Route Completion: CSN achieves 56.5±1.7% vs baseline 48.3±2.6% (+8.2 percentage points)

  3. Infraction Score: CSN achieves 0.787±0.028 vs baseline 0.729±0.034 (fewer safety violations)

  4. PL-DPO-NLL variant: CSN achieves DS 40.45±3.79 vs 32.49±3.34 baseline (+24.5% improvement)

  5. Ablation results: Flat text (same info, no causal structure) achieves DS 38.71±1.44, showing causal structure contributes 39.1% of total gain on original LMDrive

  6. Safety supervision: Semantic monitor improves IS on both configurations, while TTC monitoring degrades performance (DS 22.02±4.07)

  7. Perception noise robustness: CSN maintains performance under ±5m distance error, ±30% speed noise, and 20% actor miss rates

    Results are mixed - strong improvements on driving score but evaluation limited to CARLA simulation on single architecture.

Compute & Efficiency
  1. Model size: LLaMA-7B backbone (~7 billion parameters) with LoRA adapters (r=32)

  2. Training compute: 3× NVIDIA RTX 6000 Ada GPUs for preference training, single RTX 3090 Ti for evaluation

  3. Inference speed: CSN pipeline adds <1ms per frame on CPU, runtime safety supervisor <0.1ms per step

  4. Memory footprint: Zero additional VRAM - CSN operates entirely on CPU, safety supervisor uses negligible memory for map queries

  5. Deployment practicality: High - CSN requires no GPU memory and can be applied to any text-input VLA without retraining, though currently requires privileged simulation data for environmental state extraction

Real-World Applicability
  1. Simulation-only evaluation: All experiments conducted in CARLA 0.9.10 simulator with no real vehicle testing

  2. Privileged information dependency: CSN currently uses CARLA’s Python API for precise environmental data extraction (actor positions, traffic light states)

  3. Perception noise robustness tested: System maintains performance under realistic sensing errors (±5m distance, ±30% speed noise, 20% actor miss rate)

  4. Real-world deployment requirements: Would need integration with actual perception pipeline to replace privileged API calls with vision-based detection

  5. No hardware experiments, production integration, or physical vehicle validation reported

  6. Sim-to-real gap not addressed beyond noise injection experiments

Limitations & Failure Modes
  1. FUNDAMENTAL: Requires structured environmental data extraction which may not transfer to real-world perception systems without significant engineering

  2. FUNDAMENTAL: Single architecture evaluation (LMDrive) - generalization to other VLA models unproven

  3. ENGINEERING: Currently depends on privileged simulation API - needs integration with actual perception stack

  4. ENGINEERING: Preference training shows distribution shift issues, performing poorly on unseen towns

  5. EVALUATION: Safety supervisor evaluation limited to semantic properties, no validation on physical safety metrics

  6. EVALUATION: Multi-town benchmark still simulation-based with perfect localization and mapping

    Failure modes:

  7. CSN+Safety combination degrades performance due to control clamping that truncates evasive maneuvers.
  8. Over-reliance on causal connectives may fail when environmental context is ambiguous or incomplete.