Applied AI 5 papers

Applied AI Digest — Apr 28, 2026

Today’s Digest at a Glance

Today’s papers span physical reasoning benchmarks for robotics, memory optimization for long-context language models, closed-loop mobile manipulation systems, multimodal image generation, and thermal imagery analysis for wildfire monitoring.

Task and Motion Planning (TAMP)

Task and Motion Planning addresses the fundamental challenge of bridging symbolic reasoning with continuous control in robotics. Traditional approaches either operate purely in discrete symbolic spaces (losing geometric constraints) or purely in continuous spaces (lacking high-level reasoning). TAMP solves this by decomposing robot problems into two coupled layers: a high-level task planner that reasons about symbolic predicates and logical relationships, and a low-level motion planner that finds feasible trajectories respecting geometric and kinematic constraints.

The core mathematical framework typically involves a hierarchical search where the task planner generates abstract action sequences using languages like PDDL (Planning Domain Definition Language), while the motion planner attempts to instantiate each abstract action with concrete parameters. When motion planning fails, the task planner backtracks and explores alternative symbolic sequences. Modern TAMP systems use parameterized skills—reusable motion primitives with symbolic preconditions and effects—to bridge the abstraction gap.

TAMP essentially treats robot planning as a search through a product space of symbolic states and continuous configurations, where feasibility depends on both logical consistency and physical realizability. This allows robots to reason about complex, long-horizon tasks while respecting real-world constraints.

H2O Token Importance Scoring

H2O (Heavy Hitter Oracle) token importance scoring tackles the memory explosion problem in long-context language model inference. As context length grows, the key-value (KV) cache—which stores attention keys and values for all previous tokens—consumes enormous GPU memory, often becoming the bottleneck rather than computation. Naive approaches like uniform pruning or simple recency-based eviction fail because they ignore which tokens actually contribute to model predictions.

H2O computes cumulative attention scores for each token by summing how much attention it receives from all subsequent tokens: $s_j^{(l)} = \sum_{i=j+1}^N \alpha_{i,j}^{(l)}$ where $\alpha_{i,j}^{(l)}$ is the attention weight from token $i$ to token $j$ at layer $l$. Tokens with high cumulative attention are deemed “heavy hitters” and retained in the cache, while low-scoring tokens are evicted. This preserves the most influential historical context while dramatically reducing memory usage.

The key insight is that attention patterns naturally identify which past tokens remain relevant for future predictions, making attention weights themselves an effective proxy for token importance. Some variants also incorporate value magnitudes to account for the actual information content being retrieved, not just attention frequency.

Reading Guide

KinDER provides systematic evaluation infrastructure that exposes limitations in current TAMP approaches, while ANCHOR demonstrates how physically grounded symbolic planning can achieve robust performance through continuous state re-anchoring. DepthKV’s layer-dependent KV cache allocation reveals that different transformer layers have varying sensitivity to memory constraints, complementing MMCORE’s approach of decoupling reasoning from generation. WildFireVQA highlights the ongoing challenge of multimodal reasoning in specialized domains, particularly when incorporating temperature-critical thermal imagery alongside traditional RGB data.


KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

Authors: Yixuan Huang, Bowen Li, Vaibhav Saxena, Yichao Liang et al. (12 authors) · Institution: Princeton University · Category: cs.RO

KinDER introduces a comprehensive benchmark for robot physical reasoning with 25 environments systematically isolating five core challenges, revealing substantial gaps in current TAMP, RL, IL, and foundation model approaches.

Practical Takeaway: As a research engineer, KinDER provides a valuable standardized evaluation framework for physical reasoning that was previously lacking. The benchmark reveals significant gaps in current approaches - even SOTA methods achieve <60% success on many tasks. The object-centric state representation is practically useful for variable-size environments. If working on robot learning, consider using KinDER to systematically evaluate your approach across different physical reasoning challenges. The mixed results suggest combining complementary strengths of different methods (planning robustness + learning adaptability) could be promising.

Tags: robotics physical_reasoning manipulation benchmarking task_and_motion_planning imitation_learning reinforcement_learning foundation_models

arXiv · PDF

Task & Setting

Real-world Context: Robots operating in physical environments face complex reasoning challenges combining spatial understanding, tool use, manipulation, and physics. Current benchmarks mix physical reasoning with perception and language understanding, making it hard to isolate and advance core physical reasoning capabilities. This is a critical gap as physical reasoning underlies many manipulation and navigation tasks.

Task Definition: KinDER introduces 25 procedurally-generated environments across four categories (Kinematic2D, Dynamic2D, Kinematic3D, Dynamic3D). Each environment provides:

  • Object-centric states mapping object names to feature vectors (poses, velocities, geometry)
  • Actions as configuration changes constrained by physics
  • Sparse rewards: $r_t = -1$ until goal achievement, then termination

Evaluation Criteria: Success measured by:

  1. Success Rate (SR): binary task completion
  2. Cumulative Rewards (Rwd): efficiency measure (only for successful episodes)
  3. Inference Time: wall-clock planning/execution time per episode

    Dataset: 25 environments with infinite procedural generation, plus ≥100 demonstrations for 10 environments. Includes teleoperation interfaces for data collection and pre-implemented parameterized skills and concepts.

Architecture & Method

KinDER is a benchmark suite rather than a single architecture, evaluating diverse approaches:

  1. Task and Motion Planning: Bilevel Planning (BP) using search-then-sample TAMP with parameterized skills and PDDL operators

  2. Foundation Models: LLM/VLM planning (GPT-5.2) with object-centric states or RGB inputs, with/without in-context examples

  3. Model-based Methods: MPC with ground-truth dynamics, MBRL with learned neural transition models

  4. Reinforcement Learning: PPO (on-policy) and SAC (off-policy) with sparse rewards

  5. Imitation Learning: Diffusion Policy with RGB images ± object states, trained on 100 demonstrations per environment

  6. Vision-Language-Action: Fine-tuned π0.5 VLA on demonstration data

  7. Generative Planning: Generative Skill Chaining (GSC) using diffusion models for skill sequencing

    The core technical contribution is the systematic isolation of five physical reasoning challenges (spatial relations, nonprehensile manipulation, tool use, geometric constraints, dynamics) across environments spanning 2D/3D and kinematic/dynamic complexity.

Training Recipe

Training details vary by baseline method:

  1. Planning Methods (BP, LLM/VLM variants): No training required, direct inference

  2. RL Methods (PPO, SAC): - Data: Environment interaction with sparse rewards (-1 per step until success) - Optimizer details: Not reported - Hardware/time: Not reported

  3. Imitation Learning (DP, DPES): - Data: 100 demonstrations per environment (mix of teleoperation and planning-generated) - Training details: Not reported - Hardware/time: Not reported

  4. VLA Fine-tuning: - Data: Same 100 demonstrations as DP/DPES - Base model: π0.5 VLA (pretrained) - Fine-tuning details: Not reported

  5. MBRL: - Data: Uses demonstrations to train neural transition models - Model details: Not reported

  6. GSC: - Data: Demonstrations with skill labels - Diffusion model training: Not reported

    Most implementation details are not reported beyond high-level method descriptions.

Novelty & Lineage

Prior Work: Closest benchmarks include (1) LIBERO (2023) - lifelong robot learning with tabletop manipulation, focused on long horizons and task diversity; (2) BEHAVIOR-1k (2024) - large-scale household tasks benchmark emphasizing application complexity; (3) Virtual Tools (2020) - 2D physical reasoning but limited to tool use scenarios.

Delta: KinDER adds:

  • First benchmark systematically isolating five core physical reasoning challenges
  • Both 2D and 3D environments with procedural generation
  • Direct comparison framework across TAMP, RL, IL, and foundation models
  • Object-centric state representations enabling variable object numbers
  • Standardized evaluation with 13 implemented baselines

Applied-Specific Assessment:

  • Architecture: Not novel architectures but novel evaluation framework - useful but incremental
  • Benchmark gains: Mixed results across methods, with gaps showing room for improvement rather than breakthrough performance
  • Comparisons: Fair within-benchmark but limited scale (50 episodes × 5 seeds)
  • Generalization: Results show fundamental limitations across all current approaches

Verdict: INCREMENTAL — Solid benchmarking contribution that fills an important gap, but represents expected extension of existing benchmark paradigms rather than breakthrough methodology.

Benchmarks & Results

Results across 8 representative environments (means over 250 episodes):

  1. Motion2D: BP (1.00), LLMCon/VLMCon (1.00), MPC (0.92) - simple navigation mostly solved
  2. StickButton2D: BP (0.99), MPC (0.68), VLA (0.53) - tool use challenging for learning methods
  3. DynObstruction2D: VLA (0.50), MPC (0.41) - dynamic reasoning helps VLA, planning struggles
  4. DynPushPullHook2D: VLA (0.43), others ≈0.0 - only VLA handles complex tool manipulation
  5. BaseMotion3D: BP/LLMCon/VLMCon/LLMPlan/VLMPlan (1.00) - 3D navigation solved by planning
  6. Transport3D: BP (0.46), LLMCon (0.36), others ≈0.0 - 3D manipulation very challenging
  7. Shelf3D: BP (1.00), LLMCon (0.55), others <0.15 - geometric constraints difficult
  8. SweepIntoDrawer3D: DP (0.14), others ≈0.0 - long-horizon dynamic tasks nearly unsolved

    Overall Success Rates: BP (0.57), LLMCon (0.43), VLMCon (0.43), VLA (0.32), others <0.30

    Mixed results show substantial room for improvement across all method families. Many challenging environments remain largely unsolved.

Compute & Efficiency
  1. Model Size: Not reported for learning baselines; planning methods use GPT-5.2 API calls

  2. Training Compute: Not reported for RL/IL methods; VLA fine-tuning details not provided

  3. Inference Speed: Wide variation across methods: - RL methods: ~0.002-0.02 sec/step (fastest) - Planning methods: 0.01-73.8 sec/episode - Learning methods: 0.23-571.92 sec/episode - BP planning time scales poorly: 1.51s (1 button) → 38.39s (10 buttons)

  4. Memory Footprint: Not reported

  5. Deployment Assessment: Simulation-only evaluation limits real-world deployment insights. Real-to-sim-to-real demo shows promise but limited scope. Engineering cost varies dramatically: RL/IL require minimal setup vs. TAMP requiring skills/concepts engineering.

Real-World Applicability
  1. Real-to-sim-to-real Validation: Single demonstration on TidyBot++ mobile manipulator for Shelf3D task using overhead camera localization and object pose estimation

  2. Simulation Fidelity: Uses established physics backends (MuJoCo, PyBullet, Pymunk) but simplified compared to real-world complexity

  3. Hardware Requirements: Environments designed for standard mobile manipulator (TidyBot++ with 7DOF Kinova arm), but mainly simulation-focused

  4. Domain Transfer: Limited validation of sim-to-real transfer. Object-centric representations may help generalization but not thoroughly tested

  5. Production Integration: No production deployment results. Benchmark designed primarily for research evaluation rather than deployment readiness

Limitations & Failure Modes
  1. FUNDAMENTAL: Simulation-based evaluation cannot capture all real-world physics complexity and interaction nuances

  2. ENGINEERING: Limited training details reported make reproduction difficult; baseline implementations may not be optimally tuned

  3. EVALUATION: Small evaluation scale (50 episodes × 5 seeds) may not fully capture method reliability; limited real-world validation

  4. FUNDAMENTAL: Excludes important factors like partial observability, stochasticity, multi-robot coordination, diverse embodiments

  5. ENGINEERING: High engineering cost for planning methods (requiring skills/concepts) limits accessibility

    Failure Modes:

    • Planning methods fail on dynamic environments requiring real-time adaptation
    • Learning methods fail on long-horizon tasks with sparse rewards and limited demonstrations

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Authors: Zahra Dehghanighobadi, Asja Fischer · Institution: Ruhr University Bochum · Category: cs.CL

DepthKV improves KV cache pruning by allocating memory budgets across transformer layers based on their sensitivity to pruning rather than using uniform allocation.

Practical Takeaway: If you’re working on long-context LLM inference, the key insight is that transformer layers have different sensitivity to KV pruning - don’t assume uniform importance. The InfoNCE metric provides a reasonable proxy for layer robustness. While performance gains are modest (2-3 ROUGE points), the method is practical since it works with existing models and requires minimal overhead. Consider implementing layer-dependent allocation if you’re already using attention-based KV pruning, especially for memory-constrained deployments where every percentage point of memory savings matters.

Tags: KV-cache long-context LLM-inference attention-mechanism memory-optimization transformer-layers pruning efficiency

arXiv · PDF

Task & Setting

Long-context language model inference faces a critical memory bottleneck due to the key-value (KV) cache, whose memory footprint grows linearly with sequence length. This becomes prohibitive as models scale to millions of tokens, making inference computationally and memory-wise expensive.

The task is to prune KV cache entries during inference while maintaining performance. Given an input sequence of length N and a transformer with L layers, each layer produces key-value tensors K^(l), V^(l) ∈ R^(N×d). The goal is to select token subsets S^(l) ⊆ {1,…,N} to retain at each layer under a global budget constraint:

\[\sum_{l=1}^L B^{(l)} = B_{total}\]
where B^(l) = S^(l) is the budget allocated to layer l.

Success is measured by task performance under fixed memory constraints using: ROUGE scores (1, 2, L) and SBERT similarity for summarization; exact match accuracy and F1 scores for QA; YapScore for output verbosity analysis.

Evaluation covers 7 datasets: arXiv, PubMed, GovReport, LegalCase (summarization); Qasper, HotpotQA (QA); GSM-∞ (reasoning). Input lengths range from 934-5797 words with 60% global KV pruning ratio.

Architecture & Method
  1. Base approach uses H2O token importance scoring during prefill stage, computing cumulative attention for each token j at layer l:

    \[s_j^{(l)} = \sum_{i=j+1}^N α_{i,j}^{(l)}\]
  2. Optional value-aware variant weights by value magnitude:

    \[s_j^{(l)} = \|V_j^{(l)}\|_p \sum_{i=j+1}^N α_{i,j}^{(l)}\]
  3. DepthKV allocates KV budgets non-uniformly across layers using three strategies: - Middle-Layer Protection (MLP): preserves layers ⌊L/2⌋ and ⌊L/2⌋+1 - Metric-Guided Allocation (MGA): uses InfoNCE scores to rank layer importance - Mixed strategies (MLMA): combines position-based and metric-guided allocation

  4. InfoNCE metric measures representation robustness to perturbations as proxy for layer importance
  5. Maintains constraint that average pruning ratio across layers equals global target ρ
Training Recipe
  1. No training required - purely inference-time method
  2. Uses pretrained models: Gemma-7B-IT, LLaMA-3.1-8B-Instruct, Qwen2.5-7B-Instruct
  3. Pruning applied during chunked prefill with 1024 token chunks
  4. KV cache remains fixed during decoding phase
  5. Deterministic greedy decoding with 500 token max generation length
  6. Hardware: 8× NVIDIA H200 GPUs
  7. Global KV cache reduction ratio: 60% across all experiments
Novelty & Lineage

Prior work: H2O (Zhang et al. 2023) introduced attention-based KV pruning with uniform allocation across layers. StreamingLLM (Xiao et al. 2023) used sliding windows with attention sinks. SnapKV (Li et al. 2024) selected tokens based on attention patterns.

Delta: This paper challenges the uniform allocation assumption by showing layers have significantly different sensitivity to pruning through statistical analysis (permutation test p<0.05). Introduces InfoNCE as a predictor of layer importance and develops allocation strategies that redistribute KV budgets based on layer sensitivity.

Applied-specific assessment:

  • Architectural idea: Non-obvious insight that layer sensitivity varies significantly under KV pruning, contradicting uniform allocation assumptions
  • Benchmark gains: Consistent but modest improvements (e.g., ROUGE-1: 26.75→29.75 on arXiv)
  • Fair comparisons: Same global budget constraint ensures fair comparison to uniform baselines
  • Generalization: Improvements modest and may not hold with different pruning ratios or models

The core insight about layer heterogeneity is valuable but gains are incremental. Method builds naturally on established attention-based importance scoring.

Verdict: INCREMENTAL — solid extension of existing KV pruning with useful layer-wise insights but modest performance gains.

Benchmarks & Results
  1. arXiv summarization: ROUGE-1 improves from 26.75 (uniform) to 29.75 (MGA), SBERT from 55.09 to 61.98
  2. GovReport summarization: ROUGE-1 improves from 26.76 to 28.43 (MGA), mixed results across other metrics
  3. PubMed and LegalCase: Results not fully reported in main tables
  4. HotpotQA QA: Exact match varies by model - Gemma 12%→23% (MLP), LLaMA 47%→67% (MGA)
  5. Qasper QA: Exact match improvements - Gemma 6%→40% (MLMA-6L), LLaMA 54%→64% (MLMA-6L)
  6. GSM-∞ reasoning: All DepthKV variants outperform uniform baselines, with improvements ranging 5-25 percentage points
  7. LLM-as-a-judge scores: MGA consistently highest across correctness, completeness, conciseness dimensions

    Results show consistent but modest improvements across diverse tasks. Performance gains are task and model dependent.

Compute & Efficiency
  1. Model size: 7-8B parameters (Gemma-7B, LLaMA-3.1-8B, Qwen2.5-7B)
  2. Training compute: Not applicable (inference-only method)
  3. Inference speed: Not reported, method adds computational overhead for importance scoring
  4. Memory footprint: 60% KV cache reduction maintained across all methods for fair comparison
  5. Deployment practicality: High - method works with existing pretrained models without retraining, requires only attention score computation during prefill
Real-World Applicability
  1. No deployment results or production integration reported
  2. No hardware experiments beyond single-node evaluation
  3. Method designed for practical deployment with existing pretrained models
  4. Evaluation limited to curated benchmarks, no real-world data testing
  5. Chunked prefill implementation suggests practical considerations for long sequences
  6. 60% memory reduction could enable deployment in memory-constrained environments
Limitations & Failure Modes
  1. FUNDAMENTAL: Non-query-aware approach may miss tokens important for specific decoding queries
  2. FUNDAMENTAL: Head-specific behaviors ignored by aggregating attention across heads
  3. ENGINEERING: Limited to 60% pruning ratio - behavior at other ratios unclear
  4. ENGINEERING: InfoNCE computation adds overhead during inference
  5. EVALUATION: Evaluation limited to 7B-8B models, scalability to larger models unproven
  6. EVALUATION: No comparison to query-aware methods or recent learned approaches

    Failure modes:

  7. Aggressive pruning of critical layers could cause severe performance degradation
  8. Method may fail when layer importance patterns differ significantly from training-time behavior

ANCHOR: A Physically Grounded Closed-Loop Framework for Robust Home-Service Mobile Manipulation

Authors: Jinhao Jiang, Shengyu Fang, Sibo Zuo, Yujie Tang et al. (5 authors) · Institution: Beijing Institute of Technology · Category: cs.RO

ANCHOR is a closed-loop mobile manipulation framework that physically grounds symbolic planning with continuous state re-anchoring and operability-aware base alignment, achieving 71.7% task success vs 53.3% baseline through structured hierarchical recovery.

Practical Takeaway: As a research engineer, the key takeaway is that explicit physical grounding and structured failure recovery are more effective than end-to-end learned approaches for robust mobile manipulation. The operability-aware base alignment using dual-ellipsoid reachability constraints is a practical technique worth implementing - it addresses the common “arrived but inoperable” failure mode where navigation succeeds but manipulation fails due to kinematic infeasibility. The hierarchical recovery mechanism (L1 local retries, L2 task replanning) provides a principled alternative to global replanning on every failure. The physically anchored planning approach of continuously re-validating symbolic predicates against sensor observations is implementable and could improve robustness in other symbolic reasoning systems. Watch for extensions to more complex manipulation tasks and outdoor environments.

Tags: mobile_manipulation open_vocabulary closed_loop_control task_planning robotic_systems hierarchical_recovery base_alignment physical_grounding

arXiv · PDF

Task & Setting

The paper addresses the challenge of robust long-horizon mobile manipulation in domestic environments, where robots must navigate, locate objects, and physically interact with them under open-vocabulary natural language instructions. Domestic settings are particularly challenging due to evolving layouts, open-set objects, frequent occlusions, and human activity that continually perturbs the scene.

The task is defined as follows: given a natural language instruction (e.g., “Put the orange in the bowl”), the robot must 1) explore to locate the target object using RGB-D observations, 2) navigate to an operationally feasible base pose, 3) grasp the object, 4) navigate to the receptacle, and 5) place the object. The input modalities are RGB-D sensor data and natural language commands. The robot operates in previously unseen indoor environments with no pre-built maps.

The objective is to maximize task completion rate while handling disturbances and environmental changes. The formal objective can be stated as:

\[\max P(\text{success}) = P(\text{find} \cap \text{align} \cap \text{grasp} \cap \text{place})\]

Success is measured by task completion rate (percentage of trials where the full pick-and-place sequence is completed without human intervention), step-wise success rates for each phase, and recovery rate (fraction of detected execution anomalies successfully resolved).

The evaluation involves 60 real-robot trials across three difficulty levels: Level 1 (target in initial field of view), Level 2 (requires navigation), and Level 3 (includes mid-task perturbations like object displacement).

Architecture & Method

ANCHOR is a closed-loop framework integrating three coupled mechanisms:

  1. Physically Anchored Task Planning (PATP): Enforces state consistency by grounding symbolic predicates in observable geometric evidence. Uses LLM (GPT-4) to generate PDDL problem skeleton, Fast Downward classical planner for action sequence generation, and executes in receding-horizon fashion where only first action is dispatched per cycle.

  2. Operability-Aware Base Alignment: Reformulates base pose selection as optimization problem using offline-learned dual-ellipsoid reachability shell surrogate. The optimization objective is:

    \[J(x) = w_a J_{align} + w_s J_{shell} + w_c J_{chassis}\]

    where the shell risk term penalizes obstacle intrusion:

    \[J_{shell} = \frac{1}{|P|} \sum_{p \in P} \sigma(\alpha(1-d_{out}(p))) \cdot \sigma(\alpha(d_{in}(p)-1))\]
  3. Minimum-Responsible-Layer Hierarchical Recovery: Two-tier recovery system with L1 (manipulation-level local retries) and L2 (task-level replanning) to localize failures and prevent cascading errors.

    The core contribution is the tight coupling of symbolic reasoning with continuously verified physical state, avoiding stale semantic maps and ensuring navigation endpoints are kinematically feasible for manipulation.

Training Recipe

The system uses pre-trained foundation models and does not require task-specific training:

  1. Foundation Model Integration: - Data: Uses pre-trained GPT-4 for PDDL generation, Grounding DINO for object detection, Segment Anything for segmentation, AnyGrasp for grasp pose generation - No fine-tuning reported

  2. Reachability Shell Learning: - Data: Offline sampling of end-effector poses to compute manipulability scores based on collision-free IK solution ratios - Method: Dual-ellipsoid fitting to approximate high-manipulability workspace region - Hardware: Not reported for this offline phase

  3. Runtime Operation: - Hardware: NVIDIA RTX 3090 onboard for real-time inference - Execution: Real-time sense-plan-act loop at approximately 1 Hz based on execution times - No specific optimizer, learning rates, or batch sizes as this is primarily a systems integration approach

Novelty & Lineage

Prior Work: The closest systems are OK-Robot (2024) which integrates VLMs with modular navigation/manipulation but uses open-loop execution, MoTo (2025) which addresses navigation-manipulation coupling but lacks structured recovery, and HomeRobot (2023) which tackles long-horizon tasks but relies on pre-scanned semantic maps.

Delta: ANCHOR adds three specific mechanisms:

  1. physically anchored planning that continuously re-validates symbolic predicates against sensor observations rather than relying on stale maps
  2. operability-aware base alignment that explicitly optimizes navigation endpoints for manipulation feasibility using dual-ellipsoid reachability constraints, and
  3. hierarchical recovery that localizes failures to prevent cascading errors.

    Assessment:

    • The architectural idea of physically anchored state consistency is novel in application - while closed-loop control is well-known, the specific formulation of grounding symbolic predicates in geometric anchors for mobile manipulation is non-obvious
    • Benchmark gains are meaningful: 18.4pp improvement in overall success rate (53.3% → 71.7%) with particularly strong gains in disturbed conditions (25% → 55%)
    • Comparisons appear fair - same perception stack, same hardware, controlled baseline adaptation of OK-Robot
    • The gains likely depend on the modular architecture and explicit geometric reasoning, not just scale

    Verdict: SIGNIFICANT — The explicit physical grounding framework and operability-aware alignment address well-known failure modes in mobile manipulation with clear engineering impact, though the core ideas build incrementally on established closed-loop control principles.

Benchmarks & Results
  1. Overall Task Success Rate: ANCHOR achieves 71.7% vs OK-Robot* baseline 53.3%, improvement of 18.4pp
  2. Level 1 (Direct, target in FoV): ANCHOR 85.0% vs baseline 80.0%, improvement of 5.0pp
  3. Level 2 (Navigation required): ANCHOR 75.0% vs baseline 55.0%, improvement of 20.0pp
  4. Level 3 (With disturbances): ANCHOR 55.0% vs baseline 25.0%, improvement of 30.0pp
  5. Recovery Rate: ANCHOR achieves 71.4% (20/28 anomalies recovered) vs baseline 0.0%
  6. Step-wise Success - Find: Both methods 70.0%
  7. Step-wise Success - Grasp: ANCHOR 80.0% vs baseline 69.2%, improvement of 10.8pp
  8. Step-wise Success - Place: ANCHOR 100.0% vs baseline 88.9%, improvement of 11.1pp

    Results show consistent improvements across all metrics, with particularly strong gains under disturbances. The comparison to original OK-Robot (58.5%) suggests the adapted baseline is reasonable. Notable absence of comparison to other recent systems like MoTo or HomeRobot on standardized benchmarks.

Compute & Efficiency
  1. Model Size: Uses foundation models (GPT-4, Grounding DINO, SAM, AnyGrasp) - specific parameter counts not reported for integrated system
  2. Training Compute: No task-specific training required; offline reachability shell learning compute not specified
  3. Inference Speed: Approximately 1 Hz sense-plan-act cycle based on average execution times of 2.4-3.5 minutes for 6-8 step tasks
  4. Memory Footprint: Maintains 2D scene graphs plus grid/octree occupancy maps - specific memory usage not quantified
  5. Deployment Practicality: Demonstrated on mobile platform with onboard NVIDIA RTX 3090, suggesting practical deployment constraints but limited to high-end edge hardware
Real-World Applicability
  1. Hardware Deployment: Extensive real-robot validation on Unitree Go2 quadruped + ARX X5 manipulator with onboard computation across 60 trials in laboratory and home office environments

  2. Environment Diversity: Testing in previously unseen indoor environments with varying object categories and layouts, though limited to structured indoor settings

  3. Robustness Testing: Systematic evaluation under three difficulty levels including unannounced perturbations (object displacement, occlusion during approach)

  4. Production Integration: No reported integration with existing robotic systems or commercial platforms

  5. Sim-to-Real: No simulation component reported - all validation conducted directly on real hardware, which is both a strength (no sim-to-real gap) and limitation (reduced scalability of evaluation)

Limitations & Failure Modes
  1. Perceptual Ambiguity - FUNDAMENTAL: VLM occasionally misidentifies objects with similar textures in cluttered scenes, leading to task-level errors inherent to current vision-language models

  2. Sensing Noise - ENGINEERING: Depth inaccuracies from RGB-D camera on reflective surfaces provide erroneous inputs to base alignment, could be addressed with better sensors or multi-modal fusion

  3. Environment Scope - EVALUATION: Testing limited to structured indoor environments; unclear how system handles outdoor settings, varying lighting, or highly dynamic scenes

  4. Scalability - ENGINEERING: Requires high-end onboard GPU (RTX 3090) limiting deployment to well-resourced platforms

  5. Foundation Model Dependencies - FUNDAMENTAL: Performance bounded by capabilities of underlying VLMs and grasp planning models

    Failure Modes:

    • System fails when depth estimation errors are extreme enough that iterative base alignment cannot converge to feasible poses
    • Task termination occurs when classical planner determines goal unreachable after object displacement beyond workspace limits

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

Authors: Zijie Li, Yichun Shi, Jingxiang Sun, Ye Wang et al. (11 authors) · Institution: ByteDance · Category: cs.CV

MMCORE achieves efficient multimodal image generation by decoupling MLLM semantic reasoning from diffusion synthesis through explicit visual token alignment and two-stage training.

Practical Takeaway: This work demonstrates that decoupled training stages can be more efficient than end-to-end unified model training, achieving competitive results with 30% less compute. The key insight is using explicit semantic supervision (ViT feature regression) during MLLM adaptation rather than relying solely on diffusion loss. For practitioners, the dual-pathway conditioning approach (visual queries + full text embeddings) may be worth implementing if you’re building multimodal generation systems. However, be aware that full MLLM fine-tuning introduces understanding-generation trade-offs that require careful curriculum design to mitigate.

Tags: multimodal diffusion-models vision-language image-generation image-editing unified-models query-mechanisms representation-learning

arXiv · PDF

Task & Setting

MMCORE targets unified multimodal image generation and editing using vision-language model (VLM) reasoning capabilities. Current approaches face a fundamental trade-off: autoregressive models excel at semantic reasoning but produce lower-quality images, while diffusion models generate high-fidelity images but lack deep multimodal understanding. Existing unified models like Transfusion require expensive joint training from scratch, while approaches like MetaQueries suffer from fixed query budgets and weak supervision signals.

The task is defined as: given multimodal input (text instructions + optional reference images), generate or edit images that maintain semantic consistency and visual fidelity. Input modalities include variable-length text prompts and up to 10+ reference images. Output is high-resolution RGB images (resolution not specified). The core objective combines diffusion denoising loss with semantic alignment:

\[L = \lambda_1 L_{diffusion} + \lambda_2 L_{vis}\]

where $L_{vis}$ aligns learned query tokens with frozen ViT features via cosine similarity.

Evaluation uses DreamBench with automated metrics for prompt-image alignment, plus human evaluation across prompt alignment (English/Chinese), structural fidelity, and editing consistency. The paper also introduces multi-reference editing scenarios with 10+ input images.

Architecture & Method
  1. Multimodal LLM backbone: Fine-tuned MLLM (architecture unspecified) with learnable query tokens Q ∈ R^(N×D) where N=64 query budget, D=hidden dimension

  2. Semantic visual alignment: Query tokens regress toward frozen ViT/SigLIP features via cosine similarity loss:

    \[L_{vis} = \frac{1}{N} \sum_{i=1}^N \left(1 - \frac{F_θ(Q_i)^T v_i}{\|F_θ(Q_i)\| \|v_i\|}\right)\]
  3. Dual-pathway conditioning: Combines learned visual query tokens with original text embeddings to avoid information bottleneck

  4. Diffusion head: Pre-trained Text-to-Image MMDiT adapted with block-causal attention mask for interleaved generation

  5. Independent embedding dropout: Higher dropout on text conditioning early in training to force reliance on visual embeddings, then annealed

    The core contribution is the two-stage training with explicit semantic supervision rather than relying solely on diffusion loss, plus full MLLM fine-tuning instead of frozen adapters.

Training Recipe
  1. Stage 1 - MLLM alignment: Fine-tune MLLM backbone with visual alignment loss $L_{vis}$ only, no diffusion loss. Data scale/source not reported. Optimizer/schedule not reported.

  2. Stage 2 - Diffusion head training: Train MMDiT conditioned on frozen MLLM visual tokens using mixture of T2I and interleaved datasets. Apply independent embedding dropout with higher rates on text initially. Training details not reported.

  3. Supervised Fine-Tuning: 2K steps on curated multimodal instruction dataset significantly improves alignment scores from 0.82 to 0.86.

  4. RLHF: Mentioned but no details provided.

    Hardware: Uses internal high-performance training pipeline. Wall-clock time: not reported. Specific optimizers, learning rates, batch sizes: not reported except for ablation showing 5× batch size scaling improves performance.

Novelty & Lineage

Prior work:

  1. MetaQueries (2025): Uses learnable query tokens with frozen MLLM + lightweight connector for diffusion conditioning
  2. Transfusion (2024): Joint autoregressive-diffusion training from scratch for unified multimodal generation
  3. BAGEL (2025): Separate branches for understanding/generation with massive pretraining requirements

    Delta: This paper adds:

  4. Full MLLM fine-tuning instead of frozen backbone
  5. Explicit semantic supervision via ViT feature regression rather than diffusion-only training
  6. Dual-pathway conditioning combining visual queries + full text embeddings
  7. Two-stage decoupled training instead of end-to-end optimization.

    Assessment: The architectural ideas are incremental applications of known techniques. Full fine-tuning vs frozen adapters is a standard choice, not novel. Semantic distillation losses are common in multimodal learning. The core insight about decoupling training stages for efficiency is reasonable but not groundbreaking. Benchmark improvements appear modest (+4-6% over Seedream 4.0) and comparisons may not control for compute/data differences. The approach requires significant infrastructure (“internal high-performance training pipeline”) that limits reproducibility. While the engineering is solid, this represents expected progress rather than fundamental advances.

    Verdict: INCREMENTAL — Solid engineering combining known techniques with modest improvements, but lacks non-obvious architectural insights or breakthrough capabilities.

Benchmarks & Results
  1. DreamBench Text-to-Image: MMCORE 84.42% vs GPT-Image-1 80.69%, Seedream 4.0 78.2% (+6.2% over Seedream)

  2. DreamBench Image Editing Alignment: MMCORE 81.2% vs GPT-Image-1 79.88%, Seedream 4.0 79.55% (+1.65% over Seedream)

  3. DreamBench Editing Consistency: MMCORE 70.62% vs Seedream 4.0 68.89%, GPT-Image-1 42.39% (+1.73% over Seedream)

  4. Human Evaluation: Shows MMCORE leads across prompt alignment, visual fidelity, and editing consistency but specific numbers not provided in main results

  5. GPT-4o Automated Judge: Ablation shows 0.8585 vs baseline methods around 0.68-0.78 range

    Results are mixed - strong on text-to-image alignment but more modest gains on editing tasks. Notable absence of standard benchmarks like COCO, ImageNet classification, or specific VQA datasets. Improvements over Seedream 4.0 are consistent but relatively small (1-6%).

Compute & Efficiency
  1. Model size: Not specified for final MMCORE model, only mentions “lightweight diffusion head” for ablations

  2. Training compute: Uses “internal high-performance training pipeline” but no specific GPU hours or hardware details provided

  3. Inference speed: Not reported

  4. Memory footprint: Not reported

  5. Deployment practicality: Claims 30% computational savings vs training unified models from scratch like Transfusion/BAGEL, but no absolute numbers provided. Two-stage training enables more efficient scaling, but still requires full MLLM fine-tuning which is computationally expensive.

Real-World Applicability
  1. Dataset composition: Evaluates on DreamBench (internal benchmark) with both synthetic prompts and real editing scenarios

  2. Multi-reference editing: Demonstrates handling of 10+ input images for complex composition tasks

  3. Production readiness: No specific deployment results reported, though ByteDance Seed suggests potential internal usage

  4. Robustness testing: Shows qualitative results on challenging prompts involving spatial reasoning, counterfactual scenarios, and fine-grained attribute binding

    The work appears focused on curated benchmarks rather than wild deployment scenarios. Limited evidence of real-world stress testing or production integration details.

Limitations & Failure Modes
  1. Understanding-generation trade-off (FUNDAMENTAL): Fine-tuning MLLM for generation degrades original VQA/reasoning capabilities, acknowledged as persistent challenge

  2. Performance gap with SOTA (ENGINEERING): Authors admit gap with Nano-Banana-pro/GPT Image 1.5, attributed to weaker base MLLM

  3. Visual token redundancy (FUNDAMENTAL): Learned visual tokens supplement rather than replace text conditioning, indicating incomplete semantic transfer

  4. VAE dependency (ENGINEERING): Still requires separate VAE encoder for historical context in interleaved generation

  5. Scalability concerns (EVALUATION): Full MLLM fine-tuning may not scale efficiently compared to parameter-efficient approaches

    Failure modes:

    • Severe image artifacts when fusing heterogeneous visual features (VAE + ViT embeddings)
    • Tendency to trivially copy reference inputs in complex editing scenarios

WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring

Authors: Mobin Habibpour, Niloufar Alipour Talemi, John Spodnik, Camren J. Khoury et al. (5 authors) · Institution: Clemson University · Category: cs.CV

Introduces WildFireVQA, the first large-scale VQA benchmark combining RGB and radiometric thermal UAV imagery for wildfire monitoring, revealing that current multimodal models struggle significantly with temperature-grounded reasoning despite thermal data being critical for fire analysis.

Practical Takeaway: If you’re working on emergency response or remote sensing VQA, this benchmark reveals a critical limitation: current multimodal models are surprisingly poor at thermal reasoning despite thermal data being crucial for fire monitoring. The RGB bias is strong - even when given thermal imagery, models default to visual appearance rather than temperature information. The deterministic labeling approach using physical thresholds is worth implementing for fire detection pipelines. The benchmark itself provides a valuable testbed for developing thermal-aware vision-language models, but don’t expect current MLLMs to handle temperature-grounded reasoning without significant additional development.

Tags: wildfire_monitoring thermal_imaging visual_question_answering multimodal_reasoning UAV_imagery emergency_response remote_sensing benchmark_dataset

arXiv · PDF

Task & Setting

Wildfire monitoring from aerial platforms requires real-time decision making for emergency response, but current visual question answering systems lack evaluation on wildfire-specific scenarios that combine visual and thermal data. The high stakes environment demands accurate interpretation of fire behavior, smoke patterns, and flight safety considerations.

The task involves visual question answering on synchronized RGB and radiometric thermal UAV imagery collected during prescribed burns. Each sample contains: an RGB image, color-mapped thermal visualization, and radiometric thermal TIFF with per-pixel temperature values. The system must answer 34 multiple-choice questions per frame spanning 6 categories: presence/detection, classification, distribution/segmentation, localization/direction, cross-modal reasoning, and flight planning.

Success is measured by accuracy across question categories. The evaluation protocol tests models under RGB-only, thermal-only, and retrieval-augmented settings where radiometric thermal statistics (min/max temperature, pixels above 200°C/400°C thresholds) are provided as text context.

The benchmark contains 6,097 RGB-thermal samples with 34 questions each, yielding 207,298 total multiple-choice questions across three prescribed fire events (Sycan Marsh, Willamette, Shoetank).

Architecture & Method
  1. Dataset construction using FLAME 3 synchronized RGB and radiometric thermal UAV imagery from prescribed burns

  2. Question generation via MLLM prompting followed by manual curation, yielding 34 questions per frame across 6 operational categories

  3. Answer generation using Gemini 2.5 Pro with RGB images, color-mapped thermal visualizations, and retrieved radiometric temperature statistics

  4. Sensor-driven deterministic labeling for specific questions using mathematical formulations: - UAV height estimation from GPS/EXIF metadata and SRTM ground elevation - Thermal hotspot detection using 200°C threshold:

    \[H(x,y) = \begin{cases} 1 & T(x,y) \geq 200°C \\ 0 & \text{otherwise} \end{cases}\]
    - Spatial distribution analysis via PCA on hotspot centroids with linearity score 
    
    \[L = \lambda_1/(\lambda_1 + \lambda_2)\]
  5. Quality control through intra-frame consistency checks and inter-frame ORB feature matching for near-duplicate detection

  6. Evaluation of four MLLMs (LLaVA-v1.6-7B, Qwen3-VL-8B, InternVL2-8B, MiniCPM-V2) under controlled modality and retrieval settings

    The core contribution is the first radiometric thermal VQA benchmark for wildfire monitoring with temperature-grounded reasoning capabilities.

Training Recipe
  1. No model training involved - this is a benchmark evaluation paper

  2. Models evaluated are pre-trained versions: LLaVA-v1.6-Mistral-7B, Qwen3-VL-8B-Instruct, InternVL2-8B, MiniCPM-V2

  3. Answer generation pipeline uses Gemini 2.5 Pro in zero-shot setting with multimodal prompts containing RGB image, thermal visualization, and radiometric statistics

  4. Evaluation conducted in zero-shot setting across different input modalities (RGB, Thermal) and retrieval-augmented settings

    Training details: not applicable - benchmark paper

Novelty & Lineage

Prior work:

  1. FLAME dataset series (2021-2024) provided UAV wildfire imagery with fire detection/segmentation labels
  2. Remote sensing VQA benchmarks like RSVQA
  3. , RSIVQA
  4. , HRVQA
  5. focused on general aerial imagery understanding
  6. Disaster response VQA like FloodNet
  7. and RescueNet-VQA
  8. addressed post-disaster assessment but not active fire monitoring.

    Delta: This paper adds (1) first VQA benchmark specifically for wildfire monitoring; (2) integration of radiometric thermal data with per-pixel temperature measurements, not just color-mapped visualizations; (3) operationally relevant question categories for flight planning and tactical decisions; (4) sensor-driven deterministic labeling using physical temperature thresholds rather than purely visual annotation.

    Applied-specific assessment: The architectural contribution is modest - primarily dataset construction methodology rather than novel algorithms. The benchmark gains show RGB consistently outperforms thermal modalities, with retrieved thermal statistics helping stronger models. Comparisons are fair within scope but limited to 4 models. The temperature-grounded approach is valuable but evaluation reveals current MLLMs struggle with thermal reasoning.

    The work addresses a legitimate gap but the technical novelty is primarily in careful dataset curation rather than algorithmic innovation. The deterministic labeling approach using physical thresholds is sensible engineering rather than breakthrough methodology.

    Verdict: INCREMENTAL — solid benchmark contribution for specialized wildfire domain but limited technical novelty beyond careful dataset construction.

Benchmarks & Results
  1. Presence and Detection: Qwen3-VL achieves 76.39% (RGB), 52.70% (Thermal), showing 23.69% gap favoring RGB

  2. Classification: Qwen3-VL best at 47.67% (RGB+RAG), 31.17% (Thermal), indicating difficulty in fire behavior categorization

  3. Distribution and Segmentation: Qwen3-VL reaches 47.20% (RGB+RAG), 39.17% (Thermal+RAG), showing thermal context helps spatial reasoning

  4. Localization and Direction: Highest scoring category with 61.50% (LLaVA RGB), but thermal performs poorly at ~17-40% across models

  5. Cross-Modal Reasoning: Benefits most from RAG - Qwen3-VL improves from 57.45% to 65.77% with retrieved thermal statistics

  6. Flight Planning: Moderate performance around 19-51% across settings, indicating operational reasoning challenges

    Overall accuracy ranking: Qwen3-VL (54.76% RGB+RAG) > LLaVA-v1.6 (52.68% RGB) > MiniCPM-V2 (49.51% RGB) > InternVL2 (47.66% RGB+RAG)

    Notable pattern: RGB consistently outperforms thermal across all models, with RAG helping stronger models but degrading weaker ones. Random baseline ranges from 15.48% to 44.79% depending on answer choices per category.

Compute & Efficiency
  1. Model sizes: LLaVA-v1.6-Mistral-7B (7B parameters), Qwen3-VL-8B-Instruct (8B parameters), InternVL2-8B (8B parameters), MiniCPM-V2 (not reported)

  2. Training compute: Not applicable - benchmark evaluation only using pre-trained models

  3. Inference speed/latency: Not reported in evaluation

  4. Memory footprint: Not reported for inference setup

  5. Deployment practicality: Models evaluated are standard open-source MLLMs suitable for research deployment, though thermal processing pipeline requires radiometric TIFF handling and statistical computation which adds overhead to standard RGB-only VQA systems

Real-World Applicability
  1. Built on FLAME 3 dataset collected from actual prescribed burns in rural environments (Sycan Marsh, Willamette, Shoetank locations)

  2. UAV platform using DJI M30T with synchronized RGB and radiometric thermal cameras during active fire operations

  3. Operationally relevant question categories designed for tactical wildfire response including flight safety and asset identification

  4. Temperature thresholds (200°C, 400°C) grounded in actual fire physics for hotspot detection

  5. Dataset addresses real observability challenges like smoke occlusion and thermal signature interpretation

    However, evaluation remains on curated benchmark data rather than deployed systems. No discussion of real-time processing constraints or integration with existing wildfire response workflows.

Limitations & Failure Modes
  1. FUNDAMENTAL: Current MLLMs show poor thermal reasoning - RGB consistently outperforms thermal modalities by large margins (20-30% accuracy gaps), indicating models lack understanding of temperature-based fire signatures

  2. ENGINEERING: Retrieval-augmented setting provides thermal statistics as text rather than requiring models to extract temperature information directly from thermal imagery, limiting evaluation of end-to-end thermal processing

  3. EVALUATION: Limited to 4 models and zero-shot evaluation only - no fine-tuning experiments to assess whether thermal reasoning can be learned

  4. FUNDAMENTAL: Models struggle with operational flight planning questions (19-51% accuracy), suggesting limitations in safety-critical decision making

  5. ENGINEERING: Dataset limited to prescribed burns which may not capture full wildfire behavior variability compared to actual wildfires

    Failure modes:

  6. Models often misinterpret thermal signatures as visual artifacts rather than temperature information
  7. Cross-modal reasoning fails when RGB and thermal provide conflicting visual cues, defaulting to RGB-based responses.