Applied AI 5 papers

Applied AI Digest — Apr 14, 2026

Today’s Digest at a Glance

Today’s papers explore physically deployable adversarial attacks, temporal reasoning in video models, multi-camera surveillance systems, synthetic data generation, and sim-to-real transfer for dexterous manipulation.

Triangular Light Parameterization addresses the challenge of creating physically realizable adversarial attacks against vision-language models in real-world scenarios. Traditional adversarial perturbations work well digitally but fail when physically deployed due to lighting constraints, camera noise, and geometric distortions. The core insight is to parameterize adversarial illumination using triangular light patterns with a 9-dimensional parameter space: center coordinates $(x_c, y_c)$, radius $r$ constrained by $10 ≤ r ≤ γ·\min(H,W)$ where $γ=0.2$, RGB color values $(R,G,B)$, and three polar angles $φ_1, φ_2, φ_3 ∈ [0,360°]$ that define the triangular illumination geometry. This parameterization ensures the attack patterns can be physically projected using standard lighting equipment while maintaining adversarial effectiveness. The triangular geometry provides sufficient degrees of freedom to craft targeted perturbations while remaining feasible for physical deployment through projector systems.

Layer-Selective Merging tackles the problem that video-language models often lose temporal reasoning capabilities when adapted from their text-only backbones, despite gaining visual perception. The naive approach of fine-tuning the entire model on video data can degrade the original temporal reasoning encoded in the text backbone’s self-attention layers. The technique uses evolutionary search to discover optimal layer-wise merging recipes parameterized by a gating vector $g ∈ [0,1]^L$ where $L$ is the number of self-attention layers. For each layer $\ell$, the merged parameters are computed as a weighted combination of the video-adapted model and the original text backbone: $W_{merged}^{(\ell)} = (1-g_\ell) \cdot W_{video}^{(\ell)} + g_\ell \cdot W_{text}^{(\ell)}$. The evolutionary algorithm optimizes this gating vector to maximize temporal reasoning performance while preserving visual capabilities. This allows strategic restoration of temporal reasoning in specific layers while maintaining the visual adaptations in others.

Spatio-Temporal Topology Graph (STTG) enables systematic modeling of multi-camera surveillance networks for person tracking across complex environments. Traditional person search methods treat cameras independently, losing crucial spatial and temporal relationships that human witnesses naturally understand. The STTG represents the camera network as a directed weighted graph $T = (V,E)$ where nodes correspond to cameras with semantic zone labels, and edges encode three relationship types: $\lambda_{ij} ∈ {OVERLAP, SOFT\_ADJ, TRAVEL}$ representing overlapping fields of view, spatial adjacency, and person movement patterns respectively. Each edge carries transition-time statistics $(t_{min}, t_{med}, t_{max}, n)$ derived from observed person movements between camera pairs. This encoding captures both the physical layout constraints and empirical movement patterns, enabling reasoning about where a person might appear next given their current location and movement history.

Reading guide: The adversarial attack and layer-selective merging papers both address robustness issues in vision-language models—one from a security perspective and the other from a capability preservation angle. The surveillance and synthetic data papers tackle different aspects of video understanding, with ARGOS focusing on multi-camera reasoning and the unified pipeline addressing training data generation. The robotics paper complements these by showing how visual understanding translates to physical manipulation tasks.


Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

Authors: Yingying Zhao, Chengyin Hu, Qike Zhang, Xin Li et al. (10 authors) · Institution: Not clearly specified in the provided text - author affiliations not included in the paper excerpt · Category: cs.CV

First physically deployable adversarial attack framework against Vision-Language Models using optimized triangular illumination patterns that significantly degrade performance across classification, captioning, and VQA tasks.

Practical Takeaway: If you’re deploying VLMs in physical environments, this work demonstrates a significant and previously unrecognized vulnerability. The attacks are surprisingly effective and practically realizable with simple equipment. As a research engineer, you should: (1) Consider illumination robustness when evaluating VLM security, especially for safety-critical applications, (2) Implement input preprocessing or adversarial training that accounts for localized lighting variations, (3) Be aware that current VLM robustness benchmarks may not capture realistic physical threats. The genetic algorithm approach for physical adversarial optimization could be adapted to other physical perturbation domains beyond lighting.

Tags: adversarial_attacks vision_language_models physical_world_attacks multimodal_security CLIP VLM_robustness illumination_attacks genetic_algorithms

arXiv · PDF

Task & Setting

This work addresses the security vulnerabilities of Vision-Language Models (VLMs) in physical deployment environments. While adversarial robustness has been extensively studied in digital settings, physical-world attacks against VLMs remain largely unexplored despite VLMs being increasingly deployed in safety-critical applications like autonomous driving and security surveillance.

The task is to generate physically deployable adversarial illumination patterns that disrupt multimodal semantic understanding in VLMs. Input consists of images from the COCO dataset (300 images across 30 categories). The attack generates a triangular light pattern parameterized by 9 variables: center coordinates (x_rel, y_rel), radius r, RGB color values, and three polar angles φ₁, φ₂, φ₃. The objective is to minimize the confidence of the ground-truth label while maximizing prediction entropy:

\[L(Θ) = \log(p(\bar{t}|I_{adv}) + ε) - H(p(·|I_{adv}))\]

where Θ represents the light parameters and H(·) is Shannon entropy.

Success is measured by:

  1. Attack Success Rate (ASR) and top-1 accuracy degradation for zero-shot classification
  2. Caption consistency scores evaluated by GPT-4 for image captioning
  3. Answer correctness scores for Visual Question Answering (VQA). Physical experiments measure frame-level ASR across video sequences.

    The evaluation uses COCO dataset with 80 categories, testing on 4 CLIP variants and 6 generative VLMs including LLaVA, BLIP-2, and InstructBLIP.

Architecture & Method
  1. Triangular Light Parameterization: Uses 9-dimensional parameter space to model triangular light patterns with center coordinates, radius (constrained to 10 ≤ r ≤ γ·min(H,W) where γ=0.2), RGB color values, and three polar angles φ₁, φ₂, φ₃ ∈ [0,360°].

  2. Circle-based Geometric Modeling: Generates triangles by placing three vertices on circumference of a circle rather than optimizing free vertices, ensuring valid triangular shapes through coordinate constraints:

    \[x_i = x + r × \sin(\phi_i × π/180), y_i = y + r × \cos(\phi_i × π/180)\]
  3. Multi-objective Fitness Function: Combines ground-truth confidence suppression with entropy maximization to encourage decision confusion:

    \[L(Θ) = \log(p(\bar{t}|I_{adv}) + ε) - H(p(·|I_{adv}))\]
  4. Genetic Algorithm Optimization: Uses population-based search (population=50, generations=200) with tournament selection, crossover (rate=0.8), and mutation (rate=0.1) to handle non-convex optimization landscape across heterogeneous parameter types.

  5. Physical Deployment Setup: Implements attacks using flashlight, colored transparent plastic sheets, and paper cutouts to project triangular patterns onto target scenes.

    The core technical contribution is the first systematic framework for physically deployable adversarial attacks against VLMs using parameterized localized illumination patterns.

Training Recipe
  1. No Model Training: This work does not train models but generates adversarial examples for existing pre-trained VLMs.

  2. Attack Generation Process: - Data: 300 COCO images (10 per category across 30 categories) - Optimization: Genetic algorithm with population size 50, max 200 generations - Hardware: NVIDIA RTX 3090 (24GB) GPU - Parameters: Crossover rate 0.8, mutation rate 0.1, transparency α=0.5, radius scaling γ=0.2

  3. Physical Implementation: - Equipment: Flashlight, colored transparent plastic sheets, paper cutout templates, protractor for angle measurement - Recording: Mobile phone on tripod capturing 5-7 second video sequences - Color sampling: Physical color measurement under target camera conditions rather than random RGB generation

  4. Evaluation Protocol: - Digital attacks: Direct parameter optimization on extracted frames - Physical attacks: Parameter optimization on clean frames, then physical deployment with corresponding materials - Wall-clock time: Not reported for optimization process

Novelty & Lineage

Prior Work:

  1. ITA (Illumination Transformation Attack, 2025): Current SOTA digital illumination attack against VLMs, limited to digital domain without physical deployment.
  2. Shadow Attack (Zhong et al. 2022): Physical adversarial attacks using shadows, focused on CNN-based models rather than VLMs.
  3. OPAD/SPAA: Projector-based physical attacks targeting traditional computer vision models, not multimodal systems.

    Delta: This work adds:

  4. First physically deployable attack framework specifically designed for VLMs
  5. Systematic evaluation across both discriminative (CLIP) and generative VLMs (LLaVA, BLIP)
  6. Parameterized triangular light modeling that is more flexible than prior geometric constraints.

    Applied-specific Assessment:

    • Architectural novelty: The triangular parameterization is a reasonable engineering choice but not architecturally novel - it’s a straightforward geometric modeling approach.
    • Benchmark gains: Attack success rates show substantial improvements (82% ASR vs 42% for ITA on OpenAI CLIP), but these are compared against relatively weak baselines. The physical deployment is more convincing evidence.
    • Fair comparisons: Comparisons appear fair using identical evaluation protocols as ITA. However, the genetic algorithm vs gradient-based optimization makes direct comparison somewhat limited.
    • Generalizability concerns: Results heavily dependent on simple geometric shapes and specific hardware setup. Unclear if gains would hold with more sophisticated defenses or constrained deployment scenarios.

    Verdict: INCREMENTAL — Solid engineering application of known techniques (genetic algorithms, geometric parameterization) to a new domain (physical VLM attacks), with clear practical value but no fundamental algorithmic or architectural innovation.

Benchmarks & Results
  1. Zero-shot Classification (COCO dataset, 4 CLIP variants): - OpenCLIP ViT-B/16: 29% accuracy (↓68% vs clean 97%), previous SOTA (ITA): 46% - Meta-CLIP ViT-L/14: 23% accuracy (↓75% vs clean 98%), previous SOTA (ITA): 64% - EVA-CLIP ViT-G/14: 36% accuracy (↓62% vs clean 98%), previous SOTA (ITA): 84% - OpenAI CLIP ViT-L/14: 11% accuracy (↓82% vs clean 93%), previous SOTA (ITA): 51%

  2. Image Captioning (6 VLMs, GPT-4 consistency evaluation): - LLaVA-1.5: 56.23% consistency (↓22.37% vs clean 78.60%), ITA: 63.73% - OpenFlamingo: 41.73% consistency (↓28.47% vs clean 70.20%), ITA: 53.93% - BLIP-2 FlanT5-XL: 53.88% consistency (↓21.22% vs clean 75.10%), ITA: 60.93%

  3. Visual Question Answering (6 VLMs, GPT-4 correctness evaluation): - LLaVA-1.5: 24% correctness (↓44% vs clean 68%), ITA: 48% - OpenFlamingo: 3% correctness (↓42% vs clean 45%), ITA: 19% - BLIP-2 FlanT5-XL: 6% correctness (↓57% vs clean 63%), ITA: 38%

  4. Physical Domain Performance: Frame-level attack success rates demonstrated across video sequences but specific quantitative results not systematically reported.

    Results Analysis: MSLA consistently outperforms all baselines across all tasks and models. VQA shows the largest vulnerability with 40-50% performance drops. Smaller models (OpenFlamingo, BLIP-2) exhibit greater vulnerability than larger ones (LLaVA-1.6).

Compute & Efficiency
  1. Model Size: Attacks target existing pre-trained models ranging from 3B parameters (OpenFlamingo) to 7B parameters (LLaVA variants). No additional model parameters introduced.

  2. Training Compute: Attack generation uses genetic algorithm optimization on NVIDIA RTX 3090 (24GB). Population size 50, max 200 generations. Specific GPU hours not reported.

  3. Inference Speed/Latency: Not systematically evaluated. Physical deployment requires manual setup of optical equipment (flashlight, colored sheets, templates) which adds significant overhead compared to digital attacks.

  4. Memory Footprint: Minimal additional memory requirements beyond target VLM inference. Attack parameter optimization operates on single images with 9-dimensional parameter space.

  5. Deployment Practicality: - Digital: High practicality with standard GPU hardware - Physical: Moderate practicality requiring simple optical equipment (flashlight, colored transparent sheets, paper cutouts, protractor) - Limitations: Physical setup requires manual preparation and positioning, not easily scalable or automated - Environment constraints: Requires controlled lighting conditions and precise positioning of optical equipment

Real-World Applicability
  1. Physical Hardware Experiments: Demonstrates attacks using commodity hardware - flashlight, colored transparent plastic sheets, paper cutout templates, and protractor for angle measurement. Videos captured with mobile phone on tripod.

  2. Real-world Object Testing: Physical experiments conducted on actual COCO-category objects in laboratory environment rather than just projected/printed images.

  3. Environmental Robustness: Tests show some robustness across different lighting conditions through 5-7 second video sequences, though systematic environmental variation analysis not provided.

  4. Deployment Constraints: Requires attacker to: (a) physically access scene with optical equipment, (b) precisely position light source and templates, (c) maintain specific distance/angle relationships, (d) operate in controlled lighting environment.

  5. Real-world Defense Challenges: Physical nature makes the attacks harder to detect automatically compared to digital perturbations, as triangular light patterns can appear natural in many scenarios.

  6. Scalability Limitations: Manual setup process limits practical deployment at scale. Each attack requires custom template cutting and precise positioning based on optimized parameters.

Limitations & Failure Modes
  1. FUNDAMENTAL: Attack requires physical access to scene and ability to control lighting conditions, severely limiting real-world applicability in secured or monitored environments.

  2. ENGINEERING: Limited to simple triangular shapes - more complex geometric patterns might be more effective but would require more sophisticated optical equipment and optimization.

  3. ENGINEERING: Genetic algorithm optimization is computationally expensive and may not find global optima due to population-based heuristic search nature.

  4. EVALUATION: Physical experiments conducted only in controlled laboratory settings without systematic evaluation across varying environmental conditions (outdoor lighting, weather, multiple cameras).

  5. EVALUATION: Limited evaluation of detection/defense mechanisms - unclear how robust these attacks are against adversarial training or input preprocessing.

  6. FUNDAMENTAL: Attack effectiveness depends heavily on VLM architecture and scale - larger models show increased robustness, suggesting this vulnerability may diminish as models improve.

  7. ENGINEERING: Physical implementation requires precise manual setup and calibration, making reproducibility and consistency across different attack scenarios challenging.

    Likely Failure Modes:

    • Lighting Competition: Attack likely fails in bright outdoor environments or scenes with strong competing light sources
    • Attention Robustness: May fail against VLMs trained with attention mechanisms specifically designed to ignore localized perturbations

Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

Authors: Zihang Fu, Haonan Wang, Jian Kang, Kenji Kawaguchi et al. (5 authors) · Institution: National University of Singapore · Category: cs.CV

MERIT uses evolutionary search to find layer-selective merging recipes that restore temporal reasoning in video-language models by strategically blending self-attention parameters with the original text backbone while preserving visual perception.

Practical Takeaway: If you’re working with video-language models that show degraded temporal reasoning after multimodal adaptation, consider layer-selective merging with the original text backbone rather than uniform full-model merging. The key insight is that reasoning capabilities may be localized to specific layers, so targeted intervention can recover temporal reasoning while preserving visual capabilities. However, this requires access to both the VLM and its paired text backbone, plus computational resources for evolutionary search. The approach is most practical when you have multiple temporal reasoning tasks and can afford the search cost upfront.

Tags: video-language-models temporal-reasoning model-merging multimodal-adaptation layer-selective-intervention training-free-methods evolutionary-search parameter-interpolation

arXiv · PDF

Task & Setting

Video-Language Models (VLMs) combine visual encoders with Large Language Models to understand videos, but multimodal adaptation often weakens temporal reasoning abilities inherited from language-only pretraining. This creates a critical gap where models can perceive visual content but fail to reason about temporal-causal relationships between events.

The task is restoring temporal reasoning (TR) in VLMs while preserving temporal perception (TP). Input consists of video sequences and text questions requiring temporal understanding. The method searches over layer-wise self-attention parameter merging between a VLM and its text-only backbone. The objective function is:

\[F(g) = Acc_{TR}(g) - \lambda \cdot D_{TP}(g)\]

where $D_{TP}(g) = \max(0, Acc_{TP}^{base} - Acc_{TP}(g))$ penalizes temporal perception degradation.

Success is measured on temporal reasoning accuracy while maintaining temporal perception performance. The method is evaluated on Video-MME (search set: 55 TP + 177 TR examples), then tested on LongVideoBench, LVBench, MMBench-Video, and Video-Holmes benchmarks.

Architecture & Method
  1. Start with a VLM (LongVA-7B, InternVL3-8B, or Qwen3-VL-4B) paired with its text-only LLM backbone
  2. Define layer-wise merging recipes parameterized by continuous gating vector g ∈ [0,1]^L where L is number of self-attention layers
  3. Convert continuous gates to discrete via thresholding: ĝ_ℓ = 1 if g_ℓ ≥ 0.5, else 0
  4. Apply directional interpolation at each layer ℓ:

    \[\theta_ℓ = \begin{cases}\] \[\alpha \theta_ℓ^M + (1-\alpha) \theta_ℓ^N & \text{if } \hat{g}_ℓ = 1 \\\] \[(1-\alpha) \theta_ℓ^M + \alpha \theta_ℓ^N & \text{if } \hat{g}_ℓ = 0\] \[\end{cases}\]

    where θ^M and θ^N are self-attention parameters from VLM and text backbone respectively

  5. Use CMA-ES evolutionary search to optimize the objective function F(g)
  6. Search over interpolation weights α ∈ {0.5, 0.6, 0.7, 0.8, 0.9, 1.0} independently
  7. Select recipe with highest objective score across all runs

    Core contribution: Task-driven layer-selective merging that targets specific reasoning-critical layers rather than uniform full-model merging.

Training Recipe
  1. No training required - this is a training-free parameter merging approach
  2. Search phase uses CMA-ES with population size λ_pop = 4 + ⌊3 ln L⌋ where L is number of layers
  3. Search capped at 1,600 evaluations of objective function F(g)
  4. Evaluation performed on Video-MME subset (55 TP + 177 TR examples)
  5. Visual processor outputs cached for efficiency (2-3× speedup)
  6. Discrete recipe configurations cached to avoid redundant evaluations
  7. Search conducted independently for each interpolation weight α
  8. Hardware and wall-clock time: not reported
  9. Optimizer: CMA-ES evolutionary strategy for black-box optimization
  10. No additional data or supervision required beyond base models
Novelty & Lineage

Step 1 — Prior work: Chen et al. (2025) “Bring reason to vision” showed uniform full-model merging between reasoning-specialized LLMs and vision-language models can recover some logical deduction capabilities. Li et al. (2024b) used Direct Preference Optimization to improve reasoning but required additional supervision and computation.

Step 2 — Delta: This paper introduces layer-selective merging with explicit temporal perception constraints, moving from coarse full-model averaging to targeted layer-level intervention. Key additions:

  1. Task-driven objective that balances reasoning recovery with perception preservation
  2. Evolutionary search over layer-wise recipes rather than uniform merging
  3. Focus on video temporal reasoning rather than static image reasoning.

    Step 3 — Applied-specific assessment:

    • Architectural idea is a natural extension of existing model merging to layer-selective setting
    • Benchmark gains are meaningful: +23.9% TR improvement on LongVA-7B, +10.8% on InternVL3-8B
    • Comparisons appear fair with proper baselines (uniform merging, random layer selection)
    • Gains likely depend on access to both VLM and paired text backbone
    • Interventional analysis provides good evidence that selected layers matter

    Verdict: INCREMENTAL — Solid application of layer-selective merging to an important problem, but the core technique is a straightforward extension of existing model merging approaches.

Benchmarks & Results
  1. Video-MME Temporal Reasoning: LongVA-7B 49.7% vs 40.1% baseline (+23.9%), InternVL3-8B 57.6% vs 52.0% (+10.8%), Qwen3-VL-4B 46.9% vs 45.2% (+3.8%)
  2. LongVideoBench Relation: LongVA-7B 48.6% vs 48.0% (+1.3%), InternVL3-8B 55.2% vs 54.8% (+0.7%), Qwen3-VL-4B 56.9% vs 56.0% (+1.6%)
  3. LVBench Reasoning: LongVA-7B 46.8% vs 45.3% (+3.3%), InternVL3-8B 53.2% vs 51.2% (+3.9%), Qwen3-VL-4B 46.3% vs 39.3% (+17.8%)
  4. MMBench-Video Temporal Reasoning: LongVA-7B 1.09 vs 1.05 (+3.8%), InternVL3-8B 1.46 vs 1.44 (+1.4%), Qwen3-VL-4B 1.38 vs 1.08 (+27.8%)
  5. Video-Holmes Overall: LongVA-7B 18.9% vs 15.2% (+24.3%), InternVL3-8B 37.2% vs 36.5% (+1.9%), Qwen3-VL-4B 32.9% vs 31.5% (+4.4%)

    Results show consistent improvements in temporal reasoning while preserving perception. Gains transfer well across different benchmarks beyond the search set.

Compute & Efficiency
  1. Model size: Same as base VLMs (LongVA-7B, InternVL3-8B, Qwen3-VL-4B) - no additional parameters
  2. Training compute: None required (training-free approach)
  3. Search compute: CMA-ES with up to 1,600 evaluations per interpolation weight, 6 weights tested per model
  4. Inference speed: Same as base models (only parameter values changed, not architecture)
  5. Memory footprint: Same as base models during inference
  6. Deployment practicality: High - requires access to both VLM and paired text backbone for merging, but resulting merged model can be deployed independently
Real-World Applicability
  1. Method evaluated only on video understanding benchmarks, not real-world deployment scenarios
  2. No hardware experiments on actual robots or autonomous systems reported
  3. No production integration or user studies mentioned
  4. Limited to scenarios where both VLM and paired text backbone are available
  5. Sim-to-real transfer not discussed - purely benchmark-based evaluation
  6. Practical applicability depends on access to base model checkpoints for merging
Limitations & Failure Modes
  1. FUNDAMENTAL: Method requires access to both VLM and its specific paired text backbone, limiting applicability to cases where both models are available
  2. ENGINEERING: Recipes are model-specific and don’t transfer across architectures - requires separate search for each VLM
  3. EVALUATION: Relies on benchmark-defined temporal perception/reasoning categories which may not capture all aspects of temporal understanding
  4. ENGINEERING: Search process requires substantial computation (up to 9,600 evaluations per model)
  5. FUNDAMENTAL: Layer selection depends on how temporal reasoning vs perception are operationalized in the search dataset

    Likely failure modes:

  6. Performance may degrade on temporal tasks not well-represented in Video-MME search set
  7. Method may not work well for VLMs trained with different adaptation strategies than the tested models.

Authors: Myungchul Kim, Kwanyong Park, Junmo Kim, In So Kweon · Institution: KAIST · Category: cs.CV

ARGOS introduces the first benchmark for interactive multi-camera person search that combines witness dialogue with spatio-temporal reasoning over physically validated camera network topology.

Practical Takeaway: If you’re working on multi-modal reasoning or surveillance applications, ARGOS demonstrates the importance of tool-augmented LLM agents over direct inference. The key insight is that spatial and temporal constraints provide powerful elimination mechanisms when combined with interactive dialogue. The benchmark’s structure (three progressive tracks) offers a principled way to evaluate different reasoning capabilities. However, the deterministic witness simulator and limited environment coverage suggest this is primarily a research testbed rather than a production-ready solution. The substantial performance gaps (best TWS only 0.383-0.590) indicate significant room for improvement in agentic reasoning systems.

Tags: surveillance person-identification multi-camera spatial-reasoning temporal-reasoning interactive-dialogue benchmark multi-modal

arXiv · PDF

Task & Setting

Multi-camera person search systems face a significant gap between idealized query scenarios and real-world ambiguity. Existing person re-identification methods assume clear visual queries or detailed appearance descriptions, but real witnesses provide vague statements mixed with spatial and temporal information (e.g., “I saw them in the warehouse, then near the lobby a few minutes later”).

The task is interactive multi-camera person search. Input: a vague witness statement about a target person in a gallery G, access to a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and transition times. The agent conducts multi-turn dialogue with a witness simulator, selecting actions:

  1. visual attribute questions
  2. spatial location queries
  3. temporal reasoning checks. Output: identification of the unique target person within a limited turn budget.

    Success is measured by Turn-Weighted Success (TWS):

    \[TWS = \frac{1}{N} \sum_{i=1}^{N} s_i \cdot \frac{\tau_i^*}{\max(\tau_i, \tau_i^*)}\]

    where $s_i \in {0,1}$ indicates correctness, $\tau_i$ is agent’s turn count, $\tau_i^*$ is oracle-optimal count.

    The benchmark comprises 2,691 tasks across 14 real-world scenarios (factory and campus environments) with 1,273 persons, 16 synchronized cameras per environment, organized in three tracks: Track 1 (Who): semantic perception (989 tasks), Track 2 (Where): spatial reasoning (550 tasks), Track 3 (When): temporal reasoning (1,152 tasks).

Architecture & Method
  1. Spatio-Temporal Topology Graph (STTG): directed weighted graph $T = (V,E)$ where nodes represent cameras with zone labels, edges carry types $\lambda_{ij} \in {OVERLAP, SOFT\_ADJ, TRAVEL}$ and transition-time statistics $(t_{min}, t_{med}, t_{max}, n)$ from observed movements.

  2. Four-module agent pipeline:
    1. Analyst queries gallery and computes attribute elimination power over current candidate set
    2. Planner decides next action using information gain
    3. Interviewer executes action via appropriate tool
    4. Interpreter parses witness response and applies filters.
  3. Eight tools: gallery queries (T1-T2), zone structure retrieval (T3), witness interaction (T4), temporal feasibility checking via STTG (T5), filtering/prediction actions (T6-T8).

  4. Information-theoretic clue selection for Track 1 using penalized entropy:

    \[IG(a) = H_{value}(a | C_t) \times (1 - \alpha \cdot p_{uncertain}(a))\]
  5. Zone-based spatial disambiguation for Track 2 using pre-defined disambiguation trees partitioning cameras into sub-areas.

  6. STTG-based temporal feasibility classification for Track 3: candidates eliminated if transition violates empirical time constraints with 2.0x margin (too fast: $< t_{min}/2.0$, too slow: $> t_{max} \times 2.0$).
Training Recipe

Not applicable - this is a benchmark paper introducing evaluation tasks, not training a model. The ARGOS agent uses pre-trained LLMs (GPT-4o, GPT-5.2, GPT-5-mini, Claude Sonnet 4) as backbone reasoning engines with temperature 0.0. No additional training or fine-tuning is performed.

Novelty & Lineage

Step 1 — Prior work: Traditional person re-identification (Wei et al. 2018, Zheng et al. 2015) relies on clear visual queries. Text-based person retrieval (Yang et al. 2023, Tan et al. 2024) uses appearance descriptions alone. Interactive methods (Das et al. 2017, Levy et al. 2023, Lu et al. 2025) focus on visual dialogue without spatial-temporal reasoning. Spatial reasoning benchmarks (Chen et al. 2024, Cheng et al. 2024) remain limited to single-image settings.

Step 2 — Delta: ARGOS adds three key components:

  1. interactive multi-turn dialogue combining appearance, spatial, and temporal queries
  2. grounding in physically validated STTG encoding camera connectivity and transition times
  3. evaluation protocol requiring strategic planning under information asymmetry.

    Step 3 — Applied-specific assessment:

    • Architectural idea: Novel combination of multi-modal interaction with spatio-temporal graph reasoning, but individual components (dialogue systems, spatial graphs, temporal constraints) are well-established
    • Benchmark gains: Substantial drops when removing domain-specific tools (up to 49.6 pp), but this validates necessity rather than demonstrating breakthrough capability
    • Comparisons: Fair comparison across LLM backbones, but limited baseline diversity beyond ablations
    • Generalizability: Limited to two environments; broader validation needed

    Verdict: INCREMENTAL — solid benchmark contribution that systematically combines known techniques for an important but narrow application domain.

Benchmarks & Results
  1. Track 1 (Who): Top-1 Accuracy - LLM ToolCall 81.1%, LLM Direct 73.3%, improvement +7.8 pp

  2. Track 2 (Where): Turn-Weighted Success - Claude Sonnet 4: 0.383 (best), GPT-4o: 0.323, GPT-5-mini: 0.319, GPT-5.2: 0.338. Top-1 Accuracy range: 73.1-76.0%

  3. Track 2 (Where): Success Rate at 5 turns - Claude Sonnet 4: 38.4% (best), others: 31.6-33.6%

  4. Track 3 (When): Turn-Weighted Success - GPT-5.2: 0.590 (best), Claude Sonnet 4: 0.548, GPT-4o: 0.567, GPT-5-mini: 0.556. Top-1 Accuracy range: 80.6-88.2%

  5. Track 3 (When): Success Rate at 5 turns - GPT-5.2: 65.8% (best), others: 59.5-60.8%

    Results show benchmark is far from solved with best TWS only 0.383 (Track 2) and 0.590 (Track 3). No single backbone dominates both tracks, indicating different capability requirements.

Compute & Efficiency
  1. Model size: Uses pre-trained LLMs as backbones (GPT-4o, GPT-5.2, Claude Sonnet 4) - parameter counts not specified

  2. Training compute: Not applicable - no model training performed

  3. Inference speed/latency: Not reported, but involves multi-turn dialogue with up to 20-turn budget

  4. Memory footprint: Not specified

  5. Deployment practicality: Requires access to commercial LLM APIs and structured camera network with STTG. Limited to environments with pre-computed spatio-temporal graphs and synchronized multi-camera systems.

Real-World Applicability
  1. Data source: Built on MTMMC dataset with 16 synchronized cameras across two real environments (factory and university campus)

  2. Ground truth validation: Uses empirically validated transition times from observed human movements, not simulated data

  3. Environment coverage: Limited to 2 environments (14 scenarios total), requires expansion for broader validation

  4. Deployment considerations: Deterministic witness simulator limits real-world applicability - actual witnesses exhibit memory errors and inconsistencies not captured

  5. System requirements: Needs pre-existing multi-camera surveillance infrastructure with established STTG topology

Limitations & Failure Modes
  1. FUNDAMENTAL: Deterministic witness simulator doesn’t capture real-world witness inconsistencies, memory errors, or adversarial behavior

  2. ENGINEERING: Limited to 2 environments - broader validation across diverse camera layouts and densities needed

  3. FUNDAMENTAL: STTG construction requires extensive manual curation and domain expertise for each new environment

  4. EVALUATION: No evaluation on truly adversarial witnesses or scenarios with intentionally misleading information

  5. ENGINEERING: Currently uses only direct STTG edges - multi-hop reasoning over indirect camera paths not supported

    Failure modes:

  6. Premature prediction commitment leading to wrong answers despite correct reasoning capability
  7. Timeout from excessive evidence gathering when disambiguation requires strategic efficiency

All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

Authors: Tanzila Rahman, Renjie Liao, Leonid Sigal · Institution: University of British Columbia · Category: cs.CV

A unified pipeline generates synthetic multimodal video data from single images and shows VQA-based training outperforms caption-based approaches for video understanding tasks.

Practical Takeaway: If you’re working on multimodal video understanding with limited annotated data, this paper demonstrates that synthetic data generation can provide modest performance gains. The key insight is using VQA-based fine-tuning instead of caption-based training for better reasoning. However, the approach requires significant computational resources (multiple large models) for data generation. Before implementing, consider whether the modest gains (2-3 point improvements) justify the complexity and compute cost compared to other data augmentation strategies.

Tags: synthetic_data video_understanding multimodal_learning VQA object_counting video_segmentation data_generation MLLM

arXiv · PDF

Task & Setting

Multimodal video understanding models require large-scale annotated data across diverse tasks like object counting, visual question answering (VQA), and segmentation. Real-world video annotation is expensive, time-consuming, and limited in diversity.

This work addresses synthetic data generation for training multimodal large language models (MLLMs) on video understanding tasks. The input is a single static image, from which the system generates:

  1. textual captions describing plausible future scenarios
  2. temporally coherent video sequences
  3. object counting annotations
  4. VQA pairs, and
  5. segmentation masks across video frames.

    Success is measured on three downstream tasks:

  6. Video object counting: Mean Absolute Error (MAE) and Mean Squared Error (MSE) between predicted and ground-truth object counts
  7. Video-based VQA: CLIP-Score and Word Understanding Precision (WUP) for semantic alignment
  8. Video object segmentation: mean Intersection over Union (mIoU) for pixel-level accuracy

    The synthetic dataset contains ~5K training videos and 1K validation videos generated from MS-COCO images, with automatic annotations for multiple supervision signals.

Architecture & Method
  1. Caption Generation: ChatGPT generates future-plausible textual descriptions from input images, providing semantic guidance for video synthesis.

  2. Video Generation: Wan 2.2 (14B parameters) generates temporally coherent video sequences conditioned on both the input image and generated caption, using diffusion transformer architecture.

  3. Segmentation Mask Generation: SAM2 extracts object masks from input images, then MUG-VOS propagates these masks across generated video frames for temporal consistency.

  4. LLM-based Annotation: ChatGPT generates three types of supervision: object counting labels (aggregated from MS-COCO annotations), visual question-answer pairs (3 per image), and descriptive captions.

  5. VQA-based Training Strategy: Instead of conventional caption-based fine-tuning, models are trained on structured question-answering tasks to encourage deeper visual grounding and reasoning.

    The core technical contribution is the unified pipeline that transforms a single image into multi-modal data (text, video, masks) with consistent cross-modal conditioning.

Training Recipe
  1. Synthetic Data Generation Stage:
    • Data: ~5K videos generated from MS-COCO images using Wan 2.2 (14B parameters)
    • Caption generation via ChatGPT API calls
    • Segmentation masks via SAM2 + MUG-VOS tracking
    • Hardware: Not reported for generation stage
  2. Model Fine-tuning:
    • Base model: InternVideo2.5
    • Data: 5K synthetic videos with multi-modal annotations (captions, VQA pairs, object counts, segmentation masks)
    • Training configurations tested: (a) caption-only supervision, (b) captions + VQA pairs, (c) VQA-only supervision
    • Optimizer, learning rate, batch size: Not reported
    • Hardware and training time: Not reported
  3. Evaluation:
    • Validation: 1K synthetic videos
    • Real-world benchmarks: MS-COCO val, LV-VIS 2023, YouTube-VIS datasets
Novelty & Lineage

Prior work:

  1. Synthetic data for multimodal learning has been explored in simulation-based training and generative model applications, but typically focuses on single modalities or independent generation
  2. Vision-language models like CLIP and recent MLLMs rely heavily on real-world datasets with manual annotations (captions, instructions, image-text pairs)
  3. Recent work on compositional reasoning in VLMs addresses spatial relations and object counting but depends on expensive annotations or synthetic negatives

    Delta: This paper introduces a unified pipeline that sequentially generates multiple modalities (text→video→masks) from a single image with cross-modal conditioning, plus VQA-based fine-tuning instead of caption-based training.

    Applied-specific assessment:

    • Architectural idea: The sequential cross-modal conditioning (image→caption→video→masks) is a reasonable engineering approach but not particularly novel - it’s a straightforward application of existing generative models in sequence
    • Benchmark gains: Modest improvements on video counting and VQA tasks, but margins are not large (e.g., MAE reduction from 5.46→3.18 on MS-COCO counting)
    • Comparison fairness: Limited baselines - mainly compares different training strategies on the same model rather than comparing to other synthetic data approaches or state-of-the-art methods
    • Scale dependency: Results are demonstrated with relatively small-scale synthetic data (5K videos) and unclear whether gains would hold without access to high-quality generative models like Wan 2.2

    Verdict: INCREMENTAL — solid engineering of existing generative models into a unified pipeline, but lacks significant novelty or compelling evidence of major performance gains.

Benchmarks & Results
  1. Video Object Counting (MAE/MSE): - MS-COCO: Baseline 5.46/75.38 → VQA fine-tuning 3.18/27.21 (improvement: -2.28 MAE, -48.17 MSE) - LV-VIS: Baseline 0.97/36.18 → VQA fine-tuning 0.76/4.80 (improvement: -0.21 MAE, -31.38 MSE)
    - YouTube-VIS: Mixed results, some degradation with fine-tuning approaches

  2. Video-based VQA (CLIP-Score/WUP): - MS-COCO: Baseline 87.37/0.1761 → captions+VQA 90.89/0.84 (improvement: +3.52 CLIP-Score, +0.66 WUP) - LV-VIS: Baseline 75.16/0.45 → captions+VQA 77.91/0.54 (improvement: +2.75 CLIP-Score, +0.09 WUP) - YouTube-VIS: Baseline 82.37/0.40 → captions+VQA 83.12/0.46 (improvement: +0.75 CLIP-Score, +0.06 WUP)

  3. Video Object Segmentation (mIoU): - MS-COCO: Baseline 0.4711 → 5K synthetic videos 0.5239 (improvement: +0.0528 mIoU) - Scaling effect: 2K videos 0.4694 → 5K videos 0.5239 (+0.0545 improvement)

    Results show consistent but modest improvements. Notable absence: No comparison to other synthetic data generation methods or recent video understanding benchmarks.

Compute & Efficiency
  1. Model size: Base model InternVideo2.5 (parameters not specified), Wan 2.2 video generation (14B parameters)

  2. Training compute: Not reported for fine-tuning stage. Video generation uses Wan 2.2 14B model but GPU hours not specified

  3. Inference speed/latency: Not reported

  4. Memory footprint: Not reported

  5. Deployment practicality assessment: Pipeline requires multiple large models (ChatGPT API, Wan 2.2 14B, SAM2, MUG-VOS) making it compute-intensive for data generation. Fine-tuned model deployment requirements not discussed.

Real-World Applicability
  1. Evaluation on real-world benchmarks: Models trained on synthetic data are evaluated on MS-COCO validation set, LV-VIS 2023, and YouTube-VIS datasets, showing transfer to real video data.

  2. No deployment results: No discussion of actual deployment in production systems or real-world applications.

  3. No hardware experiments: No testing on specific robotic platforms or autonomous systems.

  4. Sim-to-real analysis: Limited discussion of domain gap between synthetic and real videos, though results suggest reasonable transfer performance.

  5. Dataset diversity limitation: Synthetic videos generated only from MS-COCO images may limit diversity compared to real-world video distributions.

Limitations & Failure Modes
  1. ENGINEERING: Synthetic video quality depends on underlying generative models (Wan 2.2) which may contain artifacts or unrealistic motion patterns
  2. FUNDAMENTAL: Sequential conditioning pipeline means errors can propagate (bad caption → bad video → bad masks)
  3. EVALUATION: Limited scale evaluation (only 5K synthetic videos) and narrow comparison baselines
  4. ENGINEERING: Requires multiple large models making data generation computationally expensive
  5. EVALUATION: No comparison to other synthetic data generation approaches or analysis of what specific aspects of synthetic data drive improvements
  6. FUNDAMENTAL: VQA training strategy improvement over captions may be task-specific and not generalize broadly

    Failure modes:

    • Pipeline may generate temporally inconsistent videos when caption-image alignment is poor
    • Object tracking for mask propagation likely fails during occlusions or rapid motion changes

ViserDex: Visual Sim-to-Real for Robust Dexterous In-hand Reorientation

Authors: Arjun Bhardwaj, Maximum Wilder-Smith, Mayank Mittal, Vaishakh Patil et al. (5 authors) · Institution: ETH Zurich, NVIDIA · Category: cs.RO

Integrates 3D Gaussian Splatting with novel pre-rasterization augmentations to enable efficient sim-to-real transfer for monocular RGB-based dexterous in-hand object reorientation.

Practical Takeaway: As a research engineer, the key takeaway is that 3D Gaussian Splatting can be practically integrated into RL training pipelines for vision-based manipulation. The pre-rasterization augmentation technique is worth implementing - it provides a computationally efficient alternative to expensive ray-traced domain randomization. The structured clustering approach (spatial, color-based, global) for perturbing Gaussian coefficients is a clever way to generate realistic visual diversity. However, the approach still requires object-specific training and high-quality 3D reconstructions, so consider this for applications where you can control the object set rather than expecting generalization to arbitrary objects.

Tags: robotics dexterous_manipulation sim_to_real 3d_gaussian_splatting computer_vision reinforcement_learning pose_estimation domain_randomization

arXiv · PDF

Task & Setting

The paper addresses monocular RGB-based in-hand object reorientation using dexterous robotic hands. Current solutions require multi-camera setups, expensive ray tracing, or complex tactile systems, limiting their practical deployment. The challenge is bridging the visual sim-to-real gap while maintaining computational efficiency.

The task is defined as goal-conditioned in-hand reorientation where the robot must rotate objects to target orientations using only monocular RGB vision. Input: RGB images from wrist-mounted camera, current object pose, target pose. Output: 16-DOF joint position commands for Allegro Hand. The objective is formulated as a POMDP with reward function combining alignment with goal orientation, success bonuses, and action smoothness penalties.

Success is measured by consecutive successes (CS) - the number of goals reached before object is dropped, with success threshold of 0.4 radians orientation error. Pose estimation is evaluated using Average Distance of Model Points (ADD) and strict accuracy (< 10mm translation, < 10° rotation).

The paper evaluates on 5 diverse objects (Cube, Globe, 3D Printed Toy, Tablet Bottle, Rubber Duck) under both nominal and adversarial lighting conditions to test visual robustness.

Architecture & Method

The system has three main components:

  1. Teacher Policy Training: PPO-based RL policy

    \[\pi_\theta(a_t|o_t, g_t)\]

    with privileged state access trains on full object dynamics, contact forces, and ground-truth pose

  2. Student-Teacher Distillation: Recurrent belief encoder-decoder architecture where encoder

    \[f_\phi\]

    updates belief state

    \[z = f_\phi(o^{noisy}_{prop}, o^{noisy}_{exte})\]

    , decoder reconstructs privileged observations, and control head outputs actions. Training uses composite loss

    \[L = L_{BC} + \lambda L_{recon}\]
  3. 3D Gaussian Splatting Integration: Objects represented as 3D Gaussians with spherical harmonic coefficients. Color computation:

    \[c(d) = \text{Sigmoid}\left(\sum_{\ell=0}^{L} \sum_{m=-\ell}^{\ell} k^m_\ell Y^m_\ell(d)\right)\]
  4. Pre-rasterization Augmentations: Novel domain randomization by perturbing Gaussian SH coefficients in structured clusters (spatial, color, random noise, global shift) before rendering

  5. Pose Estimator: ResNet-34 backbone predicts 9 keypoints as normalized 2.5D coordinates, resolved to 6D pose via Procrustes algorithm

    Core technical contribution is the pre-rasterization augmentation strategy that enables physically consistent visual domain randomization without expensive ray tracing.

Training Recipe
  1. Teacher RL Training: PPO with 24,576 parallel environments, performance-based curriculum (action latency, penalty scaling, time window), 26 hours on RTX 4090 for simple objects, 90 hours on dual-GPU for complex objects

  2. Student Distillation: Online DAgger variant with 4,096 environments, composite loss combining behavior cloning and belief state reconstruction, 16 hours on RTX 4090

  3. Pose Estimator Training: ResNet-34 with ImageNet pretraining, synthetic data from 3DGS rendering with pre-rasterization augmentations, trained on rollouts from teacher policy

    Hardware: Consumer-grade GPUs (RTX 4090/6000 Ada), significantly more efficient than prior work requiring 8x A40 GPUs. Specific optimizer, learning rates, and batch sizes not reported.

Novelty & Lineage

Prior Work:

  • DeXtreme (Handa et al. 2023): Multi-camera RGB setup with expensive ADR requiring massive compute clusters
  • OpenAI (Andrychowicz et al. 2020): Proprioceptive-based in-hand manipulation with simple objects
  • SplatSim (Qureshi et al. 2024): Basic 3DGS integration for manipulation but limited domain randomization

Delta: This paper introduces pre-rasterization augmentation - structured perturbations of 3D Gaussian spherical harmonic coefficients before rendering. Unlike post-processing augmentations or expensive ray-traced domain randomization, this approach generates physically consistent visual diversity by clustering Gaussians spatially and photometrically.

Applied Assessment:

  • Architectural idea: The pre-rasterization augmentation concept is novel and non-obvious, providing structured control over scene appearance without ray tracing overhead
  • Benchmark gains: Meaningful improvements over baselines (65.4% vs 55.6% accuracy), with larger gains on complex objects
  • Fair comparisons: Same network architecture and training protocols across baselines, though limited to their own objects
  • Generalizability: Results likely depend on quality of Gaussian reconstructions and may not scale to arbitrary objects without additional 3D capture

Verdict: SIGNIFICANT — The pre-rasterization augmentation approach is a clever engineering insight that makes high-fidelity visual RL training practical on consumer hardware, with demonstrated real-world transfer across diverse objects.

Benchmarks & Results
  1. Pose Estimation (Nominal): Ours 65.4% accuracy vs DR Tiled 55.6%, Standard Tiled 53.3%, ADD error 10.2mm vs 12.2mm, 12.1mm respectively

  2. Pose Estimation (Adversarial): Ours 56.3% accuracy vs DR Tiled 47.2%, Standard Tiled 40.8%, ADD error 12.9mm vs 14.0mm, 18.3mm respectively

  3. Hardware Deployment (Nominal): Mean 37.6 consecutive successes vs DeXtreme 27.8 on Cube object, ranging from 12.6 (Tablet Bottle) to 87.6 (Globe)

  4. Hardware Deployment (Adversarial): Mean 25.4 consecutive successes under severe lighting conditions

  5. Computational Efficiency: 1.6× faster rendering than Isaac Lab, 12GB vs 34GB VRAM for 1024 environments

    Mixed results show strong performance on geometric primitives but challenges with objects having unmodeled surface properties (Tablet Bottle). Missing comparisons to other recent vision-based manipulation methods beyond DeXtreme.

Compute & Efficiency
  1. Model Size: Not explicitly reported for pose estimator (ResNet-34 backbone), teacher/student policy parameters not specified

  2. Training Compute: 26-90 hours on RTX 4090 for teacher training, 16 hours for student distillation, 1.6× faster rendering than standard pipelines

  3. Inference Speed: Pose estimator runs at ~18 Hz, policy inference at 30 Hz, significantly faster than FoundationPose (4 Hz)

  4. Memory Footprint: 12GB VRAM for 1024 parallel environments vs 34GB for standard rendering, major improvement in memory efficiency

  5. Deployment Practicality: Runs entirely on consumer-grade hardware (RTX 4090/6000 Ada), eliminates need for compute clusters, making approach accessible for research labs

Real-World Applicability
  1. Hardware Validation: Deployed on 16-DOF Allegro Hand with wrist-mounted Intel RealSense D435i camera

  2. Real Objects: Tested on 5 diverse objects from simple geometries (Cube, Globe) to complex non-convex shapes (3D Printed Toy, Rubber Duck, Tablet Bottle)

  3. Environmental Robustness: Demonstrated functionality under both nominal white lighting and severe adversarial conditions (low illumination, dynamic color shifts, specular highlights)

  4. Performance Metrics: Achieved 25+ consecutive successful reorientations on average, with some trials exceeding 200 consecutive successes

  5. System Integration: Uses SAM2 for real-time object segmentation, 30 Hz policy control with 300 Hz low-level joint control

    Notable limitation is sim-to-real gap for objects with unmodeled surface properties (e.g., low-friction Tablet Bottle label), suggesting approach works best when simulation physics reasonably matches reality.

Limitations & Failure Modes
  1. Unmodeled Surface Properties (FUNDAMENTAL): Performance degrades significantly for objects with unmodeled friction effects, particularly Tablet Bottle with low-friction label surface

  2. Object-Specific Training (ENGINEERING): Requires separate pose estimator and policy training for each object, limiting generalization to novel objects

  3. 3D Gaussian Quality Dependence (ENGINEERING): Method relies on high-quality Gaussian reconstruction from tools like Polycam, may fail with poor initial 3D captures

  4. Occlusion Limitations (FUNDAMENTAL): Inherent constraint of monocular RGB - cannot observe occluded object regions, limiting pose estimation during heavy finger occlusions

  5. Lighting Entanglement (FUNDAMENTAL): 3DGS inherently entangles geometry and illumination, limiting independent control of material properties despite augmentation strategies

    Failure Modes:

    • Catastrophic pose estimation failures during rapid object motion or severe occlusions
    • Object dropping when surface friction differs significantly from simulation assumptions