Mar 29, 2026 Applied AI 5 papers

Applied AI Digest — Mar 29, 2026

Today’s Digest at a Glance

Today’s papers explore video generation memory optimization, few-shot learning in vision-language models, robotic photography agents, software engineering reinforcement learning, and medical AI adversarial robustness.

KV Cache Partitioning for Long Sequences

Long sequence generation in autoregressive models faces a fundamental memory bottleneck: the key-value (KV) cache grows linearly with sequence length, quickly exhausting GPU memory. Naive approaches either truncate context (losing information) or use uniform compression (degrading quality). KV cache partitioning addresses this by recognizing that different parts of the sequence have different importance patterns and compression requirements.

The core idea partitions the cache into three regions with distinct treatment strategies. Sink tokens (typically early sequence positions) are kept at full resolution since they often contain critical context that later tokens reference heavily. Mid tokens undergo aggressive compression since they’re less frequently accessed but still needed for coherence. Recent tokens are maintained at high fidelity as they’re most relevant for immediate generation. Mathematically, if the full cache has size $T \times d$, partitioning creates regions $[1, s]$, $[s+1, T-r]$, and $[T-r+1, T]$ with compression ratios $1$, $c$, and $1$ respectively, reducing memory from $O(T \cdot d)$ to $O(s \cdot d + \frac{(T-s-r) \cdot d}{c} + r \cdot d)$.

This is like managing a library where you keep the most important reference books in full (sink), compress middle sections into summaries (mid), and maintain recent acquisitions in detail (recent).

Attention Head Selection and Ensembling

Large vision-language models contain hundreds of attention heads, but not all heads contribute equally to downstream tasks - some may even hurt performance by attending to irrelevant features. Traditional approaches either use all heads uniformly or apply coarse-grained selection. Attention head selection addresses this by identifying and ensembling only the most discriminative heads for each specific task.

The technique typically uses gradient-based metrics to rank heads by their contribution to task performance. For a classification task, heads are ranked by Gradient-based Discriminative Ability (GDA): $\text{GDA}(h) = |\nabla_{\mathbf{a}_h} \mathcal{L}|_2$ where $\mathbf{a}_h$ represents the attention weights of head $h$ and $\mathcal{L}$ is the task loss. The top-$k$ heads are then selected and their outputs ensembled, often with learned weights $w_i$ such that the final representation is $\sum_{i=1}^k w_i \mathbf{h}_i$ where $\mathbf{h}_i$ are the selected head outputs.

This is like assembling a expert committee where you only invite the specialists most relevant to your specific question, rather than polling everyone.

Monte Carlo Tree Search for Adversarial Generation

Monte Carlo Tree Search (MCTS) has been covered previously in game-playing contexts, but its application to adversarial prompt generation represents a distinct algorithmic approach. Flow matching (covered previously) models generative paths as ODEs from noise to data.

Reading Guide

PackForcing and the attention head selection paper both tackle efficiency in large models - one through memory optimization for long sequences, the other through selective computation. PhotoAgent demonstrates how analytical reasoning can complement learned representations in robotics, while Composer 2 shows the continued importance of domain-specific training even for foundation models. The medical VLM robustness paper serves as a crucial reminder that apparent model capabilities may mask fundamental vulnerabilities to simple input perturbations.

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Authors: Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng et al. (7 authors) · Institution: Shanda AI Research Tokyo, Fudan University · Category: cs.CV

PackForcing enables 2-minute autoregressive video generation on single GPU by partitioning KV cache into sink/compressed/recent tokens with dual-branch spatiotemporal compression achieving 32× token reduction.

Practical Takeaway: The key insight is hierarchical KV cache management for long sequence generation. The three-partition design (sink/compressed-mid/recent) with learned compression could be valuable beyond video - any autoregressive generation task with long contexts and redundant intermediate states. The dual-branch compression (structural + semantic) is a practical technique for extreme token reduction while preserving attention-relevant information. Research engineers working on long-context generation should consider this partitioned caching approach, especially the insight that different parts of the history serve different roles and can be treated with different compression strategies.

Tags: video-generation autoregressive-models memory-optimization diffusion-models KV-cache long-video attention-mechanisms temporal-consistency

arXiv · PDF

Task & Setting

This paper addresses the challenge of generating high-quality long videos (up to 2 minutes) with autoregressive video diffusion models. In practice, existing methods are bottlenecked by linear KV-cache growth that makes minute-scale generation memory-prohibitive, and by error accumulation that degrades quality over time.

The task is autoregressive video generation: given a text prompt and initial frames, generate a sequence of video blocks where each new block conditions on previously generated content via cached key-value pairs. Input is text conditioning and noise; output is 832×480 video at 16 FPS for durations up to 120 seconds.

The core technical challenge is the memory bottleneck: for a 2-minute video, the KV cache grows to ~749K tokens requiring ~138 GB memory across transformer layers, well beyond single GPU capacity. The objective is to generate temporally coherent long videos while maintaining bounded memory usage.

Success is measured on VBench metrics including Dynamic Degree, Motion Smoothness, Overall Consistency, Subject Consistency, and temporal stability via CLIP scores at 10-second intervals throughout generation.

The paper evaluates on 128 prompts from MovieGen, generating 60s and 120s videos, demonstrating 24× temporal extrapolation from 5-second training clips.

Architecture & Method

Base architecture: Flow matching framework with Wan2.1-T2V-1.3B backbone, using UMT5-XXL text encoder
Three-partition KV cache design: - Sink tokens: First 8 frames at full resolution, never compressed or evicted - Mid tokens: Compressed via dual-branch network with ~32× token reduction - Recent tokens: Most recent frames at full resolution with dual-resolution shifting
Dual-branch compression module: - HR branch: Progressive 3D convolutions with 128× volume compression (2×8×8) - LR branch: Pixel-space pooling followed by VAE re-encoding - Fusion via element-wise addition:
\[\tilde{h} = h_{HR} + h_{LR}\]
Dynamic context selection: Query-key affinity scoring to select top-K most informative mid tokens, computed as:
\[s_m = \sum_{j=1}^{L_k} \sum_{i \in S_q} \left(\frac{1}{B \cdot N_{opt}} \sum_{b=1}^B \sum_{h=1}^{N_{opt}} \frac{Q_{b,h,i}K_{m,b,h,j}^T}{\sqrt{d_h}}\right)\]
Incremental RoPE adjustment for position continuity when tokens are evicted:
\[k'_{sink} = k_{sink} \odot e^{i \theta_t(\delta), 1_h, 1_w}\]

Training Recipe

Initialization: Causal student model initialized from pretrained bidirectional Wan2.1-T2V-1.3B via ODE trajectory alignment
Training stage: Score distillation against frozen bidirectional teacher - Data: 5-second video clips (20 latent frames), prompts from VidProM with LLM augmentation - Optimizer: AdamW with β₁=0, β₂=0.999 - Learning rates: 2.0×10⁻⁶ for generator, 1.0×10⁻⁶ for fake score estimator - Update ratio: 1:5 between generator and score estimator - Batch size: 8 - Training iterations: 3,000 - Hardware: Not specified
Joint optimization: Dual-branch compression module trained end-to-end with the main model to ensure compressed tokens preserve essential semantic/structural cues
Inference settings: 4 denoising steps, classifier-free guidance scale 3.0, timestep shift 5.0

Novelty & Lineage

Prior work:

DeepForcing (Yi et al. 2025) - Introduced attention sinks and participative compression but ultimately relies on aggressive buffer truncation, causing irreversible loss of intermediate memory
Self-Forcing (Huang et al. 2025) - Autoregressive video generation with self-generated frame conditioning but suffers severe error accumulation beyond training horizon
CausVid, Rolling Forcing, LongLive - Various autoregressive approaches but all lack explicit KV cache compression mechanisms

Delta: This paper’s key contribution is the learned spatiotemporal compression of KV cache via the dual-branch network, achieving 128× volume compression while preserving attention-relevant information. Previous methods either truncate history (losing information) or keep full resolution (memory explosion).

Applied-specific assessment:
- Architectural novelty: The three-partition cache with dual-branch compression is a clear architectural advance over prior token selection/eviction approaches
- Benchmark gains: Substantial improvements in Dynamic Degree (56.25 vs 53.67 best baseline) and temporal consistency (CLIP score decline of only 1.14 vs 6.77 for Self-Forcing)
- Fair comparisons: Uses same backbone (Wan2.1) and evaluation protocol as baselines
- Generalization: 24× temporal extrapolation (5s→120s) demonstrates the approach works without matched training data scale
However, the core insight of hierarchical cache management is somewhat incremental - it’s a principled engineering solution rather than a fundamental algorithmic breakthrough.

Verdict: SIGNIFICANT — The dual-branch compression and three-partition design represent a clear non-obvious advance in memory-efficient long video generation that most engineers working on autoregressive video models should understand.

Benchmarks & Results

Dynamic Degree (60s): 56.25 (ours) vs 53.67 (DeepForcing, previous best), +2.58 improvement
Dynamic Degree (120s): 54.12 (ours) vs 52.84 (DeepForcing), +1.28 improvement
Overall Consistency (60s): 26.07 (ours) vs 25.73 (LongLive, previous best), +0.34 improvement
Overall Consistency (120s): 26.05 (ours) vs 25.95 (LongLive), +0.10 improvement
Subject Consistency (60s): 90.49 (ours) vs 92.55 (DeepForcing, best), -2.06 decrease
CLIP Score temporal stability: 1.14-point decline (ours) vs 6.77-point decline (Self-Forcing)
Memory usage: 4GB bounded KV cache vs unbounded growth in baselines
Temporal extrapolation: 24× (5s training → 120s generation)

Results are mixed - the method excels in motion synthesis and temporal consistency but trails slightly in subject consistency. Some key baselines like Sora or recent commercial models are conspicuously absent from comparisons.

Compute & Efficiency

Model size: Wan2.1-T2V-1.3B parameters (1.3 billion parameters)
Training compute: Not reported (only mentions 3,000 iterations with batch size 8)
Inference speed: Generates 2-minute 832×480 videos at 16 FPS on single H200 GPU, ~0.1% overhead for RoPE adjustment
Memory footprint: Bounded 4GB KV cache regardless of video length vs ~138GB for full attention
Deployment practicality: High - runs on single H200 GPU, achieves 32× token reduction in mid partition, enables streaming decode with incremental display

The method demonstrates strong practical deployment characteristics with strict memory bounds and single-GPU operation.

Real-World Applicability

Streaming capability: Supports streaming VAE decoding with progressive frame display, reducing time-to-first-frame
Hardware requirements: Demonstrated on single H200 GPU, making it accessible for practical deployment
Resolution and frame rate: Generates 832×480 at 16 FPS, which is reasonable for many applications but not cutting-edge resolution
Zero-shot capability: Can operate zero-shot without training on long videos, using only the compression mechanism

No specific deployment results, production integration details, or real-world user studies are provided. The work focuses on benchmark evaluation rather than production deployment.

Limitations & Failure Modes

ENGINEERING: Fixed compression ratio (32×) could be made adaptive to scene complexity rather than using uniform compression
FUNDAMENTAL: Trade-off between motion richness and subject consistency - achieves high Dynamic Degree (56.25) but lower Subject Consistency (90.49) compared to some baselines (92.55)
EVALUATION: Limited to 832×480 resolution; unclear how the approach scales to higher resolutions like 1920×1080
ENGINEERING: Attention-based importance scoring may not capture all aspects of visual saliency; learned importance predictors could help
FUNDAMENTAL: Still relies on query-key affinity for token selection, which may miss semantically important but low-attention content

Failure modes:
Progressive semantic drift: Without sufficient sink tokens, the model loses global semantics over time
Position discontinuity artifacts: If RoPE adjustment fails, temporal position gaps cause severe frame reset artifacts

Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

Authors: Adhemar de Senneville, Xavier Bou, Jérémy Anger, Rafael Grompone et al. (5 authors) · Institution: Université Paris-Saclay · Category: cs.CV

Improves LVLM classification by selecting and ensembling discriminative attention heads using GDA ranking combined with hierarchical prompt conditioning.

Practical Takeaway: If you’re working with LVLMs for classification tasks, consider that the internal attention heads may contain better discriminative features than the final output. The key insight is using prompt conditioning (especially domain-specific prompts) combined with Gaussian Discriminant Analysis to rank and ensemble the most discriminative heads. However, the method requires access to internal representations and shows diminishing returns on well-performing benchmarks. Most practical for scenarios where you can modify the LVLM inference pipeline and need modest improvements over existing LVLM classification performance.

Tags: few-shot learning zero-shot classification vision-language models LVLM CLIP attention mechanisms prompt engineering head selection

arXiv · PDF

Task & Setting

LVLMs excel at generative tasks like captioning and VQA but surprisingly underperform at image classification compared to CLIP-based methods, despite often using CLIP-pretrained vision encoders. This gap stems from CLIP’s independent vision-text encoders that bias classification toward class-name matching rather than joint visual-text reasoning.

The task is few-shot and zero-shot image classification. Input: query images and support set of labeled images (K shots per class, N classes). Output: class predictions. The method extracts features from LVLM attention heads and uses Gaussian Discriminant Analysis ranking:

\[s_m^{(v)} = \frac{1}{KN}\sum_{i=1}^{KN} p_{i,m,y_i}^{(v)}\]

where class probabilities follow:

\[p_{i,m,c}^{(v)} = \frac{\exp(\ell_{i,m,c}/\tau)}{\sum_{j=1}^{N}\exp(\ell_{i,m,j}/\tau)}\]

Success is measured by classification accuracy on 12 datasets including EuroSAT, UCF101, DTD, Caltech101, SUN397, OxfordPets, StanfordCars, Flowers102, Food101, FGVC Aircraft, CUB-200, and Traffic-Signs across zero-shot, few-shot, and vision-text-few-shot settings.

Architecture & Method

Use LVLM (Qwen2-VL or LLaVA-OV) with vision encoder + LLM decoder processing concatenated vision and text tokens
Apply prompt conditioning at three levels: - Task conditioning: “What is the object in the image?”
- Domain conditioning: “What breed is that dog?” - Class conditioning: “Between: boxer, yorkshire, beagle or havanese”
Extract attention vectors from all heads using:
\[\mathbf{h}_m = \text{softmax}\left(\frac{\mathbf{q}_m\mathbf{K}_m^\top}{\sqrt{D}}\right)\mathbf{V}_m\]
Rank vision-heads using Gaussian Discriminant Analysis on support set with class-conditional distributions:
\[(\mathbf{H}_m^{(v)}|Y=c) \sim \mathcal{N}(\boldsymbol{\mu}_{m,c}, \boldsymbol{\Sigma}_m)\]
Rank text-heads by zero-shot accuracy on support set using dot products
Create three Head Ensemble Classifiers (HEC): - HEC-V: averages top vision-heads for few-shot - HEC-T: averages top text-heads for zero-shot
- HEC-VT: combines HEC-V and HEC-T predictions

Training Recipe

Uses frozen pretrained LVLMs (Qwen2-VL 7B, LLaVA-OV 7B) - no additional training required
Training-free method: only requires inference through the LVLM to extract attention head features
Head selection uses support set at test time with fixed hyperparameters: τ=10 temperature, top k=20 heads selected
For HEC-VT combination, α parameter swept from 0.1 to 10 during evaluation
Text-heads can be selected once on ImageNet and transferred across datasets
Hardware: single NVIDIA V100 GPU for all experiments

Novelty & Lineage

Prior work:

SAVs (2025): Selected task vectors from sparse attention heads for LVLM few-shot learning using nearest centroid classifier
VLM2Vec (2025): Used instruction conditioning for LVLM embeddings with contrastive training
Training-free CLIP methods: TipAdapter, GDA, ProKeR combined zero-shot and few-shot classifiers for CLIP

Delta: This paper adds:
Gaussian Discriminant Analysis-based head ranking instead of nearest centroid
systematic prompt conditioning hierarchy (task/domain/class)
separate vision-head and text-head selection strategies.

Assessment:
- Architectural idea: Known technique applied to new setting - GDA ranking and prompt conditioning are established, applied to LVLM head selection
- Benchmark gains: Modest but consistent - 2-3% average improvements over SAVs, competitive with but not consistently beating CLIP baselines
- Fair comparisons: Reasonable - uses same prompts and evaluation protocols, though different backbone strengths complicate comparison
- Generalization: Limited evidence - gains appear stronger on less saturated benchmarks, suggesting method may not scale to all domains
Verdict: INCREMENTAL — Solid application of known techniques (GDA ranking, prompt conditioning) to LVLM head selection with modest but consistent gains over recent baselines.

Benchmarks & Results

Vision-Few-Shot (4-shot): HEC-V achieves 82.4% average vs SAVs 80.0%, improvement +2.4%
Text-Zero-Shot (10-way): HEC-T achieves 86.2% on Qwen2-VL vs baseline 76.1%, improvement +10.1%
Vision-Text-Few-Shot (4-shot): HEC-VT achieves 83.0% vs best CLIP baseline GDA 81.5%, improvement +1.5%
Individual dataset performance: HEC methods win on 9/12 datasets in few-shot, but CLIP still wins on saturated benchmarks (>90% accuracy)
Cross-LVLM validation: Tested on both Qwen2-VL and LLaVA-OV with consistent gains
Missing evaluations: Limited to 12 classification datasets, no evaluation on larger-scale datasets like ImageNet-1K full evaluation

Results are mixed - consistent modest improvements over LVLM baselines but doesn’t consistently beat CLIP-based methods across all settings.

Compute & Efficiency

Model size: Uses 7B parameter LVLMs (Qwen2-VL-7B, LLaVA-OV-7B)
Training compute: Zero additional training - only inference compute for head selection on support set
Inference speed: Not reported, but requires extracting features from all attention heads (784 heads across 28 layers) which adds overhead
Memory footprint: Not reported, but must store attention vectors for all heads during ranking
Deployment practicality: Limited by need for intermediate attention head representations, making API-based deployment difficult. More computationally intensive than CLIP due to LVLM inference requirements.

Real-World Applicability

Evaluation data: Uses standard academic benchmarks with clean, curated images
Deployment constraints: Authors acknowledge API-based usage limitation due to need for intermediate representations
Domain transfer: Shows some evidence of text-heads transferring across domains when selected on ImageNet
Scale limitations: Class conditioning limited to small number of classes due to context window constraints
Production considerations: No discussion of computational costs, latency, or real-world deployment scenarios beyond academic benchmarks

Limited evidence of real-world applicability beyond controlled benchmark settings.

Limitations & Failure Modes

Need for intermediate representations (FUNDAMENTAL) - requires access to attention heads, limiting API-based usage
Class conditioning scale limitations (FUNDAMENTAL) - limited to small number of classes due to context window constraints
LVLM inference overhead (ENGINEERING) - more computationally intensive than CLIP models
Support set requirement for text-heads (EVALUATION) - zero-shot text-head ranking actually requires labeled support set
Limited evaluation scope (EVALUATION) - only 12 academic datasets, no large-scale real-world evaluation

Failure modes:
- Performance saturates on benchmarks with >90% accuracy where CLIP methods still win
- Method appears less effective on object-centric vs. scene/texture classification tasks

PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Authors: Lirong Che, Zhenfeng Gan, Yanbo Chen, Junbo Tan et al. (5 authors) · Institution: Tsinghua University · Category: cs.CV

PhotoAgent combines LMM reasoning with analytical geometric solving and 3DGS simulation to translate photography instructions into camera poses, achieving superior aesthetic quality through “mental simulation” rather than physical trial-and-error.

Practical Takeaway: If you’re working on embodied AI or robotic manipulation, the key insight here is using analytical geometric constraint decomposition instead of direct pose regression when bridging language and spatial control. The anchor-point hypothesis for simplifying complex scenes is practical, and the combination of LMM reasoning with 3DGS simulation could be adapted to other spatial reasoning tasks. However, the static scene requirement limits immediate applicability - consider this approach for scenarios with controlled environments rather than dynamic real-world deployment.

Tags: robotics computer_vision multimodal_learning 3d_gaussian_splatting embodied_ai photography spatial_reasoning human_robot_interaction

arXiv · PDF

Task & Setting

Robotic photography systems need to bridge the semantic gap between high-level language commands (e.g., “take a dramatic photo”) and low-level geometric control (6-DoF camera poses). This is challenging because aesthetic goals are subjective and must be translated into precise spatial positioning while avoiding costly physical trial-and-error.

Input: Natural language photography instructions and RGB observations from a robot-mounted camera. Output: Optimized camera poses that produce aesthetically pleasing photographs following the instructions. The task involves multi-modal reasoning over visual scenes, geometric constraint solving, and iterative pose refinement.

Success is measured by:

spatial reasoning capability on graduated difficulty tasks (centering objects, multi-object composition)
aesthetic quality via human evaluation using Mean Opinion Score (MOS) on 5-point Likert scale, and
instruction adherence measured by human preference in paired comparisons.

The paper introduces evaluation on 8 real and simulated photography scenarios with 100 human evaluators rating aesthetic quality and instruction-following capability.

Architecture & Method

Intention Parsing Module: Uses Large Multimodal Model (GPT-4.1) with “Anchor-Point Hypothesis” to select single principal subject and decompose aesthetic goals into geometric constraints: target image coordinates $(u^*, v^*)$, scale ratio $s$, azimuth $\theta$, elevation $\phi$.
Geometric Solver: Analytically converts constraint vector $g = (u^*, v^*, s, \theta, \phi)$ to 6-DoF pose via:
\[\rho = \frac{\rho_0}{s}\] \[p_c = \begin{bmatrix} \rho \cos \phi \sin \theta \\ \rho \sin \phi \\ \rho \cos \phi \cos \theta \end{bmatrix}\]
Visual Servoing Refinement: Projects subject center and corrects pixel error:
\[\begin{bmatrix} \Delta\theta_{yaw} \\ \Delta\phi_{pitch} \end{bmatrix} = -\lambda J^{-1} e\]
3D Gaussian Splatting World Model: Uses AnySplat for real-time photorealistic rendering from 5-7 views.
Reflective Reasoning Loop: Iteratively samples candidate poses in spherical coordinates, renders via 3DGS, scores with LMM critic, and refines until convergence.

Core contribution: First system to combine LMM reasoning with analytical geometric solving and 3DGS-based “mental simulation” for photography.

Training Recipe

No Model Training: Uses pre-trained GPT-4.1 multimodal model without fine-tuning.
3DGS Scene Reconstruction: Captures 5-7 views around subject, reconstructs scene via single AnySplat forward pass (seconds-level reconstruction time).
Prompt Engineering: Structured chain-of-thought prompts for intention parsing and reflective reasoning, no gradient-based optimization.
System Integration: GroundingDINO for object detection, MediaPipe FaceMesh for portrait landmarks, VINS-Fusion for odometry.

Hardware: NVIDIA RTX 4090D for simulation, RTX 4070 Laptop GPU for real robot deployment.

Wall-clock time: Reflective loop converges in 2-3 iterations with millisecond-level 3DGS rendering.

Novelty & Lineage

Prior Work:

PhotoBot (2024): Retrieves reference photos from gallery and uses PnP-style pose solving to mimic composition - limited to template matching.
AutoPhoto (AlZayer et al., 2021): Uses reinforcement learning for viewpoint optimization but requires environment-specific interaction data.
ReAct/Reflexion (Yao et al., 2023): Language agent frameworks with thought-action loops, but limited to symbolic domains.

Delta: This paper adds:
analytical geometric constraint solving instead of direct 6-DoF regression
3DGS-based visual simulation for closed-loop refinement
anchor-point hypothesis for scene simplification.

Applied-Specific Assessment:
- Architectural novelty: Moderate - combines existing components (LMMs + 3DGS) but the geometric constraint decomposition is non-obvious
- Benchmark gains: Limited evaluation - only 8 scenarios with 100 human raters, gains are substantial (+1.01 MOS) but scale is small
- Fair comparisons: Baselines are weak (direct 6-DoF prediction), no comparison to PhotoBot or AutoPhoto on same tasks
- Generalization concerns: Method requires 3DGS reconstruction per scene, may not scale to dynamic environments
Verdict: INCREMENTAL - Solid engineering combining known techniques (LMM reasoning + 3DGS) with reasonable geometric constraint formulation, but limited novelty and evaluation scale.

Benchmarks & Results

Spatial Reasoning Tasks: PhotoAgent achieves 100% success rate on Easy (3/3) and Medium (3/3) tasks, 67% on Hard (2/3). Direct-6-DoF baselines achieve 67% on Easy, 0% on Medium/Hard tasks.
Aesthetic Quality (MOS): Baseline 2.87 → PhotoAgent 3.88 (+1.01 improvement). Portraits: 2.86 → 3.82 (+0.96). Still-life: 2.88 → 3.94 (+1.07).
Good-or-Better Rate: 26.8% → 69.9% (+43.1 percentage points improvement).
Instruction Adherence Win Rate: 92.9% overall preference over baseline. Object-centric scenes: 96.2%, portraits: 89.5%.
Per-scene IAWR: Ranges 79-100% across 8 scenarios, all statistically significant after Bonferroni correction.

Missing benchmarks: No comparison to PhotoBot, AutoPhoto, or other robotic photography systems on standardized tasks. Limited to 8 hand-curated scenarios.

Compute & Efficiency

Model size: Uses pre-trained GPT-4.1 (parameters not disclosed by OpenAPI), no additional trainable parameters.
Training compute: No training required - uses off-the-shelf models with prompt engineering.
Inference speed: 3DGS rendering at millisecond-level, reflective loop converges in 2-3 iterations. Total time not reported but appears near real-time.
Memory footprint: 3DGS scene reconstruction from 5-7 views, AnySplat uses single forward pass. Specific memory usage not reported.
Deployment practicality: Successfully deployed on mobile robot with RTX 4070 Laptop GPU. Requires scene reconstruction per environment, limiting applicability to static scenes.

Real-World Applicability

Real robot deployment: Tested on Agilex RangeMini2 mobile base with TechRobots TB6-R3 6-DoF arm and Intel RealSense D435i camera.
Real environment testing: 4 out of 8 evaluation scenarios conducted in real laboratory/library settings with human subjects.
Hardware constraints: Successfully runs on laptop-grade GPU (RTX 4070), suggesting practical deployment feasibility.
Scene limitations: Requires static scenes for 3DGS reconstruction (5-7 views, seconds-level build time), not suitable for dynamic environments.
No production deployment: No evidence of long-term deployment or production use cases reported.

Limitations & Failure Modes

Static scene requirement (FUNDAMENTAL): 3DGS reconstruction needs 5-7 static views, cannot handle dynamic scenes or moving subjects.
Scale dependency (ENGINEERING): Only evaluated on 8 scenarios with 100 raters, unclear if results generalize to diverse photography contexts.
Anchor-point brittleness (FUNDAMENTAL): Single-subject assumption may fail in complex multi-subject scenes or when no clear primary subject exists.
3DGS reconstruction overhead (ENGINEERING): Requires scene reconstruction per environment, adding setup time and limiting spontaneous photography.
Limited baseline comparison (EVALUATION): No comparison to existing robotic photography systems like PhotoBot or AutoPhoto.

Failure modes: (1) Complex scenes without clear primary subject where anchor-point hypothesis breaks down, (2) Dynamic scenes where 3DGS reconstruction is impossible.

Composer 2 Technical Report

Authors: Cursor Reseach, :, Aaron Chan, Ahmed Shalaby et al. (56 authors) · Institution: Cursor · Category: cs.SE

Composer 2 achieves frontier-level software engineering agent performance through large-scale distributed reinforcement learning on realistic coding tasks, demonstrating that specialized models can outperform general-purpose alternatives when trained with careful domain matching.

Practical Takeaway: This work demonstrates that specialized coding models can achieve frontier performance through careful domain-focused training, even starting from smaller base models. The key insights are: (1) train in environments that exactly match deployment, (2) use realistic evaluation benchmarks derived from actual usage, (3) implement sophisticated distributed RL infrastructure for stability at scale. For engineers, the CursorBench evaluation approach is worth adopting - real-world tasks reveal performance gaps invisible in public benchmarks. The nonlinear length penalty technique could be useful for balancing efficiency vs. capability in other agent settings.

Tags: reinforcement-learning code-generation software-engineering mixture-of-experts distributed-training agent-systems benchmark-evaluation low-precision-training

arXiv · PDF

Task & Setting

Composer 2 addresses the challenge of building specialized AI models capable of autonomous software engineering tasks. Real-world software engineering requires complex multi-step reasoning, long-horizon planning, and the ability to navigate large codebases - capabilities that general-purpose models struggle with. The task involves training an agentic model that can read and edit files, run shell commands, search codebases, and execute long sequences of tool calls to solve software engineering problems.

The input consists of a codebase environment (files and execution container) plus a natural language task prompt. The model must produce a rollout of actions a1, …, aT where each action makes tool calls and receives responses. Success is measured by the correctness of the final environment state compared to the intended solution.

Success is measured by accuracy on CursorBench (their internal benchmark derived from real software engineering sessions), plus public benchmarks like SWE-bench Multilingual and Terminal-Bench. Additional metrics include token efficiency, inference cost, and code quality.

CursorBench consists of real software engineering problems from Cursor’s engineering team, with tasks requiring median changes of 181 lines across multiple files, significantly more complex than public benchmarks which typically require 7-10 line changes.

Architecture & Method

Base model: Kimi K2.5, a 1.04T parameter / 32B active parameter Mixture-of-Experts model selected via internal evaluations
Continued pretraining: Three phases - bulk training at 32k tokens, long-context extension to 256k tokens, then supervised fine-tuning on coding tasks
Multi-Token Prediction (MTP) layers trained with self-distillation for faster inference via speculative decoding
Asynchronous reinforcement learning: Policy gradient algorithm with multiple samples per group, trained on diverse coding tasks (feature development, debugging, refactoring, etc.)
Self-summarization technique: Chains multiple generations together via summaries to handle long horizons within limited context windows
Behavioral rewards: Nonlinear length penalty encouraging quick solutions on easy tasks while allowing longer reasoning on hard tasks:
\[C_{length\{k,q\}}(x) = \frac{(1 + kx)^{1-q} - 1}{k(1-q)}\]
Tool integration: Access to file editing, shell commands, semantic search, and web search in production-equivalent harness

Training Recipe

Continued pretraining: Three phases on code-dominated data mix using MXFP8 on NVIDIA B300s with AdamW optimizer. Bulk compute at 32k sequence length, long-context extension to 256k, then SFT phase. Training details and exact scale not fully reported.
Reinforcement learning: Large-scale asynchronous training with independent rollout and training workers. Policy gradient with multiple samples per prompt, single-epoch regime (no prompt reuse), full parameter updates with Adam optimizer. Training data reflects real task distribution with iterative heuristics to upsample harder examples.
Infrastructure: Training across 3 regions for GPU compute, 4 regions for CPU. Fault-tolerant design with rollout-level and group-level checkpointing beyond standard model checkpoints.
Data scale: Not explicitly reported, but training involves hundreds of thousands of environments running simultaneously across multiple Anyrun clusters.

Hardware: NVIDIA B300s for training, geographically distributed inference via Fireworks AI. Wall-clock time not reported.

Novelty & Lineage

Prior Work:

Composer 1.5 (2024): Previous version achieving 44.2% on CursorBench with similar RL approach but less sophisticated infrastructure
SWE-bench and Terminal-Bench evaluation papers: Established coding agent benchmarks but focus on narrower task distributions
Policy gradient methods for code generation: GRPO and related techniques for RL training on coding tasks

Delta: This paper adds:
larger-scale asynchronous RL infrastructure spanning multiple regions
novel nonlinear length penalties for adaptive behavior
CursorBench - a realistic evaluation benchmark
sophisticated low-precision training kernels (NVFP4/MXFP8)
router replay for MoE numerical stability.

Applied-specific Assessment:
- Architectural novelty: Limited - primarily scaling known techniques with engineering improvements
- Benchmark gains: Meaningful 37% relative improvement (44.2% → 61.3% on CursorBench), competitive with frontier models
- Fair comparisons: Evaluations use consistent harness, though some API model details missing
- Scale dependence: Gains likely tied to large-scale distributed infrastructure and compute resources
Verdict: INCREMENTAL — Solid engineering achievement scaling known RL techniques to coding with sophisticated infrastructure, but core algorithmic contributions are modest refinements rather than fundamental advances.

Benchmarks & Results

CursorBench-3: Composer 2 achieves 61.3%, vs Composer 1.5 at 44.2% (37% relative improvement), GPT-5.4 at 63.9%, Opus 4.6 High at 58.2%
SWE-bench Multilingual: Composer 2 scores 73.7%, vs Composer 1.5 at 65.9%, GPT-5.4 at 76.8%, Opus 4.6 High at 75.8%
Terminal-Bench: Composer 2 achieves 61.7%, vs Composer 1.5 at 47.9%, GPT-5.4 at 66.5%, GPT-5.3 Codex at 64.8%

Results are mixed - strong improvement over previous versions but trailing GPT-5.4 on CursorBench and Terminal-Bench while remaining competitive on SWE-bench. Notable that comparisons show both harness-run and self-reported scores where available, revealing some evaluation inconsistencies.

Compute & Efficiency

Model size: 1.04T total parameters, 32B active parameters (Mixture-of-Experts)
Training compute: NVIDIA B300s across 3 regions, specific GPU hours not reported
Inference speed: Uses Multi-Token Prediction for speculative decoding, significantly faster than baseline but exact latency not quantified
Memory footprint: MXFP8/NVFP4 quantization reduces memory requirements, specific numbers not provided
Deployment practicality: Achieves better cost-efficiency than frontier models while maintaining competitive accuracy - described as “Pareto-optimal trade-off” but positioned as specialized rather than general-purpose replacement

Real-World Applicability

Production deployment: Model actively used in Cursor’s production environment with same harness and tools used in training
Real codebase testing: CursorBench derived from actual engineering team sessions, not synthetic tasks
Infrastructure validation: Training environments match production with identical tool libraries, backend deployment, and execution containers
Scale validation: Handles hundreds of thousands of concurrent environments across multiple cloud regions during training
Cost analysis: Demonstrates superior cost-per-task compared to API models while achieving competitive accuracy on real developer workflows

Limitations & Failure Modes

Scale dependency: Performance gains likely require large-scale distributed infrastructure - FUNDAMENTAL to approach
Domain specialization: Optimized specifically for coding, may not generalize to other domains - FUNDAMENTAL limitation
Evaluation scope: CursorBench is internal and may not capture all coding scenarios - EVALUATION gap
Benchmark contamination risk: While CursorBench avoids this, public benchmark performance may reflect some overfitting - EVALUATION concern
Infrastructure complexity: Requires sophisticated fault-tolerant distributed systems - ENGINEERING barrier

Failure modes: (1) Long-horizon tasks may still suffer from context limitations despite self-summarization, (2) Model may collapse to inefficient behaviors without careful reward shaping as noted during training

When Minor Edits Matter: LLM-Driven Prompt Attack for Medical VLM Robustness in Ultrasound

Authors: Yasamin Medghalchi, Milad Yazdani, Amirhossein Dabiriaghdam, Moein Heidari et al. (10 authors) · Institution: University of British Columbia · Category: cs.CV

Demonstrates that medical VLMs are vulnerable to LLM-generated minimal prompt edits that preserve meaning but flip ultrasound diagnostic predictions.

Practical Takeaway: This work demonstrates that medical VLMs are highly vulnerable to realistic prompt variations that could occur naturally in clinical settings. For deployment, practitioners should implement prompt robustness testing using similar LLM-driven evaluation frameworks. The successful adversarial examples could be used for data augmentation during training to improve robustness. Priority should be given to low-confidence predictions which are most susceptible to attacks. However, the evaluation is limited to multiple-choice tasks - broader assessment across clinical workflows is needed before deployment.

Tags: adversarial_robustness medical_AI vision_language_models ultrasound prompt_attacks clinical_safety MCTS model_evaluation

arXiv · PDF

Task & Setting

Medical ultrasound interpretation requires consistent analysis by trained radiologists, but suffers from high operator dependence and inter-observer variability. Vision-language models (VLMs) have shown promise for automated ultrasound analysis but remain vulnerable to text-based adversarial attacks through their natural language interfaces.

The task is multiple-choice question answering on ultrasound images. Input consists of ultrasound images paired with natural language questions asking for diagnostic classifications (e.g., benign vs malignant). The objective is to evaluate robustness of Med-VLMs to minimal prompt perturbations that preserve semantic meaning but may flip model predictions.

Success is measured by:

attack success rate - percentage of originally correct predictions flipped by adversarial prompts
post-attack accuracy degradation, and
naturalness of generated adversarial prompts via perplexity and semantic similarity scores.

The evaluation uses U2-Bench disease-diagnosis subset containing 1,305 ultrasound image-question pairs spanning breast, thyroid, uterus, lung, pancreas, and musculoskeletal imaging.

Architecture & Method

LLM-driven adversarial prompt generation: An attacker LLM (Qwen-7B, Qwen-30B, or GPT-4.1 mini) generates minimal edits to original questions using synonym substitution, punctuation changes, and word reordering while preserving semantic meaning.
Monte Carlo Tree Search (MCTS) controller: Frames iterative editing as tree search where nodes represent candidate questions and edges represent edit transitions. Uses Upper Confidence Bound for Trees (UCT) selection criterion:
\[UCT(p, i) = \frac{V_i}{N_i} + c\sqrt{\frac{\ln(N_p + 1)}{N_i}}\]
Target Med-VLM evaluation: Queries target models (MedGemma-4B-IT, LLaVA-Med-7B, QoQ-Med-7B) with adversarial prompts and computes logit margins between ground truth and top competing answers.
Attack success detection: Attack succeeds when model prediction flips from correct to incorrect answer. Process terminates after maximum 80 iterations or successful attack.

The core contribution is demonstrating that realistic, minimal prompt variations can systematically fool medical VLMs without requiring model internals access.

Training Recipe

No model training involved - this is purely an evaluation study using pre-trained models
Target models used as-is: MedGemma-4B-IT, LLaVA-Med-7B, QoQ-Med-7B
Attacker LLMs used as-is: Qwen-7B, Qwen-30B, GPT-4.1 mini
MCTS hyperparameters: maximum 80 iterations, maximum depth 8, exploration constant c=1.4
Evaluation protocol: restrict to samples initially answered correctly, apply perplexity filtering (PPL < 15) using Gemma3-4B-PT as evaluator

Novelty & Lineage

Prior work:

General adversarial robustness studies on VLMs focus on explicit jailbreak prompts rather than realistic clinical communication variations.
Med-VLM safety evaluations typically use hand-crafted harmful prompts that are easily recognizable as malicious.
LLM-driven adversarial prompt generation exists in general NLP but hasn’t been systematically applied to medical VLMs.

Delta: This paper applies LLM-driven minimal prompt editing specifically to medical ultrasound VLMs using MCTS-guided search. The key addition is evaluating realistic clinical communication variations rather than explicitly malicious prompts.

Applied assessment: The architectural approach (LLM + MCTS for prompt editing) is a straightforward application of existing techniques to a new domain. The benchmark improvements are substantial (26-point accuracy drops) but this is expected given that prompt sensitivity in VLMs is well-established. The medical domain application provides value but the core technical contribution is incremental. The evaluation is thorough but comparisons are primarily across attacker LLM sizes rather than competing attack methods.

Verdict: INCREMENTAL — solid application of known adversarial techniques to medical VLMs with thorough evaluation, but limited technical novelty.

Benchmarks & Results

U2-Bench disease-diagnosis subset: Pre-attack accuracy ranges 34.79% (LLaVA-Med) to 42.22% (MedGemma). Post-attack accuracy drops to 11.57%-27.82% depending on attacker LLM, representing 14.89%-26.08% absolute degradation.
Attack success rate analysis: Qwen-7B most effective attacker, GPT-4.1 mini least effective. Most successful attacks require only 2-3 MCTS edit iterations.
Perplexity-filtered results: After removing low-quality edits (PPL ≥ 15), accuracy degradation remains substantial at 13.41%-28.20% post-attack.
Semantic similarity preservation: Attacked prompts maintain 0.974-0.996 cosine similarity with originals, confirming minimal semantic changes.

Results are consistently negative across all target models and attacker configurations, demonstrating broad vulnerability pattern.

Compute & Efficiency

Model sizes: Target Med-VLMs range 4B-7B parameters. Attacker LLMs: Qwen-7B (7B), Qwen-30B (30B), GPT-4.1 mini (undisclosed)
Training compute: Not applicable - uses pre-trained models only
Inference speed: MCTS requires up to 80 iterations per attack, with each iteration querying both attacker LLM and target Med-VLM. Most attacks succeed within 2-3 iterations.
Memory footprint: Not reported, but requires loading both attacker LLM and target Med-VLM simultaneously
Deployment assessment: Attack framework is computationally expensive for real-time use but demonstrates realistic vulnerability that could occur with simple text variations in clinical practice

Real-World Applicability

Uses clinical ultrasound images from U2-Bench spanning multiple anatomies (breast, thyroid, uterus, lung, pancreas, musculoskeletal)
Generates clinically plausible prompt variations mimicking realistic communication patterns in clinical settings (typos, shorthand, informal phrasing)
No actual deployment experiments or clinical validation reported
Study focuses on multiple-choice QA format which may not reflect typical clinical ultrasound interpretation workflows
Vulnerability assessment relevant for point-of-care ultrasound (POCUS) deployment in decentralized settings where prompts are informal

Limitations & Failure Modes

EVALUATION: Limited to multiple-choice QA format rather than free-form medical report generation used in clinical practice
EVALUATION: Restricted to ultrasound modality and disease diagnosis tasks, not tested on other medical imaging or task types
ENGINEERING: Occasional language confusion in LLM-generated edits (producing non-English text despite English prompts)
FUNDAMENTAL: Perplexity filtering removes many successful attacks, indicating tension between attack effectiveness and naturalness
EVALUATION: No comparison to other adversarial attack methods or defenses

Failure modes:
Models most vulnerable on low-confidence predictions near decision boundaries
Attacks requiring extensive edits become less natural and clinically plausible