May 5, 2026 Applied AI 5 papers

Applied AI Digest — May 5, 2026

Today’s Digest at a Glance

Preliminary

Today’s papers focus on improving multimodal reasoning through structured prompting mechanisms, reactive control policies, and failure-aware agent coordination.

Focus-CoT Reasoning

Traditional multimodal reasoning often struggles with connecting visual evidence to logical steps, leading to hallucinations or unsupported claims. The naive approach of simply prompting models to “look at the image” provides insufficient grounding between what the model sees and what it reasons about.

Focus-CoT addresses this by introducing explicit <focus> tags that force the model to cite specific visual evidence for each reasoning step. The mechanism operates through two sub-actions: <ocr> tags for extracting textual content from specific image regions, and <box> tags for localizing and describing visual elements with bounding box coordinates. Mathematically, this creates a structured reasoning trace where each logical step $s_i$ is paired with visual evidence $v_i$, forming tuples $(s_i, v_i)$ that can be validated independently.

Intuitively, Focus-CoT works like a student showing their work on a math problem—every conclusion must be backed by pointing to specific parts of the visual input.

Tube-Based Feedback Control

Standard diffusion policies generate fixed action sequences that cannot adapt to unexpected contact dynamics or environmental changes during execution. This rigidity fails in contact-rich manipulation where small perturbations can derail the entire plan.

Tube Diffusion Policy (TDP) solves this by generating “action tubes”—distributions over possible actions at each timestep rather than point estimates. The key innovation is a dual-time formulation that separates diffusion time $t_1 \in [0, T_{\text{diff}}]$ for denoising from trajectory time $t_2 \in [0, 1]$ for real-time execution. During execution, the policy continuously samples from the tube distribution conditioned on current observations, allowing reactive corrections while staying within the learned manifold.

\[a_t \sim \mathcal{N}(\mu_\theta(o_t, t_2), \Sigma_\theta(o_t, t_2))\]

where $\mu_\theta$ and $\Sigma_\theta$ define the action tube at trajectory time $t_2$.

The tube acts like guardrails on a highway—providing structured flexibility that keeps the agent on track while allowing reactive adjustments.

Chain-of-Question Decomposition

Complex visual questions often require multiple reasoning steps and external knowledge that single-shot VQA models cannot handle effectively. Standard approaches either hallucinate missing information or fail to break down multi-faceted queries systematically.

Chain-of-Visual Question Decomposition (CoVQD) combines Chain-of-Thought reasoning (covered previously) with structured question breakdown to guide multi-stage retrieval. The method first decomposes complex questions into sub-questions, then uses each sub-question to retrieve relevant knowledge before synthesizing the final answer. This creates a reasoning chain: $Q \rightarrow {q_1, q_2, …, q_n} \rightarrow {k_1, k_2, …, k_n} \rightarrow A$ where $Q$ is the original question, $q_i$ are sub-questions, $k_i$ are retrieved knowledge pieces, and $A$ is the synthesized answer.

This approach works like a research process—breaking big questions into manageable pieces, gathering evidence for each piece, then combining insights.

Reading guide: The Chart-FR1 and VQA papers both explore structured visual reasoning but at different scales—Chart-FR1 focuses on grounding individual reasoning steps in visual evidence, while the VQA work decomposes entire questions into sub-problems. The TDP paper addresses a different challenge of reactive control in robotics, showing how diffusion-based policies can be made more adaptive. The FAMA framework tackles failure recovery in tool-using agents, complementing the structured reasoning approaches by handling when those structured processes break down.

Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts

Authors: Hongkun Pan, Yuwei Wu, Wanyi Hong, Shenghui Hu et al. (11 authors) · Institution: Zhejiang University · Category: cs.CV

Chart-FR1 introduces Focus-CoT reasoning that explicitly links visual cues to reasoning steps and trains with reinforcement learning to improve chart understanding, achieving mixed results across benchmarks.

Practical Takeaway: The Focus-CoT mechanism of explicitly linking reasoning steps to visual regions is a useful pattern for dense visual reasoning tasks beyond charts. The key insight is that standard CoT lacks sufficient visual grounding, so augmenting reasoning with <focus> tags containing OCR and bounding boxes can improve performance. However, the gains come at significant training cost (two-stage pipeline, 36k samples) and the improvements are inconsistent across benchmarks. Engineers should consider this approach if they have dense visual reasoning tasks and resources for extensive fine-tuning, but simpler visual grounding methods might be worth trying first.

Tags: chart-understanding visual-reasoning multimodal-rl focus-mechanisms visual-grounding benchmark

arXiv · PDF

Task & Setting

Chart-FR1 addresses the problem of multimodal large language models (MLLMs) struggling with high information density (HID) charts—charts with multiple subplots, dense legends, and complex annotations. This is a significant practical challenge because real-world data visualization increasingly relies on information-dense charts to convey complex insights. Processing HID charts requires fine-grained perception to identify key visual cues and deep compositional reasoning to integrate information across different chart elements.

The task is defined as visual question answering on charts where the input consists of a chart image (various resolutions, up to 2487×1716 pixels) and a natural language question. The output is a text answer that could be numerical, categorical, or descriptive. The model must extract information from visual elements including text labels, legends, data points, axes, and spatial relationships between elements.

Success is measured using standard VQA accuracy metrics, with a relaxed accuracy reward for numerical answers defined as:

\[R_{relaxed\ acc} = \begin{cases} 1.0 & \text{if correctness}(\hat{y}, y) \\ 0 & \text{otherwise} \end{cases}\]

where

\[\text{correctness}(\hat{y}, y) = \begin{cases} \frac{|\hat{y}-y|}{\max(|y|,\mu)} \leq 0.05 & \text{if numerical} \\ \hat{y} = y & \text{otherwise} \end{cases}\]

The paper introduces HID-Chart, a benchmark with 734 charts and 1561 QA pairs, featuring an average information density of 3.94 across 10 chart types and 8 domains.

Architecture & Method

Base architecture: Qwen2.5-VL-7B multimodal language model
Focus-CoT mechanism: Introduces <focus> tags that explicitly link reasoning steps to visual evidence through two sub-actions: - OCR text extraction using <ocr> tags - Local image localization using <box> tags with bounding box coordinates
Automated data synthesis pipeline for Focus-CoT generation: - Generate initial CoT paths using base model - Filter by format and evaluate correctness using pass@k metric - Conditional CoT reconstruction using GPT-5 teacher model to insert focus tags - Quality filtering with rule-based and LLM-based filters
Focus-GRPO reinforcement learning algorithm with three key components: - Information-efficiency reward:
\[R_{efficiency} = \exp(-\alpha \cdot P_{redundancy})\]
```
- Adaptive KL penalty: 
```
\[\beta' = \beta \cdot \frac{1}{1 + \log(1 + N_{info})}\]
```
- Relaxed-accuracy reward for numerical tolerance
```
Multi-dimensional reward function:
\[R = R_{relaxed\ acc} + w_1 \cdot R_{format} + w_2 \cdot R_{efficiency}\]
The core technical contribution is the explicit association of reasoning steps with fine-grained visual cues through the Focus-CoT mechanism, combined with reinforcement learning that optimizes visual focusing efficiency while adapting reasoning depth based on visual complexity.

Training Recipe

Stage 1 (Cold-start): Supervised fine-tuning on Focus-CoT data - Data: 6.4k samples from NovaChart, EvoChart, ChartQA processed through automated pipeline - Optimizer: BAdam, learning rate 2×10⁻⁶, warmup ratio 0.1 - Batch size: 256 global, 1 epoch full parameter fine-tuning - Hardware: 8 NVIDIA H100 GPUs
Stage 2 (Focus-GRPO): Reinforcement learning optimization
- Data: 30k training samples, 8 rollouts per sample - Optimizer: AdamW, learning rate 1×10⁻⁶, weight decay 1×10⁻² - Batch size: 512 global, 3 epochs - Hyperparameters: β=1×10⁻², α=2, τ=0.9, w₁=w₂=0.1 - Max prompt/response length: 2048 tokens each - Temperature: 1.0, top-p: 1.0

Wall-clock time not reported. Uses vLLM for inference and VeRL framework for RL training.

Novelty & Lineage

Prior work:

ChartReasoner (2024): Converts charts to ECharts code for reasoning
R1-VL (2024): Uses GRPO for multimodal reasoning with key-step matching
OpenVLThinker (2025): Applies standard GRPO to cross-modal CoT datasets

Delta: This paper adds three specific components:
- Focus-CoT mechanism that explicitly anchors reasoning to visual regions/OCR
- Information-efficiency reward to penalize redundant visual information
- Adaptive KL penalty that adjusts based on number of visual cues
Assessment:
- Architectural novelty: Focus-CoT is a logical extension of CoT with visual grounding—not fundamentally novel but practically useful
- Benchmark gains: 6.1% average improvement over base model, 3.1% over GPT-4o is meaningful but achieved through significant engineering
- Fair comparisons: Comparisons appear fair within same model families, though method requires substantial additional training data and compute
- Generalization: Results show the approach works across different base models (3B, 7B, 8B variants)
The adaptive KL penalty and information-efficiency rewards address real limitations in applying RL to dense visual reasoning, but these are engineering solutions rather than fundamental insights.

Verdict: INCREMENTAL — Solid engineering work that combines existing techniques (visual grounding + RL) with domain-specific rewards, achieving meaningful but expected improvements on a challenging problem.

Benchmarks & Results

ChartQA: 91.0% vs previous SOTA 88.3% (InternVL2.5-78B), +2.7% improvement
CharXiv: 46.6% vs previous best 47.1% (GPT-4o), -0.5% (slightly worse)
EvoChart: 59.2% vs previous best 63.9% (GPT-4o), -4.7% (worse)
ChartBench: 75.6% vs previous best 72.3% (GPT-4o), +3.3% improvement
PlotQA: 62.9% vs previous best 62.0% (InternVL2.5-78B), +0.9% improvement
HID-Chart (new): 53.0% vs 51.2% (GPT-4o), +1.8% improvement

Results are mixed - the method shows clear gains on ChartQA, ChartBench, and the new HID-Chart benchmark, but performs worse than GPT-4o on CharXiv and EvoChart. The paper does not adequately address why performance degrades on some benchmarks. Average improvement is 3.1% over GPT-4o but this masks significant variance across benchmarks.

Compute & Efficiency

Model size: 7B parameters (also tested on 3B and 8B variants)
Training compute: 8 NVIDIA H100 GPUs, wall-clock time not reported
Inference speed: Not reported, uses vLLM for serving
Memory footprint: Not specified
Deployment practicality: Moderate - requires two-stage training pipeline with 36.4k total training samples and specialized reward functions. The Focus-CoT format adds overhead to both training and inference. Method demonstrated on consumer-grade model sizes but needs significant additional training compared to base model.

Real-World Applicability

Testing limited to benchmark datasets - no deployment results reported
No hardware experiments or production integration discussed
Charts sourced from scientific publications and websites (2023-2025) suggest some real-world relevance
HID-Chart benchmark constructed from real scientific and social science publications
Method requires charts to be preprocessed for bounding box and OCR extraction - unclear how this scales in practice
No discussion of robustness to chart types not seen during training or varying image qualities/resolutions

Limitations & Failure Modes

Mixed benchmark results - significant performance drops on CharXiv (-0.5%) and EvoChart (-4.7%) [EVALUATION]
Requires substantial additional training data (36.4k samples) and two-stage training pipeline [ENGINEERING]
Method dependent on quality of bounding box and OCR extraction preprocessing [ENGINEERING]
Information density metric relies on GPT-5 scoring, making evaluation dependent on proprietary model [EVALUATION]
Focus-CoT format increases inference complexity and latency [ENGINEERING]
Limited evaluation on chart types beyond the 10 categories in training data [EVALUATION]

Failure modes:
- Likely to struggle with charts requiring spatial reasoning beyond local regions (e.g., comparing distant data points)
- May fail when OCR extraction is poor or bounding boxes are inaccurate

Large Language Models are Universal Reasoners for Visual Generation

Authors: Sucheng Ren, Chen Chen, Zhenbang Wang, Liangchen Song et al. (8 authors) · Institution: Johns Hopkins University, Apple · Category: cs.CV

UniReasoner improves text-to-image generation by using an LLM to first draft visual plans as discrete tokens, self-critique for prompt violations, then guide diffusion synthesis with explicit corrective signals.

Practical Takeaway: If you’re working on text-to-image generation and struggling with complex prompt adherence, consider implementing the draft-evaluate-diffuse pattern. The key insight is leveraging LLMs’ stronger verification abilities compared to direct generation. The approach is modular - you can keep your existing diffusion backbone and add the reasoning layer. Start with simpler visual token representations before attempting the full SigLIP-based discretization. The method’s effectiveness likely depends on having a capable LLM backend, so prioritize that component.

Tags: text-to-image diffusion-models large-language-models multimodal-reasoning compositional-generation visual-understanding prompt-adherence

arXiv · PDF

Task & Setting

Text-to-image generation aims to synthesize high-quality images from textual descriptions, a task critical for creative applications, content generation, and AI assistants. While modern diffusion models produce photorealistic outputs, they struggle with complex prompts involving multiple constraints (object counts, spatial relations, attributes), even when the same models can accurately evaluate whether an image satisfies those constraints post-hoc.

Given a text prompt p, the task is to generate an image I that is both perceptually high-quality and semantically faithful to the prompt. The input is natural language text of arbitrary complexity, and the output is a rendered image (typically 1024×1024 resolution). The formal objective is:

\[I \sim p(I | p)\]

where the generated image I should maximize both visual quality and prompt adherence.

Success is measured using compositional evaluation benchmarks: GenEval (overall score, single/two objects, counting, colors, position, attribute binding) and DPG-Bench (global, entity, attribute, relation categories). These metrics test fine-grained constraint satisfaction beyond simple image quality.

No new dataset is introduced; the work uses existing text-image pairs for training with synthetic draft-evaluation triplets.

Architecture & Method

LLM Backbone: Qwen serves as the universal reasoner for both drafting and evaluation stages.
Visual Draft Generation: The LLM generates discrete vision tokens d representing a coarse visual plan. SigLIP-2 features are vector-quantized into a codebook of K discrete indices, creating tokens ⟨v₁⟩…⟨vₙ⟩ that encode semantic primitives:
\[d \sim p_\phi(d | p)\]
Self-Critique Evaluation: The same LLM evaluates prompt-draft consistency, producing grounded textual evaluation e that identifies specific mismatches:
\[e = \text{Eval}_\phi(p, d)\]
Joint Diffusion Conditioning: A frozen SANA diffusion model generates the final image conditioned on the triplet (prompt, draft, evaluation):
\[\epsilon_\theta(z_t, t; c(\text{Concat}(p, d, e)))\]
The key technical contribution is converting LLM verification strength into explicit generation-time guidance through the draft-evaluate-diffuse pipeline, rather than treating language as static conditioning.

Training Recipe

Stage I - Pretraining (60,000 iterations): Uses existing text-image datasets with reconstructed drafts. Images are degraded via pretrained tokenizer to create draft supervision. Target is original high-fidelity image. Grounded evaluation generated by VLM (Qwen-VL) comparing prompt to degraded image.
Stage II - Finetuning (20,000 iterations): Uses hard-negative candidates where FLUX generates poorly-aligned images as drafts, with real images as targets. VLM scores alignment and selects challenging examples.
Optimization: AdamW optimizer, learning rate 5×10⁻⁵ with 1,000-step linear warmup, decay to 1×10⁻⁵. Only LLM and cross-modal connector trained; diffusion backbone frozen.
Hardware/compute: Not reported.
Data filtering: VLM-based alignment scoring to select hard negatives for stage II.

Novelty & Lineage

Prior work:

BAGEL (Deng et al., 2025): Unified LLM for both visual understanding and generation, but suffers from understanding-generation gap
LLM-grounded diffusion (Lian et al., 2023): Uses LLMs for prompt rewriting/layout generation in text space only
Reflect-DiT (Li et al., 2025): Iterative VLM critique and regeneration for refinement

Delta: This paper introduces the draft-evaluate-diffuse paradigm where:

LLM generates visual drafts as discrete tokens rather than text/coordinates
performs self-critique to identify specific prompt violations
conditions diffusion on joint (prompt, draft, evaluation) in single pass rather than iterative refinement.

Applied-specific assessment:
- The architectural insight of using discrete visual tokens as intermediate representation is somewhat novel, but vector quantization for vision is well-established
- Benchmark gains are meaningful (+0.09 GenEval, +1.80 DPG-Bench) and hold across categories, though based on single backbone (SANA)
- Comparisons appear fair - same frozen diffusion model isolates the contribution of reasoning framework
- Gains likely depend on capable LLM backbone but approach seems generalizable
Verdict: INCREMENTAL — Solid engineering combining known techniques (VQ, LLM reasoning, diffusion conditioning) in a sensible way, with clear but expected improvements over direct generation.

Benchmarks & Results

GenEval Overall: UniReasoner 0.88 vs previous best GPT-4o 0.84 (+0.04), SANA baseline 0.79 (+0.09)
GenEval Counting: 0.90 vs GPT-4o 0.85 (+0.05), SANA 0.78 (+0.12)
GenEval Position: 0.83 vs GPT-4o 0.75 (+0.08), SANA 0.62 (+0.21)
GenEval Attribute Binding: 0.72 vs GPT-4o 0.61 (+0.11), SANA 0.57 (+0.15)
DPG-Bench Overall: 86.30 vs previous best DALL·E 3 83.50 (+2.80), SANA 84.50 (+1.80)
DPG-Bench Global: 92.46 vs SANA 77.55 (+14.91)

Results show consistent improvements across compositional reasoning categories. Gains are most pronounced on complex constraints (counting, positioning, global scene understanding) where text-only conditioning struggles. No benchmarks conspicuously absent for the task scope.

Compute & Efficiency

Model size: Qwen LLM backbone (parameters not specified) + frozen SANA diffusion model
Training compute: Not reported (GPU hours, hardware unspecified)
Inference speed/latency: Not reported, but adds LLM draft generation and evaluation steps before diffusion
Memory footprint: Not reported
Deployment practicality: Moderate - requires running LLM for draft+evaluation plus diffusion model, increasing computational overhead compared to direct text-to-image generation

Real-World Applicability

Evaluation limited to curated benchmark datasets (GenEval, DPG-Bench) with no real-world deployment reported
No hardware experiments or production integration discussed
No sim-to-real analysis as this is not a robotics/embodied AI paper
Method appears designed for general text-to-image generation rather than specific real-world applications
Practical applicability depends on computational resources for running both LLM reasoning and diffusion synthesis

Limitations & Failure Modes

ENGINEERING: Requires running LLM twice (draft + evaluation) plus diffusion, increasing computational cost significantly
ENGINEERING: Training requires synthetic data construction with VLM-generated evaluations, adding complexity
EVALUATION: Only evaluated on two benchmarks with same frozen diffusion backbone (SANA)
ENGINEERING: Approach tied to discrete token representations, may not generalize to other visual encodings
FUNDAMENTAL: Still inherits base diffusion model limitations for fine-grained visual details

Failure modes:
- Draft generation errors could propagate through evaluation to final image
- Method may struggle when LLM’s understanding itself fails (rare edge cases, novel concepts)

Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

Authors: Quanxing Xu, Ling Zhou, Xian Zhong, Xiaohua Huang et al. (6 authors) · Institution: Macau University of Science and Technology, Wuhan University of Technology, National Tsing Hua University · Category: cs.CV

Combines Chain-of-Thought reasoning with Visual Question Decomposition to guide multi-stage retrieval for knowledge-based VQA, achieving incremental improvements through structured prompting and preference optimization.

Practical Takeaway: Research engineers should consider this work’s structured approach to combining question decomposition with retrieval for knowledge-intensive VQA tasks. The CoVQD technique of fusing Chain-of-Thought with Visual Question Decomposition provides a principled way to guide retrieval, and the three-stage pipeline (DCG-EKR-CPC) offers a modular framework that can be adapted to different MLLMs. However, the gains are incremental and the multi-stage approach adds complexity. The liDPO preference optimization technique for improving question decomposition quality could be valuable for other structured reasoning tasks. Most practical impact likely comes from the systematic integration methodology rather than breakthrough performance gains.

Tags: visual question answering knowledge-based VQA retrieval augmented generation multimodal LLM chain of thought question decomposition preference optimization vision-language models

arXiv · PDF

Task & Setting

Knowledge-based Visual Question Answering (KBVQA) addresses the challenge of answering visual questions that require external knowledge beyond what is visible in images. This is essential for real-world multimodal AI systems that must integrate visual understanding with factual knowledge to provide accurate responses about objects, events, or concepts.

The task involves taking an image I and natural language question Q as input, and producing a natural language answer A. For explanatory VQA, the model must additionally generate an explanation E describing the reasoning process. The evaluation objective seeks to maximize answer accuracy while maintaining explanation quality.

Success is measured by VQA accuracy (exact match between predicted and ground truth answers) on benchmarks like OK-VQA, E-VQA, and INFOSEEK. For explanatory variants, automatic metrics include BLEU-4, METEOR, ROUGE-L, CIDEr, SPICE for explanation quality, plus Grounding metric for visual reasoning accuracy.

The paper evaluates on three main datasets: OK-VQA (9K train, 5K test), E-VQA (~221K question-answer pairs with ~1M samples), and INFOSEEK (934K train, 73K validation pairs with Unseen Entity/Question splits).

Architecture & Method

Chain-of-Visual Question Decomposition (CoVQD): Fuses Chain-of-Thought reasoning with Visual Question Decomposition to break complex questions into structured sub-questions
Dissecting Chain Generation (DCG): Uses fine-tuned MLLM with SelectiveVQD loss combining Next Token Prediction and Binary Cross-Entropy:
\[L_{SelectiveVQD} = \sum_{i=1}^{N} (\lambda L_{NTP,i} + \beta L_{BCE,i})\]
Logical implication Direct Preference Optimization (liDPO): Enhances question decomposition with preference learning:
\[L_{liDPO} = L_{DPO} + \gamma L_{AncPO}\]
```
where DPO maximizes reward gap between preferred/rejected sub-question sequences
```
Elaborate Knowledge Retrieval (EKR): Three-stage retrieval using original image, multimodal features (VinVL + BLIP-2), and CoVQD-guided search with EVA-CLIP-8B encoder
Comprehensive Prompt Construction (CPC): Integrates instruction head, refined caption, image patches, logical knowledge from QAE triples, and original question

Core contribution is using structured question decomposition chains to guide fine-grained multimodal retrieval for MLLM reasoning.

Training Recipe

MLLM fine-tuning stage: Models fine-tuned on DecoVQD+ dataset using SelectiveVQD loss combining NTP and BCE objectives, with hyperparameters λ and β for balancing generative and decomposition tasks
BERT pretraining for logical relations: BERT pretrained on SNLI for 5 epochs using bert-base-uncased initialization, batch size 16, weight decay 0.01, AdamW optimizer with learning rate 2×10^-5, then fine-tuned on 2,000 INTROSPECT proposition pairs
liDPO optimization stage: β=0.5, learning rate 1×10^-7, cosine scheduler with 0.03 warmup ratio, γ=1, trained for 1 epoch to optimize preference alignment

Hardware details: Not reported Wall-clock time: Not reported Data filtering: Uses VinVL and BLIP-2 for refined image patch generation and caption filtering under question supervision

Novelty & Lineage

Prior work:

Wiki-LLaVA (2024): Hierarchical retrieval-augmented generation for multimodal LLMs using textual features only
VLM-PRF (2025): Multimodal processing, retrieval and filtering with visual+textual features, achieving 40.1% on E-VQA
MMKB-RAG (2025): Multi-modal knowledge-based RAG achieving 39.7% on E-VQA single-hop

Delta: This paper adds Chain-of-Visual Question Decomposition (CoVQD) that fuses CoT reasoning with structured question decomposition to guide retrieval, plus logical implication DPO (liDPO) for preference optimization.

Applied-specific assessment:
- Architectural idea: Combining CoT with VQD for retrieval guidance is a reasonable but incremental extension of existing question decomposition and RAG techniques
- Benchmark gains: Modest improvements (40.4% vs 40.1% on E-VQA, 43.0% vs 42.5% on INFOSEEK-All) are within typical noise margins for VQA tasks
- Fair comparisons: Uses same compute scale and evaluation protocols, though relies on similar backbone models as baselines
- Scale dependence: Gains appear to depend on fine-tuning with specialized datasets and multi-stage retrieval, which may not transfer to resource-constrained settings
Verdict: INCREMENTAL — solid engineering combining known techniques (CoT + VQD + RAG) with modest empirical gains that don’t demonstrate clear breakthrough capabilities.

Benchmarks & Results

E-VQA Single-Hop: 40.4% (this work) vs 40.1% (VLM-PRF w/ RL), +0.3% improvement
E-VQA All: 39.5% (this work) vs 39.2% (VLM-PRF w/ RL), +0.3% improvement
INFOSEEK Unseen-Q: 43.5% (this work) vs 43.5% (VLM-PRF w/ RL), tied for best
INFOSEEK Unseen-E: 42.0% (this work) vs 42.1% (VLM-PRF w/ RL), -0.1% slightly worse
INFOSEEK All: 43.0% (this work) vs 42.5% (VLM-PRF w/ RL), +0.5% improvement
OK-VQA: 77.8% (this work) vs 77.2% (KU-RAG), +0.6% improvement
GQA-REX explanation metrics: BLEU-4 61.4%, METEOR 44.1%, ROUGE-L 83.8%, CIDEr 590.2, SPICE 57.9%, Grounding 82.3% - all best among compared methods
GQA-REX answer accuracy: 84.3% validation, 64.4% test - best results
GQA-OOD robustness: 78.0% validation, 57.5% test - best results

Results show consistent but marginal improvements across benchmarks, with strongest gains in explanation quality rather than answer accuracy.

Compute & Efficiency

Model size: Uses various backbone MLLMs (Qwen2-VL-7B, Qwen2.5-VL-7B, LLaVA-NeXT-7B/8B, InternVL3-8B) with EVA-CLIP-8B retriever
Training compute: Not reported for full pipeline training
Inference speed/latency: Not reported, but involves three-stage retrieval process (original image, multimodal, CoVQD-guided) which likely increases latency significantly
Memory footprint: Not reported, but requires maintaining multiple encoders (visual, textual) plus MLLM backbone
Deployment practicality: Framework is model-agnostic and can integrate with different MLLMs, but multi-stage retrieval and knowledge base requirements may limit real-world deployment scalability

Real-World Applicability

Evaluation limited to standard academic benchmarks (E-VQA, INFOSEEK, OK-VQA, GQA-REX) with no reported real-world deployment
No hardware experiments, robot/vehicle integration, or production system results mentioned
No sim-to-real analysis or discussion of deployment challenges
Framework depends on pre-existing knowledge bases (Wikipedia-derived for E-VQA) which may not cover domain-specific applications
Multi-stage retrieval process likely adds significant computational overhead for practical deployment
Qualitative analysis shows improved performance across diverse knowledge domains (geography, architecture, history, art) suggesting potential for educational or reference applications

Limitations & Failure Modes

FUNDAMENTAL: Approach relies on availability of comprehensive external knowledge bases, limiting applicability to domains without well-structured factual resources
ENGINEERING: Multi-stage retrieval process increases computational complexity and inference latency compared to single-stage methods
EVALUATION: Limited evaluation on out-of-domain scenarios beyond academic benchmarks
ENGINEERING: Question decomposition quality depends on fine-tuning with specialized datasets, requiring domain-specific annotation effort
FUNDAMENTAL: Framework assumes questions can be meaningfully decomposed, which may not hold for certain types of visual reasoning tasks

Failure modes:
- May produce hallucinated sub-questions that don’t align with visual content when question decomposition model fails
- Retrieved knowledge may be irrelevant or contradictory when retrieval guidance from CoVQD is poor, leading to confused final answers

Tube Diffusion Policy: Reactive Visual-Tactile Policy Learning for Contact-rich Manipulation

Authors: Teng Xue, Alberto Rigo, Bingjian Huang, Jiayi Shen et al. (7 authors) · Institution: Meta Reality Labs Research · Category: cs.RO

TDP bridges diffusion-based imitation learning with tube-based feedback control, enabling reactive visual-tactile policies that generate action “tubes” allowing step-wise corrections during execution rather than rigid action chunks.

Practical Takeaway: If you’re working on contact-rich manipulation with action chunking policies, TDP’s tube-based formulation offers a principled way to add reactivity without sacrificing the benefits of generative models. The key insight is separating action generation (diffusion) from execution (streaming flow), enabling 3-4x faster inference while improving success rates. The dual-time architecture is relatively straightforward to implement and the theoretical grounding in tube MPC provides confidence in the approach. Most importantly, it demonstrates that you don’t need perfect action generation - coarse diffusion initialization followed by step-wise corrections can be more effective than precise open-loop chunks. Worth implementing if you’re seeing failures due to contact uncertainty or external disturbances in chunk-based policies.

Tags: contact-rich manipulation diffusion policies reactive control visual-tactile sensing dexterous manipulation imitation learning tube-based control action chunking

arXiv · PDF

Task & Setting

Contact-rich manipulation is central to human daily activities and requires continuous adaptation to contact uncertainty and external disturbances through multi-modal perception, particularly vision and tactile feedback. Existing imitation learning approaches rely on action chunking, which fundamentally limits their ability to react to unforeseen observations during execution, especially in contact-rich scenarios where physical uncertainty and high-frequency tactile feedback demand rapid, reactive control.

The task involves learning control policies for contact-rich manipulation tasks including stable grasping, on-table reorientation, jar opening, and dish cleaning. The input consists of visual observations from RGB cameras (wrist-mounted and global views), tactile feedback from dense arrays (768 sensing elements providing 3-axis forces, resulting in 2304-dimensional tactile observations), and proprioceptive robot state. The output is continuous control actions consisting of end-effector position, orientation (6D rotation representation), and hand joint commands.

The training objective combines diffusion loss and streaming loss:

\[L = L_{diff} + L_{stream}\]

where

\[L_{diff} = E_{u_0, h_{t_1}, \epsilon, t_1}[\|\epsilon - v_\theta(u_{t_1}, t_1, t_2 = 0 | h_{t_1})\|^2]\] \[L_{stream} = E_{u, h, t_2}[\|v_\theta(u, t_1 = 0, t_2 | h_{t_2}) - v_\zeta(u, t_2)\|^2]\]

Success is measured by task completion rates, steps to success, and specific task metrics (e.g., percentage of dirty region cleaned in dish cleaning). The authors collect 150-200 demonstrations per task and evaluate over 50 episodes using teleoperation systems with VR headsets and motion capture.

Architecture & Method

Dual-time formulation with two distinct time variables: diffusion time t1 ∈ [0, T_diff] for multi-step denoising, and trajectory time t2 ∈ [0, 1] for streaming control evolution in real time
Shared conditional 1D U-Net backbone with separate sinusoidal embeddings for t1 and t2, using Feature-wise Linear Modulation (FiLM) for observation conditioning
Diffusion phase follows standard DDPM with ε-prediction, generating initial actions via forward noising:
\[u_{t_1} = \sqrt{\bar{\alpha}_{t_1}} u_0 + \sqrt{1 - \bar{\alpha}_{t_1}} \epsilon\]
Streaming phase learns observation-conditioned feedback flow around nominal trajectories, with target velocity field:
\[v_\zeta(u, t_2) = \frac{d\zeta_u}{dt_2} - \lambda(u - \zeta_u(t_2))\]
Action tube formulation enables step-wise closed-loop feedback during execution via integration:
\[u \leftarrow u + v_\theta(u, t_1 = 0, t_2 | h_{t_2}) \Delta t_2\]
Visual encoding using ResNet-18 backbones, tactile processing via MLPs (simulation) or multiple ResNet-18s (real-world Digit360), with 6D rotation representation for orientation

The core technical contribution is bridging diffusion-based imitation learning with tube-based feedback control theory, enabling reactive control at every timestep while preserving the expressiveness of generative models.

Training Recipe

Data collection: 150-200 human demonstrations per task collected via teleoperation using VR headsets (Meta Quest 3) in simulation and OptiTrack motion capture system in real-world experiments
Training stages: - Single-stage end-to-end training combining diffusion and streaming losses with equal weighting - Dual-time formulation trains shared U-Net backbone on both diffusion noise prediction and streaming velocity field prediction
Optimizer and hyperparameters: Not explicitly reported for optimizer choice, learning rate, or batch size
Hardware setup: Training performed on Lenovo P8 desktop with NVIDIA RTX 4080 GPU for policy inference, Lenovo P620 with RTX 3080 for robot control
Wall-clock training time: Not reported
Inference configuration: Uses 2-3 DDIM steps during denoising phase, enabling ~75-150 Hz control frequency compared to ~25 Hz for standard Diffusion Policy
Data preprocessing: Visual observations from dual RGB cameras, tactile arrays providing 2304-dimensional force measurements, proprioceptive state concatenated into unified observation representation

Novelty & Lineage

Prior Work:

Diffusion Policy (Chi et al., 2023) - established diffusion models for imitation learning with action chunking, achieving strong visuomotor manipulation performance but limited to open-loop chunk execution
Reactive Diffusion Policy (Huang et al., 2024) - introduced slow-fast architecture combining diffusion planning with tactile refinement, but primarily validated on parallel grippers with periodic jerky motions
Streaming Flow Policy (Ze et al., 2024) - formulated policy learning as continuous-time dynamical systems in action space, but operated in open-loop manner without closed-loop feedback

Delta: This paper adds tube-based feedback control theory to diffusion policies, creating “action tubes” that enable step-wise reactive corrections during execution while preserving diffusion-based action generation.

Applied-specific Assessment:
- Architectural novelty: The dual-time formulation separating diffusion and streaming phases is novel, but builds incrementally on existing diffusion and flow-based policies
- Benchmark gains: Consistent improvements across multiple tasks (96.1% vs 93.2% on Push-T, 96% vs 88% grasp success), with significantly reduced latency (3-4x faster inference)
- Fair comparisons: Uses same network architectures, training data, and evaluation protocols as baselines; comparisons appear fair
- Scale dependence: Gains appear to hold with minimal compute (2-3 DDIM steps vs 10+ for baselines), suggesting robustness without requiring massive scale
Verdict: SIGNIFICANT — The tube-based reactive control formulation addresses a fundamental limitation of chunk-based policies with clear theoretical grounding and consistent empirical improvements across diverse contact-rich tasks.

Benchmarks & Results

Push-T (state): TDP achieves 96.1%/96.9% avg/max scores vs DP 93.2%/93.8%, with 106.2/105.1 avg/min steps vs DP 125.9/128.5
Push-T (image): TDP achieves 82.0%/85.6% avg/max scores vs DP 76.0%/77.5%, with 103.5/91.3 avg/min steps vs DP 114.7/107.9
Stable Grasping: TDP achieves 96% grasp success and 88% stable grasp vs DP 88% and 60% respectively, with comparable 27.8±3.06 steps to success
On-table Reorientation: TDP achieves 90% success rate vs DP 82%, with 40.0±15.3 steps vs DP 41.7±8.51 steps
Dish Cleaning: TDP achieves 98.0% cleaning score with 13.3ms latency vs DP 98.4% score with 72.9ms latency
Real-world On-table Reorientation: TDP achieves 96% success vs DP 60%, with 0.008s denoising latency vs DP 0.037s
Real-world Jar Opening: TDP achieves 96% success vs DP 84%, maintaining high performance with 3 DDIM steps

Results show consistent improvements across all benchmarks with 3-5x latency reduction. No conspicuous benchmark absences noted.

Compute & Efficiency

Model size: Conditional 1D U-Net backbone with ResNet-18 encoders for visual processing - exact parameter count not reported
Training compute: Training performed on NVIDIA RTX 4080 GPU, wall-clock training time not reported
Inference speed: Denoising phase at ~75 Hz (0.008-0.013s), streaming phase at ~150 Hz (0.003s), compared to standard Diffusion Policy at ~25 Hz (0.037s)
Memory footprint: Not explicitly reported, but uses lightweight architecture with tactile MLPs and visual ResNet-18 encoders
Deployment practicality: High - demonstrated successful real-world deployment on Franka Research 3 + Allegro V5 hand system with real-time visual-tactile processing across dual workstation setup, achieving 3-4x faster inference than baselines while maintaining high success rates

Real-World Applicability

Real-world robot experiments conducted on Franka Research 3 robotic arm equipped with Allegro V5 dexterous hand in contact-rich manipulation tasks
Hardware integration includes Intel RealSense D405/D435F cameras for visual sensing and Digit360 tactile sensors on fingertips providing image-based tactile feedback
Two real-world tasks validated: jar opening and on-table reorientation, both involving contact uncertainty and external disturbances
Teleoperation system using OptiTrack motion capture and instrumented gloves for demonstration collection, with fingertip retargeting strategy for hand control
Robustness testing under intentional perturbations: thumb finger bending during manipulation and jar position shifting, demonstrating reactive recovery capabilities
Real-time deployment achieving >100 Hz control frequency on dual workstation setup (RTX 3080 for control, RTX 4080 for inference), significantly outperforming baseline methods in success rates (96% vs 60-84%)

Limitations & Failure Modes

ENGINEERING - Simple concatenation of visual, tactile, and proprioceptive observations neglects inherent sparsity and asynchronous nature of tactile sensing, could benefit from structured tactile representations
ENGINEERING - Control latency could be further reduced using recent diffusion acceleration techniques like one-step or distillation-based denoising methods instead of 2-3 DDIM steps
EVALUATION - Tube-based formulation currently only applied to diffusion policies, but could extend to other chunk-based methods including Vision-Language-Action models
FUNDAMENTAL - Relies on local linearization assumption in quasi-dynamic regime, which may break down under highly dynamic or discontinuous contact transitions
ENGINEERING - Requires dual workstation setup for real-world deployment, limiting practical accessibility compared to single-GPU solutions

Failure Modes:
Performance degrades when local linearization assumption fails during highly dynamic contact transitions or large disturbances that exceed the action tube bounds
Streaming-only variant (without diffusion correction) shows significant performance drops, indicating accumulated drift without periodic global corrections

FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

Authors: Amir Saeidi, Venkatesh Mishra, Souradeep Mukhopadhyay, Gaowen Liu et al. (7 authors) · Institution: Arizona State University, Cisco Research · Category: cs.CL

FAMA improves open-source LLM tool-calling agents by first analyzing failure patterns then dynamically selecting minimal specialized helper agents rather than using all available agents statically.

Practical Takeaway: If you’re deploying open-source LLMs for tool-calling tasks, FAMA provides a principled way to diagnose and address systematic failures through selective agent composition. The key insight is that naively using all available helper agents can hurt performance - instead, analyze your specific failure patterns and activate only relevant specialized agents. The framework is modular and could be adapted to other domains by defining appropriate error categories and helper agents. However, consider the orchestration overhead and whether simpler approaches might suffice for your use case.

Tags: tool-calling multi-agent-systems failure-analysis open-source-llms conversational-ai meta-learning dynamic-composition

arXiv · PDF

Task & Setting

Multi-turn conversational AI agents with tool-calling capabilities struggle with cascading errors in realistic customer service scenarios, particularly when using smaller open-source language models. These agents must maintain extended conversations while invoking external APIs and following domain-specific policies across interactions spanning multiple turns. The core challenge is that errors accumulate over long trajectories due to context window limitations, weaker reasoning capabilities, and inadequate failure recovery mechanisms.

The task involves multi-turn conversational tool use where an AI agent acts as a customer service assistant. Input consists of user queries in natural language within specific domains (airline, retail, telecom, etc.). The agent must generate appropriate responses and invoke external API tools while adhering to domain constraints. Output includes both conversational responses and structured API calls. Episodes can span 5-15 turns with complex state dependencies.

Evaluation uses task completion success rate (pass@k) where k represents multiple execution attempts. Success is measured by whether the agent successfully completes the user’s request while following all domain policies and constraints. Benchmarks include τ-bench (airline, retail domains), τ-trait (telehealth, telecom), and ACEBench (food delivery, telecom) with tasks formulated as Partially Observable Markov Decision Processes.

Architecture & Method

Baseline agent selection: Choose from ReAct, Function Calling (FC), or IRMA as the base tool-calling agent

Failure analysis stage: Deploy

= 4 independent error analysis agents, each specialized for one error category: Domain Policy Violation (DCV), Wrong Retrieval from Complex Tool Outputs (WRCO), Contextual Misinterpretation (CM), Incomplete Fulfillment (IFU)

Orchestrator agent: Takes concatenated outputs from all error analysis agents plus full interaction trajectory, produces final failure attribution identifying dominant error categories
Mitigation agent: Given identified error categories and functional definitions of available agents A, determines minimal subset A* ⊆ A to address failures
Helper agent pool: Contains Domain Constraints Extractor (DCE), Tool Suggestion Agent (TSA), Tool Output Reformulator (TOR), Planner Agent, Decision Verifier Agent, Memory module
Dynamic composition: Re-execute baseline agent using only selected helper agents A* injecting targeted context before decision-making steps

The core technical contribution is the two-stage meta-agentic orchestration that first diagnoses failure patterns, then dynamically selects minimal specialized agents rather than using all available agents statically.

Training Recipe

No training required - FAMA is a training-free framework
Stage 1: Execute baseline agent on task set T to collect failure trajectories
Stage 2: Run failure analysis using GPT-4o or GPT-4.1-mini as judgment models for error categorization and agent orchestration
Hardware: not reported for training (training-free)
Data: Uses existing benchmark datasets (τ-bench, τ-trait, ACEBench) without additional data collection
Inference deployment: Uses Qwen family models (4B-72B parameters) as base agents, with specialized helper agents activated dynamically based on failure patterns

Novelty & Lineage

Prior work:

IRMA (Mishra et al. 2025) proposes multi-agent frameworks with specialized helpers for tool-calling but uses all agents statically.
ReAct (Yao et al. 2023) combines reasoning and acting for tool use but lacks error-aware adaptation.
Reflexion (Shinn et al. 2023) uses verbal reinforcement learning for self-correction but doesn’t target specific failure modes.

Delta: FAMA adds failure-aware dynamic agent composition - first analyzing dominant failure patterns in baseline trajectories, then selectively activating only necessary helper agents rather than using all available agents. This targets specific weaknesses of open-source models under resource constraints.

Applied-specific assessment:
- Architectural novelty: The two-stage meta-agentic approach (failure analysis → selective composition) is a reasonable extension of existing multi-agent patterns, not fundamentally novel
- Benchmark gains: 27% improvement over baselines is substantial, but evaluation is limited to tool-calling tasks on relatively small models
- Fair comparisons: Uses same base models and evaluation protocols; improvements hold across multiple model sizes and domains
- Scale sensitivity: The approach specifically targets resource-constrained open-source models, unclear if benefits persist with larger models or proprietary data
Verdict: INCREMENTAL — solid engineering contribution that improves existing multi-agent frameworks through targeted failure mitigation, but the core ideas are straightforward extensions of established techniques.

Benchmarks & Results

τ-bench (Airline domain): FAMA achieves 37.6% vs ReAct 32.0% (5.6% improvement) with Qwen3-4B-Instruct
τ-bench (Retail domain): FAMA achieves 34.6% vs ReAct 17.22% (17.4% improvement) with Qwen3-4B-Instruct
τ-bench (Airline, Qwen2.5-72B): FAMA 29.2% vs ReAct 24.4% vs IRMA 26.4% (improvements of 4.8% and 2.8%)
τ-bench (Retail, Qwen2.5-72B): FAMA 44.17% vs ReAct 43.47% vs IRMA 38.78% (improvements of 0.7% and 5.4%)
τ-trait and ACEBench: Claims up to 27% and 24% improvements respectively but detailed results relegated to appendix
Results are mixed - larger improvements on smaller models, more modest gains on 72B model
Conspicuously absent: No comparison to recent SOTA tool-calling methods beyond IRMA, ReAct, and basic function calling

Compute & Efficiency

Model size: Evaluated on 4B-72B parameter Qwen models as base agents, plus GPT-4o/4.1-mini for orchestration
Training compute: Not applicable (training-free method)
Inference speed: FAMA shows 91.1s latency vs IRMA 149.8s vs ReAct-thinking 221.4s (improvement over multi-agent baselines)
Memory footprint: 30% token overhead vs 50-58% for IRMA, but still substantial compared to base ReAct
Deployment practicality: More efficient than static multi-agent approaches but requires orchestrator model calls, adding latency and complexity compared to single-agent baselines

Real-World Applicability

Evaluation limited to simulated conversational benchmarks (τ-bench, τ-trait, ACEBench) with LLM-based user simulators
No deployment results on actual customer service systems or real user interactions reported
No hardware experiments or production integration discussed
Limited sim-to-real considerations - benchmarks designed to simulate realistic scenarios but still synthetic
Framework designed for privacy-sensitive applications requiring local deployment but no evidence of actual deployment testing

Limitations & Failure Modes

FUNDAMENTAL: Effectiveness bounded by predefined agent pool - cannot address failure modes not covered by existing specialized agents
FUNDAMENTAL: Limited to structured conversational environments, unclear how to extend to more diverse interactive settings
ENGINEERING: Requires GPT-4 class model for orchestration, reducing cost benefits of using smaller base models
EVALUATION: Benchmarks focus on tool-calling tasks, unclear generalization to other agent capabilities
EVALUATION: No evaluation on frontier models where the approach may be less beneficial

Failure modes:
- Framework may select suboptimal agent combinations when failure analysis is incorrect
- Performance may degrade if orchestrator model fails to properly categorize errors or if mitigation agent recommendations are poor