Applied AI 5 papers

Applied AI Digest — May 2, 2026

Today’s Digest at a Glance

Today’s papers explore educational frameworks for strategic AI agents, multimodal medical AI systems, and enhanced retrieval-augmented generation approaches.

Shannon’s Game-Playing Taxonomy

Classical game theory distinguishes games based on information availability and strategic complexity, but lacks frameworks for modern AI agents that must reason about partial information and opponent modeling. Shannon’s 1950 taxonomy provides a foundational classification system that categorizes games along three key dimensions: perfect vs. imperfect information (whether players can observe all game state), complete vs. incomplete information (whether players know opponent payoffs and strategies), and zero-sum vs. non-zero-sum outcomes.

The taxonomy creates eight distinct game classes by combining these binary choices, enabling systematic analysis of strategic scenarios. For example, chess represents perfect information with complete knowledge, while poker involves imperfect information where players cannot observe hidden cards. Modern educational applications extend this framework by having students implement AI agents for different game classes, learning how information constraints affect optimal strategies.

Intuitively, Shannon’s taxonomy provides a systematic roadmap for understanding when and why different strategic reasoning approaches succeed or fail.

Invoke-and-Reason Loops

Traditional AI systems either make decisions independently or follow rigid tool-calling pipelines, but complex multimodal tasks require adaptive reasoning that can revise conclusions based on new evidence. Invoke-and-reason loops address this limitation by treating external tool outputs as revisable hypotheses rather than final answers, enabling iterative refinement through multiple reasoning cycles.

The core mechanism alternates between two phases: invocation (calling specialized tools or detectors) and reasoning (interpreting results in context and deciding next actions). During each cycle, the system maintains a working hypothesis that can be updated, contradicted, or refined based on new tool outputs. The loop continues until confidence thresholds are met or reasoning chains converge to stable conclusions.

\[\text{State}_{t+1} = \text{Reason}(\text{State}_t, \text{Invoke}(\text{Tools}, \text{Query}_t))\]

This creates a flexible framework where AI agents can adaptively decide which tools to use and how to integrate their outputs. For medical imaging, this enables systems to call lesion detectors, evaluate their outputs against clinical context, and invoke additional specialized tools when initial results are uncertain or contradictory.

Intuitively, invoke-and-reason loops mirror human expert decision-making, where specialists gather evidence, form hypotheses, and seek additional information when needed.

Two-Way Late-Interaction Scoring

Standard retrieval systems compute similarity using simple dot products or cosine similarity between query and document embeddings, but this approach fails to capture fine-grained semantic relationships between different parts of multimodal content. Two-way late-interaction scoring addresses this limitation by computing bidirectional maximum similarities between query tokens and document patches, enabling more nuanced matching.

The method represents queries as sequences of token embeddings ${q_1, q_2, …, q_m}$ and documents as patch embeddings ${p_1, p_2, …, p_n}$. Instead of aggregating these into single vectors, it computes two complementary scores: query-to-document matching finds the most similar document patch for each query token, while document-to-query matching finds the most similar query token for each document patch.

\[S_{q \to d} = \frac{1}{m}\sum_{i=1}^m \max_j (q_i \cdot p_j)\] \[S_{d \to q} = \frac{1}{n}\sum_{j=1}^n \max_i (q_i \cdot p_j)\] \[S(q,d) = \alpha S_{q \to d} + (1-\alpha) S_{d \to q}\]

This bidirectional approach ensures that both query coverage (how well the document matches all query aspects) and document coverage (how much of the document content is query-relevant) contribute to the final score. The weighting parameter $\alpha$ balances these complementary perspectives based on task requirements.

Intuitively, two-way late-interaction scoring ensures that good matches must satisfy the query completely while avoiding irrelevant document content that might dominate simple averaging approaches.

Reading Guide

Paper 1 (Nemobot Games) demonstrates educational applications of Shannon’s taxonomy, while Papers 2 and 4 (Echo-α and SAKE) both employ invoke-and-reason loops for medical and social media understanding respectively. Paper 3 (MEDVRAG) introduces two-way late-interaction scoring for medical document retrieval, and Paper 5 (RIME) explores alternatives to traditional chain-of-thought reasoning in multimodal embeddings.


Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

Authors: Chee Wei Tan, Yuchen Wang, Shangxin Guo · Institution: Nanyang Technological University · Category: cs.AI

Nemobot extends Shannon’s 1950 game-playing taxonomy with LLMs to create an educational programming framework where students build strategic game agents through modular prompt engineering and crowdsourced feedback.

Practical Takeaway: If you’re building educational AI tools or game programming environments, this paper offers a structured approach to organizing LLM capabilities within classical AI frameworks. The key insight is treating LLM functions as modular, inspectable components rather than black boxes. The tiered curriculum (dictionary → mathematical → heuristic → learning-based games) provides a reasonable progression for teaching AI concepts. However, don’t expect breakthrough game AI performance - this is primarily an educational framework. The crowdsourcing approach for prompt engineering could be valuable for other interactive AI applications, but requires significant effort to ensure data quality. Consider this if you need a systematic way to teach AI programming concepts, but look elsewhere for cutting-edge game AI research.

Tags: educational-ai game-ai llm-applications prompt-engineering shannon-taxonomy interactive-learning crowdsourcing strategic-games

arXiv · PDF

Task & Setting

This paper addresses the need for educational frameworks to help developers understand and program AI game agents. Traditional game AI development either requires exhaustive rule programming or is limited to black-box LLM interactions that are non-reproducible and opaque. This makes it difficult to learn fundamental AI concepts or create systematic game AI programming.

The task involves creating a programming framework where users design, customize, and deploy LLM-powered game agents across four categories based on Shannon’s taxonomy:

  1. Dictionary-based games (like tic-tac-toe with state-action mappings)
  2. Mathematically rigorous games (like Nim with optimal formulas)
  3. Heuristic-based games (like Mancala using minimax + crowdsourced data)
  4. Learning-based games (using reinforcement learning with human feedback)

    Input: Game rules, initial heuristics, user interactions. Output: AI game agents that can play strategically and explain their reasoning. The system operates through a web interface with coding pad, chat playground, and analysis portal.

    Evaluation criteria: Educational effectiveness measured through classroom deployment (251 students at City University Hong Kong, 80+ at NTU, 30 from Princeton), game performance on leaderboards, and ability to generate human-readable strategy explanations.

    The paper introduces Nemobot as both a programming environment and a collection of implemented games across Shannon’s four categories.

Architecture & Method
  1. System Architecture: Three-layer design with model-agnostic AI integration (BYOK principle), local state management for low latency, and cloud-based data persistence for crowdsourcing.

  2. LLM Function Integration: Instead of direct LLM calls, users define “LLM functions” within the platform that act as modular components for strategy generation, move explanation, and reasoning.

  3. Shannon’s Taxonomy Extension: - Dictionary games: LLM compresses state-action mappings into generalized models rather than exhaustive storage - Mathematical games: LLM provides natural language explanations of optimal strategies (e.g., Nim-sum calculations)
    - Heuristic games: LLM synthesizes minimax algorithm insights with crowdsourced data - Learning games: LLM implements reinforcement learning with human feedback using trial-and-error

  4. Interactive Training Loop: Algorithm 1 shows iterative heuristic refinement where H_k gets updated based on rewards R from dataset D_k until performance threshold τ is met.

  5. Programmable Prompts: Users specify high-level logic in natural language that gets converted to executable strategies, enabling “neuralized memoization” inspired by Michie’s memo functions.

  6. Crowdsourcing Integration: Platform aggregates human feedback and gameplay data to refine LLM prompts and strategies collaboratively.

    The core technical contribution is treating LLM capabilities as programmable modules within Shannon’s classical framework rather than as black-box players.

Training Recipe
  1. Platform Setup: Web-based framework using NodeJS templates, deployed on Facebook Messenger and Telegram with API connections to various LLMs (GPT-4, GPT-3.5) via BYOK model.

  2. Data Collection: Multi-stage process including crowdsourced human gameplay data, automated self-play simulations, and trial-and-error learning using Michie’s Boxes algorithm. Scale: 251 students at City University Hong Kong (2020-2021), 80+ at NTU (2022-2023), 30 from Princeton (2021-2022).

  3. Iterative Refinement: Students program initial heuristics H_0, test through chat playground, collect reward data R, and update heuristics based on performance analysis. No specific optimizer, learning rate, or batch size reported.

  4. LLM Fine-tuning: Prompt engineering using collected gameplay data, with different prompts categorized by use case (decision-making, strategy explanation, error correction). Students initially used OpenAI Playground and Codex, transitioning to ChatGPT and OpenAI API in late 2022.

  5. Hardware and Timeline: Web-based deployment, 10-week course duration. Wall-clock time and specific compute requirements not reported.

  6. Evaluation Method: Performance measured through classroom deployment, leaderboard competitions (Mancala), and educational effectiveness assessment. Specific training metrics and convergence criteria not detailed.

Novelty & Lineage

Prior Work:

  1. ChessGPT (2023): Bridges policy learning with language modeling for chess, specialized to single domain
  2. Voyager (2023): Open-ended embodied agent using GPT-4 in Minecraft, fully autonomous with no human programming interface
  3. Suspicion-Agent (2023): GPT-4 with theory-of-mind reasoning for imperfect-information games

    Delta: This paper adds a programmable educational framework that exposes LLM functions as transparent, modular components within Shannon’s taxonomy. Users design, test, and refine strategies through collaborative prompt engineering rather than consuming black-box AI gameplay.

    Applied-specific assessment:

    • Architectural idea: Extending Shannon’s 1950 taxonomy with LLMs is a reasonable but not groundbreaking combination. The “neuralized memoization” concept connecting to Michie’s memo functions is interesting but not deeply developed.
    • Benchmark gains: No quantitative performance comparisons to SOTA game-playing systems. Evaluation focuses on educational effectiveness rather than game-playing strength.
    • Fair comparisons: Missing - no head-to-head comparisons with other game AI frameworks or educational tools.
    • Scale dependency: The approach fundamentally depends on LLM capabilities and would not work without large-scale pretrained models.

    The contribution is primarily in creating an educational programming environment rather than advancing game AI state-of-the-art. While pedagogically valuable, the technical novelty is limited to systematically organizing existing LLM capabilities within a classical framework.

    Verdict: INCREMENTAL — Solid educational framework combining known techniques (Shannon’s taxonomy + LLM prompting) without significant technical or performance breakthroughs.

Benchmarks & Results
  1. Educational Deployment: Successfully deployed across 3 institutions with 361 total students (251 at City University Hong Kong 2020-2021, 80+ at NTU 2022-2023, 30 Princeton remote 2021-2022). No quantitative learning outcome metrics reported.

  2. Game Implementation: Successfully implemented 8+ games across Shannon’s four categories including tic-tac-toe, Nim, Mancala, physics quiz, math RPG, audio recommender, code mentor, and role-playing games. No performance benchmarks provided.

  3. Nim Game Learning: Table II shows learning convergence data - for pile size N=5 with max take K=3, requires L=5 rounds for training. For N=21, K=6, requires L=90 rounds. No comparison to baseline methods.

  4. Mancala Leaderboard: Crowdsourced human vs AI competition with easy/medium/hard difficulty levels using minimax algorithm. Specific win/loss ratios not reported.

  5. System Performance: Claims low-latency gameplay through local state management, but no quantitative latency measurements provided.

    Notable Absences: No head-to-head comparisons with other game AI systems, no standardized game benchmarks (like computer chess ratings), no quantitative measures of educational effectiveness, no performance metrics against SOTA game-playing agents. Results are primarily qualitative and deployment-focused rather than performance-oriented.

Compute & Efficiency
  1. Model Size: Not specified - uses external LLM APIs (GPT-4, GPT-3.5) via BYOK model, so depends on chosen provider. No local model parameters reported.

  2. Training Compute: Not reported. Uses web-based platform with API calls to external LLM services, so compute handled by providers (OpenAI, etc.).

  3. Inference Speed/Latency: Claims “low latency” through local state management where game logic runs locally and LLM only queried for strategy functions. No quantitative measurements provided.

  4. Memory Footprint: Game states stored in-memory during active play, with performance metrics batched and synchronized to cloud database asynchronously. Specific memory usage not reported.

  5. Deployment Practicality: High - web-based platform deployable on Facebook Messenger and Telegram. Model-agnostic architecture allows switching between LLM providers based on performance/cost needs. However, requires API access to commercial LLM services, creating ongoing operational costs and external dependencies.

Real-World Applicability
  1. Educational Deployment: Successfully used in actual undergraduate courses across 3 universities (City University Hong Kong, NTU, Princeton) with 361 total students over 2020-2023 period.

  2. Platform Integration: Games deployed on real messaging platforms (Facebook Messenger, Telegram) showing practical integration with existing communication infrastructure.

  3. Web Accessibility: Live demonstration available at nemobot-neue-experiment.vercel.app with public access to source code and tutorials for reproducibility.

  4. Remote Learning: Tested in remote internship program with Princeton students (2021-2022) demonstrating effectiveness in distributed educational settings.

  5. Crowdsourcing Validation: Mancala leaderboard shows real human-AI interaction data collection, proving the framework works with actual user engagement.

    However, the work focuses primarily on educational environments rather than production gaming systems. No evidence of commercial game deployment, industry adoption, or integration with professional game development pipelines.

Limitations & Failure Modes

Limitations:

  1. Long-term Strategy Planning (FUNDAMENTAL): LLMs excel at short-term decisions but struggle with long-term strategic thinking required for complex games like chess.

  2. Computational Scalability (ENGINEERING): LLM inference is resource-intensive, creating cost and latency challenges for real-time interactive games at scale.

  3. Quality Control of Crowdsourced Feedback (ENGINEERING): Ensuring consistency and quality of human feedback for training remains a bottleneck.

  4. Dependency on External APIs (ENGINEERING): BYOK model creates ongoing costs and external dependencies on commercial LLM providers.

  5. Limited Performance Evaluation (EVALUATION): No quantitative benchmarks against SOTA game AI systems or educational effectiveness metrics.

  6. Generalization vs Specialization Trade-off (FUNDAMENTAL): LLMs trained for generalization require extensive fine-tuning for specific game domains.

    Failure Modes:

  7. Adversarial Exploitation: Like other game AI systems, could be vulnerable to adversarial strategies that exploit LLM weaknesses rather than developing robust gameplay.

  8. Non-deterministic Behavior: LLM-based agents may produce inconsistent strategies across similar game states, reducing reliability for educational or competitive use.


Echo-α: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

Authors: Jing Zhang, Wentao Jiang, Tao Huang, Zhiwei Wang et al. (11 authors) · Institution: Wuhan University · Category: cs.CV

Echo-α trains MLLMs to coordinate ultrasound lesion detectors through an invoke-and-reason loop, treating detector outputs as revisable evidence rather than final predictions.

Practical Takeaway: The key insight is treating specialized detector outputs as revisable evidence rather than final predictions within an MLLM reasoning loop. The staged RL approach (first optimize for grounding, then for diagnosis) is a practical pattern for multi-objective medical AI. The detector-agnostic experiments showing consistent improvements across different backbones suggest the approach could be broadly applicable. However, the computational overhead of the two-stage inference and modest performance gains suggest this is most valuable in high-stakes clinical scenarios where interpretability and evidence grounding outweigh efficiency concerns.

Tags: medical_imaging ultrasound multimodal_reasoning tool_augmentation agentic_AI reinforcement_learning lesion_detection clinical_diagnosis

arXiv · PDF

Task & Setting

Real-world context: Ultrasound interpretation requires combining precise lesion localization with holistic clinical reasoning, but is highly operator-dependent and requires extensive expertise. Existing methods either excel at localization (specialized detectors) or reasoning (MLLMs) but not both, limiting clinical adoption.

Task definition: Input consists of ultrasound images paired with optional clinical context (patient history, chief complaint, etc.). The model must produce:

  1. grounded lesion localization with bounding boxes and coordinates, and
  2. clinical diagnosis with category labels. For renal ultrasound: 6 lesion categories (Angiomyolipoma, Hydronephrosis, Renal Stone, Renal Cyst, Diffuse Renal Parenchymal Disease, Renal Malignant Tumor). For breast ultrasound: 6 BI-RADS categories (2-5, with 4a/4b/4c subdivisions).

    Evaluation criteria: Grounding measured by instance-level F1 and image-level accuracy at IoU thresholds 0.25/0.5/0.75. Diagnosis measured by overall accuracy across all classes including negative cases. Multi-center protocol: validation uses same-center data, testing uses cross-center data.

    The paper evaluates on renal and breast ultrasound benchmarks with COCO-format annotations, using a multi-center evaluation protocol to assess cross-institutional generalization.

Architecture & Method
  1. Base architecture: Qwen3-VL multimodal large language model as the central reasoning module within an invoke-and-reason framework.

  2. Detector tools: Organ-specific lesion detectors built on LW-DETR architecture with improvements, exposing unified function-calling interfaces for renal (6 categories) and breast (6 BI-RADS categories) detection.

  3. Agentic loop: Model first performs visual reasoning to form initial hypothesis, then invokes specialized detector via structured function call, receives rendered visualization with overlaid detections plus structured metadata (coordinates, confidence, labels).

  4. Evidence integration: Model compares detector feedback against its own perception and produces grounded diagnostic output that goes beyond detector-only inference.

  5. Training stages: (a) 9-task supervised curriculum covering foundational grounding (REC/REG), diagnostic reasoning, tool collaboration, and interaction loop; (b) Group Relative Policy Optimization (GRPO) reinforcement learning with different reward weightings.

  6. Reward function: Localization reward (smoothed IoU), classification reward (gated on sufficient localization + correct category), shape reward (Distance-IoU), and tool cost penalty.

    Core contribution: Unifying detector precision with MLLM reasoning through learnable tool coordination, moving beyond treating detector outputs as final predictions to using them as revisable evidence.

Training Recipe
  1. Supervised Fine-tuning stage: 9-task curriculum including foundational grounding (REC/REG), diagnostic reasoning (direct diagnosis, attribute explanations, multi-step analysis), tool collaboration (box refinement, error correction, joint assessment), and interaction loop (tool invocation, feedback interpretation, grounded synthesis). Training data constructed by prompting teacher model with ground-truth annotations and detector predictions. Specific optimizer, learning rate, batch size, hardware details not reported.

  2. Reinforcement Learning stage: Group Relative Policy Optimization (GRPO) starting from shared SFT initialization. Two variants with different reward weightings: Echo-α-Grounding (emphasizes localization/shape rewards) and Echo-α-Diagnosis (emphasizes classification rewards while preserving localization). Geometric augmentation applied during RL (random flipping, resizing, cropping, square resizing). Wall-clock time and hardware specifications not reported.

  3. Data scale: Multi-center renal and breast ultrasound datasets with COCO-format annotations, but specific dataset sizes not reported.

Novelty & Lineage

Step 1 — Prior work:

  1. Toolformer
  2. established language models learning to invoke external tools during generation.
  3. ChatCAD/ChatCAD+ (2023-2024) demonstrated interactive medical diagnosis coupling language models with specialized medical modules.
  4. Recent medical MLLMs like LLaVA-Med
  5. and BiomedCLIP
  6. advanced medical visual-language understanding.

    Step 2 — Delta: This paper specifically addresses ultrasound interpretation by training MLLMs to coordinate detector outputs rather than treating them as final predictions. The key addition is the invoke-and-reason loop where detector feedback becomes revisable evidence integrated with global visual context, optimized through staged RL with different reward trade-offs.

    Step 3 — Applied-specific assessment: The architectural idea of treating detector outputs as revisable evidence rather than final predictions is a reasonable extension of existing tool-use paradigms to medical imaging. Benchmark gains are modest but consistent across cross-center evaluation (e.g., F1@0.5 improvements of 6.62 points for renal, 6.53 for breast grounding). The comparison appears fair using same backbone (Qwen3-VL) across all MLLM variants. However, the gains may be limited by the relatively small scale and domain specificity.

    The detector-agnostic experiments showing consistent improvements across different detector backbones (+2.15 to +22.95 points) provide stronger evidence that the approach learns transferable reasoning rather than detector-specific fitting.

    Verdict: INCREMENTAL — Solid application of established agentic/tool-use paradigms to ultrasound interpretation with modest but consistent cross-center improvements.

Benchmarks & Results
  1. Renal Ultrasound Grounding (Val): F1@0.5 = 70.78% vs 69.70% specialized detector, 56.56% direct MLLM+tool, +1.08/+14.22 point improvements

  2. Renal Ultrasound Grounding (Test): F1@0.5 = 56.73% vs 52.63% specialized detector, 50.11% direct MLLM+tool, +4.10/+6.62 point improvements

  3. Breast Ultrasound Grounding (Val): F1@0.5 = 50.37% vs 46.68% specialized detector, 36.61% direct MLLM+tool, +3.69/+13.76 point improvements

  4. Breast Ultrasound Grounding (Test): F1@0.5 = 43.78% vs 42.01% specialized detector, 37.25% direct MLLM+tool, +1.77/+6.53 point improvements

  5. Renal Ultrasound Diagnosis (Val): 77.43% vs 74.53% specialized detector, 63.99% direct MLLM+tool, +2.90/+13.44 point improvements

  6. Renal Ultrasound Diagnosis (Test): 74.90% vs 69.13% specialized detector, 66.99% direct MLLM+tool, +5.77/+7.91 point improvements

  7. Breast Ultrasound Diagnosis (Val): 48.75% vs 51.41% specialized detector, 37.71% direct MLLM+tool, -2.66/+11.04 point changes

  8. Breast Ultrasound Diagnosis (Test): 49.20% vs 46.96% specialized detector, 44.75% direct MLLM+tool, +2.24/+4.45 point improvements

    Results are mixed - consistent improvements over MLLM baselines but sometimes underperforming specialized detectors on diagnosis.

Compute & Efficiency
  1. Model size: Based on Qwen3-VL backbone (exact parameter count not specified, likely 4B+ parameters based on typical VL model scales)

  2. Training compute: GPU hours and specific hardware not reported

  3. Inference speed/latency: Not reported

  4. Memory footprint: Not reported

  5. Deployment practicality assessment: The two-stage approach (detector invocation + MLLM reasoning) likely increases inference cost significantly compared to detector-only approaches. Multi-center evaluation suggests reasonable transferability, but computational overhead of the agentic loop may limit real-time clinical deployment.

Real-World Applicability
  1. Multi-center evaluation protocol: Validation on same-center data, testing on cross-center data for both renal and breast ultrasound benchmarks, demonstrating cross-institutional generalization.

  2. Clinical context integration: Model processes real clinical information including patient history, chief complaints, and clinical indications alongside ultrasound images.

  3. Case studies: Two representative clinical cases (renal mass with dizziness/fatigue symptoms, breast lesion with nipple discharge history) show integration of detector evidence with clinical context.

  4. No deployment results: No reported integration into clinical workflows, PACS systems, or real-time clinical environments.

  5. No hardware constraints: No discussion of deployment on clinical workstations or mobile ultrasound devices.

Limitations & Failure Modes
  1. ENGINEERING: Computational overhead of two-stage inference (detector + MLLM) may limit real-time clinical deployment compared to detector-only approaches.

  2. EVALUATION: Limited to two organ systems (renal, breast) - generalization to other ultrasound applications unclear.

  3. ENGINEERING: Dependence on high-quality specialized detectors as tools - poor detector performance significantly impacts overall system capability.

  4. FUNDAMENTAL: Multi-center evaluation shows performance drops on cross-center data, indicating domain shift sensitivity despite agentic approach.

  5. EVALUATION: No comparison to radiologist performance or inter-rater agreement studies to establish clinical relevance of reported improvements.

    Failure modes:

  6. May hallucinate clinical interpretations when detector provides weak or conflicting evidence
  7. Tool invocation strategy may be suboptimal when detector fails completely or provides highly noisy outputs.

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

Authors: Xupeng Chen, Binbin Shi, Chenqian Le, Jiaqi Zhang et al. (8 authors) · Institution: New York University · Category: cs.AI

MEDVRAG combines page-image retrieval with iterative VLM reasoning for medical QA, achieving 78.6% average accuracy with modest gains over text-only approaches.

Practical Takeaway: If you’re building medical QA systems, consider that page-image retrieval can preserve important tabular and visual information lost in text-only approaches, though the gains over stronger text baselines remain unclear. The iterative reasoning with memory bank shows promise for multi-hop questions, but the substantial compute cost (~48s, 4×A100) limits practical deployment. The modular design (embedding → filtering → reasoning) is sensible for scaling, and the audit trail through memory bank addresses some interpretability needs for medical applications. However, focus on stronger text baselines and clinical validation before investing heavily in the multimodal approach.

Tags: medical-qa multimodal-rag document-retrieval vision-language-models iterative-reasoning biomedical-nlp page-image-retrieval memory-augmented-models

arXiv · PDF

Task & Setting

Medical question answering from biomedical literature faces challenges in accessing rich visual content like tables, figures, and structured layouts present in original documents. Current medical RAG systems extract only text from documents, discarding visual elements that often convey critical information more effectively than linearized text. This is particularly problematic for complex multi-hop questions requiring synthesis across multiple sources.

The task is medical multiple-choice question answering. Input: medical questions with multiple-choice options from standardized exams. Output: selected option letter (A-D or yes/no/maybe). The system retrieves relevant evidence from a corpus of ~350K PMC document page images rather than text chunks. Success is measured by exact-match accuracy on four benchmarks: MedQA (1,273 USMLE questions), MedMCQA (4,183 questions), PubMedQA (500 expert-labeled), and MMLU-Medical (1,089 questions across six clinical subdomains).

The evaluation corpus contains ~350K pages from ~18K PMC Open Access articles (2000-2024), rendered at 300 DPI with near-duplicates removed (cosine similarity threshold 0.97). Text overlap with benchmark source articles is excluded by DOI/PMID matching.

Architecture & Method
  1. Page-level embedding using ColQwen2.5 multi-vector retriever encoding each page into n patch-level vectors, PCA-reduced from 768 to 128 dimensions for efficiency

  2. Two-stage retrieval pipeline: Stage-1 embedding similarity using two-way late-interaction scoring:

    \[S(q, p) = \frac{1}{m}\sum_i \max_j q_i^T v_j + \frac{1}{n}\sum_j \max_i q_i^T v_j\]

    returning top N₁=2,000 pages via coarse-to-fine indexing (C=8 centroids per page, ANN over centroids, exact scoring on shortlist)

  3. Stage-2 LLM filtering using sharded MapReduce over Qwen3-30B-A3B: 8 parallel map calls each scoring 256 page summaries, emitting top-25; reduce call ranks surviving 200 into final N₂=100

  4. Iterative VLM reasoning with Qwen2.5-VL-32B-Instruct processing question + top-10 page images + top-20 summaries + memory bank across ≤3 rounds

  5. Structured memory bank accumulating iteration number, key findings, and reasoning history, enabling query refinement and evidence accumulation across rounds

    The core contribution is combining multimodal page-image retrieval with iterative reasoning and memory, avoiding information loss from text linearization while enabling multi-hop question decomposition.

Training Recipe
  1. No model training reported - uses pre-trained components: ColQwen2.5 for embeddings, Qwen2.5-VL-7B for offline page summarization, Qwen3-30B-A3B for filtering, Qwen2.5-VL-32B-Instruct for reasoning

  2. Offline indexing: Pages embedded with ColQwen2.5, PCA reduction to d=128, k-means clustering to C=8 centroids per page for coarse-to-fine retrieval

  3. Page summaries generated offline by Qwen2.5-VL-7B (~120 tokens each) at temperature T=0

  4. All inference uses deterministic decoding: VLM reasoning T=0.1 max_tokens=2048, LLM filter T=0 max_tokens=1024, random seed=42

  5. Hardware: 4×NVIDIA A100 80GB

  6. No fine-tuning or adaptation of base models - zero-shot evaluation throughout

Novelty & Lineage

Prior work:

  1. MedRAG (Xiong et al., 2024) - benchmarked RAG on medical corpora using text-chunk retrieval, achieved 82.8% MedQA with GPT-4
  2. ColPali (Faysse et al., 2024) - introduced multi-vector page-image retrieval for document VQA, but not applied to medical domains or iterative reasoning
  3. i-MedRAG (Xiong et al., 2025) - added iterative follow-up queries to medical RAG, but operated on text chunks

    Delta: This paper combines page-image retrieval (from ColPali lineage) with iterative reasoning and memory bank for medical QA, targeting document pages rather than text chunks.

    Applied-specific assessment:

    • Architectural idea is incremental: applies existing ColPali page-image retrieval + standard iterative prompting to medical domain
    • Benchmark gains are modest: +1.0 from page vs text, +1.5 from iteration, +1.0 from memory bank, totaling +5.8 over no-RAG baseline
    • Comparison issues: text baseline uses only BGE-large on 512-token chunks; stronger text retrievers (E5-Mistral, cross-encoders, layout-aware parsing) not evaluated
    • Single-seed results without statistical validation; cross-paper comparison to MedRAG uses different prompts/settings
    • Gains likely depend on curated PMC corpus and substantial compute (4×A100, ~47.8s per 3-iteration query)

    Verdict: INCREMENTAL — solid engineering combining existing techniques but lacks architectural novelty or compelling evidence that page-image retrieval substantially outperforms stronger text baselines.

Benchmarks & Results
  1. MedQA: 79.4% accuracy (previous SOTA: MedRAG + GPT-4 82.8%, Med-PaLM 2 86.5%)

  2. MedMCQA: 69.2% accuracy (previous SOTA: MedRAG + GPT-4 66.7%, improvement +2.5%)

  3. PubMedQA: 77.2% accuracy (previous SOTA: MedRAG + GPT-4 70.6%, Med-PaLM 2 81.8%)

  4. MMLU-Medical: 88.6% accuracy (previous SOTA: MedRAG + GPT-4 87.2%, improvement +1.4%)

  5. Average across four benchmarks: 78.6% (vs MedRAG + GPT-4 76.8%, +1.8% improvement with caveats about cross-paper comparison)

    Results are mixed - the system achieves strong performance but does not establish new SOTA on any individual benchmark. The largest improvement is on MedMCQA (+2.5%), while other gains are modest. Single-seed evaluation without confidence intervals limits statistical interpretation.

Compute & Efficiency
  1. Model size: Qwen2.5-VL-32B backbone (32 billion parameters) plus additional components (ColQwen2.5 embedder, Qwen3-30B filter)

  2. Training compute: Not reported (uses pre-trained models)

  3. Inference speed: Single iteration ~15.9s, full three-iteration pipeline ~47.8s on 4×A100, dominated by VLM reasoning (~12.1s per iteration), Stage-1 retrieval <30ms via coarse-to-fine indexing

  4. Memory footprint: ~350K pages with 128-dim embeddings (~35.9M patch vectors), substantial GPU memory for 32B VLM

  5. Deployment practicality: Poor for real-time applications due to ~48s latency for full pipeline, requires 4×A100 GPUs, but retrieval component is efficient at <30ms. Early stopping after Round 2 could reduce latency to ~32s while capturing 75% of recoverable errors.

Real-World Applicability
  1. No deployment results or production integration reported

  2. No hardware experiments outside of controlled benchmark evaluation

  3. Evaluation limited to multiple-choice questions from standardized exams, not real clinical scenarios

  4. No clinician-in-the-loop validation or clinical decision support evaluation

  5. System provides audit trail through memory bank and preserves original page images for inspection, which are necessary but not sufficient for clinical deployment

  6. Authors explicitly state the system is “not clinically validated” and requires “free-text generation, calibrated uncertainty, harmful-answer analysis, and clinician-in-the-loop review” before deployment

    The work remains purely academic with no demonstrated real-world applicability beyond benchmark performance.

Limitations & Failure Modes
  1. EVALUATION: Single-seed results without statistical validation or confidence intervals for reported gains

  2. EVALUATION: Text baseline uses only BGE-large embeddings; stronger retrievers not evaluated

  3. FUNDAMENTAL: Limited to PMC Open Access corpus, may not generalize to other medical literature or clinical guidelines

  4. ENGINEERING: High inference latency (~47.8s for full pipeline) limits practical deployment

  5. EVALUATION: Multiple-choice evaluation only, no free-text generation or clinical scenario testing

  6. FUNDAMENTAL: No clinical validation or safety analysis for medical decision support

  7. ENGINEERING: Requires substantial compute resources (4×A100 GPUs)

    Failure modes:

    • Retrieval miss (35.9% of errors): relevant pages not in top-100 retrieved
    • Iteration drift (14.9% of errors): refined queries move off-topic from original question

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

Authors: Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu et al. (9 authors) · Institution: Sun Yat-sen University · Category: cs.IR

SAKE introduces self-aware reasoning that adaptively balances internal knowledge exploitation and external knowledge exploration for grounded multimodal named entity recognition, achieving SOTA results while reducing search API usage.

Practical Takeaway: SAKE demonstrates that teaching models self-awareness about their knowledge boundaries enables more efficient and accurate multimodal reasoning. The key insight is using difficulty-aware sampling to create explicit uncertainty signals, then training with a hybrid reward that penalizes unnecessary search. For practitioners, this suggests: (1) Cold-start SFT with explicit uncertainty modeling before RL, (2) Hybrid rewards that balance performance and efficiency, (3) The value of structured reasoning formats that force models to express their confidence. The 68.8% search rate vs. 100% for similar performance shows this approach could significantly reduce API costs in production systems.

Tags: multimodal named_entity_recognition reinforcement_learning self_awareness tool_use social_media knowledge_retrieval chain_of_thought

arXiv · PDF

Task & Setting

Problem Definition: Grounded Multimodal Named Entity Recognition (GMNER) addresses the challenge of extracting named entities from text and simultaneously localizing them to visual regions in paired images. In open-world social media platforms, GMNER faces unique difficulties due to long-tailed entity distributions, rapidly evolving entities, and a high prevalence of unseen entities (≈54% in Twitter benchmarks).

Task Formulation: Given textual sequence T and corresponding image I, produce multimodal triplets Y = {(s_j, t_j, r_j)} where s_j is the token span, t_j is the entity type, and r_j ∈ R^4 is the visual bounding box (or None if no visual grounding exists).

Key Challenge: Current methods either rely on external knowledge exploration through heuristic retrieval (introducing noise for known entities) or internal knowledge exploitation via iterative refinement (limited by knowledge boundaries and prone to hallucinations).

Evaluation: Performance measured using F1 scores on two subtasks: Multimodal Named Entity Recognition (MNER) and Entity Extraction and Grounding (EEG). Evaluated on Twitter-GMNER and Twitter-FMNERG benchmarks with standard GMNER protocol.

Architecture & Method
  1. Base Architecture: Built on Qwen2.5-VL-7B-Instruct as the foundation multimodal large language model
  2. Agentic Framework: Multi-turn reasoning with adaptive tool invocation supporting text search, image search, and direct answer actions
  3. Difficulty-aware Search Tag Generation: Quantifies model uncertainty through N-sampling to generate explicit knowledge gap signals:

    \[H_{text}^{(j)} = \sum_{k=1}^N \mathbb{I}[(\hat{s}_j^{(k)}, \hat{t}_j^{(k)}) = (s_j, t_j)]\] \[H_{region}^{(j)} = \sum_{k=1}^N \mathbb{I}[(\hat{s}_j^{(k)} = s_j) \land (IoU(\hat{r}_j^{(k)}, r_j) > 0.5)]\]
  4. Action Space: Three structured actions with explicit reasoning tags: , , ,
  5. Self-aware Reasoning: Model explicitly expresses uncertainty and decides when external retrieval is necessary versus relying on internal knowledge
Training Recipe
  1. Stage 1 - Cold Start with SFT: Train on SAKE-SeCoT dataset (2,764 high-quality Chain-of-Thought trajectories) with search tags based on difficulty-aware generation. SFT objective:

    \[L_{SFT} = -\mathbb{E}_{T \sim D_{SeCoT}} \sum_{t=1}^T \log \pi_\theta(a_t | o_{\leq t})\]
  2. Stage 2 - Agentic Reinforcement Learning: Group Relative Policy Optimization (GRPO) with hybrid reward function:

    \[R(T) = \lambda_{F1} R_{F1} + \lambda_{fmt} R_{fmt} - \lambda_{search} \cdot \mathbb{I}[R_{F1} \geq \gamma] \cdot N_{search}\]
  3. Hardware: 8 NVIDIA H100 GPUs with 80GB memory
  4. Implementation: ms-swift framework for SFT, veRL for RL stage
  5. Data: Search tags generated with N=4 forward samplings, teacher model supervision from Qwen3-VL-Plus
Novelty & Lineage

Prior Work:

  1. MoRE (2022): Multi-modal retrieval for all entities via heuristic search - achieved external knowledge but introduced noise
  2. MAKAR (2025): Multi-agent framework with knowledge exploration - best previous SOTA but with 100% search rate
  3. Internal exploitation methods like UnCo, RiVEG: Iterative refinement limited by knowledge boundaries

    Delta: SAKE introduces self-aware reasoning to adaptively decide when to exploit internal knowledge vs. explore external sources. Key innovations:

  4. Difficulty-aware search tag generation to create explicit uncertainty signals
  5. Two-stage training with cold-start SFT followed by agentic RL
  6. Hybrid reward function that penalizes unnecessary retrieval.

    Assessment:

    • Architectural novelty: The self-awareness mechanism via difficulty-aware tagging is non-obvious and addresses a real limitation
    • Benchmark gains: 3.75% and 2.91% improvements over SOTA while reducing search rate from 100% to 68.8%
    • Fair comparisons: Uses same base model (Qwen2.5-VL-7B) as recent baselines with consistent evaluation protocols
    • Scalability: Gains likely dependent on quality of search tools and two-stage training paradigm

    Verdict: SIGNIFICANT — The self-aware reasoning paradigm that dynamically balances internal vs external knowledge represents a clear advance over rigid exploration/exploitation strategies, with substantial empirical gains.

Benchmarks & Results
  1. Twitter-GMNER: SAKE achieves 75.63% F1 (GMNER), 86.95% (MNER), 78.76% (EEG) vs. previous SOTA MAKAR at 71.88%, 86.38%, 74.64% respectively - improvements of 3.75%, 0.57%, 4.12%
  2. Twitter-FMNERG: SAKE achieves 63.45% F1 (GMNER), 73.37% (MNER), 77.12% (EEG) vs. MAKAR at 60.54%, 71.24%, 75.66% - improvements of 2.91%, 2.13%, 1.46%
  3. Search Efficiency: SAKE maintains 68.8% average search rate compared to MAKAR’s 100%, showing significant efficiency gains
  4. Unseen Entity Performance: Substantial improvements on unseen entities (66.07% vs 58.48% without search on Twitter-GMNER)
  5. Consistent across settings: Strong performance on both seen and unseen entity categories
Compute & Efficiency
  1. Model size: 7B parameters (Qwen2.5-VL-7B base)
  2. Training compute: 8 NVIDIA H100 GPUs (80GB each), wall-clock time not reported
  3. Inference speed: Not explicitly reported, but reduced search rate (68.8% vs 100%) indicates improved efficiency
  4. Memory footprint: Not reported beyond GPU requirements
  5. Deployment practicality: End-to-end framework with external search API dependencies; more practical than multi-agent systems due to reduced search overhead and single-model architecture
Real-World Applicability
  1. Social Media Data: Evaluated on real Twitter datasets with authentic long-tail entity distributions and multimodal content
  2. Unseen Entity Performance: Specifically addresses open-world scenarios with 54% unseen entities in benchmark data
  3. Search Tool Integration: Uses standard web search APIs (text and image search) that can be readily deployed
  4. No Sim-to-Real Gap: Works directly on real social media images and text without synthetic training data
  5. Production Considerations: Requires external search API access and multi-turn inference, but significantly more efficient than previous multi-agent approaches
Limitations & Failure Modes
  1. External Dependency: FUNDAMENTAL - Requires access to external search APIs which may have availability, latency, or cost constraints
  2. Search Quality Bottleneck: FUNDAMENTAL - Performance limited by quality and relevance of search tool results
  3. Text-Dominant Reasoning: EVALUATION - Current analysis shows text search more effective than image search, suggesting underutilized visual reasoning
  4. Difficulty Level Sensitivity: ENGINEERING - Performance varies with N-sampling parameter for search tag generation
  5. Two-Stage Training Complexity: ENGINEERING - Requires careful balance between SFT and RL stages

    Failure Modes:

    • May over-rely on search for ambiguous but known entities when uncertainty estimation is imprecise
    • Could fail on domains where search engines lack relevant multimodal content

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

Authors: Peixi Wu, Ke Mei, Feipeng Ma, Bosong Chai et al. (18 authors) · Institution: Tencent · Category: cs.CV

RIME replaces Chain-of-Thought reasoning with structured rewriting for generative multimodal embeddings, achieving modest but consistent improvements across retrieval benchmarks while reducing inference overhead by ~50%.

Practical Takeaway: If you’re working on multimodal retrieval systems, RIME offers a practical alternative to CoT-based generative embeddings that reduces inference overhead while maintaining performance gains. The cross-mode alignment mechanism is particularly interesting for deployment scenarios - you can pre-encode your corpus with generative embeddings offline for richer semantics, then use fast discriminative embeddings for real-time queries. The structured rewriting approach could be adapted to other MLLM backbones beyond Qwen2-VL. However, the fundamental inference latency limitation of generative embeddings remains, so consider whether the 2-4 point improvements justify the computational overhead for your specific use case.

Tags: multimodal_embeddings generative_models retrieval vision_language_models reinforcement_learning contrastive_learning query_rewriting

arXiv · PDF

Task & Setting

Real-world context (2–3 sentences): Multimodal Large Language Models (MLLMs) have emerged as foundations for universal multimodal embeddings, replacing traditional dual-encoder architectures like CLIP that struggle with complex inputs like videos, documents, and mixed-modal content. Current generative embedding approaches rely on Chain-of-Thought (CoT) reasoning which generates redundant steps and introduces semantic ambiguity in broader retrieval scenarios.

Task definition: Given multimodal inputs (text, images, videos, visual documents), generate embeddings that enable effective retrieval across diverse scenarios. The method processes queries and targets through structured rewriting rather than CoT reasoning, jointly optimizing:

\[L_{Joint} = \lambda \cdot L_{Rewrite} + L_{CM\_InfoNCE}\] \[L_{CM\_InfoNCE} = L_{disc} + L_{gen} + L_{intra}\]

where discriminative and generative embeddings are aligned through cross-mode contrastive learning.

Evaluation criteria: Success is measured using Hit@1 on MMEB-V2 (78 datasets across image, video, visual document tasks), nDCG@10 on MRMR (11 expert-level reasoning tasks), and Recall@1 on UVRB (16 video retrieval subtasks).

Dataset/benchmark: MMEB-V2 covers 78 datasets spanning classification, QA, retrieval, grounding, and moment retrieval across three modalities. Training uses ~1.5M samples from VLM2Vec datasets, LLaVA-Hound, ViDoRe, and VisRAG.

Architecture & Method
  1. Base architecture: Qwen2-VL-2B/7B-Instruct as backbone MLLM with vision and language understanding capabilities

  2. Rewrite-driven generation: Replace CoT reasoning with structured rewriting that performs image assessment, recaption, and text explanation without forced summarization

  3. Dual embedding extraction: Extract both discriminative embeddings (from token) and generative embeddings (from token after rewrite generation)

  4. Cross-mode alignment (CMA): Align discriminative and generative embedding spaces via inter-mode and intra-mode InfoNCE losses enabling mutual retrieval

  5. Joint optimization: Unified training with autoregressive language modeling loss for rewrite generation and contrastive learning for embedding alignment

  6. Refine reinforcement learning: Use discriminative embeddings as stable semantic anchors to guide rewrite policy optimization with rewards:

    \[R(o) = R_{Format}(o) + R_{Gap}(o) + R_{Process}(o)\]

    where process reward encourages generative embeddings to exceed discriminative embedding similarity gaps.

Training Recipe
  1. Rewrite SFT stage: Train on ~1.5M samples from VLM2Vec datasets, LLaVA-Hound, ViDoRe, VisRAG with learning rate 5×10⁻⁵, batch size 512 via gradient accumulation, temperature τ=0.02, λ=1.0 for 1 epoch (~2500 steps)

  2. Refine-RL stage: Train on ~15K uniformly sampled multimodal samples using GRPO with group size K=8, clipping parameter ε=0.2, KL coefficient β=0.04, learning rate 5×10⁻⁶ for 1 epoch

  3. Hardware: 32×A800 GPUs for all experiments

  4. Data filtering: Initial sampling followed by CoT-guided filtering achieving ~80% retention rate across modalities

  5. Optimizer and schedule: Not explicitly reported beyond learning rates

Novelty & Lineage

Step 1 — Prior work: VLM2Vec (2024) introduced MLLM-based discriminative embeddings with instruction tuning and contrastive learning. UME-R1 (2025) and Think-Then-Embed (2025) adopted CoT reasoning for generative embeddings but suffered from parameter redundancy and reasoning overhead.

Step 2 — Delta: This paper replaces CoT with structured rewriting that performs modality-specific recaption/explanation without forced summarization. Introduces cross-mode alignment enabling mutual retrieval between discriminative and generative embeddings. Proposes Refine-RL using discriminative embeddings as semantic anchors.

Step 3 — Applied-specific assessment:

  • Architectural novelty: The rewrite paradigm is a reasonable but not groundbreaking alternative to CoT - essentially trading step-by-step reasoning for structured description
  • Benchmark gains: Consistent but modest improvements (2-4 points) across benchmarks, within meaningful range but not transformative
  • Fair comparisons: Uses same backbone (Qwen2-VL) and comparable training data as baselines
  • Generalizability: Gains appear consistent across modalities and likely transferable to other MLLMs

Verdict: INCREMENTAL — solid engineering contribution that systematically improves upon CoT-based generative embeddings through a more retrieval-appropriate rewriting paradigm, but represents evolutionary rather than revolutionary progress.

Benchmarks & Results
  1. MMEB-V2 benchmark: RIME-7B achieves 68.6 vs UME-R1-7B 64.5 (+4.1), VLM2Vec-7B 52.3 (+16.3), Think-Then-Embed 68.6 (matches)

  2. MMEB-V2 image tasks: RIME-7B 73.4 vs UME-R1-7B 71.3 (+2.1)

  3. MMEB-V2 video tasks: RIME-7B 49.4 vs UME-R1-7B 47.5 (+1.9)

  4. MMEB-V2 visual document tasks: RIME-7B 75.6 vs UME-R1-7B 67.1 (+8.5)

  5. MRMR benchmark (nDCG@10): RIME-7B 50.2 vs UME-R1-7B 48.0 (+2.2), Ops-MM-Embed 48.1 (+2.1)

  6. UVRB benchmark (Recall@1): RIME-7B 55.6 vs Unite-7B 53.8 (+1.8), GME-7B 53.0 (+2.6)

    Results show consistent but modest improvements across all benchmarks. No major weaknesses identified but gains are evolutionary rather than breakthrough level.

Compute & Efficiency
  1. Model size: Qwen2-VL-2B (~2B parameters) and Qwen2-VL-7B (~7B parameters)

  2. Training compute: 32×A800 GPUs, SFT stage ~2500 steps, RL stage 1 epoch, wall-clock time not reported

  3. Inference speed: Achieves ~50% token reduction (232 vs 475 average tokens) compared to CoT methods, but still requires autoregressive generation for rewrite process

  4. Memory footprint: Not explicitly reported, likely similar to base Qwen2-VL models

  5. Deployment practicality: Authors acknowledge inference latency remains a fundamental limitation of generative embeddings. Cross-mode alignment enables flexible deployment where discriminative embeddings can be used for low-latency scenarios while generative embeddings provide higher accuracy when needed.

Real-World Applicability
  1. No deployment results or production integration reported

  2. No hardware experiments with specific robots, vehicles, or real-world environments

  3. Evaluation limited to academic benchmarks (MMEB-V2, MRMR, UVRB) without real-world data assessment

  4. Authors acknowledge fundamental inference latency issues that hinder practical large-scale retrieval deployment

  5. Cross-mode alignment provides some practical flexibility by enabling offline corpus encoding with generative embeddings while using discriminative embeddings for real-time queries

    Real-world applicability appears limited by inference cost and lack of demonstrated deployment scenarios.

Limitations & Failure Modes
  1. Inference latency remains substantial despite 50% token reduction - FUNDAMENTAL limitation of generative embedding paradigm

  2. Evaluation limited to academic benchmarks without real-world deployment validation - EVALUATION gap

  3. Scalability concerns for large-scale retrieval systems due to autoregressive generation overhead - FUNDAMENTAL constraint

  4. Dependency on high-quality rewrite annotations from teacher models - ENGINEERING dependency

  5. Limited analysis of failure cases or edge scenarios where rewriting may introduce semantic drift - EVALUATION limitation

    Failure modes:

  6. Structured rewriting may still generate irrelevant or misleading descriptions for complex multimodal content
  7. Cross-mode alignment may fail when discriminative and generative embeddings capture fundamentally different semantic aspects