Apr 26, 2026 Applied AI 5 papers

Applied AI Digest — Apr 26, 2026

Today’s Digest at a Glance

Preliminary

Today’s papers explore multimodal reasoning architectures, trajectory prediction frameworks, and capability evaluation metrics, introducing several new techniques for improving cross-modal consistency and reasoning assessment.

Fine-grained Multimodal Reasoning (FiMR)

Traditional text-to-image generation relies on global verification that checks entire images against complete prompts, but this approach struggles with compositional scenes where individual elements may be correct while their relationships are wrong. Fine-grained Multimodal Reasoning addresses this by decomposing complex prompts into semantic units and performing localized verification.

The core idea uses Visual Question Answering (VQA) to break down verification: given a prompt “a red car next to a blue house,” FiMR generates separate questions like “Is there a red car?” and “Is the car next to the house?” Each question targets specific image regions, enabling precise error localization. The framework then applies targeted corrections only to regions that fail verification, preserving correct portions of the image.

Mathematically, if $P$ is the original prompt and $I$ the generated image, traditional approaches compute a global consistency score $S(P, I)$. FiMR instead decomposes $P = {p_1, p_2, …, p_k}$ and computes localized scores $S(p_i, R_i)$ where $R_i$ are corresponding image regions, enabling surgical corrections. This decomposed approach acts like having multiple specialized inspectors each checking one aspect of a complex scene.

von Mises-Fisher (vMF) Spherical Geometry for RL

Standard reinforcement learning operates in Euclidean spaces, but many reasoning tasks naturally live on high-dimensional spheres where directions matter more than magnitudes—like embedding spaces where semantic similarity is measured by cosine distance. The von Mises-Fisher distribution extends RL to spherical manifolds by defining probability densities that concentrate around mean directions.

The vMF distribution on a unit sphere has density $f(x; \mu, \kappa) = C_d(\kappa) \exp(\kappa \mu^T x)$ where $\mu$ is the mean direction, $\kappa$ controls concentration, and $C_d(\kappa)$ is a normalization constant. For RL, this enables policy optimization where actions are unit vectors representing reasoning directions in embedding space.

The key insight is that spherical geometry naturally handles the constraint that reasoning states should have consistent magnitude while varying in direction. Instead of learning arbitrary vectors that might have meaningless magnitudes, the policy learns to navigate on the unit sphere where all valid states have equal “intensity” but differ in semantic direction.

PASS@(k,T) Capability Evaluation

Standard evaluation metrics like accuracy fail to distinguish whether improvements come from genuine capability expansion or merely better sampling efficiency. PASS@(k,T) addresses this by measuring success rates across both multiple attempts (k) and extended reasoning time (T).

The metric computes $\text{PASS@}(k,T) = \mathbb{P}[\text{success within } k \text{ attempts and } T \text{ time steps}]$. Genuine capability expansion manifests as improvements that persist even with unlimited attempts and time—if a model truly understands a concept, it should eventually succeed given enough tries. Efficiency improvements only help at low k and T values.

Mathematically, if $p$ is the per-attempt success probability, then $\text{PASS@}(k,\infty) = 1 - (1-p)^k$. Models with expanded capabilities show increased $p$, while efficiency improvements only reduce the variance around the same $p$. This metric reveals whether RL training genuinely teaches new reasoning patterns or just makes existing capabilities more reliable.

Gated Context Projectors

Hierarchical reasoning systems face the challenge of maintaining consistency across multiple reasoning stages while avoiding information bottlenecks. Naive approaches either lose important context when transitioning between stages or become computationally prohibitive by passing all information forward.

Gated context projectors solve this using learnable gates that selectively compress and forward contextual information. Given context representations $C_t$ at stage $t$, the projector computes $G_t = \sigma(W_g C_t + b_g)$ where $G_t$ are gate values, then produces compressed context $\tilde{C}_t = G_t \odot \text{Proj}(C_t)$ where $\text{Proj}$ is a learned projection and $\odot$ denotes element-wise multiplication.

This acts like a smart summarization system that learns which aspects of previous reasoning stages matter most for future decisions, maintaining consistency without overwhelming later stages with irrelevant details.

Reading Guide

Papers 1, 2, and 4 all tackle consistency in multimodal reasoning but at different scales: FiMR focuses on fine-grained image generation consistency, the driving VQA work addresses cross-stage reasoning consistency, and HyLaR handles consistency in hybrid discrete-continuous reasoning spaces. Paper 3 demonstrates how frozen language models can be adapted for specialized reasoning tasks through architectural modifications rather than retraining. Paper 5 provides the evaluation framework needed to properly assess whether these improvements represent genuine capability advances.

Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning

Authors: Yongjin Kim, Yoonjin Oh, Yerin Kim, Hyomin Kim et al. (8 authors) · Institution: Korea University · Category: cs.CV

FiMR improves text-to-image generation through decomposed VQA that breaks prompts into semantic units for fine-grained verification and localized image correction, achieving modest but consistent improvements over global reasoning approaches.

Practical Takeaway: If working on T2I generation systems, the key insight is using decomposed verification rather than holistic judgment for identifying misalignments. The decomposed VQA approach for breaking prompts into semantic units is implementable and shows consistent modest improvements. However, consider the significant engineering overhead: you’ll need to construct specialized reasoning datasets, implement multi-step inference pipelines, and accept 2-3x latency increases. The approach is most valuable for applications requiring high compositional accuracy where the extra computational cost is justified, particularly for counting and spatial relationship tasks.

Tags: text-to-image multimodal-reasoning visual-question-answering iterative-refinement compositional-generation multimodal-large-language-models image-editing prompt-alignment

arXiv · PDF

Task & Setting

This paper addresses improving text-to-image generation quality through fine-grained multimodal reasoning. Existing unified multimodal large language models (MLLMs) can generate images but struggle with precise text-image alignment, particularly for compositional prompts with multiple objects, attributes, and spatial relationships.

The task is text-to-image generation with iterative refinement. Input: text prompt P describing desired image content. Output: generated image I that accurately reflects all semantic elements in P. The method performs iterative correction through three steps:

initial generation
fine-grained feedback generation via decomposed VQA, and
localized image correction. The process continues until alignment is achieved or a maximum number of iterations is reached.

Success is measured on compositional benchmarks: GenEval (550 prompts across 6 categories: single object, two objects, counting, colors, position, color attributes), T2I-CompBench (6K prompts across color, shape, texture, spatial, non-spatial, complex), and DPGBench (1,065 long dense prompts). Metrics include accuracy scores for each category and overall performance.

The paper constructs a training dataset of 200K image-text pairs: 140K from FocusDiff for editing data and 60K synthetic composition-specific samples generated using Qwen-Image models.

Architecture & Method

The method uses Fine-grained Multimodal Reasoning (FiMR) built on Janus-Pro-7B unified MLLM architecture. The framework consists of three iterative steps:

Initial Text-to-Image Generation: Standard autoregressive image generation from text prompt P using the MLLM’s generation capability.
Fine-grained Feedback Generation: Four sub-processes: - Prompt summarization: condense P into Psummary removing subjective details - Decomposition: break Psummary into semantic tuples T = {Ti}^N using Davidsonian Scene Graph format (entities, attributes, relationships, counting, text) - Verification: perform VQA on each tuple Ti against generated image I, producing rationale and binary judgment Vi - Feedback synthesis: consolidate failed rationales into explicit corrective instructions F(Imisalign) or termination signal F(Ialign)
Localized Image Correction: Apply feedback F to perform targeted edits on misaligned regions while preserving correct areas.

The core technical contribution is decomposed VQA evaluation of minimal semantic units rather than holistic image-text alignment judgment, enabling precise localization and correction of specific misalignments.

Training loss for each step s:
\[L^{(s)}_{CE} = -\frac{1}{T} \sum_{t=1}^{T} \log p(y_t | \cdot)\]

Training Recipe

Supervised fine-tuning on Janus-Pro-7B base model using reasoning dataset of 200K samples: - Data: 140K FocusDiff edit pairs + 60K synthetic composition samples from Qwen-Image - Reasoning path construction: {P, Imisalign, Φ(Imisalign), Ialign, Φ(Ialign)} format - Hardware: 8x NVIDIA A100 GPUs with DeepSpeed ZeRO-2 - Optimizer: AdamW (β1=0.9, β2=0.95, ε=1e-6) - Learning rate: 2e-5 with cosine scheduler - Training steps: 14K with global batch size 128 - Weight decay: 0.05, gradient clipping: 1.0 - Total compute: ~768 GPU hours
Additional 1B model training: - Hardware: 8x NVIDIA H100 GPUs with DDP - Training steps: 20K - Total compute: ~120 GPU hours - Same hyperparameters as 7B model

Data construction uses Qwen3-Next-80B-A3B for text tasks and Qwen-VL-32B for VQA verification.

Novelty & Lineage

Prior work:

Janus-Pro-R1 (Pan et al., 2025b) performs global image-text alignment judgment and full image regeneration when misalignment detected.
T2I-R1 (Jiang et al., 2025) uses Chain-of-Thought prompt augmentation but lacks post-generation correction.
ReasonGen-R1 (Zhang et al., 2025) enriches prompts through reasoning but no iterative refinement.

Delta: This paper adds decomposed VQA that breaks prompts into minimal semantic units (entities, attributes, relations) and verifies each unit independently. Instead of holistic judgment leading to full regeneration, it provides explicit feedback for localized corrections of only misaligned regions.

Applied-specific assessment:
- Architectural idea: Decomposed VQA for T2I is a reasonable extension of existing VQA techniques to generation tasks, not fundamentally novel
- Benchmark gains: Modest improvements (GenEval: 0.80→0.86, T2I-CompBench: marginal gains) within expected range for iterative refinement
- Comparisons: Fair same-backbone comparison with Janus-Pro-R1, but relies on significant additional training data (200K specialized reasoning samples)
- Scale dependency: Gains likely dependent on large model capacity and substantial fine-tuning data, questionable if approach works without this scale
The core insight of fine-grained verification vs. global judgment is sensible but not groundbreaking. Improvements are consistent but modest.

Verdict: INCREMENTAL — solid application of decomposed reasoning to T2I generation with expected modest gains.

Benchmarks & Results

GenEval Overall: FiMR 0.86 (3rd iteration) vs Janus-Pro-R1 0.83 (3rd), improvement +0.03
GenEval Counting: FiMR 0.70 vs Janus-Pro-R1 0.51, improvement +0.19
GenEval Position: FiMR 0.84 vs Janus-Pro-R1 0.88, decline -0.04
GenEval Color Attributes: FiMR 0.76 vs Janus-Pro-R1 0.72, improvement +0.04
T2I-CompBench Color: FiMR 84.74 vs Janus-Pro-R1 83.53, improvement +1.21
T2I-CompBench Spatial: FiMR 43.05 vs Janus-Pro-R1 37.35, improvement +5.70
T2I-CompBench Non-Spatial: FiMR 31.61 vs Janus-Pro-R1 31.58, marginal +0.03
DPGBench Overall: FiMR 85.36 vs Janus-Pro-R1 84.13, improvement +1.23

Results show consistent but modest improvements across most categories. Strongest gains in counting tasks where fine-grained verification helps. Some categories show minimal or no improvement, indicating limitations of the approach.

Compute & Efficiency

Model size: 7B parameters (Janus-Pro-7B backbone), also tested 1B variant
Training compute: 768 GPU hours on 8x A100 (7B), 120 GPU hours on 8x H100 (1B)
Inference speed: Not reported, but iterative refinement (up to 3 rounds) significantly increases latency vs single-shot generation
Memory footprint: Not reported, but decomposed VQA requires multiple forward passes per image
Deployment practicality: Limited due to multi-step inference process requiring 2-3x longer generation time, specialized reasoning dataset construction, and requirement for large model capacity to achieve reported gains. Framework adds substantial complexity over baseline generation.

Real-World Applicability

No deployment results or production integration reported
No hardware experiments beyond standard GPU training setups
No sim-to-real evaluation or real-world data testing described
Method evaluated only on curated benchmark datasets (GenEval, T2I-CompBench, DPGBench)
No discussion of robustness to out-of-distribution prompts or real user inputs
Framework requires significant computational overhead (multiple VQA rounds per image) that may limit practical deployment

Limitations & Failure Modes

ENGINEERING: Requires substantial specialized training data (200K reasoning samples) and large model capacity to achieve gains
ENGINEERING: Multi-step inference increases latency 2-3x compared to single-shot generation, limiting real-time applications
FUNDAMENTAL: Decomposed VQA may miss holistic semantic relationships that require understanding image as complete scene
EVALUATION: Limited evaluation on real-world diverse prompts beyond curated benchmarks
ENGINEERING: Framework complexity makes it difficult to integrate into existing T2I pipelines
FUNDAMENTAL: Relies on VQA model accuracy - errors in verification stage propagate to feedback generation

Failure modes:
VQA verification errors leading to incorrect feedback and misguided corrections
Over-correction where localized edits introduce new misalignments in previously correct regions.

Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors

Authors: Gautam Kumar Jain, Carsten Markgraf, Julian Stähler · Institution: Technische Hochschule Augsburg · Category: cs.CV

Introduces gated context projectors and prompt-based methods to improve cross-stage consistency in hierarchical driving VQA, achieving significant NLI contradiction reduction while revealing the importance of domain-specific pretraining.

Practical Takeaway: If you’re working on multi-stage VLM reasoning for safety-critical applications, this paper provides a useful engineering pattern: gated context projectors with sequential LoRA training can improve semantic consistency between reasoning stages while updating minimal parameters. The explicit prompt-based approaches offer strong zero-training baselines for cross-stage coherence. However, be aware that domain-specific pretraining appears essential for maintaining surface-level consistency alongside semantic improvements. The NLI-based evaluation framework for cross-stage consistency could be valuable for similar hierarchical reasoning tasks beyond driving.

Tags: autonomous_driving vision_language_models hierarchical_reasoning cross_stage_consistency context_passing QLoRA safety_critical_AI multi_modal_reasoning

arXiv · PDF

Task & Setting

Graph Visual Question Answering (GVQA) for autonomous driving addresses the challenge of ensuring logical consistency across multi-stage reasoning in safety-critical applications. Current VLMs can generate locally fluent but globally contradictory outputs when processing perception, prediction, and planning decisions independently. Cross-stage coherence is essential for safe autonomous driving systems where planning decisions must align with a model’s own perception outputs.

The task takes as input multi-camera observations v ∈ V (six synchronized RGB frames from nuScenes, stitched into 1344×896 images) and three stage-specific questions covering Perception (identifying objects and scene elements), Prediction (forecasting object behaviors), and Planning (determining safe ego vehicle actions). The objective is to minimize cross-stage contradictions while maintaining high language quality:

\[\min_\theta \mathbb{E}_{(v,q)} \left[ \text{Contradiction}(\hat{a}_{\text{Perc}}, \hat{a}_{\text{Pred}}, \hat{a}_{\text{Plan}}) \right]\]

Success is measured through language quality metrics (BLEU-1, ROUGE-L, CIDEr) and cross-stage consistency metrics: lexical overlap between adjacent stages, rule-based structural consistency for driving-domain attributes, and NLI-based contradiction scores using multilingual classifiers.

The study uses DriveLM-nuScenes dataset (v1.1, ~3,200 keyframes, 80/20 split) with 796 validation scenes for consistency analysis.

Architecture & Method

Two complementary approaches for cross-stage context passing in hierarchical driving VQA
Explicit variant: Three prompt-based conditioning strategies on Mini-InternVL2-4B-DA-DriveLM without additional training - Flat baseline: independent processing per stage - History-chain: multi-turn conversation with autoregressive memory - Injection-chain: structured text prefixes from prior stage answers
Implicit variant: Learned gated context projectors on InternVL3-8B-Instruct with QLoRA adaptation - Context extraction: h_k = H_k[τ_k] (hidden state at final prompt token) - Gated projection:
\[\tilde{h}_k = \sigma(g_k) \cdot W_k \frac{h_k}{||h_k||_2 + \epsilon}\]
```
- Single-position injection into next stage's input embeddings: E_{k+1}[τ_{k+1}] += \tilde{h}_k
```
Sequential training protocol with frozen upstream adapters and trainable downstream adapters - Phase 1: P_erc → Pred (freeze Perception, train Prediction + W_1, g_1) - Phase 2: P_erc → P*_red → Plan (freeze both upstream, train Planning + W_2, g_2)
Stage-specific QLoRA adapters (rank r=16, α=32) with 4-bit NF4 quantization, updating ~0.5% of parameters

Training Recipe

Explicit variant: No training - inference-only evaluation on pre-trained Mini-InternVL2-4B-DA-DriveLM
Implicit variant training stages: - Stage 1: Independent LoRA adapters per reasoning stage (Perc/Pred/Plan) with frozen 8B backbone - Stage 2: Sequential training P_erc → Pred with frozen Perception adapter, trainable Prediction adapter + context projector W_1, gate g_1
- Stage 3: Sequential training P_erc → P*_red → Plan with frozen upstream adapters, trainable Planning adapter + projector W_2, gate g_2
Training details: - Data: DriveLM-nuScenes (~3,200 keyframes, 80/20 split), driving-domain QA pairs - Optimizer: AdamW, base LR 1.5×10^-5, weight decay 0.05, cosine schedule with 10% warmup - Context projectors: 3.0× base LR, gates: 25.0× base LR
- Hardware: 4-bit NF4 double quantization, BF16 precision - Initialization: projector weights scale 0.01, gates g_k = -3.5 (σ(g_k) ≈ 0.029)
Wall-clock time and compute details: not reported

Novelty & Lineage

Prior work:

DriveLM (Sima et al. 2023) introduced GVQA for autonomous driving with three reasoning stages but no cross-stage coherence mechanism.
Chain-of-thought methods (Ishaq et al. 2025, Wang et al. 2024) rely on autoregressive context windows rather than trainable routing.
End-to-end architectures (Jia et al. 2025, Tang et al. 2025) pass structured feature tokens between tasks, not compressed semantic states from natural language answers.

Delta: This paper adds (1) trainable gated context projectors that route compressed hidden states between discrete VQA stages, (2) modular sequential training with frozen upstream adapters, and (3) NLI-based cross-stage consistency evaluation for hierarchical driving VQA.

Applied-specific assessment:
- Architectural novelty: The gated context projector mechanism is a straightforward application of learned linear projections with normalization and gating - technically sound but not architecturally novel
- Benchmark gains: NLI contradiction reduction of 34-42.6% is meaningful for safety-critical applications, but gains are task-specific and limited to specialized driving VQA
- Fair comparisons: The two variants use different base models (4B domain-adapted vs 8B general-purpose), making direct comparison impossible and weakening the evaluation
- Scalability concerns: The approach requires driving-domain pretraining for surface-level consistency, limiting generalization beyond the specific nuScenes/DriveLM setting
Verdict: INCREMENTAL — solid engineering contribution applying known techniques (LoRA, hidden state routing, gating) to driving VQA with meaningful safety-relevant improvements, but lacks architectural novelty and suffers from non-comparable experimental design.

Benchmarks & Results

Language Quality (InternVL3-8B+QLoRA): - BLEU-1: Sequential PPP skip+transfer 34.1 vs flat 30.1 (+13.3%) - ROUGE-L: Sequential PPP skip+transfer 25.3 vs flat 21.6 (+17.1%) - CIDEr: Sequential PPP skip+transfer 67.1 vs flat 51.5 (+30.3%)
Cross-stage NLI Consistency (primary metric): - Explicit (4B-DA): History-chain reduces contradiction 0.461→0.264 (-42.6%) - Implicit (8B): Planning-stage contradiction 0.340→0.223 (-34.4%, p<0.05) - Cross-stage entailment increases by 50% (0.042→0.063)
Mixed results on surface metrics: - Lexical overlap decreases significantly in implicit variant (0.102→0.054-0.067) - Structural consistency degrades (0.684→0.188-0.359) due to mixed-language outputs
Statistical significance confirmed via bootstrap 95% CIs on 796 validation scenes
Notable benchmark absent: No comparison to state-of-the-art driving VQA methods beyond flat baselines

Compute & Efficiency

Model size: 4B parameters (explicit variant), 8B parameters (implicit variant)
Training compute: Not reported - missing GPU hours, wall-clock time, hardware specifications
Inference speed/latency: Not reported
Memory footprint: 4-bit NF4 quantization with double quantization for 8B model, ~0.5% trainable parameters via QLoRA
Deployment practicality: Limited by requirement for driving-domain pretraining and specialized multi-camera sensor setup (nuScenes configuration), making real-world deployment challenging without significant adaptation costs

Real-World Applicability

No deployment results on actual autonomous vehicles reported
No hardware experiments beyond standard GPU training infrastructure
No production integration or real-world driving scenario testing
No sim-to-real transfer evaluation or discussion
Limited to curated nuScenes dataset evaluation - unclear how methods generalize to diverse real-world driving conditions, weather, lighting, or geographical variations
Mixed-language outputs in implicit variant would require additional post-processing for production deployment

Limitations & Failure Modes

FUNDAMENTAL: Requires driving-domain pretraining for surface-level consistency - implicit variant shows semantic gains but degrades lexical/structural metrics without domain adaptation
FUNDAMENTAL: Different base models (4B vs 8B) prevent direct comparison between explicit and implicit approaches, limiting scientific rigor
ENGINEERING: Mixed-language outputs (38% Chinese) in implicit variant due to lack of domain-specific training data
EVALUATION: Rule-based structural consistency checker can be gamed by hedging with contradictory actions, potentially missing coherent but decisive plans
ENGINEERING: Context projector initialization requires careful tuning (gate values, learning rate multipliers) for stable training

Failure modes:
Pedestrian crossing → maintain speed bias persists across both variants (dominant contradiction pattern)
Planning stage may suppress safety-relevant context through learned projectors, leading to overconfident but unsafe decisions

Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction

Authors: Yanjiao Liu, Jiawei Liu, Xun Gong, Zifei Nie · Institution: Jilin University · Category: cs.CV

A framework that uses frozen large language models with reprogramming adapters to integrate vehicle trajectories and HD map data for trajectory prediction, demonstrating modest but consistent improvements from map information across multiple LLM architectures.

Practical Takeaway: This framework provides a systematic approach to evaluate how well different frozen LLMs can handle spatio-temporal reasoning for trajectory prediction. The key practical insight is that map information consistently improves prediction accuracy across multiple LLM architectures, with gains becoming more pronounced at longer prediction horizons. However, the computational overhead of hosting 7-8B parameter frozen models may limit real-world deployment. Research engineers working on trajectory prediction should consider this as a benchmarking tool for LLM evaluation rather than a production-ready system, and investigate whether similar gains can be achieved with smaller, more deployable models.

Tags: trajectory_prediction autonomous_driving large_language_models spatio_temporal_reasoning multi_modal_fusion HD_maps frozen_models reprogramming_adapters

arXiv · PDF

Task & Setting

This paper addresses trajectory prediction for autonomous driving (AD), a critical safety component where vehicles must accurately forecast future movements of surrounding traffic participants to enable safe navigation decisions. The challenge lies in understanding complex spatio-temporal interactions between dynamic agents (vehicles, pedestrians) and static road infrastructure (lanes, intersections) in real-time.

The task takes as input:

observed trajectories of ego vehicle and surrounding agents over 2 seconds at 2 Hz sampling (4 timesteps)
local HD map patches containing lane geometry and topology, and
task description prompts. The output is predicted future trajectories for the next 6 seconds (12 timesteps). The formal objective minimizes trajectory prediction error:
\[\min_\theta \mathbb{E}_{(\tau_{obs}, \tau_{gt}, M)} \left[ \| \hat{\tau}_{1:N} - \tau_{gt}_{1:N} \|_2^2 \right]\]
where $\hat{\tau}_{1:N}$ are predicted trajectories and $\tau_{gt}_{1:N}$ are ground truth future positions.

Success is measured using: Average Displacement Error (ADE), Final Displacement Error (FDE), Missing Rate (MR) for predictions >2m from ground truth, and Inference Efficiency (IE) in seconds.

The framework is evaluated on nuScenes dataset with annotated trajectories, surrounding agents, and HD maps, providing a standardized benchmark for autonomous driving trajectory prediction.

Architecture & Method

Scene Encoder: Processes ego and neighbor vehicle trajectories using vectorized states $\vec{x}_t^i = [x_t^i - x_{t-1}^i; x_t^i - x_t^0]$ combining displacement and relative position. Cross-attention mechanism encodes agent interactions:
\[h'_t = \sum_{i=1}^I \text{Softmax}\left(\frac{q_t^0 (k_t^i)^T}{\sqrt{d_{scene}}}\right) v_t^i\]
Map Encoder: Lightweight CNN processes local HD map patches $M_{local}$ to extract road topology features $h_{map} = f_{CNN}(M_{local})$ encoding lane geometry and connectivity.
Reprogramming Adapter: Maps scene features to LLM token space via learned transformation $b_t^{scene} = \text{Reprogram}(h_t)$ where $\text{Reprogram}: \mathbb{R}^{d_{scene}} \rightarrow \mathbb{R}^{d_{llm}}$.
Feature Fusion: Cross-attention between trajectory and map features: $\tilde{h}_t^{map} = \text{MHA}(h_t^{traj}, h_{map}, h_{map})$ followed by concatenation and linear fusion.
Frozen LLM Backbone: Off-the-shelf models (LLaMA2/3, Qwen2.5, Mistral, etc.) process fused scene+map tokens with task prompts without parameter updates.
Linear Decoder: Lightweight matrix transformation converts LLM outputs to trajectories: $\hat{\tau}_{1:N}^0 = \text{LinearDecoder}({e_t^{scene}}_{t=1}^T)$.

The core contribution is seamless integration of frozen LLMs for trajectory prediction through reprogramming adapters, enabling systematic evaluation of spatio-temporal reasoning without model retraining.

Training Recipe

Scene Encoder Training: Trained end-to-end on nuScenes trajectory data using Adam optimizer. Specific learning rate, batch size, and training duration not reported.
Map Encoder Training: CNN trained jointly with scene encoder on HD map patches. Architecture details and training hyperparameters not reported.
Reprogramming Adapter Training: Learned mapping trained to project scene features into LLM embedding space. Training details not reported.
Frozen LLM Integration: Pre-trained LLMs (LLaMA2-7B, LLaMA3-8B, Qwen2.5, Mistral, Vicuna, WizardLM) used without parameter updates - completely frozen during trajectory prediction.
Linear Decoder Training: Simple matrix transformation trained jointly with other components. No separate training stage.

Data: nuScenes dataset with trajectories sampled at 2 Hz, 2-second observation windows predicting 6-second futures. Data filtering and preprocessing details not reported.

Hardware/Compute: Training hardware, wall-clock time, and computational requirements not reported.

Optimization Details: Specific optimizers, learning rates, batch sizes, and training schedules not reported for most components.

Novelty & Lineage

Prior Work:

Time-LLM (2024) - reprogrammed time-series features as tokens for frozen LLMs in forecasting
ST-LLM (2024) - represented spatial-temporal data as discrete tokens for traffic forecasting
Liu et al. (2025) - evaluated LLM extrapolation ability for vehicle trajectory prediction without map integration

Delta: This paper adds:
systematic integration of HD map semantics with trajectory data through CNN encoder and cross-attention fusion
comprehensive evaluation framework across multiple LLM architectures
quantitative analysis of map information impact on trajectory prediction accuracy.

Applied-Specific Assessment:
- Architecture: The reprogramming adapter approach is well-established (Time-LLM). Map integration via CNN+cross-attention is standard. Core novelty is combining these for trajectory prediction evaluation.
- Benchmark Gains: Modest improvements - adding neighbors reduces ADE by 10.26% at 2s, adding maps provides additional 1.99% reduction. Gains are incremental rather than transformative.
- Fair Comparisons: Limited baselines - primarily ablation studies. No comparison to dedicated trajectory prediction models or state-of-the-art AD systems. Missing comparisons to Transformer-based trajectory predictors.
- Scale Dependency: Framework requires frozen pre-trained LLMs - gains likely depend on large-scale pretraining that may not be reproducible.
Verdict: INCREMENTAL - Solid engineering combining existing techniques (reprogramming adapters, CNN map encoding) for a new application, but architectural contributions are straightforward extensions of known methods with modest empirical gains.

Benchmarks & Results

nuScenes ADE (2s): This paper 0.789±1.038, previous best not reported, improvement margin unknown
nuScenes ADE (4s): This paper 1.704±2.385, previous best not reported, improvement margin unknown
nuScenes ADE (6s): This paper 2.920±4.202, previous best not reported, improvement margin unknown
nuScenes FDE (6s): This paper 6.563±6.802, previous best not reported, improvement margin unknown
nuScenes Missing Rate: This paper 65.35%, previous best not reported, improvement margin unknown

Results Analysis: Mixed performance across LLM backbones - LLaMA3 achieves best results while WizardLM performs poorly (ADE 3.151 vs 0.789 at 2s). Map information provides consistent but modest improvements (1-10% reduction in ADE/FDE). Inference efficiency ranges 0.034-0.037 seconds across models.

Missing Benchmarks: No comparison to dedicated trajectory prediction baselines (Social-GAN, Trajectron++, MultiPath), no Argoverse results, no comparison to state-of-the-art autonomous driving prediction models, no cross-dataset evaluation.

Compute & Efficiency

Model Size: Frozen LLM backbones range from 7B (LLaMA2) to 8B+ parameters (LLaMA3, others). Additional trainable components (scene encoder, map CNN, adapters, decoder) size not reported.
Training Compute: GPU hours, training hardware, and computational requirements not reported. Only mentions lightweight CNN and linear components.
Inference Speed: 0.034-0.037 seconds per prediction across different LLM backbones. LLaMA3 inference time: 0.037s, Qwen2.5: 0.034s.
Memory Footprint: Memory requirements for frozen LLMs and additional components not reported. Map processing adds minimal overhead via lightweight CNN.
Deployment Practicality: Limited real-time applicability - 0.037s inference for 6s prediction provides reasonable safety margin, but requires hosting large frozen LLMs (7-8B parameters) which may be challenging for embedded automotive systems. No discussion of quantization or deployment optimization.

Real-World Applicability

Dataset Limitations: Evaluation limited to nuScenes benchmark data - no real-world deployment testing or live vehicle integration reported.
Simulation Testing: No sim-to-real analysis or discussion of domain gap between nuScenes annotations and real-world trajectory prediction.
Hardware Integration: No experiments on actual autonomous vehicles or embedded automotive compute platforms.
Production Considerations: Framework requires 7-8B parameter frozen LLMs which may be impractical for real-time automotive deployment due to memory and compute constraints.
Environmental Robustness: No evaluation across diverse weather conditions, lighting scenarios, or geographic regions beyond nuScenes dataset scope.

The work remains primarily benchmarking-focused without demonstrated real-world deployment or practical integration into autonomous driving systems.

Limitations & Failure Modes

Computational Requirements (FUNDAMENTAL) - Framework requires hosting large frozen LLMs (7-8B parameters) which may be prohibitive for embedded automotive systems
Limited Baseline Comparisons (EVALUATION) - No comparison to state-of-the-art dedicated trajectory prediction models, making it difficult to assess true performance gains
Dataset Scope (EVALUATION) - Evaluation limited to single dataset (nuScenes) without cross-dataset validation or real-world testing
Map Dependency (FUNDAMENTAL) - Requires high-definition maps which may not be available in all driving environments or geographic regions
Inference Latency (ENGINEERING) - 0.037s inference time may be insufficient for safety-critical real-time trajectory prediction in high-speed scenarios
Frozen Model Constraints (FUNDAMENTAL) - Cannot adapt LLM parameters for domain-specific trajectory prediction improvements, potentially limiting performance ceiling

Failure Modes:
- Novel Traffic Scenarios: May fail on traffic patterns not seen in nuScenes training data due to frozen LLM constraints
- Map Quality Degradation: Performance likely degrades significantly with low-quality or outdated HD maps, as map integration is core to the approach

Hybrid Latent Reasoning with Decoupled Policy Optimization

Authors: Tao Cheng, Shi-Zhe Chen, Hao Zhang, Yixin Qin et al. (6 authors) · Institution: Tencent · Category: cs.CV

HyLaR introduces DePO, a reinforcement learning algorithm that uses von Mises-Fisher spherical geometry and decoupled trust regions to effectively optimize hybrid discrete-continuous reasoning in multimodal language models.

Practical Takeaway: The key insight about hyperspherical geometry in normalized LLM representations is broadly applicable. If you’re working on continuous RL in latent spaces, consider von Mises-Fisher distributions instead of Gaussian assumptions - the closed-form KL divergence using cosine similarity is much more stable than sample-based estimation. The decoupled clipping approach (different ε values for discrete vs continuous actions) should be adopted for any hybrid discrete-continuous RL setting. For practitioners implementing latent reasoning in VLMs, the two-stage approach (SFT with canvas alignment, then DePO) provides a clear recipe that avoids the complexity of prior multi-stage frameworks.

Tags: multimodal-reasoning visual-language-models reinforcement-learning latent-space-optimization chain-of-thought high-resolution-perception policy-optimization hyperspherical-geometry

arXiv · PDF

Task & Setting

Multimodal Large Language Models (MLLMs) struggle with complex visual reasoning due to “early semantic collapse” - they discretize continuous visual signals into text tokens too early, losing fine-grained spatial details needed for precise visual analysis. While Chain-of-Thought reasoning helps text-only tasks, adapting it to vision forces high-bandwidth visual information through a narrow textual bottleneck.

Task Definition: Given an image and question, the model must generate a hybrid response interleaving discrete text tokens with continuous visual latent representations. During inference, the model transitions into “canvas mode” using special tokens <|canvas_start|> and <|canvas_end|>, where hidden states recur as continuous embeddings rather than being mapped to discrete vocabulary. The objective combines cross-entropy loss for text generation with MSE alignment to ground-truth visual embeddings:

\[\mathcal{L}_{\text{SFT}} = \mathcal{L}_{\text{CE}} + \lambda \mathcal{L}_{\text{Canvas}}\]

where

\[\mathcal{L}_{\text{Canvas}} = \frac{1}{|\mathcal{S}|} \sum_{t \in \mathcal{S}} \|\mathbf{h}_t - \mathbf{e}_{t+1}\|_2^2\]

Evaluation: Success measured on high-resolution perception benchmarks (V*, HRBench-4K/8K) and general VQA tasks (MMStar, MMVP, SeedBench2Plus, BLINK, HallusionBench) using accuracy metrics.

Dataset: Training uses Zebra-CoT for supervised fine-tuning (scientific problems, 2D/3D visual reasoning, visual logic games) and DeepEyes/Thyme/CodeDance for RL training.

Architecture & Method

Base Architecture: Standard MLLM with vision encoder, text encoder, and projector, extended with hybrid discrete-continuous action space capability.
Canvas Compression Module: Frozen SigLIP encoder processes ground-truth intermediate images into P=729 patch tokens, then learnable cross-attention compressor (L=2 layers, N=16 queries) aggregates patches into compact canvas embeddings.
Hybrid Generation Mechanism: Model alternates between discrete text generation and continuous latent recursion. Special control tokens <|canvas_start|> and <|canvas_end|> bound visual reasoning phases where hidden states from previous layer feed back as input embeddings, bypassing discrete vocabulary.
DePO (Decoupled Policy Optimization): Core technical contribution addressing hybrid discrete-continuous RL optimization. Key innovations: - von Mises-Fisher (vMF) modeling for continuous latent policy on hypersphere rather than Euclidean Gaussian - Decoupled trust-region clipping with separate constraints for text (ε_tok=0.2/0.28) and latent positions (ε_lat=0.05) - Closed-form vMF KL regularization using exact cosine distance rather than sample-based estimation

The unified policy log-probability is:
\[\log \pi_\theta(a_t \mid s_t) = \begin{cases} \log \pi_\theta(a_t \mid s_t), & a_t \in \mathcal{V} \\ \log C_D(\kappa) + \kappa (\boldsymbol{\mu}_t^\theta)^\top \mathbf{\tilde{z}}_t, & a_t \in \mathbb{S}^{D-1} \end{cases}\]

Training Recipe

Stage 1 - Cold-start SFT: Joint optimization with cross-entropy loss for text and MSE loss for canvas alignment on Zebra-CoT dataset. Learning rate 10^-5, per-GPU batch size 1, gradient accumulation 16 steps. Single epoch to avoid overfitting. Canvas compressor gradients flow end-to-end into LLM backbone.
Stage 2 - DePO Reinforcement Learning: From SFT checkpoint, apply decoupled policy optimization on curated samples from DeepEyes/Thyme/CodeDance. Learning rate reduced to 10^-6. Decoupled surrogate objective:
\[\mathcal{L}_{\text{PPO}}(\theta) = \mathcal{L}_{\text{tok}}(\theta \mid \mathcal{Z}; \epsilon_l^{\text{tok}}, \epsilon_h^{\text{tok}}) + \alpha \mathcal{L}_{\text{lat}}(\theta \mid \mathcal{S}; \epsilon_l^{\text{lat}}, \epsilon_h^{\text{lat}})\]
Combined with position-specific KL penalties (β_tok=0.01, β_lat=0.005).
Optimizer/Schedule: Not explicitly reported for RL stage.
Hardware: Not reported.
Data filtering: Ground-truth canvases compressed via frozen SigLIP + cross-attention compressor to avoid high-resolution computational overhead.

Novelty & Lineage

Prior Work:

LVR (2025): Aligns generated latent embeddings with auxiliary cropped images, uses supervised fine-tuning primarily
Monet (2025): Complex three-stage SFT pipeline with subsequent RL, but intricate design prone to training bias
SkiLa (2025): Straightforward SFT approach without RL optimization

Delta: This paper addresses a fundamental limitation in existing latent reasoning: optimizing hybrid discrete-continuous action spaces. Key innovations:
Geometric insight: Recognizes that normalized LLM representations live on hypersphere, requiring vMF distribution rather than Euclidean Gaussian
Variance mismatch solution: Decoupled clipping addresses different importance ratio behaviors between discrete tokens and continuous vectors
Exact KL regularization: Closed-form vMF KL divergence eliminates high-variance sampling

Assessment:
- Architectural novelty: The vMF spherical modeling is a non-obvious insight that elegantly matches LLM geometry
- Benchmark gains: Meaningful improvements (7.33% on V*, 7.00% on HRBench-4K) that hold across multiple datasets
- Fair comparisons: Uses same base model (Qwen2.5-VL-7B), evaluates against both reproduced and original scores
- Scale dependence: Method appears to work without requiring massive compute or proprietary data
Limitations of Assessment: Some baselines show discrepancies between reproduced vs. original scores, suggesting evaluation protocol differences may affect comparisons.

Verdict: SIGNIFICANT — The geometric insight about hyperspherical LLM representations and the resulting vMF formulation provides a principled solution to hybrid RL optimization that should be relevant to the broader latent reasoning community.

Benchmarks & Results

V*: 83.77% (ours) vs 76.44% (Qwen2.5-VL-7B), +7.33% improvement. Previous SOTA latent methods: Monet 80.1%, SkiLa 78.53%
HRBench-4K Overall: 75.00% (ours) vs 68.00% (Qwen2.5-VL-7B), +7.00% improvement. Outperforms Monet (67.37%) and SkiLa (72.12%)
HRBench-8K Overall: 70.50% (ours) vs 63.75% (Qwen2.5-VL-7B), +6.75% improvement. Outperforms Monet (64.37%) and SkiLa (66.50%)
MMVP: 73.67% (ours) vs 65.67% (Qwen2.5-VL-7B), +8.00% improvement
MMStar: 62.00% (ours) vs 59.70% (Qwen2.5-VL-7B), +2.30% improvement
SeedBench2Plus: 70.32% (ours) vs 65.31% (Qwen2.5-VL-7B), +5.01% improvement
BLINK: 57.14% (ours) vs 53.60% (Qwen2.5-VL-7B), +3.54% improvement
HallusionBench: 63.68% (ours) vs 56.57% (Qwen2.5-VL-7B), +7.11% improvement

Results show consistent improvements across both high-resolution perception tasks and general VQA benchmarks. Notably strong performance on challenging visual search tasks (V*, HRBench) where fine-grained spatial reasoning is critical.

Compute & Efficiency

Model size: 7B parameters (based on Qwen2.5-VL-7B backbone)
Training compute: Not reported (GPU hours, hardware specifications not provided)
Inference speed/latency: Not explicitly measured, but framework designed to avoid pixel-level image generation overhead. Canvas mode uses continuous latent recursion rather than external tool calls to reduce latency compared to “Think-with-Images” approaches
Memory footprint: Not reported
Deployment practicality: High - auxiliary canvas compression module discarded post-training for inference efficiency. Framework avoids external tool dependencies that create deployment bottlenecks in agent-based approaches

Real-World Applicability

Evaluation on real-world data: Tests on high-resolution benchmarks (4K, 8K images) with realistic visual search scenarios where target regions occupy only 100-200 pixels in large cluttered images
Production considerations: Framework eliminates external tool dependencies and pixel-generation overhead, making it more suitable for deployment than agent-based alternatives
Hardware experiments: Not reported - no specific robot/vehicle deployment
Sim-to-real discussion: Not applicable - focuses on visual reasoning rather than embodied control

Note: While benchmarks use realistic high-resolution imagery, no actual production deployment results or real-world system integration are demonstrated.

Limitations & Failure Modes

FUNDAMENTAL: Method still requires ground-truth intermediate visual supervision during SFT stage, limiting scalability compared to pure self-supervised approaches
FUNDAMENTAL: Canvas compression introduces information bottleneck - compressing 729 patch tokens to 16 embeddings may lose fine-grained spatial details critical for some tasks
ENGINEERING: Fixed concentration parameter κ in vMF distribution - adaptive concentration could potentially improve optimization dynamics
ENGINEERING: Limited ablation on canvas length budget and maximum reasoning steps - optimal allocation unclear
EVALUATION: Some baseline reproductions show significant discrepancies from original reported scores, questioning evaluation protocol consistency
EVALUATION: Missing comparisons to some recent latent reasoning methods and agent frameworks

Failure Modes:
“Over-thinking” degradation: SFT models perform worse when inference steps significantly exceed training steps, though RL mitigates this
Canvas boundary artifacts: Model must learn when to exit canvas mode appropriately, potential for premature or delayed transitions

Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

Authors: Zhiyuan Zhai, Wenjing Yan, Xiaodan Shao, Xin Wang · Institution: Fudan University · Category: cs.LG

Introduces PASS@(k,T) metric showing that reinforcement learning genuinely expands LLM agent capabilities on compositional tool-use tasks, contradicting static-reasoning conclusions about RL only improving efficiency.

Practical Takeaway: If you’re building tool-using LLM agents for compositional tasks requiring multi-step information gathering, use reinforcement learning with task reward rather than supervised fine-tuning on expert trajectories - this paper shows SFT actually regresses capability boundaries on sequential retrieval tasks while RL expands them. The PASS@(k,T) evaluation framework is immediately useful for diagnosing whether training improvements come from genuine capability expansion versus efficiency gains. Focus RL training on compositional tasks where base models have sparse but existing solution strategies that need reweighting rather than simple tasks where expert demonstrations suffice.

Tags: reinforcement_learning llm_agents tool_use capability_evaluation pass@k compositional_reasoning retrieval_augmented_generation grpo

arXiv · PDF

Task & Setting

Real-world LLM agent applications require multi-step tool use, where agents must interact with environments through retrieved information to solve complex problems requiring sequential reasoning. However, it remains unclear whether reinforcement learning genuinely expands agent capabilities beyond the base model’s latent abilities or simply improves reliability.

Task definition: The paper evaluates LLM agents on three categories of problems with different interaction complexities. Input is a question requiring factual retrieval, and output is an exact answer after up to T rounds of search(query) tool interactions returning BM25 paragraphs. The formal objective maximizes:

\[\text{PASS@}(k,T) = \Pr_{\tau_1,\ldots,\tau_k \sim \pi(\cdot|q;T)}[\exists i \in [k] : \tau_i \text{ is correct}]\]

Evaluation uses three metrics:

PASS@(k,T) across sampling budgets k ∈ {1,4,16,64} and interaction depths T ∈ {0,1,2,3,5}
capability boundary expansion B_T(π_RL) \ B_T(π_base) , and
exact match accuracy.

The evaluation dataset contains 300 problems: 100 MATH-500 (no tools), 100 HotPotQA comparison questions (independent retrievals), and 100 HotPotQA bridge questions (sequential retrievals requiring compositional reasoning).

Architecture & Method

Base model: Qwen2.5-7B-Instruct with search tool instructions in system prompt
Three training variants with matched 200-problem training data: - π_base: Unmodified base model - π_SFT: LoRA fine-tuned on 200 expert trajectories with observation tokens masked - π_RL: GRPO-trained with binary exact-match reward and per-token k3 KL estimator
Agent follows ReAct loop: Thought → Search(query) → Observation → Answer pattern with BM25 retrieval over 10-paragraph HotPotQA corpus
Key training loss for GRPO:
\[L = \mathbb{E}[\log \pi_\theta(a_t|s_t) \cdot (R - b) - \beta \cdot \text{KL}(\pi_\theta || \pi_\text{ref})]\]
Core technical contribution: PASS@(k,T) metric that separates capability expansion (new problems solved at any k) from efficiency improvement (higher reliability on solvable problems) by jointly varying sampling budget k and interaction depth T

Training Recipe

SFT stage: LoRA fine-tuning on 200 expert trajectories from HotPotQA gold supporting facts, observation tokens masked from loss, not reported optimizer details
RL stage: GRPO training for 10 epochs (1000 steps) on same 200 problems - Data: 200 HotPotQA problems (100 comparison + 100 bridge questions)
- Optimizer: 8-bit Adam with LoRA rank 16 (~40M trainable parameters) - Reward: Binary exact-match (1 if correct, 0 otherwise) - Hyperparams: Group size G=8, temperature 0.7, per-token k3 KL estimator - Hardware: Single 48GB GPU, 55 hours wall-clock time - Reference policy: LoRA-disabled base model
Exploration variant: π_RL+explore adds λ=0.1 novelty bonus for unseen paragraph title sets per batch (training incomplete due to GPU issues)

Novelty & Lineage

Prior work:

Yue et al. 2025 “Does reinforcement learning really incentivize reasoning capacity” found RL only redistributes probability mass within base model capabilities on static math reasoning, with pass@k curves converging at large k.
Agent-R1 (Cheng et al. 2025), ReTool (Feng et al. 2025) apply end-to-end RL to tool-using agents but report only aggregate accuracy without capability decomposition.
Static pass@k analysis (Chen et al. 2021, Wang et al. 2023) focuses on sampling efficiency but cannot address multi-step interaction.

Delta: This paper introduces PASS@(k,T), the first two-dimensional evaluation metric that varies both sampling budget k and interaction depth T. Applied to matched-data comparison between base, SFT, and RL agents.

Applied-specific assessment:
- The architectural idea (two-dimensional evaluation) is genuinely novel for agentic settings - static pass@k fundamentally cannot capture interaction-dependent capabilities
- Benchmark gains are meaningful: +4pp capability boundary expansion on sequential retrieval tasks, with clean 5:1 vs 1:3 problem-trading ratios between RL and SFT
- Comparisons are fair: identical training data across SFT/RL, same base model, same evaluation protocol
- The gains appear to require the compositional task structure - simple retrieval shows minimal RL advantage
Verdict: SIGNIFICANT — provides the first rigorous framework to separate capability expansion from efficiency in agentic RL, with clear empirical demonstration that contradicts pessimistic static-reasoning conclusions.

Benchmarks & Results

Category A (MATH-500, no tools): π_RL 84.0% vs π_base 84.0% at PASS@(64,0) - no improvement, confirms orthogonality to parametric reasoning
Category B (HotPotQA comparison): π_RL 86.0% vs π_SFT 85.0% vs π_base 82.0% at PASS@(64,5) - modest RL advantage (+4pp over base)
Category C (HotPotQA bridge): π_RL 81.0% vs π_base 77.0% vs π_SFT 73.0% at PASS@(64,5) - substantial RL expansion (+4pp over base, +8pp over SFT)

Capability boundary analysis:

B_RL \ B_base

= 5 newly solvable bridge problems vs

B_base \ B_RL

= 1 regression (5:1 ratio), while SFT shows 3:7 ratio (net regression)

Pass-curve divergence: On Category C, π_RL and π_base cross at k≈4, with widening gap reaching maximum at k=64 (opposite of static reasoning convergence)

Compute & Efficiency

Model size: Qwen2.5-7B base + LoRA rank 16 (~40M trainable parameters)
Training compute: Single 48GB GPU, 55 hours wall-clock for 1000 GRPO steps
Inference speed: 64 rollouts per (problem, T) combination using vLLM, temperature 0.7 - specific latency not reported
Memory footprint: Fits single 48GB GPU during training with 8-bit Adam optimization
Deployment practicality: Moderate - requires maintained search corpus and multi-turn interaction capability, but uses standard 7B model size with efficient LoRA adaptation

Real-World Applicability

Uses realistic BM25 retrieval over HotPotQA corpus (10 paragraphs per query) rather than simulated environment
Evaluates on real multi-hop question answering requiring compositional reasoning chains
ReAct interaction protocol is widely used in production agent systems
No hardware deployment results reported - evaluation is simulation-based on existing benchmarks
Limited to single search tool and fixed 10-paragraph corpus, not web-scale retrieval systems used in practice

Limitations & Failure Modes

Scale limitations: Single 7B model, 200 training problems, BM25 over 10-paragraph corpus vs web-scale retrieval (ENGINEERING)
Single tool restriction: Only search(query) available, no diverse tool ecosystems (ENGINEERING)
Evaluation scope: HotPotQA-based tasks may not generalize to other compositional reasoning domains (EVALUATION)
Training compute: 1000 GRPO steps may be insufficient for full convergence (ENGINEERING)
Temperature/sampling: Fixed 0.7 temperature, no systematic decoding parameter sweep (EVALUATION)

Likely failure modes:
Performance degradation on longer reasoning chains requiring T > 5 interactions
Brittleness when search corpus quality drops or contains misleading information