Apr 15, 2026 Applied AI 5 papers

Applied AI Digest — Apr 15, 2026

Today’s Digest at a Glance

Today’s papers explore efficient architectures and training systems for multimodal reasoning, spanning document understanding, robotic manipulation, spatial reasoning analysis, and distributed reinforcement learning infrastructure.

OCR-Free Document Understanding

OCR-Free Visual Question Answering addresses the limitations of traditional OCR-based document processing pipelines, which suffer from error propagation when text extraction fails on complex layouts, handwritten content, or degraded images. The core insight is to treat documents as purely visual entities, allowing vision-language models to directly interpret text, tables, diagrams, and spatial relationships without intermediate text extraction steps.

The approach typically combines global document structure understanding with fine-grained content analysis. Models first generate thumbnail overviews to understand document organization, then perform targeted analysis of relevant regions. This eliminates OCR bottlenecks while preserving the ability to reason about textual content through visual understanding. The key advantage is robustness: when OCR fails on stylized fonts or complex layouts, visual understanding can still succeed by leveraging learned text recognition capabilities.

Hybrid Mamba-Transformer Architectures

Hybrid Mamba-Transformer models combine the linear scaling properties of state-space models with the representational power of attention mechanisms. Traditional transformers scale quadratically with sequence length due to attention computation, while pure Mamba models struggle with certain reasoning tasks that benefit from global context modeling.

The hybrid approach alternates between Mamba blocks (which handle sequential dependencies efficiently via selective state-space modeling) and transformer layers (which capture complex relational patterns through attention). Mathematically, Mamba layers compute hidden states via $h_t = \text{SSM}(x_t, h_{t-1})$ with selective mechanisms, while transformer layers apply $\text{Attention}(Q, K, V) = \text{softmax}(QK^T/\sqrt{d})V$ for global reasoning. This creates a complementary architecture where Mamba handles long-range sequential modeling efficiently, and transformers provide sophisticated reasoning capabilities where needed.

LatentMoE Design

LatentMoE (Latent-Space Mixture-of-Experts) addresses the computational overhead of traditional MoE routing by performing expert selection in compressed latent spaces rather than full token dimensions. Standard MoE models route each token through gating networks operating on high-dimensional embeddings, creating significant computational costs even when most parameters remain inactive.

LatentMoE first projects input tokens to lower-dimensional latent representations via learned projections $z = W_\text{proj}x$, then performs expert routing in this compressed space: $\text{gate}(z) = \text{softmax}(W_g z)$. Selected experts operate in the latent space before projecting back to full dimensions. This reduces routing computation while maintaining expert specialization quality. The intuition is that expert selection decisions can be made effectively using compressed representations, avoiding the full computational cost of high-dimensional routing.

Asynchronous Reinforcement Learning Systems

Service-Oriented RL Training decouples reinforcement learning components (actors, critics, rollout workers, reward models) into independent services that communicate asynchronously through message queues rather than synchronous parameter sharing. Traditional distributed RL suffers from synchronization bottlenecks where slow workers block entire training iterations, and component failures can crash entire training runs.

The service-oriented approach runs each RL component as an independent service with its own fault tolerance and scaling policies. Components communicate through persistent message queues that buffer data exchanges, allowing fast components to continue working even when others are temporarily unavailable. Mathematical updates like policy gradients $\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a

s) A(s,a)]$ can proceed asynchronously as data becomes available, rather than waiting for synchronous batches. This enables elastic scaling where components can be added or removed without affecting others.

Reading Guide

Doc-V* demonstrates OCR-free document understanding with coarse-to-fine visual reasoning, while Nemotron 3 Super showcases hybrid Mamba-Transformer architectures with LatentMoE for efficient inference. The MLLM orientation study reveals that spatial reasoning failures occur in language processing rather than vision encoding, suggesting that hybrid architectures may need specialized components for spatial reasoning. Relax provides the distributed infrastructure needed to train such complex multimodal systems at scale, while VLAJS shows how to efficiently incorporate pretrained models into RL training pipelines.

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

Authors: Yuanlei Zheng, Pei Fu, Hang Li, Ziyang Wang et al. (12 authors) · Institution: Huazhong University of Science and Technology, Xiaomi Inc., Fudan University · Category: cs.CL

Doc-V* introduces an OCR-free agentic framework that combines thumbnail overviews with sequential evidence aggregation for multi-page document VQA, achieving competitive results through GRPO-optimized active perception.

Practical Takeaway: If you’re working on long document understanding, this paper demonstrates that sequential evidence aggregation with working memory can outperform static retrieval approaches. The key insight is that thumbnail overviews provide valuable navigation cues at low computational cost. However, the method’s complexity (requiring external retrievers, GRPO training, and teacher model distillation) may not justify the incremental gains over simpler RAG baselines for most practical applications. Consider implementing the thumbnail overview approach as a lightweight enhancement to existing RAG systems before building the full agentic framework.

Tags: document-understanding visual-question-answering multimodal-agents retrieval-augmented-generation reinforcement-learning multi-page-documents OCR-free evidence-aggregation

arXiv · PDF

Task & Setting

Multi-page Document Visual Question Answering (DocVQA) addresses the practical need to extract information from lengthy, visually rich documents such as academic papers, financial reports, and industrial manuals. This task is challenging because documents convey information through complex interactions of textual semantics, spatial layouts, and visual elements like tables and figures, requiring reasoning across multiple pages while handling quadratic attention costs and context length limits.

The task takes as input a document D = {p₁, …, pₙ} with N pages and a question Q. The output is a textual answer y. The method formulates this as a sequential decision process where an OCR-free MLLM-based agent πθ interacts with the document environment for up to T steps, receiving observations Oₜ, performing reasoning, and selecting actions aₜ ∈ A. The objective is to maximize answer accuracy while minimizing evidence-seeking cost.

Success is measured using benchmark-specific metrics: ANLS (Average Normalized Levenshtein Similarity) for DUDE and MP-DocVQA, F1 score for SlideVQA, and Accuracy for MMLongBench-Doc and LongDocURL. The method is evaluated on five benchmarks spanning diverse document types and reasoning challenges.

Architecture & Method

Base architecture: Qwen-2.5-VL-7B-Instruct with visual encoder V (ViT architecture), MLP projection module M, and LLM backbone L
Global Thumbnail Overview: Document pages are resized to 256×256 thumbnails, arranged in grid images with page annotations, providing structural navigation cues at ~10-12× visual token compression
Action Space: Three atomic actions - retrieval_page for semantic search using external multimodal retriever (ColQwen), fetch_page for direct page access by indices, and answer for termination
Structured Visual Reasoning: ReAct-style protocol with .........<summary>...</think>... format enforcing explicit decision processes
Working Memory: Concatenated per-turn summaries Wₜ = Concat(S₀, …, Sₜ₋₁) to prevent forgetting during multi-turn interaction
Environment Design: Pre-computed visual tokens vᵢ = M(V(pᵢ)) cached at high resolution (1024×768), dynamically requested rather than fed all at once

The core technical contribution is casting multi-page DocVQA as sequential evidence aggregation with active perception, combining coarse-grained thumbnail navigation with fine-grained targeted page retrieval.

Training Recipe

Supervised Fine-Tuning (SFT): Distilled 9,019 high-quality interaction trajectories from GPT-4o teacher in closed-loop environment. Data filtered by format validity, answer correctness (ANLS ≥ 0.7), and evidence page sanity. Sources: MP-DocVQA (5,969 samples) and DUDE (3,050 samples). Cross-entropy loss applied only to agent-generated tokens.
Group Relative Policy Optimization (GRPO): 2,048 non-overlapping training examples stratified into easy/medium/hard buckets. Composite reward function: answer correctness (ωₐₙₛ = 0.6), evidence recall (ωₑᵥᵢ = 0.3), structural validity (ωₛₜᵣᵤcₜ = 0.1). Maximum interaction horizon T = 8 steps.

Specific training hyperparameters, optimizer details, learning rates, and hardware specifications are referenced but not fully detailed in the main text.

Novelty & Lineage

Prior Work:

M3DocRAG (2024): Multimodal retrieval-based approach for multi-page document understanding, achieving 84.4 ANLS on MP-DocVQA
InternVL3 (2025): End-to-end model processing all pages jointly, reaching 80.8 ANLS on MP-DocVQA but scaling poorly with document length
URaG (2026): Unified retrieval-augmented generation achieving 88.2 ANLS on MP-DocVQA through static top-k page selection

Delta: This paper adds:
Active perception paradigm with coarse-to-fine navigation from thumbnail overview to targeted page fetching
Sequential evidence aggregation with working memory across multiple interaction steps
GRPO training optimizing composite reward balancing accuracy and efficiency

Assessment:
- Architectural idea: The combination of thumbnail overview + sequential evidence aggregation is a reasonable extension of existing retrieval-augmented approaches, not fundamentally novel
- Benchmark gains: Meaningful improvements on out-of-domain tasks (47.9% over RAG baseline on some benchmarks), but mixed results on in-domain tasks where it doesn’t always achieve SOTA
- Fair comparisons: Uses same backbone (Qwen-2.5-VL-7B) as direct baselines, though some compared methods use different architectures
- Scale dependence: Method relies on high-quality external retriever and teacher model distillation, limiting reproducibility
Verdict: INCREMENTAL — Solid engineering combining existing techniques (RAG, agent frameworks, GRPO) with reasonable performance gains, but lacks fundamental novelty in approach or breakthrough capabilities.

Benchmarks & Results

DUDE (ANLS): Doc-V* 64.5, previous best URaG 57.6, improvement +6.9 points
MP-DocVQA (ANLS): Doc-V* 86.2, previous best URaG 88.2, performance -2.0 points
SlideVQA (F1): Doc-V* 77.2, previous best Claude-3.7-Sonnet 76.3, improvement +0.9 points
MMLongBench-Doc (Accuracy): Doc-V* 42.1, previous best GPT-4.1 45.6, performance -3.5 points
LongDocURL (Accuracy): Doc-V* 56.3, previous best GPT-4o 64.5, performance -8.2 points

Results are mixed - strong performance on DUDE and competitive on SlideVQA, but trails SOTA on other benchmarks. Most significant gains are over RAG baselines rather than end-to-end methods. Notable that closed-source models often outperform on several benchmarks.

Compute & Efficiency

Model size: 7B parameters (Qwen-2.5-VL backbone)
Training compute: GPU hours not reported, uses GPT-4o for trajectory distillation which adds cost
Inference speed: 17.9s average latency per sample vs 5.8s for RAG and 19.0s for All Pages method
Memory footprint: 31.7GB peak GPU memory vs 20.2GB for RAG and 65.3GB for All Pages
Deployment practicality: Requires external multimodal retriever (ColQwen) and maintains working memory across T=8 interaction steps, making deployment more complex than static methods

Real-World Applicability

Benchmarks only: All evaluation conducted on curated academic benchmarks (DUDE, MP-DocVQA, SlideVQA, MMLongBench-Doc, LongDocURL) without real deployment validation
No production integration: No evidence of deployment in actual document processing systems or user studies
No hardware experiments: Method is purely computational without robotics or specialized hardware components
Limited real-world discussion: Paper focuses on benchmark performance without addressing practical deployment challenges like latency requirements or user interaction patterns

Limitations & Failure Modes

Single backbone dependency (FUNDAMENTAL) - Only evaluated on Qwen-2.5-VL, effectiveness across other vision-language backbones unknown
Single-document limitation (FUNDAMENTAL) - Framework only handles individual documents, not multi-document scenarios requiring evidence aggregation across heterogeneous sources
External retriever dependency (ENGINEERING) - Requires high-quality multimodal retriever like ColQwen, performance degrades with weaker retrievers
Teacher model distillation requirement (ENGINEERING) - Training depends on GPT-4o for trajectory generation, limiting reproducibility and cost-effectiveness
Limited interaction budget (EVALUATION) - Fixed T=8 step limit may be insufficient for complex reasoning chains

Failure modes: 1) May repeatedly fetch irrelevant pages when initial thumbnail navigation fails, 2) Working memory can become cluttered with irrelevant information degrading later decisions

Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

Authors: Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda · Institution: University of Applied Science and Arts of Southern Switzerland, Politecnico di Milano · Category: cs.LG

VLAJS accelerates robotic manipulation RL by using pretrained VLA models as sparse, transient directional guidance through a cosine similarity loss that is adaptively annealed based on learning progress.

Practical Takeaway: If you’re working on robotic manipulation RL with sparse rewards or long horizons, VLAJS offers a practical way to leverage pretrained VLA models for faster learning without getting trapped by the teacher’s limitations. The key insight is using directional cosine loss instead of MSE and implementing reward-based guidance annealing. The method is particularly valuable if you already have access to VLA models and want to train deployable high-frequency controllers. However, be prepared for additional systems complexity from serving large VLA models during training, and consider whether the compute overhead justifies the sample efficiency gains for your specific application.

Tags: reinforcement_learning robotics vision_language_action manipulation sample_efficiency exploration distillation sim_to_real

arXiv · PDF

Task & Setting

The paper addresses robotic manipulation tasks that require long-horizon reasoning or have sparse/imperfect reward signals, where standard reinforcement learning struggles with exploration and credit assignment. Vision-Language-Action (VLA) models provide task-level reasoning but are too slow for precise control.

Task definition: The input consists of RGB observations + language instructions for the VLA teacher, and proprioceptive + privileged simulator state for the RL policy. Actions are continuous delta end-effector controls (translation, rotation, gripper) executed at high frequency. The objective is to maximize task success rate while minimizing sample complexity:

\[\max_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]

where $R(\tau)$ is the cumulative task reward.

Evaluation criteria: Success rate at budget $t^*$ (SR$_{t^*}$) measures fraction of episodes completing within interaction budget. Area Under the Success Curve (AUC) integrates success rate over full training horizon, capturing both learning speed and final performance.

The paper evaluates on six ManiSkill manipulation tasks: PickCube, PickPlaceCube, LiftPegUpright, PegInsertionSide, PokeCube, PushCube, with modified reward functions to induce sparse rewards or extended horizons (10x longer episodes).

Architecture & Method

Base architecture: Proximal Policy Optimization (PPO) with state-based policy $\pi_\theta(a

s)$ and value function $V_\phi(s)$ operating at high control frequency

VLA teacher queries: Sparse calls to pretrained OpenVLA or Octo models that map RGB + language to delta actions, queried at most 20% of episode steps
Temporal discretization: Each VLA delta action is linearly interpolated over $D$ control steps to provide guidance targets $\tilde{a}^{VLA}_t$
Directional action-consistency loss: Core technical contribution using cosine similarity instead of MSE:
\[L_{dir} = \mathbb{E}_t\left[1[\text{valid}_t] \sum_{c \in \{\text{pos},\text{rot}\}} \ell_{dir}(\mu^c_\theta(s_t), \tilde{a}^c_t)\right]\] \[\ell_{dir}(x,y) = 1 - \frac{\langle x,y \rangle}{\|x\|\|y\| + \epsilon}\]
Reward-based jump-starting: Adaptive query rate with exponential decay based on reward improvement, permanent deactivation when mean reward exceeds threshold
Final objective: $L(\theta) = L_{PPO}(\theta) + \lambda_t L_{dir}(\theta)$ where $\lambda_t$ is annealed over time

Training Recipe

Pretraining stage: Uses pretrained VLA models (OpenVLA-best, Octo) - no additional pretraining reported
Main training: PPO with sparse VLA guidance - Data: ManiSkill simulation environments, privileged state for RL policy, RGB for VLA teacher - Optimizer: PPO with GAE, learning rate and batch size not reported - VLA guidance: Maximum 20% of timesteps queried, exponentially decayed based on reward improvement - Discretization length D: not reported specific value - Hardware: Training uses GPU resources, wall-clock time not reported
Guidance annealing: VLA queries reduced via exponential schedule $N_{calls} \leftarrow \max(N_{min}, \lfloor N_{max} \exp(-\kappa \cdot \Delta\bar{r}) \rfloor)$, permanently deactivated when reward improvement exceeds threshold of 3
Real-world deployment: Zero-shot transfer to Franka Panda robot using YOLO object detection for state estimation

Most training details marked as “not reported” - paper focuses on method rather than detailed hyperparameter specification.

Novelty & Lineage

Prior work:

“Refined Policy Distillation” (Jülg et al. 2025): Persistent MSE action-matching loss between PPO and VLA teacher at every timestep
“Jump-start Reinforcement Learning” (Uchendu et al. 2023): Transient behavioral guidance where expert policy directly controls early episodes
Policy distillation approaches: Auxiliary losses like $L_{RL} + ||a_\pi - a_{teacher}||^2$ for continuous teacher supervision

Delta: This paper introduces transient auxiliary guidance - using VLA as sparse directional hints rather than persistent imitation targets. Key differences:
directional cosine loss vs MSE
reward-based adaptive deactivation
sparse temporal queries vs dense supervision.

Applied-specific assessment:
- Architectural novelty: The directional loss is a straightforward modification of standard distillation - using cosine similarity instead of L2 norm. The sparse querying and reward-based annealing are reasonable engineering choices but not fundamentally novel.
- Benchmark gains: Substantial improvements shown (50%+ sample efficiency gains), but experiments are primarily in simulation with limited real-world validation. Comparisons seem fair within the experimental setup.
- SOTA comparison: Limited - mainly compares against PPO and their own RPD variant. Missing comparisons to other state-of-the-art robotic RL methods or recent VLA fine-tuning approaches.
- Generalization concerns: Results may not hold without access to strong pretrained VLA models. Real-world experiments are limited in scope (simple pick-and-place tasks).
Verdict: INCREMENTAL — Solid engineering contribution combining existing techniques (sparse distillation, cosine similarity losses, reward-based scheduling) but the core ideas are straightforward applications of known principles rather than fundamental innovations.

Benchmarks & Results

PickCube-v2: VLAJS achieves 95.1% success rate vs 0.0% PPO, 1.1% VLAJS(RPD) at 9.9M steps
PickPlaceCube-v2: VLAJS achieves 65.9% success rate vs 0.0% PPO and VLAJS(RPD) at 37.7M steps
LiftPegUpright-v2: VLAJS achieves 91.5% success rate vs 16.9% PPO, 0.3% VLAJS(RPD) at 7.3M steps
LiftPegUpright-v3: VLAJS achieves 63.3% success rate vs 13.2% PPO, 19.2% VLAJS(RPD) at 17.1M steps
PokeCube-v2: VLAJS achieves 81.4% success rate vs 9.9% PPO, 75.1% VLAJS(RPD) at 8.8M steps
PushCube-v2 (OOD): VLAJS achieves 94.4% success rate vs 49.1% PPO, 75.6% VLAJS(RPD) at 1.9M steps
Real robot deployment: VLAJS achieves 70% lift success vs 47% OpenVLA-best, 80% pick-and-place vs 40% OpenVLA-best
Long-horizon macro average: Sparse RPD achieves 40.3% success vs 0.0% PPO, 37.6% AUC vs 7.2% PPO

Results show consistent large improvements over PPO baseline, with mixed results vs VLAJS(RPD). Notable absence of comparisons to other state-of-the-art robotic RL methods or recent VLA approaches beyond RPD.

Compute & Efficiency

Model size: PPO policy and value networks (size not reported), uses pretrained OpenVLA (~7B parameters) and Octo as teachers
Training compute: GPU resources utilized, specific hardware and training time not reported. VLA teacher queries limited to 20% of timesteps to reduce computational overhead
Inference speed: Policies operate at high control frequency for real-world deployment, VLA teacher operates at low frequency. Specific latency numbers not provided
Memory footprint: Must maintain both PPO networks and large VLA model in memory during training, specific memory requirements not quantified
Deployment practicality: Demonstrates zero-shot sim-to-real transfer on Franka Panda robot. State-based RL policy enables high-frequency control (advantage over VLA-only), but requires object detection for state estimation in real world

Real-World Applicability

Real robot experiments: Deployed on Franka Panda robot for pick-and-place, lifting, and peg reorientation tasks using YOLO-based object detection for state estimation
Zero-shot transfer: Policies trained in simulation transfer directly to real robot without additional training or domain adaptation
Robustness testing: Demonstrated stable performance under visual perturbations (human hand entering scene), clutter, and external disturbances where VLA-only policies fail
Environment diversity: Real-world testing includes randomized objects (grapes, peppers, tomatoes, pot lids) and background conditions
Limitations in scope: Real-world experiments limited to tabletop manipulation tasks with relatively simple geometries. No testing on more complex manipulation scenarios, mobile robots, or multi-robot settings

Limitations & Failure Modes

FUNDAMENTAL: Requires access to pretrained VLA model that provides minimally reliable directional cues - method fails if teacher is completely random or adversarial
ENGINEERING: VLA teacher inference adds computational overhead and systems complexity (GPU memory, serving infrastructure) that may outweigh sample efficiency gains
ENGINEERING: Reward-based deactivation heuristic may be brittle in highly stochastic environments where reward signals are noisy
EVALUATION: Real-world experiments limited to tabletop manipulation - unclear how method scales to more complex scenarios like mobile manipulation or force-sensitive tasks
EVALUATION: Missing comparisons to other state-of-the-art robotic RL methods and recent VLA fine-tuning approaches beyond basic baselines

Failure modes:
Method likely fails when VLA teacher provides consistently misleading directional guidance, as the cosine loss would bias exploration in wrong directions
In environments where rewards are extremely noisy or non-stationary, the reward-based jump-start schedule may deactivate guidance too early or too late

Why MLLMs Struggle to Determine Object Orientations

Authors: Anju Gopinath, Nikhil Krishnaswamy, Bruce Draper · Institution: Colorado State University · Category: cs.CV

Shows that CLIP-style vision encoders accurately preserve orientation information (recoverable within 3° by linear regressors), contradicting the prevailing hypothesis that encoder limitations cause MLLM spatial reasoning failures.

Practical Takeaway: If you’re working on MLLM spatial reasoning, don’t blame the vision encoder. This work definitively shows that CLIP, SigLIP, and ViT embeddings contain accurate orientation information recoverable with simple linear models. The bottleneck lies elsewhere in the pipeline - likely in how attention mechanisms process the highly distributed orientation signals across hundreds of thousands of features. Focus optimization efforts on better attention patterns or intermediate representations rather than replacing encoders. The diffuse nature of orientation information suggests that explicit geometric reasoning modules might be more effective than relying on emergent spatial understanding.

Tags: multimodal-llm vision-encoder spatial-reasoning orientation-estimation clip siglip vit llava

arXiv · PDF

Task & Setting

Multi-modal Large Language Models (MLLMs) consistently fail at tasks requiring 2D object orientation reasoning, with state-of-the-art models achieving only 34-46% accuracy on orientation benchmarks. This represents a fundamental limitation in spatial reasoning capabilities that affects real-world deployment of vision-language systems.

The task is to determine whether rotation/orientation information can be recovered from visual encoder embeddings. Input consists of either:

pairs of images where one is rotated relative to the other, or
single images with rotated foreground objects on static backgrounds. Output is the predicted rotation angle in degrees. The formal objective is to minimize:
\[L = \frac{1}{N} \sum_{i=1}^{N} |\theta_i - \hat{\theta}_i|\]
where $\theta_i$ is the true rotation angle and $\hat{\theta}_i$ is the predicted angle.

Success is measured by Mean Absolute Error (MAE) between predicted and ground truth rotation angles. The authors test whether linear regressors can predict orientations from CLIP, SigLIP, and ViT embeddings across rotations from 0-360 degrees in 1-degree increments.

The study uses controlled synthetic datasets: 6 whole image sets (180 samples each) and 3 object-rotation sets for multi-image models, plus 3 foreground/background pairs at 3 scales (360 orientations each) for single-image models.

Architecture & Method

Extract visual embeddings from pre-trained encoders without modification: SigLIP (LLaVA-OneVision), ViT (Qwen2.5-VL-7B-Instruct), and CLIP (LLaVA-1.5/1.6)
Train separate Ridge regressors to predict sine and cosine of rotation angles from flattened embedding vectors
Combine predictions using:
\[\theta = \arctan2(\sin(\theta), \cos(\theta))\]
Apply K-fold cross-validation to select optimal L2 regularization parameter $\alpha = 0.005$
Evaluate prediction accuracy using Mean Absolute Error (MAE) on held-out test sets
Perform feature substitution analysis by incrementally replacing embedding values between anchor (9° rotation) and target orientations to determine information distribution

The core technical contribution is demonstrating that orientation information is preserved in contrastive vision encoders despite their training objective being semantic alignment rather than geometric reasoning. This directly contradicts the prevailing hypothesis that MLLM orientation failures stem from encoder limitations.

Training Recipe

No model training or fine-tuning performed - uses pre-trained encoders as-is
Linear Ridge regressor training only: - Data: Synthetic rotation datasets with 80:20 train/test split - Optimizer: Standard Ridge regression with L2 regularization - Learning rate: Not applicable (closed-form solution) - Cross-validation: K-fold to select α = 0.005 - Hardware: Not reported - Wall-clock time: Not reported
Base models used: - LLaVA-OneVision (SigLIP encoder) - Qwen2.5-VL-7B-Instruct (ViT encoder) - LLaVA-1.5-13B and LLaVA-1.6-13B (CLIP encoders)

All base model training details not reported as this work uses pre-trained checkpoints without modification.

Novelty & Lineage

Step 1 — Prior work:

Tong et al. (2024) and Nichols et al. (2025) hypothesized that MLLM orientation failures originate from CLIP-style encoders trained for semantic alignment rather than geometric reasoning
Yang et al. (2025) documented systematic MLLM failures on spatial reasoning benchmarks with near-chance performance (39-46% accuracy)
Multiple studies attributed poor spatial reasoning to insufficient geometric information in visual encoders

Step 2 — Delta: This work empirically tests and rejects the encoder hypothesis by showing linear regressors can predict orientations from encoder embeddings with <3° MAE. It provides first direct evidence that orientation information is preserved in CLIP/SigLIP/ViT representations, contradicting accepted explanations.

Step 3 — Applied-specific assessment:

The experimental design is straightforward but the finding genuinely challenges established assumptions in the field
Benchmark gains are not applicable - this is a diagnostic study revealing that the problem lies elsewhere in the MLLM pipeline
Comparisons are fair using identical encoders from actual deployed models
The orientation prediction accuracy is robust across multiple encoder architectures and would likely generalize

Verdict: SIGNIFICANT — This clearly refutes a widely-accepted hypothesis about MLLM limitations and redirects research focus from encoder deficiencies to other pipeline components.

Benchmarks & Results

LLaVA-OneVision orientation prediction: MAE 0.36-2.31°, previous assumption was complete failure, demonstrates orientation information is preserved
Qwen2.5-VL-7B-Instruct orientation prediction: MAE 0.90-2.85°, same baseline assumption, shows ViT also preserves orientation
LLaVA-1.5 orientation prediction: MAE 0.67-2.08°, contradicts CLIP encoder limitation hypothesis
LLaVA-1.6 orientation prediction: MAE 0.42-0.87°, best performance across all scales
Kolmogorov-Smirnov normality tests: p-values >0.05 across all conditions, confirms error distributions are Gaussian (not systematic)
Feature substitution analysis: Requires 128K-540K features to fool predictor, demonstrates information is highly distributed

Results are consistent across all tested encoders and image types. No traditional benchmarks used since this is a controlled diagnostic study. The key finding is that all encoders preserve orientation information contrary to prevailing assumptions.

Compute & Efficiency

Model sizes: LLaVA-OneVision (not specified), Qwen2.5-VL-7B-Instruct (7B parameters), LLaVA-1.5-13B (13B), LLaVA-1.6-13B (13B)
Training compute: Not applicable - no model training performed, only linear regression on extracted features
Inference speed/latency: Not reported for embedding extraction or regression prediction
Memory footprint: Embedding sizes range from 576×1024 to 4×729×1152 values depending on model and image size, requiring tens of thousands to millions of features
Deployment practicality: The linear regression approach is highly practical for orientation prediction, but the finding that information is spread across hundreds of thousands of features suggests why MLLMs struggle to exploit this information during normal inference

Real-World Applicability

Uses controlled synthetic datasets with artificial image rotations - not tested on natural orientation variations in real-world scenarios
No deployment experiments or hardware validation reported
No production system integration demonstrated
Foreground/background experiments use unnatural circular patches superimposed on natural backgrounds to control artifacts
Limited to familiar object categories from ImageNet - unclear if findings generalize to novel objects or complex scenes
No sim-to-real analysis as this is purely a diagnostic study of existing model representations rather than a practical system

Limitations & Failure Modes

FUNDAMENTAL: Orientation information is spread across tens of thousands of features, making it potentially inaccessible to attention mechanisms in transformer architectures
EVALUATION: Only tested on controlled synthetic rotations, not natural orientation variations in real images
ENGINEERING: Feature substitution experiments require access to internal representations, limiting practical applicability
FUNDAMENTAL: Foreground orientation estimates become unreliable when background is significantly rotated from canonical position
EVALUATION: Limited to 3-4 encoder architectures and specific MLLM implementations

Failure modes:
Linear regression approach fails when background orientation deviates significantly from standard orientations
Orientation prediction degrades for very small foreground patches or unclear object boundaries

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Authors: NVIDIA, :, Aakshita Chandiramani, Aaron Blakeman et al. (547 authors) · Institution: NVIDIA · Category: cs.LG

Nemotron 3 Super combines hybrid Mamba-Transformer architecture with LatentMoE design to achieve 2-7× inference throughput improvements over dense models while maintaining comparable accuracy.

Practical Takeaway: If you’re building inference systems on NVIDIA hardware, the LatentMoE architecture and MTP approach offer clear throughput improvements worth implementing. The key insight about projecting to latent space for expert routing is broadly applicable to MoE designs. However, be cautious about the hardware dependencies and consider whether the engineering complexity is justified for your use case. The NVFP4 training results suggest low-precision training at scale is feasible but requires careful monitoring of gradient dynamics. For research engineers, this represents solid systems work rather than algorithmic breakthroughs.

Tags: mixture-of-experts hybrid-architectures mamba efficient-inference agentic-reasoning multi-token-prediction low-precision-training speculative-decoding

arXiv · PDF

Task & Setting

Large language models with mixture-of-experts (MoE) architectures have shown promise for scaling efficiently, but existing designs face memory bandwidth and communication bottlenecks in real-world deployment. Hybrid architectures combining Transformers with linear-complexity sequence models like Mamba can improve inference throughput by reducing the quadratic memory scaling of attention’s KV cache.

The task is training a large-scale hybrid Mamba-Transformer MoE model for general language understanding and agentic reasoning. The model takes text input sequences up to 1M tokens and produces text outputs. The objective combines standard next-token prediction with multi-token prediction (MTP):

\[L = L_{next-token} + 0.3 \cdot L_{MTP}\]

Evaluation uses standard language modeling benchmarks including MMLU, GSM8K, HumanEval, and agentic tasks like SWE-Bench and tool use benchmarks. Success is measured by accuracy on these benchmarks and inference throughput (tokens/second/GPU).

The paper introduces Nemotron 3 Super, a 120B total parameter model with 12B active parameters, trained on 25T tokens.

Architecture & Method

Hybrid Mamba-Transformer MoE Architecture: 88-layer model alternating between Mamba-2 blocks and attention layers, with LatentMoE layers for sparse scaling (120B total, 12B active parameters)
LatentMoE Design: Novel MoE architecture that projects tokens to lower-dimensional latent space (1024D) for routing and expert computation, then projects back to full dimension (4096D). Uses 512 total experts with top-22 routing
Multi-Token Prediction (MTP): Shared-weight prediction heads that predict multiple future tokens simultaneously, enabling native speculative decoding during inference
Attention Configuration: Grouped-query attention with 32 query heads, 2 KV heads, head dimension 128, strategically placed as “global anchors” between Mamba blocks
Expert Load Balancing: Auxiliary-loss-free load balancing strategy with update rate 10^-3, plus standard load balancing loss with coefficient 10^-4

The core contribution is the LatentMoE architecture that reduces memory bandwidth and communication costs by factor d/ℓ while maintaining model quality through increased expert count and top-K routing.

Training Recipe

Pretraining Phase 1 (20T tokens): 25T total tokens using WSD schedule, AdamW optimizer, peak LR 4.5×10^-4, batch size 3072×8192, sequence length 8192. Trained in NVFP4 precision with selective BF16 layers
Pretraining Phase 2 (5T tokens): Shifted data mixture toward high-quality sources, LR decay to 4.5×10^-6 using minus-sqrt schedule
Long-context Extension (51B tokens): Continuous pretraining on 1M context length using constant LR 4.5×10^-6, batch size 16, with 64-way context parallelism
Supervised Fine-tuning (80B tokens, 7M samples): Two-stage approach with token-level then sample-level loss normalization, LR 1×10^-5, includes agentic and coding datasets
Reinforcement Learning (43B tokens total): Three-stage RL including RLVR across 21+ environments, SWE-RL (20B tokens), and RLHF (18B tokens) using large-scale asynchronous training
MTP Healing: Final phase to restore multi-token prediction capabilities after RL training

Training used GB200 GPUs with various parallelism strategies. Total training time not reported.

Novelty & Lineage

Prior Work:

Nemotron 3 Nano (2025): Previous 30B hybrid Mamba-Transformer model with standard MoE
DeepSeek-V3 (2025) and Qwen3.5 (2025): Large MoE models with traditional expert designs
Various MTP works like Gloeckle et al. (2024): Multi-token prediction for improved modeling

Delta: This paper adds:

LatentMoE architecture that compresses routing to latent space
shared-weight MTP heads for more robust speculative decoding
NVFP4 training at 120B scale
expanded agentic training with 21+ RL environments.

Applied Assessment:
- Architecture: LatentMoE is a reasonable engineering optimization but follows obvious principles (reduce communication bottleneck). The insight about memory bandwidth vs accuracy per parameter is valuable but not groundbreaking
- Benchmark Gains: Results show comparable accuracy to existing 120B models but with 2.2-7.5× throughput improvements - meaningful practical gains
- Comparisons: Limited SOTA comparisons, mainly against GPT-OSS-120B and Qwen3.5-122B. Missing comparisons to other efficient architectures
- Scale Dependency: Gains likely depend heavily on the specific hardware (B200 GPUs) and optimized inference stack
Verdict: INCREMENTAL — Solid engineering work that combines known techniques (hybrid architectures, MoE, MTP) with reasonable optimizations (LatentMoE), but lacks fundamental novelty. The throughput improvements are valuable but expected from the architectural choices.

Benchmarks & Results

MMLU (5-shot): 86.01% vs 81.0% (Ling-flash), 81.0% (GLM-4.5) - moderate improvement
GSM8K (8-shot): 90.67% vs 90.75% (Ling-flash), 82.6% (GLM-4.5) - comparable to best
HumanEval (pass@1): 79.4% vs 70.1% (Ling-flash), 76.3% (GLM-4.5) - solid improvement
MATH Level 5 (4-shot): 70.0% vs 39.8% (Ling-flash), 26.3% (GLM-4.5) - large improvement
SWE-Bench: 60.5% (exact numbers for baselines not clearly reported)
RULER 1M (long context): 71.0% vs not reported for baselines
Inference Throughput: 2.2× vs GPT-OSS-120B, 7.5× vs Qwen3.5-122B on 8k input/64k output
MTP Acceptance Rate: 3.45 average length vs 3.33 (Qwen3-Next), 2.70 (DeepSeek-R1)

Results are mixed - strong on math and coding, comparable on general knowledge. The throughput improvements are the standout result. Notably missing comparisons on many agentic benchmarks despite the focus on agentic capabilities.

Compute & Efficiency

Model Size: 120.6B total parameters, 12.7B active parameters (10.6× sparsity ratio)
Training Compute: 25T tokens pretraining + 43B tokens post-training on GB200 GPUs, specific GPU hours not reported
Inference Speed: Up to 7.5× throughput improvement over dense baselines, measured on B200 GPUs with vLLM/TRT-LLM
Memory Footprint: Reduced by factor d/ℓ (4×) for expert routing due to LatentMoE compression to 1024D latent space
Deployment: Models released in NVFP4, FP8, and BF16 formats. Native speculative decoding through MTP reduces latency. Optimized for NVIDIA hardware stack but deployment on other hardware unclear.

Real-World Applicability

Limited Real-World Testing: No deployment results or production integration details reported beyond synthetic benchmarks
Hardware Dependency: Optimizations appear highly specific to NVIDIA B200/GB200 hardware and TRT-LLM inference stack
Agentic Environments: Training used 21+ diverse RL environments including SWE-Gym, but evaluation mostly on curated benchmarks rather than live systems
Open Source Release: Models and datasets released on HuggingFace, but practical deployment guidance limited
Sim-to-Real Gap: No discussion of how benchmark performance translates to real agentic applications or user studies

Limitations & Failure Modes

Hardware Lock-in (FUNDAMENTAL): Performance gains appear tightly coupled to NVIDIA hardware and software stack, limiting broader applicability
NVFP4 Training Issues (ENGINEERING): Observed 7% zero-valued gradients and channel magnitude patterns, indicating potential training instability at scale
Evaluation Gaps (EVALUATION): Limited comparison to other efficient architectures, missing evaluation on many claimed agentic capabilities
Long Context Degradation (ENGINEERING): Required separate alternating training phase to maintain math performance after long-context extension
MTP Brittleness (ENGINEERING): Required healing phase after RL training to restore multi-token prediction capabilities

Failure Modes:
- Model likely degrades on hardware without optimized low-precision kernels
- Performance may not transfer to domains requiring different expert specialization patterns

Authors: Liujie Zhang, Benzhe Ning, Rui Yang, Xiaoyan Yu et al. (11 authors) · Institution: Xiaohongshu Inc · Category: cs.CL

Relax is a service-oriented RL training engine that provides fault-tolerant, asynchronous execution with native omni-modal support, achieving 1.2×-2.0× speedups over existing systems.

Practical Takeaway: If you’re doing RL post-training at scale, Relax offers a mature production-ready alternative to veRL with better fault tolerance and modest throughput improvements. The unified staleness parameter is genuinely useful for navigating on-policy vs off-policy tradeoffs without code changes. The omni-modal support is the most comprehensive in open source, making it worth considering if you need image/audio/video RL training. However, the service architecture complexity may not be justified for smaller teams or short experimental runs - evaluate whether the operational benefits outweigh the setup overhead for your use case.

Tags: reinforcement-learning distributed-systems multimodal language-models training-infrastructure agentic-ai fault-tolerance

arXiv · PDF

Task & Setting

Real-world context: Modern large language models are increasingly deployed with reinforcement learning post-training to unlock reasoning and tool-use capabilities. These systems are evolving toward omni-modal inputs (text, images, audio, video) and agentic multi-turn workflows involving external environments. Traditional RL training systems face bottlenecks from heterogeneous data flows, operational failures at scale, and the staleness-throughput tradeoff.

Task definition: Design and implement a distributed RL training engine for post-training large language models across omni-modal inputs. The system must handle variable-length multi-turn trajectories with mixed media inputs (text tokens, image patches of varying resolution, video frame sequences, audio waveforms). The core training objective follows standard policy gradient methods like PPO or GRPO:

\[L(\theta) = \mathbb{E}[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]\]

where $r_t(\theta) = \frac{\pi_\theta(a_t

s_t)}{\pi_{\theta_{old}}(a_t

s_t)}$ is the probability ratio and $\hat{A}_t$ is the advantage estimate.

Evaluation criteria: System throughput (steps/hour), end-to-end training speedup vs. baselines, reward convergence across modalities, fault tolerance and recovery time.

Dataset: Uses proprietary Echo Ink (image+text+audio), NextQA video subset, DAPO-MATH-17k text, and Deepeyes agentic tasks. Scale ranges from 4K to 17K samples.

Architecture & Method

Service-oriented architecture: Each RL role (Actor, Critic, Rollout, Reward) runs as independent Ray Serve deployment with fault isolation and elastic scaling
TransferQueue asynchronous data bus: Field-based storage system mediates all inter-role data exchange, supporting streaming micro-batch delivery and staleness control via single parameter max_staleness
Omni-native pipeline: Unified preprocessing for text, image, audio, video with modality-aware parallel strategies (ViT tensor parallelism, encoder-aware pipeline placement)
Distributed Checkpoint Service (DCS): Dedicated service for weight synchronization between training and inference engines with NCCL and TCP backends
Staleness-unified training modes: Single codebase supports on-policy (staleness=0), near-on-policy, and fully asynchronous (staleness>0) execution through staleness parameter s = v_t - v_r where v_t is training weight version and v_r is rollout weight version
Streaming data flow: Micro-batch level pipelining replaces global-batch synchronization, eliminating long-tail blocking from variable-length responses

Training Recipe

Pretraining: Uses existing pretrained models (Qwen3-4B, Qwen3-Omni-30B, Qwen3-30B-A3B MoE)
RL post-training: GRPO or DAPO algorithms depending on task - Data: DAPO-MATH-17k (17K samples), Echo Ink (4.4K audio+image), NextQA video subset, Deepeyes agentic - Optimizer: Not explicitly reported, standard policy gradient - Hardware: 16×H800 or 16×H20 GPUs across experiments - Wall-clock: Training runs up to 2,000+ steps, specific durations not reported
Checkpoint conversion: Bidirectional HuggingFace ↔ Megatron format via extended Megatron Bridge
Two deployment modes: Colocate (shared GPUs) vs. fully async (separate inference/training clusters)

Novelty & Lineage

Step 1 — Prior work: veRL/HybridFlow (2024) provides hybrid SPMD/RPC dataflow with 3D-HybridEngine resharding achieving 1.53×-20.57× speedups. OpenRLHF (2024) offers Ray-based RLHF with DeepSpeed integration achieving 1.22×-1.68× speedups. AReaL (2025) implements fully asynchronous RL with staleness-enhanced PPO achieving up to 2.77× speedups.

Step 2 — Delta: Relax adds (1) co-designed service architecture where each RL role runs as independent fault-isolated service, (2) unified staleness parameter controlling on-policy to off-policy spectrum in single codebase, (3) omni-native pipeline handling text/image/audio/video with modality-aware parallelism, (4) field-based data bus enabling streaming micro-batch delivery.

Step 3 — Applied-specific assessment:

Architectural idea: Service isolation + asynchronous data bus is engineering-heavy but not fundamentally novel
Benchmark gains: 1.20× over veRL on text, 1.76×-2.00× over colocate mode - meaningful but not breakthrough-level
Fair comparisons: Uses same tech stack (Megatron+SGLang) as veRL, fair evaluation
Scale dependence: Gains appear to increase with model size (1.76× at 4B → 2.00× at 30B), suggesting architecture scales well

Verdict: INCREMENTAL — solid engineering system that combines known techniques (service architecture, async training, staleness control) with comprehensive omni-modal support, but lacks fundamental algorithmic novelty.

Benchmarks & Results

End-to-end throughput (Qwen3-4B vs veRL): 28.7 vs 23.9 steps/hour, 1.20× speedup
Colocate vs async modes (Qwen3-4B): Colocate 15.9 steps/hr, async on-policy 17.9 steps/hr (1.12×), async off-policy 28.0 steps/hr (1.76×)
Omni-modal speedup (Qwen3-Omni-30B): Colocate 13.5 steps/hr vs async 26.9 steps/hr, 2.00× speedup
R3 routing overhead: Relax +1.9% vs veRL +32% when enabling routing replay for MoE models
Convergence validation: Echo Ink reward 0.72→0.93, NextQA 0.75→0.93 over 2,000 steps, all training modes converge to same final reward levels
Notable absence: No comparison on standard RLHF benchmarks like Anthropic’s HH or larger-scale evaluations

Compute & Efficiency

Model size: 4B (Qwen3-4B), 30B (Qwen3-Omni-30B), 30B total/3B active (Qwen3-30B-A3B MoE)
Training compute: 16×H800 or 16×H20 GPU configurations, wall-clock times not fully reported
Inference speed: Rollout generation 119.8s overlapped with training in async mode, 0s effective cost vs 38.2s sequential in veRL
Memory footprint: Not explicitly reported, mentions GPU memory optimization through service separation
Deployment practicality: High - service architecture enables independent scaling and fault recovery, but requires Ray/TransferQueue infrastructure setup complexity

Real-World Applicability

Production integration: Service-oriented architecture designed for long-running production deployments with fault tolerance and elastic scaling
Multi-modal validation: Demonstrates training on real image+audio+video data (Echo Ink, NextQA), not just curated text benchmarks
Agentic workflows: Shows multi-turn reasoning and tool-calling support (Deepeyes dataset), relevant for real agent applications
Hardware requirements: Requires substantial GPU clusters (16+ GPUs) and distributed infrastructure (Ray, TransferQueue), limiting accessibility
Open source availability: Full system released at github.com/redai-infra/Relax for community adoption

Limitations & Failure Modes

FUNDAMENTAL: Limited to omni-modal understanding tasks, does not support generative modalities (text-to-image/video RL training)
ENGINEERING: Service architecture adds deployment complexity vs monolithic scripts, requiring Ray Serve + TransferQueue + DCS setup
ENGINEERING: TransferQueue introduces serialization overhead for colocate mode where shared memory would suffice
EVALUATION: Convergence experiments use proprietary datasets (Echo Ink, Deepeyes), limiting external reproducibility
ENGINEERING: Ultra-large scale validation (397B+ parameters) still ongoing, not mature

Failure modes:
TransferQueue disk/network bottlenecks under high throughput could degrade async benefits
Service orchestration failures could cascade despite fault isolation if Ray cluster itself becomes unstable

Today’s Digest at a Glance

OCR-Free Document Understanding

Hybrid Mamba-Transformer Architectures

LatentMoE Design

Asynchronous Reinforcement Learning Systems

Reading Guide

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

Why MLLMs Struggle to Determine Object Orientations

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale