CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird's-Eye-View Semantic Segmentation

CycleBEV is a training-only regularization framework that enhances Bird's-Eye-View semantic segmentation by introducing an inverse view transformation network to enforce cycle consistency between perspective and BEV spaces, thereby improving geometric and semantic feature learning without increasing inference complexity.

Jeongbin Hong, Dooseop Choi, Taeg-Hyun An + 2 more2026-03-02🤖 cs.AI

Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning

This paper introduces HDFLIM, a framework that achieves efficient image captioning by aligning frozen vision and language models through hyperdimensional computing operations like binding and bundling, thereby eliminating the need for computationally intensive multimodal fine-tuning while maintaining performance comparable to end-to-end training methods.

Abhishek Dalvi, Vasant Honavar2026-03-02🤖 cs.AI

Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering

This paper introduces Semantically Decoupled Latent Steering (SDLS), a training-free inference-time framework that utilizes LLM-driven semantic decomposition and QR-based orthogonalization to generate intervention vectors that specifically suppress prior-comparison hallucinations in radiology report generation while preserving clinical accuracy.

Ao Li, Rui Liu, Mingjie Li + 6 more2026-03-02💻 cs

HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

HiDrop is a novel framework that significantly accelerates Multimodal Large Language Models (MLLMs) by aligning token pruning with hierarchical layer functions through Late Injection, Concave Pyramid Pruning, and Early Exit mechanisms, achieving a 90% reduction in visual tokens with a 1.72x training speedup while maintaining original performance.

Hao Wu, Yingqi Fan, Jinyang Dai + 3 more2026-03-02💬 cs.CL

Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?

This paper introduces VGUBench to demonstrate that while Unified Multimodal Large Language Models exhibit strong textual reasoning and visual rendering capabilities individually, they fail to maintain semantic equivalence when required to generate visual answers, revealing a critical breakdown in cross-modal semantic alignment rather than a lack of generation fidelity.

Hongbo Jiang, Jie Li, Yunhang Shen + 4 more2026-03-02💻 cs

StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation

StemVLA is an open-source Vision-Language-Action model that enhances robot manipulation performance on long-horizon tasks by explicitly integrating predicted future 3D spatial geometry and aggregated 4D historical spatiotemporal representations to improve spatial reasoning and decision-making in dynamic environments.

Jiasong Xiao, Yutao She, Kai Li + 3 more2026-03-02💻 cs