cs.CV papers | Gist.Science

JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

JanusVLN is a novel Vision-Language Navigation framework that addresses the limitations of explicit semantic memory by introducing a dual implicit neural memory to decouple spatial-geometric and visual-semantic representations, thereby achieving state-of-the-art performance through efficient, compact, and fixed-size neural modeling.

Shuang Zeng, Dekang Qi, Xinyuan Chang + 7 more2026-02-26💻 cs

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

This paper introduces "Grounding IDs" as a latent symbolic mechanism induced by external visual cues that aligns image-text representations, strengthens cross-modal attention, and reduces hallucinations to explain how structured annotations enhance reasoning in large vision-language models.

Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari + 4 more2026-02-26🤖 cs.AI

Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

This study demonstrates that applying discrete semantic entropy to filter out high-uncertainty questions significantly improves the diagnostic accuracy of black-box vision-language models in radiology by effectively detecting and rejecting hallucinations.

Patrick Wienholt, Sophie Caselitz, Robert Siepmann + 6 more2026-02-26💻 cs

ImpMIA: Leveraging Implicit Bias for Membership Inference Attack

ImpMIA is a novel white-box membership inference attack that leverages the implicit bias of neural networks and KKT optimality conditions to identify training samples without requiring auxiliary reference models or assumptions about the target model's training procedure, thereby achieving state-of-the-art performance in realistic settings where only model weights and a superset of data are available.

Yuval Golbari, Navve Wasserman, Gal Vardi + 1 more2026-02-26🤖 cs.LG

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

This paper introduces Uni-MMMU, a comprehensive benchmark designed to evaluate the bidirectional synergy between visual understanding and generation across eight reasoning-centric domains by utilizing coupled tasks with verifiable intermediate steps to reveal performance disparities and guide the advancement of unified multimodal models.

Kai Zou, Ziqi Huang, Yuhao Dong + 7 more2026-02-26💻 cs

Caption-Driven Explainability: Probing CNNs for Bias via CLIP

This paper proposes a caption-driven explainable AI method that integrates a target CNN with the CLIP model via network surgery to identify dominant predictive concepts, thereby mitigating the risk of spurious feature reliance and enhancing model robustness.

Patrick Koller, Amil V. Dravid, Guido M. Schuster + 1 more2026-02-26⚡ eess

World Simulation with Video Foundation Models for Physical AI

The paper introduces Cosmos-Predict2.5 and Cosmos-Transfer2.5, a unified family of world foundation models that leverage flow-based architectures and reinforcement learning to deliver high-fidelity, instruction-aligned video generation and world translation for advancing physical AI, robotics, and embodied intelligence.

NVIDIA, :, Arslan Ali + 87 more2026-02-26🤖 cs.AI

Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

This paper introduces CoMa, a novel pre-training paradigm that decouples semantic compression from contrastive matching to efficiently transform multimodal large language models into state-of-the-art embedding models with minimal data.

Da Li, Yuxiao Luo, Keping Bi + 7 more2026-02-26💻 cs

Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

This paper introduces QTSplus, a lightweight query-aware token selection module that dynamically filters visual tokens based on text query complexity to significantly reduce computational costs and latency in long-video multimodal language models while preserving or enhancing performance on temporal understanding tasks.

Siyou Li, Huanan Wu, Juexi Shao + 10 more2026-02-26💻 cs

RobustGait: Robustness Analysis for Appearance Based Gait Recognition

This paper introduces RobustGait, a comprehensive framework that systematically evaluates the robustness of appearance-based gait recognition systems against diverse real-world corruptions and silhouette extraction biases across multiple datasets, revealing critical vulnerabilities and proposing effective strategies to enhance deployment readiness.

Reeshoon Sayera, Akash Kumar, Sirshapan Mitra + 2 more2026-02-26💻 cs

NTK-Guided Implicit Neural Teaching

This paper proposes NTK-Guided Implicit Neural Teaching (NINT), a method that accelerates Implicit Neural Representation training by dynamically selecting coordinates based on Neural Tangent Kernel scores to maximize global functional updates, thereby significantly reducing training time while maintaining or improving representation quality.

Chen Zhang, Wei Zuo, Bingyang Cheng + 4 more2026-02-26🤖 cs.LG

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

MIRA is a lightweight, plug-and-play multimodal agent that enhances instruction-guided image editing by employing an iterative perception-reasoning-action loop to decompose complex requests into atomic steps, thereby significantly improving semantic consistency and perceptual quality when paired with existing open-source editing models.

Ziyun Zeng, Hang Hua, Jiebo Luo2026-02-26💻 cs

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

This paper presents a unified framework for Aerial Vision-and-Language Navigation that enables lightweight UAVs to navigate complex urban environments using only monocular RGB observations and natural language instructions by formulating navigation as a next-token prediction problem with specialized strategies for keyframe selection and multi-task co-training.

Huilin Xu, Zhuoyang Liu, Yixiang Luomei + 1 more2026-02-26🤖 cs.AI

KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification

This paper proposes KD-OCT, an efficient knowledge distillation framework that compresses a high-performance ConvNeXtV2-Large teacher model into a lightweight EfficientNet-B2 student to achieve clinical-grade retinal OCT classification with significantly reduced computational costs while maintaining near-teacher diagnostic accuracy.

Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh2026-02-26🤖 cs.AI

VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding

This paper introduces VULCA-Bench, a multicultural vision-language benchmark featuring 7,410 bilingual image-critique pairs across eight cultural traditions and a five-layer evaluation framework to assess and reveal the limitations of current models in higher-order cultural understanding beyond basic visual perception.

Haorui Yu, Diji Yang, Hang He + 2 more2026-02-26💬 cs.CL

FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

This paper introduces FigEx2, a visual-conditioned framework that localizes and captions individual panels in scientific compound figures using a noise-aware gated fusion module and a staged optimization strategy with reinforcement learning, achieving superior performance and zero-shot transferability across diverse scientific domains.

Jifeng Song, Arun Das, Pan Wang + 3 more2026-02-26💬 cs.CL

Pay Attention to Where You Looked

This paper addresses the suboptimal performance of existing few-shot novel view synthesis methods by introducing a camera-weighting mechanism that dynamically adjusts the importance of source views based on their geometric or learned relevance to the target, thereby significantly enhancing synthesis accuracy and realism.

Alex Berian, JhihYang Wu, Daniel Brignac + 2 more2026-02-26💻 cs

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

This paper introduces DenseGRPO, a novel framework that enhances flow matching model alignment by addressing the sparse reward problem through step-wise reward prediction and a reward-aware exploration scheme that adaptively adjusts stochasticity injection for effective fine-grained training.

Haoyou Deng, Keyu Yan, Chaojie Mao + 4 more2026-02-26💻 cs

Rectifying Geometry-Induced Similarity Distortions for Real-World Aerial-Ground Person Re-Identification

This paper proposes a novel framework for aerial-ground person re-identification that addresses the failure of standard similarity metrics under extreme viewpoint variations by introducing a lightweight Geometry-Induced Query-Key Transformation (GIQT) module to explicitly rectify geometric distortions in the similarity space, complemented by geometry-conditioned prompt generation for robust cross-view matching.

Kailash A. Hambarde, Hugo Proença2026-02-26💻 cs

TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

The paper introduces TimeBlind, a diagnostic benchmark utilizing a minimal-pairs paradigm to reveal that even state-of-the-art Multimodal Large Language Models struggle with fine-grained compositional spatio-temporal reasoning, often relying on static visual shortcuts rather than genuine temporal logic.

Baiqi Li, Kangyi Zhao, Ce Zhang + 3 more2026-02-26🤖 cs.AI

← Previous Next →