cs.CV papers | Gist.Science

Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos

This paper proposes STE-VLN, a novel approach that enhances Vision-Language Navigation in unseen environments by constructing the YE-KG, a large-scale multimodal spatiotemporal knowledge graph derived from real-world indoor videos, and integrating it via a Coarse-to-Fine Hierarchical Retrieval mechanism to improve long-horizon reasoning and handle coarse-grained instructions.

Haoxuan Xu, Tianfu Li, Wenbo Chen + 4 more2026-03-02💻 cs

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

This paper introduces PointCoT, a novel framework and large-scale benchmark (Point-Reason-Instruct) that enhances Multimodal Large Language Models' 3D point cloud understanding by enforcing an explicit "Look, Think, then Answer" Chain-of-Thought reasoning paradigm to mitigate geometric hallucinations and achieve state-of-the-art performance.

Dongxu Zhang, Yiding Sun, Pengcheng Li + 12 more2026-03-02🤖 cs.AI

Micro-expression Recognition Based on Dual-branch Feature Extraction and Fusion

This paper proposes a dual-branch micro-expression recognition network integrating residual and Inception architectures with parallel attention and adaptive feature fusion, achieving a 74.67% accuracy on the CASME II dataset that significantly outperforms existing methods.

Mingjie Zhang, Bo Li, Wanting Liu + 5 more2026-03-02🤖 cs.AI

CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

The paper proposes CC-VQA, a training-free method that mitigates knowledge conflicts in Knowledge-Based Visual Question Answering by integrating vision-centric conflict reasoning with correlation-guided encoding and decoding to achieve state-of-the-art performance on multiple benchmarks.

Yuyang Hong, Jiaqi Gu, Yujin Lou + 7 more2026-03-02💻 cs

GDA-YOLO11: Amodal Instance Segmentation for Occlusion-Robust Robotic Fruit Harvesting

This paper introduces GDA-YOLO11, a novel amodal instance segmentation framework that significantly enhances occlusion-robust robotic fruit harvesting by inferring complete fruit shapes and accurately estimating picking points, achieving superior performance metrics and higher success rates under varying occlusion levels compared to existing models.

Caner Beldek, Emre Sariyildiz, Son Lam Phung + 1 more2026-03-02💻 cs

SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

SwitchCraft is a training-free framework that enhances multi-event video generation by introducing Event-Aligned Query Steering to align prompts with specific frames and an Auto-Balance Strength Solver to maintain temporal consistency, thereby preventing scene collapse in complex narratives.

Qianxun Xu, Chenxi Song, Yujun Cai + 1 more2026-03-02💻 cs

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

This paper proposes Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables multimodal large language models to perform precise region-grounded reasoning by generating continuous numerical coordinates as actions, thereby overcoming the limitations of discrete text-based or fixed-patch approaches while improving localization accuracy and training efficiency.

Kesen Zhao, Beier Zhu, Junbao Zhou + 3 more2026-03-02💻 cs

Clinically-aligned ischemic stroke segmentation and ASPECTS scoring on NCCT imaging using a slice-gated loss on foundation representations

This paper proposes a clinically aligned framework that integrates a frozen DINOv3 backbone with a novel Territory-Aware Gated Loss to enforce basal ganglia and supraganglionic consistency, achieving state-of-the-art performance in ischemic stroke segmentation and ASPECTS scoring on NCCT imaging.

Hiba Azeem, Behraj Khan, Tahir Qasim Syed2026-03-02⚡ eess

Extending 2D foundational DINOv3 representations to 3D segmentation of neonatal brain MR images

This paper proposes a structured window-based strategy that extends frozen 2D DINOv3 foundation representations to 3D neonatal brain MRI segmentation by decomposing volumes into sub-cubes for parallel decoding and reassembling them, achieving a Dice score of 0.65 on the ALBERT dataset while maintaining a constant memory footprint.

Annayah Usman, Behraj Khan, Tahir Qasim Syed2026-03-02⚡ eess

SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking

SpikeTrack is a novel, energy-efficient spike-driven framework for RGB visual tracking that employs an asymmetric design with unidirectional information flow and a memory-retrieval module to achieve state-of-the-art accuracy among SNN-based trackers while significantly outperforming advanced ANN counterparts in energy efficiency.

Qiuyang Zhang, Jiujun Cheng, Qichao Mao + 5 more2026-03-02💻 cs

MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation

This paper introduces MSVBench, the first comprehensive benchmark designed to evaluate multi-shot video generation through hierarchical scripts and a hybrid LMM-expert framework, revealing that current models lack true world modeling capabilities while achieving near-perfect alignment with human judgments.

Haoyuan Shi, Yunxin Li, Nanhao Deng + 5 more2026-03-02💻 cs

Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

This paper introduces Venus, a two-stage framework built upon the new AesGuide dataset, which empowers multimodal large language models to provide actionable aesthetic guidance and achieve state-of-the-art performance in aesthetic cropping by addressing the gap between ordinary users and professional photographers.

Tianxiang Du, Hulingxiao He, Yuxin Peng2026-03-02💻 cs

MINT: Multimodal Imaging-to-Speech Knowledge Transfer for Early Alzheimer's Screening

The paper proposes MINT, a novel three-stage framework that transfers biomarker knowledge from structural MRI to speech analysis during training, enabling biologically grounded, non-invasive early Alzheimer's screening at the population scale without requiring neuroimaging at inference.

Vrushank Ahire, Yogesh Kumar, Anouck Girard + 1 more2026-03-02🤖 cs.AI

Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

The paper proposes MIGM-Shortcut, a lightweight method that learns latent controlled dynamics by regressing feature evolution velocities using both previous features and sampled tokens, achieving over 4x acceleration in masked image generation while maintaining high quality.

Kaiwen Zhu, Quansheng Zeng, Yuandong Pu + 8 more2026-03-02💻 cs

Ordinal Diffusion Models for Color Fundus Images

This paper proposes an ordinal latent diffusion model that leverages the continuous, ordered nature of diabetic retinopathy severity to generate more realistic and clinically consistent color fundus images compared to standard categorical conditional diffusion models.

Gustav Schmidt, Philipp Berens, Sarah Müller2026-03-02💻 cs

Interpretable Debiasing of Vision-Language Models for Social Fairness

This paper introduces DeBiasLens, an interpretable, model-agnostic framework that utilizes sparse autoencoders to identify and selectively deactivate social attribute neurons within Vision-Language models, thereby mitigating social biases without compromising semantic knowledge.

Na Min An, Yoonna Jang, Yusuke Hirota + 3 more2026-03-02🤖 cs.AI

SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting

The paper proposes SR3R, a feed-forward framework that reformulates 3D super-resolution as a direct mapping from sparse low-resolution views to high-resolution 3D Gaussian Splatting representations, enabling robust zero-shot generalization and superior reconstruction fidelity by autonomously learning 3D-specific high-frequency details from large-scale multi-scene data.

Xiang Feng, Xiangbo Wang, Tieshi Zhong + 7 more2026-03-02💻 cs

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection

This paper proposes SteerVAD, a novel tuning-free framework that enhances video anomaly detection in frozen multi-modal LLMs by identifying latent anomaly experts and employing a hierarchical meta-controller to dynamically steer and rectify their internal representations, thereby achieving state-of-the-art performance with minimal training data.

Zhaolin Cai, Fan Li, Huiyu Duan + 2 more2026-03-02💻 cs

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

GuardAlign is a training-free defense framework for multimodal large language models that combines optimal transport-based safety detection and cross-modal attentive calibration to significantly reduce unsafe response rates while preserving model utility.

Xingyu Zhu, Beier Zhu, Junfeng Fang + 4 more2026-03-02💻 cs

Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation

This paper proposes Adaptive Visual Reinforcement (AIR), a training-free framework that mitigates hallucinations in Multimodal Large Language Models by condensing visual tokens and selectively reinforcing the most consistent image patches to enhance reliance on salient visual evidence.

Xingyu Zhu, Kesen Zhao, Liang Yi + 4 more2026-03-02💻 cs

← Previous Next →