cs.CV papers | Gist.Science

Altitude-Aware Visual Place Recognition in Top-Down View

This paper proposes a hardware-free, vision-only approach for aerial visual place recognition that estimates relative altitude through ground feature density analysis to generate canonical images, significantly improving localization accuracy and robustness across diverse terrains and large altitude variations compared to traditional sensor-dependent or depth estimation methods.

Xingyu Shao, Mengfan He, Chunyu Li + 2 more2026-03-02💻 cs

DACESR: Degradation-Aware Conditional Embedding for Real-World Image Super-Resolution

This paper proposes DACESR, a real-world image super-resolution framework that enhances degraded image recognition via a Real Embedding Extractor (REE) and integrates these high-level features into a Mamba-based network using a Conditional Feature Modulator (CFM) to achieve superior fidelity and perceptual quality.

Xiaoyan Lei, Wenlong Zhang, Biao Luo + 3 more2026-03-02💻 cs

SelfOccFlow: Towards end-to-end self-supervised 3D Occupancy Flow prediction

The paper proposes SelfOccFlow, a self-supervised method for end-to-end 3D occupancy flow prediction that eliminates the need for human annotations or external flow supervision by disentangling static and dynamic scenes and leveraging temporal aggregation with a cosine similarity-based flow cue.

Xavier Timoneda, Markus Herb, Fabian Duerr + 1 more2026-03-02💻 cs

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

This paper introduces Ref-Adv, a challenging benchmark for Referring Expression Comprehension designed to eliminate shortcut solutions and expose significant gaps in visual reasoning and grounding capabilities of current multimodal LLMs.

Qihua Dong, Kuo Yang, Lin Ju + 6 more2026-03-02💬 cs.CL

Experience-Guided Self-Adaptive Cascaded Agents for Breast Cancer Screening and Diagnosis with Reduced Biopsy Referrals

The paper proposes BUSD-Agent, an experience-guided self-adaptive cascaded multi-agent framework for breast ultrasound screening and diagnosis that leverages a memory bank of historical decision trajectories to dynamically adjust escalation thresholds, significantly reducing unnecessary biopsy referrals and improving specificity without requiring model parameter updates.

Pramit Saha, Mohammad Alsharid, Joshua Strong + 1 more2026-03-02🤖 cs.AI

ABPolicy: Asynchronous B-Spline Flow Policy for Real-Time and Smooth Robotic Manipulation

ABPolicy is an asynchronous flow-matching framework that utilizes B-spline control points and bidirectional prediction with refitting to generate smooth, continuous, and real-time robotic manipulation trajectories, effectively eliminating the jitter and discontinuities common in synchronous action-space policies.

Fan Yang, Peiguang Jing, Kaihua Qu + 2 more2026-03-02💻 cs

SegMate: Asymmetric Attention-Based Lightweight Architecture for Efficient Multi-Organ Segmentation

SegMate is an efficient, open-source 2.5D framework that leverages asymmetric attention and multi-scale fusion to achieve state-of-the-art multi-organ segmentation accuracy while significantly reducing computational costs and memory usage across diverse medical datasets.

Andrei-Alexandru Bunea, Dan-Matei Popovici, Radu Tudor Ionescu2026-03-02🤖 cs.LG

Half-Truths Break Similarity-Based Retrieval

This paper identifies and addresses the "half-truth" vulnerability in CLIP-style models, where adding plausible but incorrect details to a description erroneously increases similarity scores, by proposing CS-CLIP, a component-supervised training approach that decomposes captions into entities and relations to enforce finer-grained grounding and significantly improve compositional understanding.

Bora Kargi, Arnas Uselis, Seong Joon Oh2026-03-02💻 cs

The Geometry of Transfer: Unlocking Medical Vision Manifolds for Training-Free Model Ranking

This paper proposes a novel training-free Topology-Driven Transferability Estimation framework that leverages global and local topological metrics to accurately rank medical foundation models for segmentation tasks, significantly outperforming existing classification-based methods on the OpenMind benchmark.

Jiaqi Tang, Shaoyang Zhang, Xiaoqi Wang + 3 more2026-03-02🤖 cs.AI

Leveraging Geometric Prior Uncertainty and Complementary Constraints for High-Fidelity Neural Indoor Surface Reconstruction

The paper proposes GPU-SDF, a neural implicit framework for high-fidelity indoor surface reconstruction that explicitly estimates geometric prior uncertainty to modulate prior influence and incorporates complementary edge and multi-view constraints to recover fine details and complex geometries.

Qiyu Feng, Jiwei Shan, Shing Shin Cheng + 1 more2026-03-02💻 cs

Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos

This paper proposes STE-VLN, a novel approach that enhances Vision-Language Navigation in unseen environments by constructing the YE-KG, a large-scale multimodal spatiotemporal knowledge graph derived from real-world indoor videos, and integrating it via a Coarse-to-Fine Hierarchical Retrieval mechanism to improve long-horizon reasoning and handle coarse-grained instructions.

Haoxuan Xu, Tianfu Li, Wenbo Chen + 4 more2026-03-02💻 cs

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

This paper introduces PointCoT, a novel framework and large-scale benchmark (Point-Reason-Instruct) that enhances Multimodal Large Language Models' 3D point cloud understanding by enforcing an explicit "Look, Think, then Answer" Chain-of-Thought reasoning paradigm to mitigate geometric hallucinations and achieve state-of-the-art performance.

Dongxu Zhang, Yiding Sun, Pengcheng Li + 12 more2026-03-02🤖 cs.AI

Micro-expression Recognition Based on Dual-branch Feature Extraction and Fusion

This paper proposes a dual-branch micro-expression recognition network integrating residual and Inception architectures with parallel attention and adaptive feature fusion, achieving a 74.67% accuracy on the CASME II dataset that significantly outperforms existing methods.

Mingjie Zhang, Bo Li, Wanting Liu + 5 more2026-03-02🤖 cs.AI

CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

The paper proposes CC-VQA, a training-free method that mitigates knowledge conflicts in Knowledge-Based Visual Question Answering by integrating vision-centric conflict reasoning with correlation-guided encoding and decoding to achieve state-of-the-art performance on multiple benchmarks.

Yuyang Hong, Jiaqi Gu, Yujin Lou + 7 more2026-03-02💻 cs

GDA-YOLO11: Amodal Instance Segmentation for Occlusion-Robust Robotic Fruit Harvesting

This paper introduces GDA-YOLO11, a novel amodal instance segmentation framework that significantly enhances occlusion-robust robotic fruit harvesting by inferring complete fruit shapes and accurately estimating picking points, achieving superior performance metrics and higher success rates under varying occlusion levels compared to existing models.

Caner Beldek, Emre Sariyildiz, Son Lam Phung + 1 more2026-03-02💻 cs

SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

SwitchCraft is a training-free framework that enhances multi-event video generation by introducing Event-Aligned Query Steering to align prompts with specific frames and an Auto-Balance Strength Solver to maintain temporal consistency, thereby preventing scene collapse in complex narratives.

Qianxun Xu, Chenxi Song, Yujun Cai + 1 more2026-03-02💻 cs

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

This paper proposes Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables multimodal large language models to perform precise region-grounded reasoning by generating continuous numerical coordinates as actions, thereby overcoming the limitations of discrete text-based or fixed-patch approaches while improving localization accuracy and training efficiency.

Kesen Zhao, Beier Zhu, Junbao Zhou + 3 more2026-03-02💻 cs

Clinically-aligned ischemic stroke segmentation and ASPECTS scoring on NCCT imaging using a slice-gated loss on foundation representations

This paper proposes a clinically aligned framework that integrates a frozen DINOv3 backbone with a novel Territory-Aware Gated Loss to enforce basal ganglia and supraganglionic consistency, achieving state-of-the-art performance in ischemic stroke segmentation and ASPECTS scoring on NCCT imaging.

Hiba Azeem, Behraj Khan, Tahir Qasim Syed2026-03-02⚡ eess

Extending 2D foundational DINOv3 representations to 3D segmentation of neonatal brain MR images

This paper proposes a structured window-based strategy that extends frozen 2D DINOv3 foundation representations to 3D neonatal brain MRI segmentation by decomposing volumes into sub-cubes for parallel decoding and reassembling them, achieving a Dice score of 0.65 on the ALBERT dataset while maintaining a constant memory footprint.

Annayah Usman, Behraj Khan, Tahir Qasim Syed2026-03-02⚡ eess

SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking

SpikeTrack is a novel, energy-efficient spike-driven framework for RGB visual tracking that employs an asymmetric design with unidirectional information flow and a memory-retrieval module to achieve state-of-the-art accuracy among SNN-based trackers while significantly outperforming advanced ANN counterparts in energy efficiency.

Qiuyang Zhang, Jiujun Cheng, Qichao Mao + 5 more2026-03-02💻 cs

← Previous Next →