cs.CV papers | Gist.Science

Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning

This paper proposes a robust end-to-end multimodal framework for DICOM series classification that leverages bi-directional cross-attention and a sparse, missingness-aware dictionary learning encoder to effectively handle heterogeneous image content, variable series lengths, and incomplete metadata without requiring imputation, thereby outperforming existing baselines in both in-domain and out-of-domain settings.

Tuan Truong, Melanie Dohmen, Sara Lorio + 1 more2026-03-02⚡ eess

Polarization Uncertainty-Guided Diffusion Model for Color Polarization Image Demosaicking

This paper proposes a Polarization Uncertainty-Guided Diffusion Model that leverages image diffusion priors and explicitly models polarization uncertainty to accurately reconstruct high-fidelity color polarization images, effectively overcoming the limitations of existing network-based methods in recovering polarization characteristics due to data scarcity.

Chenggong Li, Yidong Luo, Junchao Zhang + 1 more2026-03-02⚡ eess

NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection

The NAU-QMUL team proposed a multi-modal multi-task model leveraging pre-trained BERT and CLIP encoders with cross-modal fusion and pseudo-labeling data augmentation to achieve fifth place in both detection and source identification tasks of the CT2 AI-Generated Image Detection competition.

Xiaoyu Guo, Arkaitz Zubiaga2026-03-02💬 cs.CL

Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition

This paper introduces ReSeg-CLIP, a training-free open-vocabulary semantic segmentation method for remote sensing that achieves state-of-the-art performance by combining hierarchical attention masking with SAM-generated masks and a novel model composition strategy that averages multiple RS-specific CLIP variants.

Mohammadreza Heidarianbaei, Mareike Dorozynski, Hubert Kanyamahanga + 2 more2026-03-02💻 cs

Bandwidth-adaptive Cloud-Assisted 360-Degree 3D Perception for Autonomous Vehicles

This paper proposes a bandwidth-adaptive, cloud-assisted framework for autonomous vehicles that dynamically splits transformer-based 360-degree 3D perception tasks between the vehicle and the cloud using feature compression and quantization, achieving a 72% latency reduction and up to 20% accuracy improvement over static methods under fluctuating network conditions.

Faisal Hawladera, Rui Meireles, Gamal Elghazaly + 2 more2026-03-02🤖 cs.LG

Altitude-Aware Visual Place Recognition in Top-Down View

This paper proposes a hardware-free, vision-only approach for aerial visual place recognition that estimates relative altitude through ground feature density analysis to generate canonical images, significantly improving localization accuracy and robustness across diverse terrains and large altitude variations compared to traditional sensor-dependent or depth estimation methods.

Xingyu Shao, Mengfan He, Chunyu Li + 2 more2026-03-02💻 cs

DACESR: Degradation-Aware Conditional Embedding for Real-World Image Super-Resolution

This paper proposes DACESR, a real-world image super-resolution framework that enhances degraded image recognition via a Real Embedding Extractor (REE) and integrates these high-level features into a Mamba-based network using a Conditional Feature Modulator (CFM) to achieve superior fidelity and perceptual quality.

Xiaoyan Lei, Wenlong Zhang, Biao Luo + 3 more2026-03-02💻 cs

SelfOccFlow: Towards end-to-end self-supervised 3D Occupancy Flow prediction

The paper proposes SelfOccFlow, a self-supervised method for end-to-end 3D occupancy flow prediction that eliminates the need for human annotations or external flow supervision by disentangling static and dynamic scenes and leveraging temporal aggregation with a cosine similarity-based flow cue.

Xavier Timoneda, Markus Herb, Fabian Duerr + 1 more2026-03-02💻 cs

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

This paper introduces Ref-Adv, a challenging benchmark for Referring Expression Comprehension designed to eliminate shortcut solutions and expose significant gaps in visual reasoning and grounding capabilities of current multimodal LLMs.

Qihua Dong, Kuo Yang, Lin Ju + 6 more2026-03-02💬 cs.CL

Experience-Guided Self-Adaptive Cascaded Agents for Breast Cancer Screening and Diagnosis with Reduced Biopsy Referrals

The paper proposes BUSD-Agent, an experience-guided self-adaptive cascaded multi-agent framework for breast ultrasound screening and diagnosis that leverages a memory bank of historical decision trajectories to dynamically adjust escalation thresholds, significantly reducing unnecessary biopsy referrals and improving specificity without requiring model parameter updates.

Pramit Saha, Mohammad Alsharid, Joshua Strong + 1 more2026-03-02🤖 cs.AI

ABPolicy: Asynchronous B-Spline Flow Policy for Real-Time and Smooth Robotic Manipulation

ABPolicy is an asynchronous flow-matching framework that utilizes B-spline control points and bidirectional prediction with refitting to generate smooth, continuous, and real-time robotic manipulation trajectories, effectively eliminating the jitter and discontinuities common in synchronous action-space policies.

Fan Yang, Peiguang Jing, Kaihua Qu + 2 more2026-03-02💻 cs

SegMate: Asymmetric Attention-Based Lightweight Architecture for Efficient Multi-Organ Segmentation

SegMate is an efficient, open-source 2.5D framework that leverages asymmetric attention and multi-scale fusion to achieve state-of-the-art multi-organ segmentation accuracy while significantly reducing computational costs and memory usage across diverse medical datasets.

Andrei-Alexandru Bunea, Dan-Matei Popovici, Radu Tudor Ionescu2026-03-02🤖 cs.LG

Half-Truths Break Similarity-Based Retrieval

This paper identifies and addresses the "half-truth" vulnerability in CLIP-style models, where adding plausible but incorrect details to a description erroneously increases similarity scores, by proposing CS-CLIP, a component-supervised training approach that decomposes captions into entities and relations to enforce finer-grained grounding and significantly improve compositional understanding.

Bora Kargi, Arnas Uselis, Seong Joon Oh2026-03-02💻 cs

The Geometry of Transfer: Unlocking Medical Vision Manifolds for Training-Free Model Ranking

This paper proposes a novel training-free Topology-Driven Transferability Estimation framework that leverages global and local topological metrics to accurately rank medical foundation models for segmentation tasks, significantly outperforming existing classification-based methods on the OpenMind benchmark.

Jiaqi Tang, Shaoyang Zhang, Xiaoqi Wang + 3 more2026-03-02🤖 cs.AI

Leveraging Geometric Prior Uncertainty and Complementary Constraints for High-Fidelity Neural Indoor Surface Reconstruction

The paper proposes GPU-SDF, a neural implicit framework for high-fidelity indoor surface reconstruction that explicitly estimates geometric prior uncertainty to modulate prior influence and incorporates complementary edge and multi-view constraints to recover fine details and complex geometries.

Qiyu Feng, Jiwei Shan, Shing Shin Cheng + 1 more2026-03-02💻 cs

Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos

This paper proposes STE-VLN, a novel approach that enhances Vision-Language Navigation in unseen environments by constructing the YE-KG, a large-scale multimodal spatiotemporal knowledge graph derived from real-world indoor videos, and integrating it via a Coarse-to-Fine Hierarchical Retrieval mechanism to improve long-horizon reasoning and handle coarse-grained instructions.

Haoxuan Xu, Tianfu Li, Wenbo Chen + 4 more2026-03-02💻 cs

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

This paper introduces PointCoT, a novel framework and large-scale benchmark (Point-Reason-Instruct) that enhances Multimodal Large Language Models' 3D point cloud understanding by enforcing an explicit "Look, Think, then Answer" Chain-of-Thought reasoning paradigm to mitigate geometric hallucinations and achieve state-of-the-art performance.

Dongxu Zhang, Yiding Sun, Pengcheng Li + 12 more2026-03-02🤖 cs.AI

Micro-expression Recognition Based on Dual-branch Feature Extraction and Fusion

This paper proposes a dual-branch micro-expression recognition network integrating residual and Inception architectures with parallel attention and adaptive feature fusion, achieving a 74.67% accuracy on the CASME II dataset that significantly outperforms existing methods.

Mingjie Zhang, Bo Li, Wanting Liu + 5 more2026-03-02🤖 cs.AI

CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

The paper proposes CC-VQA, a training-free method that mitigates knowledge conflicts in Knowledge-Based Visual Question Answering by integrating vision-centric conflict reasoning with correlation-guided encoding and decoding to achieve state-of-the-art performance on multiple benchmarks.

Yuyang Hong, Jiaqi Gu, Yujin Lou + 7 more2026-03-02💻 cs

GDA-YOLO11: Amodal Instance Segmentation for Occlusion-Robust Robotic Fruit Harvesting

This paper introduces GDA-YOLO11, a novel amodal instance segmentation framework that significantly enhances occlusion-robust robotic fruit harvesting by inferring complete fruit shapes and accurately estimating picking points, achieving superior performance metrics and higher success rates under varying occlusion levels compared to existing models.

Caner Beldek, Emre Sariyildiz, Son Lam Phung + 1 more2026-03-02💻 cs

← Previous Next →