cs.CV papers | Gist.Science

Scaling Dense Event-Stream Pretraining from Visual Foundation Models

This paper proposes a novel self-supervised pretraining method that leverages structure-aware distillation from visual foundation models to overcome annotation bottlenecks and semantic collapse, enabling scalable learning of versatile, fine-grained representations from dense event streams.

Zhiwen Chen, Junhui Hou, Zhiyu Zhu + 2 more2026-03-05💻 cs

Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction

Dual-Solver is a generalized ODE solver for diffusion models that employs learnable parameters to dynamically interpolate prediction types, select integration domains, and adjust residuals, thereby significantly improving image quality and CLIP scores in low-function-evaluation regimes across various backbones.

Soochul Park, Yeon Ju Lee2026-03-05🤖 cs.LG

Phi-4-reasoning-vision-15B Technical Report

This technical report introduces Phi-4-reasoning-vision-15B, a compact open-weight multimodal model that achieves competitive performance in scientific, mathematical, and UI reasoning through strategic architecture choices, rigorous data curation, and a hybrid training approach, demonstrating that smaller models can excel with significantly less compute.

Jyoti Aneja, Michael Harrison, Neel Joshi + 3 more2026-03-05🤖 cs.AI

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

GeoSeg is a training-free, zero-shot framework that leverages MLLM reasoning combined with bias-aware coordinate refinement and dual-route prompting to achieve instruction-grounded segmentation in remote sensing imagery, addressing the lack of generalizable solutions and data scarcity in the domain.

Lifan Jiang, Yuhang Pei, oxi Wu + 5 more2026-03-05🤖 cs.AI

RIVER: A Real-Time Interaction Benchmark for Video LLMs

This paper introduces RIVER, a novel benchmark and framework designed to evaluate and improve the real-time interactive capabilities of video large language models by addressing their current limitations in online processing, long-term memory, and proactive anticipation through a three-task system of Retrospective Memory, Live-Perception, and Proactive Anticipation.

Yansong Shi, Qingsong Zhao, Tianxiang Jiang + 3 more2026-03-05💻 cs

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

This paper introduces a diagnostic framework using face pareidolia to reveal that vision models' behavior under visual ambiguity is primarily governed by their representational architecture, with vision-language models exhibiting semantic overactivation, pure vision models adopting uncertainty-based abstention, and detection models relying on conservative priors to suppress false positives.

Qianpu Chen, Derya Soydaner, Rob Saunders2026-03-05🤖 cs.AI

Weakly Supervised Patch Annotation for Improved Screening of Diabetic Retinopathy

This paper introduces SAFE, a two-stage framework that leverages weak supervision, contrastive learning, and feature-space ensemble methods to systematically expand sparse expert annotations of diabetic retinopathy lesions, thereby significantly improving both patch-level detection accuracy and downstream disease classification performance.

Shramana Dey, Abhirup Banerjee, B. Uma Shankar + 2 more2026-03-05💻 cs

Discriminative Perception via Anchored Description for Reasoning Segmentation

The paper proposes DPAD, a method that enhances reasoning segmentation by introducing a discriminative perception mechanism through anchored object descriptions, which effectively guides Multimodal Large Language Models to generate more focused and efficient reasoning chains while significantly improving localization accuracy and reducing verbosity.

Tao Yang, Qing Zhou, Yanliang Li + 1 more2026-03-05🤖 cs.AI

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

This paper proposes a novel framework for radiology report generation that enhances reinforcement learning efficiency through a diagnostic diversity-based data sampling strategy and a Diagnostic Token-weighted Policy Optimization (DiTPO) method, achieving state-of-the-art clinical accuracy with significantly fewer training samples by prioritizing diagnostically critical content.

Zilin Lu, Ruifeng Yuan, Weiwei Cao + 6 more2026-03-05💻 cs

Volumetric Directional Diffusion: Anchoring Uncertainty Quantification in Anatomical Consensus for Ambiguous Medical Image Segmentation

The paper proposes Volumetric Directional Diffusion (VDD), a novel framework that anchors generative trajectories to a deterministic consensus prior to predict 3D boundary residuals, thereby achieving state-of-the-art anatomically coherent uncertainty quantification for ambiguous medical image segmentation while avoiding the topological fractures common in standard diffusion models.

Chao Wu, Kangxian Xie, Mingchen Gao2026-03-05🤖 cs.AI

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

The paper proposes DQE-CIR, a novel composed image retrieval method that enhances query discriminativeness and fine-grained retrieval accuracy by integrating learnable attribute weights for precise vision-language alignment and a target relative negative sampling strategy to mitigate relevance suppression and semantic confusion.

Geon Park, Ji-Hoon Park, Seong-Whan Lee2026-03-05🤖 cs.AI

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

This paper addresses the lack of benchmarks for long-term visual localization in dynamic benthic environments by introducing a curated multi-year underwater dataset, a novel footprint-based ground-truthing method that outperforms traditional distance-threshold approaches, and a benchmark evaluation demonstrating that state-of-the-art visual place recognition methods struggle significantly in these challenging underwater settings.

Martin Kvisvik Larsen, Oscar Pizarro2026-03-05💻 cs

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

This paper introduces MELT, a lightweight backdoor attack framework for multi-encoder diffusion models like Stable Diffusion 3, demonstrating that tuning fewer than 0.2% of parameters via low-rank adapters is sufficient to achieve effective attacks while identifying the minimal encoder subsets required for different objectives.

Ziyuan Chen, Yujin Jeong, Tobias Braun + 1 more2026-03-05🤖 cs.LG

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

This study demonstrates that for cell-level histopathological image analysis under extreme spatial constraints, task-specific architectures trained on sufficient data outperform foundation models in both accuracy and efficiency, while offering comparable robustness to blur perturbations.

Hiroki Kagiyama, Toru Nagasaka, Yukari Adachi + 5 more2026-03-05💻 cs

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

EgoPoseFormer v2 is a transformer-based framework that significantly advances egocentric human motion estimation for AR/VR by combining a novel architecture with an uncertainty-aware auto-labeling system to achieve state-of-the-art accuracy and temporal consistency on large-scale unlabeled datasets.

Zhenyu Li, Sai Kumar Dwivedi, Filip Maric + 11 more2026-03-05💻 cs

CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

This paper proposes a CLIP-guided multi-task regression framework that leverages level-aware vision-language embeddings to robustly predict plant age and leaf count from multi-view imagery, achieving significant accuracy improvements on the GroMo25 benchmark while simplifying the pipeline and handling incomplete inputs.

Simon Warmers, Muhammad Zawish, Fayaz Ali Dharejo + 2 more2026-03-05💻 cs

Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning

This paper introduces a training-free, capture-time frame curation method for always-on egocentric cameras that leverages gaze stability and pupil-derived novelty as complementary criteria to efficiently select high-quality, informative frames, achieving full-stream classification performance with only 10% of the data while respecting wearable device constraints.

Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish2026-03-05💻 cs

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

This paper introduces HPENets, an efficient suite of MLP-based point cloud networks that leverage a two-stage abstraction-refinement paradigm, high-dimensional positional encoding, and non-local MLPs to achieve superior performance with significantly reduced computational costs compared to state-of-the-art models.

Yanmei Zou, Hongshan Yu, Yaonan Wang + 4 more2026-03-05🤖 cs.AI

Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast

This paper proposes a disentangled representation learning framework for brain MRI to demonstrate that demographic predictability primarily stems from anatomical variation rather than acquisition-dependent contrast, highlighting the need for targeted mitigation strategies that address these distinct sources to ensure robust bias reduction.

Mehmet Yigit Avci, Akshit Achara, Andrew King + 1 more2026-03-05🤖 cs.AI

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

This paper introduces Any2Any, a unified latent diffusion framework that enables efficient and generalizable arbitrary modality translation in remote sensing by projecting heterogeneous inputs into a shared geometrically aligned latent space, supported by the newly proposed million-scale RST-1M dataset.

Haoyang Chen, Jing Zhang, Hebaixu Wang + 7 more2026-03-05💻 cs

← Previous Next →