cs.CV papers | Gist.Science

Structural Action Transformer for 3D Dexterous Manipulation

This paper proposes the Structural Action Transformer (SAT), a novel 3D dexterous manipulation policy that reframes actions as variable-length, unordered joint trajectories and utilizes an Embodied Joint Codebook to achieve superior sample efficiency and cross-embodiment skill transfer from heterogeneous datasets.

Xiaohan Lei, Min Wang, Bohong Weng + 2 more2026-03-05💻 cs

ProFound: A moderate-sized vision foundation model for multi-task prostate imaging

The paper introduces ProFound, a domain-specialized vision foundation model pre-trained on over 22,000 prostate MRI volumes via self-supervised learning, which demonstrates superior or competitive performance across 11 diverse clinical tasks compared to state-of-the-art specialized and foundation models.

Yipei Wang, Yinsong Xu, Weixi Yi + 11 more2026-03-05💻 cs

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

This paper introduces BLOCK, an open-source bi-stage pipeline that leverages a large multimodal model to generate consistent 3D character previews and a fine-tuned FLUX.2 model with a novel EvolveLoRA curriculum to decode these previews into pixel-perfect Minecraft skins.

Hengquan Guo2026-03-05🤖 cs.AI

UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization

This paper proposes UniRain, a unified image deraining framework that combines a RAG-based dataset distillation pipeline for selecting high-quality training samples and a multi-objective reweighted optimization strategy within an asymmetric MoE architecture to effectively restore images degraded by diverse rain streaks and raindrops across both daytime and nighttime conditions.

Qianfeng Yang, Qiyuan Guan, Xiang Chen + 3 more2026-03-05💻 cs

Scaling Dense Event-Stream Pretraining from Visual Foundation Models

This paper proposes a novel self-supervised pretraining method that leverages structure-aware distillation from visual foundation models to overcome annotation bottlenecks and semantic collapse, enabling scalable learning of versatile, fine-grained representations from dense event streams.

Zhiwen Chen, Junhui Hou, Zhiyu Zhu + 2 more2026-03-05💻 cs

Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction

Dual-Solver is a generalized ODE solver for diffusion models that employs learnable parameters to dynamically interpolate prediction types, select integration domains, and adjust residuals, thereby significantly improving image quality and CLIP scores in low-function-evaluation regimes across various backbones.

Soochul Park, Yeon Ju Lee2026-03-05🤖 cs.LG

Phi-4-reasoning-vision-15B Technical Report

This technical report introduces Phi-4-reasoning-vision-15B, a compact open-weight multimodal model that achieves competitive performance in scientific, mathematical, and UI reasoning through strategic architecture choices, rigorous data curation, and a hybrid training approach, demonstrating that smaller models can excel with significantly less compute.

Jyoti Aneja, Michael Harrison, Neel Joshi + 3 more2026-03-05🤖 cs.AI

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

GeoSeg is a training-free, zero-shot framework that leverages MLLM reasoning combined with bias-aware coordinate refinement and dual-route prompting to achieve instruction-grounded segmentation in remote sensing imagery, addressing the lack of generalizable solutions and data scarcity in the domain.

Lifan Jiang, Yuhang Pei, oxi Wu + 5 more2026-03-05🤖 cs.AI

RIVER: A Real-Time Interaction Benchmark for Video LLMs

This paper introduces RIVER, a novel benchmark and framework designed to evaluate and improve the real-time interactive capabilities of video large language models by addressing their current limitations in online processing, long-term memory, and proactive anticipation through a three-task system of Retrospective Memory, Live-Perception, and Proactive Anticipation.

Yansong Shi, Qingsong Zhao, Tianxiang Jiang + 3 more2026-03-05💻 cs

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

This paper introduces a diagnostic framework using face pareidolia to reveal that vision models' behavior under visual ambiguity is primarily governed by their representational architecture, with vision-language models exhibiting semantic overactivation, pure vision models adopting uncertainty-based abstention, and detection models relying on conservative priors to suppress false positives.

Qianpu Chen, Derya Soydaner, Rob Saunders2026-03-05🤖 cs.AI

Weakly Supervised Patch Annotation for Improved Screening of Diabetic Retinopathy

This paper introduces SAFE, a two-stage framework that leverages weak supervision, contrastive learning, and feature-space ensemble methods to systematically expand sparse expert annotations of diabetic retinopathy lesions, thereby significantly improving both patch-level detection accuracy and downstream disease classification performance.

Shramana Dey, Abhirup Banerjee, B. Uma Shankar + 2 more2026-03-05💻 cs

Discriminative Perception via Anchored Description for Reasoning Segmentation

The paper proposes DPAD, a method that enhances reasoning segmentation by introducing a discriminative perception mechanism through anchored object descriptions, which effectively guides Multimodal Large Language Models to generate more focused and efficient reasoning chains while significantly improving localization accuracy and reducing verbosity.

Tao Yang, Qing Zhou, Yanliang Li + 1 more2026-03-05🤖 cs.AI

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

This paper proposes a novel framework for radiology report generation that enhances reinforcement learning efficiency through a diagnostic diversity-based data sampling strategy and a Diagnostic Token-weighted Policy Optimization (DiTPO) method, achieving state-of-the-art clinical accuracy with significantly fewer training samples by prioritizing diagnostically critical content.

Zilin Lu, Ruifeng Yuan, Weiwei Cao + 6 more2026-03-05💻 cs

Volumetric Directional Diffusion: Anchoring Uncertainty Quantification in Anatomical Consensus for Ambiguous Medical Image Segmentation

The paper proposes Volumetric Directional Diffusion (VDD), a novel framework that anchors generative trajectories to a deterministic consensus prior to predict 3D boundary residuals, thereby achieving state-of-the-art anatomically coherent uncertainty quantification for ambiguous medical image segmentation while avoiding the topological fractures common in standard diffusion models.

Chao Wu, Kangxian Xie, Mingchen Gao2026-03-05🤖 cs.AI

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

The paper proposes DQE-CIR, a novel composed image retrieval method that enhances query discriminativeness and fine-grained retrieval accuracy by integrating learnable attribute weights for precise vision-language alignment and a target relative negative sampling strategy to mitigate relevance suppression and semantic confusion.

Geon Park, Ji-Hoon Park, Seong-Whan Lee2026-03-05🤖 cs.AI

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

This paper addresses the lack of benchmarks for long-term visual localization in dynamic benthic environments by introducing a curated multi-year underwater dataset, a novel footprint-based ground-truthing method that outperforms traditional distance-threshold approaches, and a benchmark evaluation demonstrating that state-of-the-art visual place recognition methods struggle significantly in these challenging underwater settings.

Martin Kvisvik Larsen, Oscar Pizarro2026-03-05💻 cs

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

This paper introduces MELT, a lightweight backdoor attack framework for multi-encoder diffusion models like Stable Diffusion 3, demonstrating that tuning fewer than 0.2% of parameters via low-rank adapters is sufficient to achieve effective attacks while identifying the minimal encoder subsets required for different objectives.

Ziyuan Chen, Yujin Jeong, Tobias Braun + 1 more2026-03-05🤖 cs.LG

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

This study demonstrates that for cell-level histopathological image analysis under extreme spatial constraints, task-specific architectures trained on sufficient data outperform foundation models in both accuracy and efficiency, while offering comparable robustness to blur perturbations.

Hiroki Kagiyama, Toru Nagasaka, Yukari Adachi + 5 more2026-03-05💻 cs

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

EgoPoseFormer v2 is a transformer-based framework that significantly advances egocentric human motion estimation for AR/VR by combining a novel architecture with an uncertainty-aware auto-labeling system to achieve state-of-the-art accuracy and temporal consistency on large-scale unlabeled datasets.

Zhenyu Li, Sai Kumar Dwivedi, Filip Maric + 11 more2026-03-05💻 cs

CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

This paper proposes a CLIP-guided multi-task regression framework that leverages level-aware vision-language embeddings to robustly predict plant age and leaf count from multi-view imagery, achieving significant accuracy improvements on the GroMo25 benchmark while simplifying the pipeline and handling incomplete inputs.

Simon Warmers, Muhammad Zawish, Fayaz Ali Dharejo + 2 more2026-03-05💻 cs

← Previous Next →