cs.CV papers | Gist.Science

3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

The paper introduces 3DMedAgent, a unified agent that leverages a flexible MLLM and long-term structured memory to coordinate heterogeneous tools for decomposing complex 3D CT analysis into tractable 2D-based subtasks, thereby enabling general-purpose 3D medical understanding without 3D-specific fine-tuning.

Ziyue Wang, Linghan Cai, Chang Han Low, Haofeng Liu, Junde Wu, Jingyu Wang, Rui Wang, Lei Song, Jiang Bian, Jingjing Fu, Yueming Jin2026-03-10💻 cs

Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges

This paper demonstrates that neural networks learning equivariant operators in a latent space can effectively generalize to out-of-distribution symmetric transformations on simple datasets like rotated MNIST, while also highlighting the significant challenges involved in scaling this approach to more complex data.

Minh Dinh, Stéphane Deny2026-03-10🤖 cs.LG

OVerSeeC: Open-Vocabulary Costmap Generation from Satellite Images and Natural Language

OVerSeeC is a zero-shot modular framework that leverages large language models and open-vocabulary segmentation to generate executable global costmaps from satellite imagery and natural language instructions, enabling autonomous navigation to adapt to novel entities and dynamic mission constraints without requiring fixed ontologies.

Rwik Rana, Jesse Quattrociocchi, Dongmyeong Lee, Christian Ellis, Amanda Adkins, Adam Uccello, Garrett Warnell, Joydeep Biswas2026-03-10💻 cs

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

This paper introduces Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting and benchmark for autonomous driving that addresses both unseen domains and categories, and proposes S2-Corr, a state-space-driven mechanism to refine text-image correlations in Vision-Language Models to achieve robust performance across diverse urban environments.

Dong Zhao, Qi Zang, Nan Pu, Wenjing Li, Nicu Sebe, Zhun Zhong2026-03-10💻 cs

Universal 3D Shape Matching via Coarse-to-Fine Language Guidance

UniMatch is a novel coarse-to-fine framework that establishes dense semantic correspondences between strongly non-isometric, cross-category 3D shapes by leveraging class-agnostic segmentation, multimodal language models for part identification, and a rank-based contrastive learning scheme to overcome the limitations of prior isometry-dependent methods.

Qinfeng Xiao, Guofeng Mei, Bo Yang, Liying Zhang, Jian Zhang, Kit-lun Yick2026-03-10💻 cs

InfScene-SR: Arbitrary-Size Image Super-Resolution via Iterative Joint-Denoising

InfScene-VF proposes a diffusion-based framework for arbitrary-size image super-resolution that eliminates boundary artifacts and enables memory-efficient, parallelized inference on gigapixel imagery by introducing Variance-Corrected Fusion and Spatially-Decoupled Variance Correction to achieve spatially continuous joint-denoising.

Shoukun Sun, Zhe Wang, Xiang Que, Jiyin Zhang, Xiaogang Ma2026-03-10💻 cs

Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection

This paper proposes an online data manipulation scheme that decomposes training images into independent object, scene, and camera components and recomposes them with perturbed poses to generate diverse training data, thereby improving the data efficiency and performance of monocular 3D object detection models across both fully and sparsely supervised settings.

Zhaonian Kuang, Rui Ding, Meng Yang + 2 more2026-03-10💻 cs

Cycle-Consistent Tuning for Layered Image Decomposition

This paper presents a cycle-consistent tuning framework that leverages lightweight LoRA adaptation of pretrained diffusion models to achieve robust, high-fidelity layered image decomposition, specifically for challenging logo-object separation, by enforcing bidirectional reconstruction consistency and iteratively refining performance through a progressive self-improving process.

Zheng Gu, Min Lu, Zhida Sun, Dani Lischinski, Daniel Cohen-Or, Hui Huang2026-03-10💻 cs

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

This paper proposes "See It, Say It, Sorted," a lightweight, training-free, and plug-and-play framework that mitigates visual hallucination in large vision-language models by iteratively supervising each reasoning step with dynamically extracted visual evidence, thereby significantly improving reasoning accuracy without requiring additional model training.

Yongchang Zhang, Oliver Ma, Tianyi Liu, Guangquan Zhou, Yang Chen2026-03-10💻 cs

Tokenizing Semantic Segmentation with RLE

This paper introduces a unified language modeling approach for semantic and panoptic segmentation in images and videos that discretizes masks into run-length encoded tokens, employing novel compression strategies to enable autoregressive generation despite computational constraints.

Abhineet Singh, Justin Rozeboom, Nilanjan Ray2026-03-10💻 cs

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

WISER is a training-free framework for Zero-Shot Composed Image Retrieval that unifies Text-to-Image and Image-to-Image paradigms through a "retrieve-verify-refine" pipeline, leveraging wider search, adaptive fusion, and self-reflection to significantly outperform existing methods across diverse benchmarks.

Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang2026-03-10💻 cs

PackUV: Packed Gaussian UV Maps for 4D Volumetric Video

The paper introduces PackUV, a novel 4D Gaussian representation and fitting method that maps volumetric video attributes into structured UV atlases for efficient, codec-compatible storage and streaming, while demonstrating superior temporal consistency and rendering fidelity on the newly proposed large-scale PackUV-2B dataset.

Aashish Rai, Angela Xing, Anushka Agarwal, Xiaoyan Cong, Zekun Li, Tao Lu, Aayush Prakash, Srinath Sridhar2026-03-10💻 cs

Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning

This paper proposes HART, an annotation-free framework that leverages a novel Advantage Preference Group Relative Policy Optimization (AP-GRPO) algorithm to enable Large Multimodal Models to autonomously identify and verify key high-resolution image regions, thereby improving reasoning performance without requiring costly human grounding labels.

Jiacheng Yang, Anqi Chen, Yunkai Dang, Qi Fan, Cong Wang, Wenbin Li, Feng Miao, Yang Gao2026-03-10💻 cs

Self-Attention And Beyond the Infinite: Towards Linear Transformers with Infinite Self-Attention

This paper introduces Infinite Self-Attention (InfSA) and its linear-time variant, Linear-InfSA, a spectral reformulation of self-attention as a diffusion process on token graphs that achieves state-of-the-art ImageNet accuracy and enables efficient, memory-free inference at ultra-high resolutions (up to 9216×9216) by replacing the quadratic softmax cost with a Neumann series approximation.

Giorgio Roffo, Luke Palmer2026-03-10💻 cs

WildActor: Unconstrained Identity-Preserving Video Generation

This paper introduces WildActor, a framework for unconstrained identity-preserving human video generation that leverages the large-scale Actor-18M dataset and novel attention mechanisms to overcome existing limitations in maintaining consistent full-body identities across dynamic shots, viewpoints, and motions.

Qin Guo, Tianyu Yang, Xuanhua He, Fei Shen, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Dan Xu2026-03-10💻 cs

Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

This position paper argues that the evaluation of modern visual processing systems must shift from a reliance on single-metric benchmarks toward a human-centered, context-aware paradigm to better align with human perception and foster genuine innovation.

Jinfan Hu, Fanghua Yu, Zhiyuan You, Xiang Yin, Hongyu An, Xinqi Lin, Chao Dong, Jinjin Gu2026-03-10💻 cs

DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles

The paper proposes DeAR, a fine-grained adaptation framework for Vision-Language Models that decomposes attention heads into functional roles (Attribute, Generalization, and Mixed) using a Concept Entropy metric to selectively isolate task-specific learning from generalization capabilities, thereby achieving superior performance across diverse tasks while preserving zero-shot robustness.

Yiming Ma, Hongkun Yang, Lionel Z. Wang, Bin Chen, Weizhi Xian, Jianzhi Teng2026-03-10💻 cs

MSP-ReID: Hairstyle-Robust Cloth-Changing Person Re-Identification

The paper proposes the MSP framework, which mitigates hairstyle distraction and preserves structural information through Hairstyle-Oriented Augmentation, Cloth-Preserved Random Erasing, and Region-based Parsing Attention to achieve state-of-the-art performance in cloth-changing person re-identification.

Xiangyang He, Lin Wan2026-03-10💻 cs

A Detection-Gated Pipeline for Robust Glottal Area Waveform Extraction and Clinical Pathology Assessment

This paper presents a computationally efficient, detection-gated deep learning pipeline that achieves state-of-the-art robustness and cross-dataset generalization in glottal segmentation from high-speed videoendoscopy, enabling reliable extraction of clinical biomarkers for distinguishing healthy from pathological vocal function.

Harikrishnan Unnikrishnan2026-03-10🤖 cs.LG

Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

This paper proposes a robust framework combining the hybrid CoAtNet architecture with model soups ensembling to effectively classify Intangible Cultural Heritage images from the Mekong Delta, achieving state-of-the-art performance on the ICH-17 dataset by reducing variance and enhancing generalization in data-scarce, high-similarity settings.

Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham2026-03-10🤖 cs.LG

← Previous Next →