cs.CV papers | Gist.Science

Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks

This paper introduces the Angular Gradient Sign Method, a novel adversarial attack for hyperbolic networks that leverages the geometric decomposition of gradients to apply perturbations solely along angular (semantic) directions, thereby achieving higher fooling rates and revealing unique vulnerabilities in hierarchical embeddings compared to conventional Euclidean-based methods.

Minsoo Jo, Dongyoon Yang, Taesup Kim2026-03-10🤖 cs.LG

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

The paper proposes Video2Layout, a two-stage framework that reconstructs metric-grounded spatial layouts using continuous object boundary coordinates instead of discretized grids, thereby enhancing fine-grained spatial reasoning in Multimodal Large Language Models and achieving superior performance on spatial benchmarks.

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, Tiejun Zhao2026-03-10💻 cs

Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

This paper proposes the Multi-Order Matching Network (MOMNet), an alignment-free framework that achieves state-of-the-art depth super-resolution by adaptively retrieving and integrating misaligned RGB information through a novel multi-order matching and aggregation mechanism.

Zhengxue Wang, Zhiqiang Yan, Yuan Wu, Guangwei Gao, Xiang Li, Jian Yang2026-03-10💻 cs

Learning to Think Fast and Slow for Visual Language Models

This paper introduces DualMindVLM, a visual language model that leverages a dual-mode thinking mechanism to dynamically select between fast, intuitive responses and slow, deliberate reasoning based on problem complexity, thereby achieving state-of-the-art performance with significantly improved token efficiency.

Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou2026-03-10💻 cs

Radiative-Structured Neural Operator for Continuous and Extrapolative Spectral Super-Resolution

The paper proposes the Radiative-Structured Neural Operator (RSNO), a novel deep learning framework that reconstructs hyperspectral images from multispectral observations by learning continuous spectral mappings under radiative priors and employing angular-consistent projection to ensure physical consistency and eliminate color distortion.

Ziye Zhang, Bin Pan, Zhenwei Shi2026-03-10💻 cs

UnfoldLDM: Deep Unfolding-based Blind Image Restoration with Latent Diffusion Priors

The paper proposes UnfoldLDM, a deep unfolding framework that integrates a multi-granularity degradation-aware module for robust degradation estimation and a degradation-resistant latent diffusion model with an over-smoothing correction transformer to effectively address blind image restoration by overcoming degradation-specific dependencies and suppressing over-smoothing bias.

Chunming He, Rihan Zhang, Zheng Chen, Bowen Yang, Chengyu Fang, Yunlong Lin, Yulun Zhang, Fengyang Xiao, Sina Farsiu2026-03-10💻 cs

Stable Multi-Drone GNSS Tracking System for Marine Robots

This paper presents a stable, real-time multi-drone GNSS tracking system for marine robots that integrates visual detection, multi-object tracking, triangulation, and a confidence-weighted Extended Kalman Filter with cross-drone ID alignment to overcome the limitations of underwater signal loss and traditional alternatives.

Shuo Wen, Edwin Meriaux, Mariana Sosa Guzmán, Zhizun Wang, Junming Shi, Gregory Dudek2026-03-10💻 cs

Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

This paper introduces Yo'City, an agentic framework that leverages large models for hierarchical planning and a self-critic expansion loop to generate personalized, boundless, and spatially coherent 3D realistic city scenes, outperforming existing state-of-the-art methods across multiple evaluation metrics.

Keyang Lu, Sifan Zhou, Hongbin Xu, Gang Xu, Zhifei Yang, Yikai Wang, Zhen Xiao, Jieyi Long, Ming Li2026-03-10💻 cs

Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space

This paper proposes "Shortcut Invariance," a targeted Jacobian regularization method that improves out-of-distribution generalization by injecting anisotropic noise into a disentangled latent space to flatten decision boundaries along shortcut-aligned axes, thereby eliminating the need for explicit shortcut labels or conflicting training samples.

Shivam Pal, Sakshi Varshney, Piyush Rai2026-03-10🤖 cs.LG

ForamDeepSlice: A High-Accuracy Deep Learning Framework for Foraminifera Species Classification from 2D Micro-CT Slices

This study introduces ForamDeepSlice, a high-accuracy deep learning framework that combines an ensemble of ConvNeXt-Large and EfficientNetV2-Small models with a rigorous specimen-level split dataset to achieve 95.64% accuracy in classifying foraminifera species from 2D micro-CT slices, while also providing an interactive dashboard for real-time identification and 3D matching.

Abdelghafour Halimi, Ali Alibrahim, Didier Barradas-Bautista, Ronell Sicat, Abdulkader M. Afifi2026-03-10🤖 cs.LG

S2AM3D: Scale-controllable Part Segmentation of 3D Point Cloud

The paper proposes S2AM3D, a novel framework that integrates 2D segmentation priors with 3D consistent supervision and a scale-aware prompt decoder to achieve robust, generalizable, and real-time controllable part segmentation for 3D point clouds, supported by a newly introduced large-scale dataset.

Han Su, Tianyu Huang, Zichen Wan, Xiaohe Wu, Wangmeng Zuo2026-03-10💻 cs

HiconAgent: History Context-aware Policy Optimization for GUI Agents

HiconAgent introduces History Context-aware Policy Optimization (HCPO), featuring Dynamic Context Sampling and Anchor-guided History Compression, to enable a compact 3B-parameter GUI agent to outperform larger models in navigation accuracy while significantly reducing computational costs.

Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kaiwen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, Rui Shao2026-03-10💻 cs

MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

MAViD is a novel multimodal framework that employs a Conductor-Creator architecture, combining autoregressive audio and diffusion-based video generation with a specialized fusion module, to overcome existing limitations and achieve seamless, long-duration, and contextually coherent audio-visual dialogue understanding and generation.

Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, Yebin Liu2026-03-10💻 cs

When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs

This paper reveals that visual token information in Vision Large Language Models progressively vanishes at a depth-dependent "information horizon," beyond which existing pruning methods underperform random selection, leading to a novel strategy that integrates random pruning to achieve state-of-the-art efficiency without sacrificing accuracy.

Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei Yang, Ying Wen, Lianghua He, Xianfeng Tang, Hui Liu, Yuyin Zhou2026-03-10💻 cs

Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

This paper addresses the challenges of off-road road network extraction by introducing the WildRoad dataset and MaGRoad, a novel path-centric framework that overcomes the limitations of existing node-centric models to achieve state-of-the-art performance and faster inference in wild terrains.

Wenfei Guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang, Chen Min, Yu Hu2026-03-10💻 cs

Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

This paper presents a two-step generative data augmentation framework combining rule-based mask warping and unpaired image-to-image translation to address the scarcity of masked face datasets, achieving performance improvements with minimal training data while explicitly noting its origins as a resource-constrained coursework project that lacked downstream quantitative evaluation.

Yan Yang, George Bebis, Mircea Nicolescu2026-03-10🤖 cs.LG

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

The paper introduces SALVE, a unified framework that combines sparse autoencoders and feature-level saliency mapping to discover, validate, and precisely edit neural network weights, enabling interpretable and robust control over both convolutional and transformer-based models.

Vegard Flovik2026-03-10🤖 cs.LG

ReMeDI: Refined Memory for Disambiguation of Identities with SAM3 in Surgical Segmentation

The paper introduces ReMeDI-SAM3, a training-free extension of SAM3 that enhances surgical instrument segmentation in endoscopy by implementing relevance-aware memory filtering, piecewise interpolation, and feature-based re-identification to overcome challenges like occlusions and rapid motion, achieving significant zero-shot performance improvements over existing methods.

Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch2026-03-10💻 cs

It is not always greener on the other side: Greenery perception across demographics and personalities in multiple cities

This study analyzes the discrepancies between objective and subjective urban greenery perceptions across five countries using street view imagery and a survey of 1,000 participants, revealing that while demographics and personality have little influence, an individual's geographic location is a primary factor shaping how they perceive green spaces.

Matias Quintana, Fangqi Liu, Jussi Torkko, Youlong Gu, Xiucheng Liang, Yujun Hou, Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, Tuuli Toivonen, Yi Lu, Filip Biljecki2026-03-10💻 cs

ReDepth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

Re-Depth Anything is a test-time self-supervised framework that enhances monocular depth estimation by fusing foundation models with large-scale 2D diffusion priors to perform label-free refinement via generative re-lighting and Score Distillation Sampling, achieving state-of-the-art results without direct depth tensor optimization.

Ananta R. Bhattarai, Helge Rhodin2026-03-10🤖 cs.LG

← Previous Next →