cs.CV papers | Gist.Science

Activation Function Design Sustains Plasticity in Continual Learning

This paper demonstrates that thoughtful activation function design, specifically through the introduction of Smooth-Leaky and Randomized Smooth-Leaky nonlinearities, serves as a lightweight, architecture-agnostic solution to sustain model plasticity and prevent adaptation loss in continual learning scenarios without requiring additional capacity or task-specific tuning.

Lute Lillo, Nick Cheney2026-03-02🤖 cs.AI

Unsupervised Representation Learning for 3D Mesh Parameterization with Semantic and Visibility Objectives

This paper presents an unsupervised differentiable framework for 3D mesh parameterization that automates UV mapping by integrating semantic part alignment and visibility-aware seam placement to overcome the limitations of manual processes and existing automatic methods.

AmirHossein Zamani, Bruno Roy, Arianna Rampini2026-03-02💻 cs

Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving

This paper introduces Max-V1, a lean yet powerful vision-language model that reconceptualizes autonomous driving as a next-waypoint prediction task, achieving state-of-the-art performance and superior cross-domain generalization on the nuScenes dataset through a single-pass generative paradigm trained on large-scale expert demonstrations.

Sheng Yang, Tong Zhan, Guancheng Chen + 2 more2026-03-02🤖 cs.AI

Universal Beta Splatting

This paper introduces Universal Beta Splatting (UBS), a unified rendering framework that generalizes 3D Gaussian Splatting to N-dimensional anisotropic Beta kernels, enabling controllable modeling of spatial, angular, and temporal dependencies for real-time, high-quality radiance field rendering without auxiliary networks.

Rong Liu, Zhongpai Gao, Benjamin Planche + 8 more2026-03-02⚡ eess

CLEAR-IR: Clarity-Enhanced Active Reconstruction of Infrared Imagery

This paper introduces CLEAR-IR, a novel Deep Multi-scale Aware Overcomplete (DeepMAO) architecture that reconstructs clean, RGB-like infrared imagery from emitter-populated inputs, thereby enabling robust robotic perception and downstream tasks in extreme low-light conditions without the need for on-board lighting.

Nathan Shankar, Pawel Ladosz, Hujun Yin2026-03-02🤖 cs.LG

The False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators

This paper demonstrates that machine-learned operators fail to achieve robust zero-shot super-resolution due to brittleness and aliasing, but proposes a simple, data-driven multi-resolution training protocol to overcome these limitations and enable accurate inference across varying resolutions.

Mansi Sakarvadia, Kareem Hegazy, Amin Totounferoush + 4 more2026-03-02🤖 cs.AI

Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

By applying Sparse Autoencoders to DINOv2, this study reveals that task-specific concepts exhibit functional specialization and a non-sparse, locally connected geometry, leading to the proposal of the Minkowski Representation Hypothesis, which posits that vision transformer tokens are formed by convex mixtures of archetypes within conceptual spaces rather than strict linear sparsity.

Thomas Fel, Binxu Wang, Michael A. Lepori + 8 more2026-03-02🤖 cs.AI

Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction

The paper introduces USplat4D, an uncertainty-aware dynamic Gaussian Splatting framework that estimates time-varying per-Gaussian uncertainty to construct a spatio-temporal graph, thereby improving monocular 4D reconstruction stability and synthesis quality by prioritizing reliable motion cues from well-observed regions.

Fengzhi Guo, Chih-Chuan Hsu, Sihao Ding + 1 more2026-03-02🤖 cs.AI

Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

This paper proposes a novel semi-supervised video anomaly detection framework that leverages Multimodal Large Language Models to generate and compare high-level textual descriptions of object interactions, thereby achieving state-of-the-art performance on complex anomalies while providing inherent explainability.

Furkan Mumcu, Michael J. Jones, Anoop Cherian + 1 more2026-03-02💻 cs

From Volume Rendering to 3D Gaussian Splatting: Theory and Applications

This tutorial provides a comprehensive overview of 3D Gaussian Splatting, detailing its theoretical foundations, addressing key limitations such as memory footprint and lighting baking, and surveying its diverse applications in surface reconstruction, avatar modeling, and content generation.

Vitor Pereira Matias, Daniel Perazzo, Vinicius Silva + 4 more2026-03-02💻 cs

Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

The paper proposes Speculative Verdict (SV), a training-free framework that enhances information-intensive visual reasoning by combining multiple lightweight draft experts for diverse localization candidates with a strong verdict model for synthesis, thereby achieving improved accuracy and efficiency on challenging benchmarks without additional training.

Yuhan Liu, Lianhui Qin, Shengjie Wang2026-03-02💬 cs.CL

TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

TokenCLIP proposes a token-wise prompt learning framework for zero-shot anomaly detection that addresses the limitations of single textual space alignment by dynamically mapping visual tokens to orthogonal textual subspaces via an optimal transport formulation, enabling fine-grained and efficient adaptation to diverse anomaly semantics.

Qihang Zhou, Binbin Gao, Guansong Pang + 3 more2026-03-02💻 cs

MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection

This paper introduces MMSD3.0, a new multi-image benchmark for real-world multimodal sarcasm detection, and proposes the Cross-Image Reasoning Model (CIRM) with a relevance-guided fusion mechanism that achieves state-of-the-art performance across both single- and multi-image scenarios.

Haochen Zhao, Yuyao Kong, Yongxiu Xu + 4 more2026-03-02💻 cs

Enhancing CLIP Robustness via Cross-Modality Alignment

This paper proposes COLA, a training-free, optimal transport-based framework that enhances CLIP's robustness against adversarial attacks by projecting image embeddings onto a text-feature subspace and refining global and local cross-modal alignment to mitigate feature misalignment.

Xingyu Zhu, Beier Zhu, Shuo Wang + 2 more2026-03-02💻 cs

Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues

This paper introduces Attentive Feature Aggregation (AFA), a lightweight pooling mechanism that enhances the robustness of visuomotor policies trained with pre-trained visual representations by learning to selectively attend to task-relevant cues while ignoring visual distractors, thereby outperforming standard approaches in perturbed environments without requiring expensive data augmentation or model fine-tuning.

Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier + 4 more2026-03-02💻 cs

Score-Regularized Joint Sampling with Importance Weights for Flow Matching

This paper proposes a score-regularized joint sampling framework with importance weighting that generates diverse, high-quality samples from flow matching models to enable accurate expectation estimation under limited sampling budgets.

Xinshuang Liu, Runfa Blark Li, Shaoxiu Wei + 1 more2026-03-02🤖 cs.AI

General vs Domain-Specific CNNs: Understanding Pretraining Effects on Brain MRI Tumor Classification

This study demonstrates that for brain tumor classification on limited MRI data, modern general-purpose CNNs pre-trained on diverse large-scale datasets (specifically ConvNeXt-Tiny) outperform specialized medical-domain pre-trained models (RadImageNet DenseNet121), challenging the assumption that domain-specific pre-training always yields superior results in data-scarce medical imaging scenarios.

Helia Abedini, Saba Rahimi, Reza Vaziri2026-03-02🤖 cs.AI

Q-Save: Towards Scoring and Attribution for Generated Video Evaluation

This paper introduces Q-Save, a holistic benchmark dataset and unified evaluation model that jointly assesses visual quality, dynamic quality, and text-video alignment for AI-generated videos through a three-stage training strategy, thereby overcoming the limitations of isolated and undefined evaluation approaches.

Xiele Wu, Zicheng Zhang, Mingtao Chen + 7 more2026-03-02💻 cs

SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

SocialNav introduces a hierarchical foundation model for socially-aware embodied navigation, trained on a large-scale 7-million-sample dataset and a novel Socially-Aware Flow Exploration GRPO framework to achieve state-of-the-art performance in both navigation success and social compliance.

Ziyi Chen, Yingnan Guo, Zedong Chu + 14 more2026-03-02🤖 cs.AI

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

The paper introduces SpecTemp, a reinforcement learning-based framework that enhances the efficiency of long video understanding by decoupling temporal perception and reasoning through a cooperative dual-model design, where a lightweight draft MLLM proposes salient frames for verification by a powerful target MLLM, thereby significantly accelerating inference while maintaining competitive accuracy.

Pengfei Hu, Meng Cao, Yingyao Wang + 6 more2026-03-02💻 cs

← Previous Next →