cs.CV papers | Gist.Science

Learning in the Null Space: Small Singular Values for Continual Learning

This paper introduces NESS, a continual learning method that mitigates catastrophic forgetting by constraining task-specific updates to an approximate null space derived from the smallest singular values of input representations, thereby enabling efficient adaptation while preserving performance on previous tasks.

Cuong Anh Pham, Praneeth Vepakomma, Samuel Horváth2026-02-26🤖 cs.LG

Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context

This paper introduces "geometry-as-context," an autoregressive framework that iteratively estimates scene geometry and restores novel views to achieve superior scene consistency and camera control in video generation, overcoming the error accumulation and non-differentiability issues of previous methods.

JiaKui Hu, Jialun Liu, Liying Yang + 7 more2026-02-26💻 cs

A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography

The paper presents CARD-ViT, a self-supervised Vision Transformer framework trained exclusively on ECG-gated CT data that successfully enables automated Coronary Artery Calcium scoring on non-gated scans, thereby facilitating scalable cardiovascular risk assessment using routine chest imaging without requiring additional scans or annotations.

Mahmut S. Gokmen, Moneera N. Haque, Steve W. Leung + 6 more2026-02-26🤖 cs.AI

Directed Ordinal Diffusion Regularization for Progression-Aware Diabetic Retinopathy Grading

This paper proposes Directed Ordinal Diffusion Regularization (D-ODR), a novel method that enforces the unidirectional nature of diabetic retinopathy progression through a directed graph and multi-scale diffusion, thereby preventing biologically implausible reverse transitions and achieving superior grading performance compared to existing state-of-the-art approaches.

Huangwei Chen, Junhao Jia, Ruocheng Li + 7 more2026-02-26💻 cs

Mobile-Ready Automated Triage of Diabetic Retinopathy Using Digital Fundus Images

This paper presents a lightweight, mobile-optimized deep learning framework using MobileNetV3 with a Consistent Rank Logits head to efficiently and accurately assess diabetic retinopathy severity from fundus images, achieving a Quadratic Weighted Kappa score of 0.9019 to enable scalable early-stage screening.

Aadi Joshi, Manav S. Sharma, Vijay Uttam Rathod + 3 more2026-02-26💻 cs

Learning to Fuse and Reconstruct Multi-View Graphs for Diabetic Retinopathy Grading

This paper proposes MVGFDR, an end-to-end multi-view graph fusion framework that explicitly disentangles shared and view-specific features through graph initialization, frequency-domain fusion, and masked cross-view reconstruction to achieve superior diabetic retinopathy grading on multi-view fundus images.

Haoran Li, Yuxin Lin, Huan Wang + 9 more2026-02-26💻 cs

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

MindDriver is a novel progressive multimodal reasoning framework that bridges the gap between semantic understanding and physical trajectory planning for autonomous driving by introducing a human-like thinking process, supported by a feedback-guided data annotation pipeline and progressive reinforcement fine-tuning, which achieves superior performance in both open-loop and closed-loop evaluations.

Lingjun Zhang, Yujian Yuan, Changjie Wu + 7 more2026-02-26💻 cs

Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation

This paper introduces GLoTran, a global-local dual perception framework for MLLMs that combines low-resolution global context with multi-scale region-level details to overcome challenges in high-resolution text-rich image translation, supported by the newly constructed large-scale GLoD dataset.

Junxin Lu, Tengfei Song, Zhanglin Wu + 9 more2026-02-26💻 cs

Global-Aware Edge Prioritization for Pose Graph Initialization

This paper proposes a globally-aware edge prioritization framework for Structure-from-Motion pose graph initialization that leverages a GNN to predict edge reliability and guide a connectivity-aware construction process, resulting in more accurate and compact 3D reconstructions compared to existing retrieval-based methods.

Tong Wei, Giorgos Tolias, Jiri Matas + 1 more2026-02-26💻 cs

Dream-SLAM: Dreaming the Unseen for Active SLAM in Dynamic Environments

Dream-SLAM is a novel monocular active SLAM method that enhances localization accuracy, mapping quality, and exploration efficiency in dynamic environments by "dreaming" cross-spatio-temporal images and semantically plausible structures to mitigate data incompleteness and enable long-horizon planning.

Xiangqi Meng, Pengxu Hou, Zhenjun Zhao + 4 more2026-02-26💻 cs

PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

This paper introduces PanoEnv, a large-scale 3D spatial reasoning benchmark for panoramic images, and proposes a curriculum-based reinforcement learning framework with GRPO that significantly enhances the 3D spatial intelligence of Vision-Language Models, achieving state-of-the-art performance on omnidirectional perception tasks.

Zekai Lin, Xu Zheng2026-02-26💻 cs

World Guidance: World Modeling in Condition Space for Action Generation

The paper proposes WoG (World Guidance), a framework that enhances Vision-Language-Action models by training them to simultaneously predict future actions and compact condition representations derived from future observations, thereby achieving superior fine-grained action generation and generalization in both simulation and real-world environments.

Yue Su, Sijin Chen, Haixin Shi + 7 more2026-02-26💻 cs

RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models

This paper addresses the challenges of Kilometer Marker Recognition for autonomous metro trains in complex environments by proposing a robust multi-modal method that adapts a pre-trained RGB OCR foundation model to event camera data and introducing the first large-scale synchronized RGB-Event dataset, EvMetro5K, to validate the approach.

Xiaoyu Xian, Shiao Wang, Xiao Wang + 2 more2026-02-26🤖 cs.AI

RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

This paper introduces RT-RMOT, a new task for all-day referring multi-object tracking, along with the first RGB-Thermal dataset (RefRT) and the RTrack framework, which leverages a multimodal large language model enhanced by Group Sequence Policy Optimization and specialized reward strategies to achieve robust tracking in challenging low-visibility conditions.

Yanqiu Yu, Zhifan Jin, Sijia Chen + 4 more2026-02-26💻 cs

SPGen: Stochastic scanpath generation for paintings using unsupervised domain adaptation

The paper introduces SPGen, a novel deep learning model that utilizes unsupervised domain adaptation and stochastic sampling to accurately predict human eye movement scanpaths on paintings, thereby advancing the analysis and preservation of cultural heritage.

Mohamed Amine Kerkouri, Marouane Tliba, Aladine Chetouani + 1 more2026-02-26💻 cs

AutoSew: A Geometric Approach to Stitching Prediction with Graph Neural Networks

AutoSew is a fully automatic, geometry-based framework that utilizes Graph Neural Networks and optimal transport to predict stitch correspondences directly from 2D pattern contours, achieving high accuracy in assembling garments without relying on manual annotations or semantic cues.

Pablo Ríos-Navarro, Elena Garces, Jorge Lopez-Moreno2026-02-26💻 cs

NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training

The paper proposes NESTOR, a nested Mixture-of-Experts neural operator that combines image-level and token-level expert modules to capture both global and local dependencies, thereby enabling effective large-scale pre-training across diverse PDE systems and enhancing generalization to downstream tasks.

Dengdi Sun, Xiaoya Zhou, Xiao Wang + 4 more2026-02-26🤖 cs.AI

AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting

AdaSpot is a novel framework for precise event spotting that enhances efficiency and localization accuracy by processing low-resolution videos globally while adaptively selecting and analyzing high-resolution regions of interest through an unsupervised, task-aware strategy, achieving state-of-the-art performance on standard benchmarks.

Artur Xarles, Sergio Escalera, Thomas B. Moeslund + 1 more2026-02-26💻 cs

WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation

WeatherCity is a novel framework that enables flexible, high-fidelity, and temporally consistent 4D urban scene reconstruction with controllable multi-weather transformations by combining text-guided image editing, a shared-feature weather Gaussian representation, and a physics-driven dynamic model.

Wenhua Wu, Huai Guan, Zhe Liu + 1 more2026-02-26💻 cs

Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D

The paper introduces Brain3D, a specialized vision-language framework that converts 2D pretrained encoders into native 3D architectures to automate neuroradiology report generation from brain tumor MRIs, achieving significantly higher clinical accuracy and perfect specificity on healthy scans compared to 2D baselines through a three-stage alignment process.

Mariano Barone, Francesco Di Serio, Giuseppe Riccio + 4 more2026-02-26💻 cs

← Previous Next →