cs.CV papers | Gist.Science

RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models

This paper addresses the challenges of Kilometer Marker Recognition for autonomous metro trains in complex environments by proposing a robust multi-modal method that adapts a pre-trained RGB OCR foundation model to event camera data and introducing the first large-scale synchronized RGB-Event dataset, EvMetro5K, to validate the approach.

Xiaoyu Xian, Shiao Wang, Xiao Wang + 2 more2026-02-26🤖 cs.AI

RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

This paper introduces RT-RMOT, a new task for all-day referring multi-object tracking, along with the first RGB-Thermal dataset (RefRT) and the RTrack framework, which leverages a multimodal large language model enhanced by Group Sequence Policy Optimization and specialized reward strategies to achieve robust tracking in challenging low-visibility conditions.

Yanqiu Yu, Zhifan Jin, Sijia Chen + 4 more2026-02-26💻 cs

SPGen: Stochastic scanpath generation for paintings using unsupervised domain adaptation

The paper introduces SPGen, a novel deep learning model that utilizes unsupervised domain adaptation and stochastic sampling to accurately predict human eye movement scanpaths on paintings, thereby advancing the analysis and preservation of cultural heritage.

Mohamed Amine Kerkouri, Marouane Tliba, Aladine Chetouani + 1 more2026-02-26💻 cs

AutoSew: A Geometric Approach to Stitching Prediction with Graph Neural Networks

AutoSew is a fully automatic, geometry-based framework that utilizes Graph Neural Networks and optimal transport to predict stitch correspondences directly from 2D pattern contours, achieving high accuracy in assembling garments without relying on manual annotations or semantic cues.

Pablo Ríos-Navarro, Elena Garces, Jorge Lopez-Moreno2026-02-26💻 cs

NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training

The paper proposes NESTOR, a nested Mixture-of-Experts neural operator that combines image-level and token-level expert modules to capture both global and local dependencies, thereby enabling effective large-scale pre-training across diverse PDE systems and enhancing generalization to downstream tasks.

Dengdi Sun, Xiaoya Zhou, Xiao Wang + 4 more2026-02-26🤖 cs.AI

AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting

AdaSpot is a novel framework for precise event spotting that enhances efficiency and localization accuracy by processing low-resolution videos globally while adaptively selecting and analyzing high-resolution regions of interest through an unsupervised, task-aware strategy, achieving state-of-the-art performance on standard benchmarks.

Artur Xarles, Sergio Escalera, Thomas B. Moeslund + 1 more2026-02-26💻 cs

WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation

WeatherCity is a novel framework that enables flexible, high-fidelity, and temporally consistent 4D urban scene reconstruction with controllable multi-weather transformations by combining text-guided image editing, a shared-feature weather Gaussian representation, and a physics-driven dynamic model.

Wenhua Wu, Huai Guan, Zhe Liu + 1 more2026-02-26💻 cs

Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D

The paper introduces Brain3D, a specialized vision-language framework that converts 2D pretrained encoders into native 3D architectures to automate neuroradiology report generation from brain tumor MRIs, achieving significantly higher clinical accuracy and perfect specificity on healthy scans compared to 2D baselines through a three-stage alignment process.

Mariano Barone, Francesco Di Serio, Giuseppe Riccio + 4 more2026-02-26💻 cs

GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models

The paper introduces GeoDiv, a novel framework leveraging large language and vision-language models to systematically measure and reveal significant geographical biases and socio-economic stereotypes in text-to-image generation, demonstrating how current models disproportionately portray countries like India, Nigeria, and Colombia in impoverished ways.

Abhipsa Basu, Mohana Singh, Shashank Agnihotri + 2 more2026-02-26💻 cs

Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels

The paper presents Lumosaic, a compact active hyperspectral video system that synchronizes a narrowband LED array with coded-exposure pixels to achieve real-time, high-fidelity 31-channel video reconstruction of dynamic scenes, significantly outperforming existing passive snapshot methods in both spectral accuracy and temporal stability.

Dhruv Verma, Andrew Qiu, Roberto Rangel + 8 more2026-02-26⚡ eess

WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

WeaveTime is a model-agnostic framework that addresses the time-agnostic limitations of current Video-LLMs in streaming scenarios by introducing a lightweight Temporal Reconstruction objective to instill order-aware representations and a Past-Current Dynamic Focus Cache for uncertainty-triggered retrieval, thereby improving accuracy and reducing latency without architectural changes.

Yulin Zhang, Cheng Shi, Sibei Yang2026-02-26💻 cs

MedTri: A Platform for Structured Medical Report Normalization to Enhance Vision-Language Pretraining

This paper introduces MedTri, a deployable framework that normalizes heterogeneous medical reports into structured, anatomy-grounded triplets to remove stylistic noise and significantly enhance the performance and generalizability of medical vision-language pretraining across X-ray and CT modalities.

Yuetan Chu, Xinhua Ma, Xinran Jin + 2 more2026-02-26💻 cs

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

This paper introduces NoLan, a training-free framework that mitigates object hallucinations in Large Vision-Language Models by identifying language decoder priors as the primary cause and dynamically suppressing them during decoding.

Lingfeng Ren, Weihao Yu, Runpeng Yu + 1 more2026-02-26💬 cs.CL

CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness

CASR is a robust, single-model cyclic framework for arbitrary-scale super-resolution that mitigates cross-scale distribution shifts and texture inconsistencies by reformulating ultra-magnification as a sequence of in-distribution transitions guided by structural alignment and self-similarity priors.

Wenhao Guo, Zhaoran Zhao, Peng Lu + 3 more2026-02-26💻 cs

Mixed Magnification Aggregation for Generalizable Region-Level Representations in Computational Pathology

This paper proposes a region-level mixing encoder that fuses multi-magnification tile representations through masked embedding modeling pretraining to enhance generalizable region-level features and improve biomarker prediction performance in computational pathology.

Eric Zimmermann, Julian Viret, Michal Zelechowski + 7 more2026-02-26💻 cs

Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

This paper demonstrates that off-the-shelf image-to-image generative AI models can be simply repurposed as generic denoisers to effectively defeat a wide range of image protection schemes, outperforming specialized attacks and revealing a critical vulnerability in current defense mechanisms.

Xavier Pleimling, Sifat Muhammad Abdullah, Gunjan Balde + 4 more2026-02-26🤖 cs.AI

WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

WHOLE is a novel method that holistically reconstructs hand and object motion in world space from challenging egocentric videos by leveraging a learned generative prior to jointly reason about their interactions, thereby achieving state-of-the-art performance in handling occlusions and ensuring consistent hand-object relations.

Yufei Ye, Jiaman Li, Ryan Rong + 1 more2026-02-26💻 cs

Towards Attributions of Input Variables in a Coalition

This paper addresses the challenge of partitioning input variables in Shapley value-based Explainable AI by analyzing attribution conflicts caused by AND-OR interactions, proposing a new attribution metric for variable coalitions and three faithfulness evaluation metrics that are validated across diverse domains.

Xinhao Zheng, Huiqi Deng, Quanshi Zhang2026-02-25🤖 cs.AI

Interpretable Medical Image Classification using Prototype Learning and Privileged Information

This paper proposes Proto-Caps, an interpretable medical image classification model that integrates capsule networks, prototype learning, and privileged information to achieve state-of-the-art accuracy in lung nodule malignancy prediction while providing visual, case-based reasoning for radiologist validation.

Luisa Gallee, Meinrad Beer, Michael Goetz2026-02-25🤖 cs.AI

Coherent and Multi-modality Image Inpainting via Latent Space Optimization

This paper introduces PILOT, a tuning-free image inpainting method that optimizes latent spaces using semantic centralization and background preservation losses to generate coherent, multi-modal content that seamlessly integrates with pre-trained diffusion models.

Lingzhi Pan, Tong Zhang, Bingyuan Chen + 4 more2026-02-25💻 cs

← Previous Next →