cs.CV papers | Gist.Science

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

The paper introduces SpecTemp, a reinforcement learning-based framework that enhances the efficiency of long video understanding by decoupling temporal perception and reasoning through a cooperative dual-model design, where a lightweight draft MLLM proposes salient frames for verification by a powerful target MLLM, thereby significantly accelerating inference while maintaining competitive accuracy.

Pengfei Hu, Meng Cao, Yingyao Wang + 6 more2026-03-02💻 cs

TARDis: Time Attenuated Representation Disentanglement for Incomplete Multi-Modal Tumor Segmentation and Classification

The paper proposes TARDis, a novel physics-aware framework that addresses incomplete multi-modal tumor segmentation by disentangling time-invariant anatomical features from time-dependent hemodynamic dynamics, enabling robust diagnosis even when temporal CT phases are missing due to clinical constraints.

Zishuo Wan, Qinqin Kang, Na Li + 6 more2026-03-02💻 cs

Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective

This paper introduces a self-supervised AI-generated image detection framework that leverages EXIF metadata to learn intrinsic photographic features, achieving state-of-the-art generalization and robustness across diverse generative models through one-class and binary detection strategies.

Nan Zhong, Mian Zou, Yiran Xu + 4 more2026-03-02💻 cs

FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

This paper introduces FRIEDA, a rigorous benchmark for evaluating multi-step cartographic reasoning in large vision-language models, revealing a significant performance gap between current state-of-the-art systems and human capabilities in interpreting complex spatial relationships across map images.

Jiyoon Pyo, Yuankun Jiao, Dongwon Jung + 11 more2026-03-02🤖 cs.AI

Sharp Monocular View Synthesis in Less Than a Second

SHARP is a novel, real-time monocular view synthesis method that uses a single feedforward neural network pass to regress metric 3D Gaussian parameters from a single image, achieving state-of-the-art photorealistic rendering with significantly reduced synthesis time and superior zero-shot generalization compared to prior models.

Lars Mescheder, Wei Dong, Shiwei Li + 10 more2026-03-02🤖 cs.LG

Geometric-Photometric Event-based 3D Gaussian Ray Tracing

This paper proposes a novel event-based 3D Gaussian Splatting framework that decouples geometry and radiance rendering into event-by-event and snapshot-based branches, respectively, to achieve state-of-the-art, prior-free 3D reconstruction with high temporal resolution and sharp edge details.

Kai Kohyama, Yoshimitsu Aoki, Guillermo Gallego + 1 more2026-03-02🤖 cs.AI

ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

ColaVLA is a unified vision-language-action framework that addresses the latency and modality mismatch of existing VLM-based planners by transferring cognitive reasoning into a compact latent space and employing a hierarchical parallel decoder to achieve state-of-the-art, efficient, and safe trajectory planning on the nuScenes benchmark.

Qihang Peng, Xuesong Chen, Chenye Yang + 2 more2026-03-02💻 cs

Inference-time Physics Alignment of Video Generative Models with Latent World Models

This paper introduces WMReward, an inference-time alignment method that leverages a latent world model (VJEPA-2) as a reward signal to steer video generation trajectories, significantly improving physics plausibility across various conditioning settings and securing first place in the ICCV 2025 Perception Test PhysicsIQ Challenge.

Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich + 7 more2026-03-02💻 cs

CPiRi: Channel Permutation-Invariant Relational Interaction for Multivariate Time Series Forecasting

CPiRi is a novel framework for multivariate time series forecasting that combines a spatio-temporal decoupling architecture with permutation-invariant regularization to overcome the limitations of existing channel-dependent and independent models, achieving state-of-the-art performance, robustness to channel reordering, and strong inductive generalization to unseen channels.

Jiyuan Xu, Wenyu Zhang, Xin Jing + 3 more2026-03-02💻 cs

Scale Equivariance Regularization and Feature Lifting in High Dynamic Range Modulo Imaging

This paper proposes a learning-based HDR restoration framework for modulo imaging that combines scale-equivariant regularization with a feature lifting input design to effectively distinguish natural edges from wrapping artifacts and achieve state-of-the-art reconstruction performance.

Brayan Monroy, Jorge Bacca2026-03-02⚡ eess

Imagine a City: CityGenAgent for Procedural 3D City Generation

This paper introduces CityGenAgent, a natural language-driven framework that utilizes a two-stage learning strategy of Supervised Fine-Tuning and Reinforcement Learning to hierarchically generate high-quality, editable, and semantically aligned 3D cities through interpretable Block and Building programs.

Zishan Liu, Zecong Tang, RuoCheng Wu + 6 more2026-03-02💻 cs

Erase at the Core: Representation Unlearning for Machine Unlearning

The paper introduces Erase at the Core (EC), a model-agnostic framework that enforces comprehensive machine unlearning by applying multi-layer contrastive learning and deep supervision to eliminate superficial forgetting and substantially reduce representational similarity across the entire network hierarchy while preserving performance on retained data.

Jaewon Lee, Yongwoo Kim, Donghyun Kim2026-03-02🤖 cs.LG

PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

PixelRush is a novel, training-free framework that achieves ultra-fast, high-resolution image generation by enabling efficient patch-based denoising in a single step, reducing 4K image creation time from minutes to approximately 20 seconds while maintaining superior visual quality.

Hong-Phuc Lai, Phong Nguyen, Anh Tran2026-03-02💻 cs

Beyond Ground: Map-Free LiDAR Relocalization for UAVs

This paper introduces MAILS, a novel map-free LiDAR relocalization framework specifically designed for UAVs that leverages locality-preserving attention and coordinate-independent mechanisms to achieve high-precision positioning in GNSS-denied environments, accompanied by the creation of a new large-scale dataset to address the lack of realistic UAV flight data.

Hengyu Mu, Jianshi Wu, Yuxin Guo + 5 more2026-03-02⚡ eess

COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception

COOPERTRIM is an adaptive data selection framework for cooperative perception that leverages temporal continuity and a novel conformal uncertainty metric to dynamically filter redundant information, achieving significant bandwidth reduction (up to 80%) while maintaining or improving detection and segmentation accuracy compared to existing methods.

Shilpa Mukhopadhyay, Amit Roy-Chowdhury, Hang Qiu2026-03-02💻 cs

Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Diff-Aid is a lightweight, plug-and-play inference-time method that adaptively modulates text-image interactions across transformer blocks and denoising timesteps to enhance prompt adherence and visual quality in rectified text-to-image generation while providing interpretable insights into the alignment process.

Binglei Li, Mengping Yang, Zhiyu Tan + 2 more2026-03-02💻 cs

SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

SceneTok introduces a novel tokenizer that compresses 3D scene view sets into a small, permutation-invariant set of unstructured tokens, enabling state-of-the-art reconstruction, flexible novel view synthesis, and efficient scene generation with significantly higher compression than existing methods.

Mohammad Asim, Christopher Wewer, Jan Eric Lenssen2026-03-02🤖 cs.AI

Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

This paper proposes a learning-free, prototype-guided framework for multimodal dataset distillation that leverages CLIP embeddings and an unCLIP decoder to synthesize images, thereby achieving state-of-the-art cross-architecture generalization without the computational costs and architectural limitations of existing optimization-based methods.

Junhyeok Choi, Sangwoo Mo, Minwoo Chae2026-03-02💻 cs

One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image

One2Scene is a novel framework that generates geometrically consistent, explorable 3D scenes from a single image by decomposing the task into panorama generation, 3D scaffold construction via multi-view stereo matching on sparse anchor views, and novel view synthesis, thereby overcoming the severe distortions and artifacts common in existing methods during large camera motions.

Pengfei Wang, Liyi Chen, Zhiyuan Ma + 3 more2026-03-02💻 cs

Test-Time Training with KV Binding Is Secretly Linear Attention

This paper reframes Test-Time Training (TTT) with KV binding not as a memorization-based online meta-learning process, but as a form of learned linear attention, a perspective that explains puzzling model behaviors and enables principled architectural simplifications and efficient parallel formulations.

Junchen Liu, Sven Elflein, Or Litany + 2 more2026-03-02🤖 cs.AI

← Previous Next →