cs.CV papers | Gist.Science

Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding

This paper presents a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding across multiple datasets, concluding that while foundation models offer flexibility, supervised training remains the most reliable approach for accurately detecting small objects and delineating boundaries in cluttered disaster scenes when annotations are available.

Anna Michailidou, Georgios Angelidis, Vasileios Argyriou + 2 more2026-03-03💻 cs

You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image

The paper proposes NVB-Face, a novel one-stage framework that directly generates consistent, high-quality novel-view face images from a single degraded (blind) input by extracting features and transforming them into 3D-aware latent representations via a diffusion model, thereby outperforming traditional two-stage restoration-and-synthesis pipelines.

Taoyue Wang, Xiang Zhang, Xiaotian Li + 2 more2026-03-03🤖 cs.AI

Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth

The paper proposes Perspective-Equivariant Fine-tuning for Demosaicing (PEFD), a novel framework that enables high-fidelity, ground-truth-free multispectral image reconstruction by leveraging projective geometry and adapting pretrained foundation models to outperform existing methods on real-world datasets.

Andrew Wang, Mike Davies2026-03-03💻 cs

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

MixerCSeg is an efficient, state-of-the-art architecture for crack segmentation that integrates CNN-like local texture analysis, Transformer-style global dependency modeling, and Mamba-inspired sequential context processing within a unified encoder, enhanced by specialized edge-aware and multi-scale fusion modules to achieve high performance with minimal computational cost.

Zilong Zhao, Zhengming Ding, Pei Niu + 2 more2026-03-03🤖 cs.AI

TIMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity

The paper proposes TIMI, a training-free framework that achieves high spatial fidelity in image-to-3D multi-instance generation by leveraging pre-trained model priors through an Instance-aware Separation Guidance module for disentanglement and a Spatial-stabilized Geometry-adaptive Update module for geometric preservation, outperforming existing methods without additional training overhead.

Xiao Cai, Lianli Gao, Pengpeng Zeng + 3 more2026-03-03💻 cs

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

This paper proposes AOT, a training-free method that optimizes token reduction in Video Large Language Models by establishing local and global token anchors and aggregating informative contexts via optimal transport to efficiently eliminate redundancy while preserving spatiotemporal fidelity.

Jinlong Li, Liyuan Jiang, Haonan Zhang + 1 more2026-03-03💻 cs

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

UniTalking is a unified, end-to-end diffusion framework that leverages Multi-Modal Transformer Blocks and pre-trained video priors to generate high-fidelity, lip-synchronized talking portraits with personalized voice cloning, achieving superior performance over existing open-source methods.

Hebeizi Li, Zihao Liang, Benyuan Sun + 4 more2026-03-03💻 cs

SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation

The paper introduces SeaVIS, the first online framework for audio-visual instance segmentation that utilizes Causal Cross Attention Fusion and Audio-Guided Contrastive Learning to effectively track and segment sounding objects in continuous video streams while suppressing silent instances.

Yingjian Zhu, Ying Wang, Yuyang Hong + 5 more2026-03-03💻 cs

Unifying Language-Action Understanding and Generation for Autonomous Driving

This paper introduces LinkVLA, a novel Vision-Language-Action model for autonomous driving that unifies language and action tokens in a shared codebook, incorporates an auxiliary action understanding objective for bidirectional semantic alignment, and employs a coarse-to-fine generation strategy to significantly improve instruction following, driving performance, and inference efficiency.

Xinyang Wang, Qian Liu, Wenjie Ding + 7 more2026-03-03💻 cs

Revisiting Global Token Mixing in Task-Dependent MRI Restoration: Insights from Minimal Gated CNN Baselines

This paper demonstrates that the effectiveness of global token mixing in MRI restoration is highly task-dependent, showing that while local gated CNNs suffice for reconstruction and super-resolution tasks constrained by physics or preserved low-frequency data, global models are superior for denoising tasks involving spatially heteroscedastic noise.

Xiangjian Hou, Chao Qin, Chang Ni + 3 more2026-03-03⚡ eess

Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection

This paper introduces the Deepfake Forensics Adapter (DFA), a novel dual-stream framework that integrates a frozen CLIP model with global and local forensic analysis streams to achieve state-of-the-art generalization in deepfake detection, particularly demonstrating significant performance improvements on the challenging DFDC dataset.

Jianfeng Liao, Yichen Wei, Raymond Chan Ching Bon + 3 more2026-03-03💻 cs

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

The paper introduces VidDoS, the first universal Energy-Latency Attack framework for Video-LLMs that employs masked teacher forcing and refusal penalties to generate instance-agnostic triggers, causing extreme token expansion and inference latency that lead to critical safety violations in real-time applications.

Duoxun Tang, Dasen Dai, Jiyao Wang + 3 more2026-03-03🤖 cs.AI

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

This paper introduces MM-Mem, a cognition-inspired pyramidal multimodal memory architecture that leverages Fuzzy-Trace Theory and a Semantic Information Bottleneck to progressively distill verbatim visual details into abstract semantic schemas, thereby enabling efficient long-horizon video understanding through hierarchical storage and entropy-driven retrieval.

Niu Lian, Yuting Wang, Hanshu Yao + 5 more2026-03-03💬 cs.CL

UltraStar: Semantic-Aware Star Graph Modeling for Echocardiography Navigation

To address the limitations of existing sequential models in handling noisy echocardiography probe trajectories, this paper proposes UltraStar, a semantic-aware star graph framework that reformulates navigation as anchor-based global localization by connecting the current view directly to representative historical keyframes, thereby achieving robust performance and better scalability on large-scale datasets.

Teng Wang, Haojun Jiang, Chenxi Li + 6 more2026-03-03💻 cs

WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments

This paper introduces WildCross, a large-scale cross-modal benchmark featuring over 476K annotated RGB frames and synchronized lidar data designed to advance place recognition and metric depth estimation in unstructured natural environments where existing urban-focused datasets fall short.

Joshua Knights, Joseph Reid, Kaushik Roy + 3 more2026-03-03💻 cs

SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout

This paper presents SCATR, a novel LiDAR-based tracking-by-attention framework that mitigates new instance suppression and bridges the performance gap with detection-based methods through two architecture-agnostic training strategies: Second Chance Assignment and Track Query Dropout, achieving state-of-the-art results on the nuScenes benchmark.

Brian Cheong, Letian Wang, Sandro Papais + 1 more2026-03-03💻 cs

ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models

The paper proposes ATA, a novel training-free, plug-and-play framework that enhances Vision-Language-Action models by introducing implicit reasoning through complementary attention-guided and action-guided strategies, thereby improving task success and robustness without the need for additional annotations or retraining.

Cheng Yang, Jianhao Jiao, Lingyi Huang + 8 more2026-03-03🤖 cs.AI

Radiometrically Consistent Gaussian Surfels for Inverse Rendering

This paper introduces RadioGS, a novel inverse rendering framework that leverages a radiometric consistency constraint and Gaussian surfels to accurately disentangle material properties from complex global illumination effects, enabling efficient relighting and superior performance over existing Gaussian-based methods.

Kyu Beom Han, Jaeyoon Kim, Woo Jae Kim + 2 more2026-03-03💻 cs

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

This paper introduces PhotoBench, the first benchmark constructed from authentic personal albums to shift photo retrieval from simple visual matching to complex, intent-driven reasoning by exposing critical limitations in current unified embedding and agentic systems regarding non-visual constraints and multi-source fusion.

Tianyi Xu, Rong Shan, Junjie Wu + 11 more2026-03-03🤖 cs.AI

Rate-Distortion Signatures of Generalization and Information Trade-offs

This paper introduces a rate-distortion-theoretic framework that characterizes the generalization trade-offs of human and machine vision systems using geometric signatures of slope and curvature, revealing that while both follow a common lossy-compression principle, humans exhibit smoother and more flexible trade-offs compared to the steeper, more brittle regimes of modern deep networks.

Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin2026-03-03🧬 q-bio

← Previous Next →