cs.CV papers | Gist.Science

CityGuard: Graph-Aware Private Descriptors for Bias-Resilient Identity Search Across Urban Cameras

CityGuard is a privacy-preserving, graph-aware transformer framework that enables robust, bias-resilient person re-identification across distributed urban cameras by integrating dispersion-adaptive metric learning, spatially conditioned attention for coarse geometric alignment, and differentially private embeddings to balance retrieval accuracy with data protection.

Rong Fu, Yibo Meng, Jia Yee Tan + 5 more2026-03-06💻 cs

CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

The paper introduces CARE, a molecular-guided foundation model that utilizes a two-stage pretraining strategy to automatically partition whole slide images into biologically relevant, adaptive regions, achieving superior performance across diverse pathology tasks with significantly less pretraining data than existing models.

Di Zhang, Zhangpeng Gong, Xiaobo Pang + 14 more2026-03-06💻 cs

When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters

This paper introduces MasqLoRA, the first systematic framework that exploits the modular nature of Low-Rank Adaptation (LoRA) to stealthily inject backdoors into text-to-image diffusion models, enabling attackers to trigger malicious visual outputs via specific textual prompts while maintaining benign behavior otherwise.

Liangwei Lyu, Jiaqi Xu, Jianwei Ding + 1 more2026-03-06💻 cs

RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations

The paper introduces RobustVisRAG, a causality-guided dual-path framework that disentangles visual semantics from degradation factors to significantly improve retrieval and generation performance under various visual distortions, validated by a new large-scale benchmark and substantial experimental gains.

I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu + 3 more2026-03-06💻 cs

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

This paper proposes LFG, a label-free, teacher-guided framework that leverages unposed, in-the-wild ego-centric videos to pretrain a unified pseudo-4D representation for autonomous driving, achieving state-of-the-art planning performance on the NAVSIM benchmark using only a single monocular camera without relying on poses, labels, or LiDAR.

Matthew Strong, Wei-Jer Chang, Quentin Herau + 4 more2026-03-06💻 cs

Diffusion Probe: Generated Image Result Prediction Using CNN Probes

Diffusion Probe is a model-agnostic framework that leverages statistical properties of early-stage cross-attention maps in text-to-image diffusion models to accurately predict final image quality, thereby enabling efficient early termination of low-potential generations and reducing computational overhead.

Benlei Cui, Bukun Huang, Zhizeng Ye + 7 more2026-03-06💻 cs

DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

DiffusionHarmonizer is an online, single-step generative framework that leverages a custom data curation pipeline to transform imperfect neural reconstruction renderings into temporally consistent, photorealistic simulations, effectively resolving artifacts and harmonizing inserted dynamic objects for autonomous robot development.

Yuxuan Zhang, Katarína Tóthová, Zian Wang + 7 more2026-03-06💻 cs

UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

UFO-4D introduces a unified feedforward framework that reconstructs dense, explicit 4D representations from just two unposed images by jointly estimating dynamic 3D Gaussian splats, geometry, motion, and camera pose through a self-supervised approach that leverages shared geometric primitives to significantly outperform prior methods.

Junhwa Hur, Charles Herrmann, Songyou Peng + 4 more2026-03-06💻 cs

Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design

Dr. Seg challenges the assumption that language-based GRPO training transfers seamlessly to visual perception by introducing a plug-and-play framework with a Look-to-Confirm mechanism and Distribution-Ranked Reward module that significantly enhances performance in complex visual scenarios without requiring architectural modifications.

Haoxiang Sun, Tao Wang, Chenwei Tang + 2 more2026-03-06💻 cs

AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution

This paper proposes AlignVAR, a globally consistent visual autoregressive framework for image super-resolution that overcomes locality bias and error accumulation through Spatial Consistency Autoregression and Hierarchical Consistency Constraint, achieving superior structural coherence and perceptual fidelity with significantly faster inference and fewer parameters than diffusion-based methods.

Cencen Liu, Dongyang Zhang, Wen Yin + 6 more2026-03-06💻 cs

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

The paper introduces SOLACE, a fully unsupervised post-training framework for text-to-image generation that leverages intrinsic self-confidence signals derived from noise recovery to optimize model performance without external reward models or annotated datasets.

Seungwook Kim, Minsu Cho2026-03-06💻 cs

Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving

Dr. Occ is a novel 3D semantic occupancy prediction framework for autonomous driving that leverages a depth-guided view transformer for precise geometric alignment and a region-guided expert transformer to address spatial class imbalance, achieving significant performance improvements over existing vision-only baselines on the Occ3D-nuScenes benchmark.

Xubo Zhu, Haoyang Zhang, Fei He + 4 more2026-03-06💻 cs

FreeAct: Freeing Activations for LLM Quantization

FreeAct is a novel quantization framework that improves Large Language Model performance by relaxing rigid one-to-one transformation constraints to dynamically allocate token-specific activation transformations, thereby addressing the distinct distribution patterns in diffusion and multimodal models.

Xiaohao Liu, Xiaobo Xia, Manyi Zhang + 6 more2026-03-06💻 cs

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Kiwi-Edit addresses the limitations of instruction-based video editing and the scarcity of reference-guided training data by introducing a scalable data generation pipeline to create the RefVIE dataset and a unified architecture that synergizes learnable queries with latent visual features to achieve state-of-the-art controllable video editing.

Yiqi Lin, Guoqiang Liang, Ziyun Zeng + 3 more2026-03-06💻 cs

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

Track4World is a feedforward model that enables efficient, holistic 3D tracking of every pixel in a monocular video by leveraging a global 3D scene representation and a novel 3D correlation scheme to simultaneously estimate dense 2D/3D flows and reconstruct 4D dynamics.

Jiahao Lu, Jiayi Xu, Wenbo Hu + 5 more2026-03-06💻 cs

Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation

The paper introduces PVT-GDLA, a linear-time decoder architecture featuring Gated Differential Linear Attention that combines noise-canceling kernel paths, adaptive gating, and local token mixing to achieve state-of-the-art, high-fidelity medical image segmentation with superior efficiency compared to existing CNN and Transformer baselines.

Hongbo Zheng, Afshin Bozorgpour, Dorit Merhof + 1 more2026-03-06💻 cs

MultiShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model

This paper introduces MultiShadow, a diffusion-based framework that leverages multimodal conditioning and attention mechanisms to generate physically plausible, geometrically consistent shadows for multiple foreground objects simultaneously, addressing a critical gap in existing single-object shadow generation methods.

Waqas Ahmed, Dean Diepeveen, Ferdous Sohel2026-03-06💻 cs

IoUCert: Robustness Verification for Anchor-based Object Detectors

The paper introduces IoUCert, a novel formal verification framework that overcomes the challenges of non-linear coordinate transformations and IoU metrics to enable the first robustness verification of realistic, anchor-based object detection models like SSD and YOLO.

Benedikt Brückner, Alejandro J. Mercado, Yanghao Zhang, Panagiotis Kouvaros, Alessio Lomuscio2026-03-06🔒 cs.CR

DMD-augmented Unpaired Neural Schrödinger Bridge for Ultra-Low Field MRI Enhancement

This paper proposes a DMD-augmented Unpaired Neural Schrödinger Bridge framework that enhances Ultra-Low Field (64 mT) MRI image quality by leveraging diffusion-guided distribution matching and anatomical structure preservation to achieve superior realism and structural fidelity in translating unpaired 64 mT scans to 3 T quality.

Youngmin Kim, Jaeyun Shin, Jeongchan Kim + 5 more2026-03-06💻 cs

TumorFlow: Physics-Guided Longitudinal MRI Synthesis of Glioblastoma Growth

The paper presents TumorFlow, a biophysically-conditioned generative framework that synthesizes realistic, patient-specific 3D longitudinal MRI sequences of glioblastoma growth by integrating tumor-infiltration maps with mechanistic growth models to enable controllable progression visualization and synthetic data generation.

Valentin Biller, Niklas Bubeck, Lucas Zimmer + 6 more2026-03-06💻 cs

← Previous Next →