cs.CV papers | Gist.Science

LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

The paper introduces LAMM-ViT, a novel Vision Transformer that enhances AI face detection by integrating Region-Guided Multi-Head Attention with dynamic Layer-aware Mask Modulation to capture hierarchical structural inconsistencies across diverse generative models, achieving state-of-the-art generalization performance.

Jiangling Zhang, Weijie Zhu, Jirui Huang + 1 more2026-02-27💻 cs

Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds

This paper proposes a Reflectance Prediction-based Knowledge Distillation (RPKD) framework that enhances 3D object detection robustness in low-bitrate compressed point clouds by discarding reflectance during transmission, reconstructing it via a geometry-based prediction module, and utilizing a cross-source distillation strategy to transfer knowledge from raw to compressed data.

Hao Jing, Anhong Wang, Yifan Zhang + 2 more2026-02-27💻 cs

Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation

BriGeS is a resource-efficient method for generalized monocular depth estimation that fuses geometric and semantic foundation models via a trainable Bridging Gate and Attention Temperature Scaling to achieve state-of-the-art performance in complex scenes.

Sanggyun Ma, Wonjoon Choi, Jihun Park + 4 more2026-02-27💻 cs

Sparse Imagination for Efficient Visual World Model Planning

This paper proposes "Sparse Imagination," a transformer-based visual world model planning method that utilizes a randomized grouped attention strategy to dynamically reduce token processing during latent rollout, thereby significantly accelerating inference efficiency while maintaining high control fidelity for real-time robotic applications.

Junha Chun, Youngjoon Jeong, Taesup Kim2026-02-27🤖 cs.AI

LinGuinE: Longitudinal Guidance Estimation for Volumetric Tumour Segmentation

LinGuinE is a novel, training-free PyTorch framework that achieves state-of-the-art longitudinal volumetric tumour segmentation and lesion tracking across multiple datasets by combining image registration with guided segmentation from a single radiologist prompt, enabling flexible, direction-agnostic analysis without requiring longitudinal data training.

Nadine Garibli, Mayank Patwari, Bence Csiba + 2 more2026-02-27⚡ eess

Human-Guided Shade Artifact Suppression in CBCT-to-MDCT Translation via Schrödinger Bridge with Conditional Diffusion

This paper proposes a novel human-guided framework for CBCT-to-MDCT translation that leverages a Schrödinger Bridge formulation with conditional diffusion and classifier-free guidance to effectively suppress shade artifacts while preserving anatomical fidelity and aligning with clinical preferences through iterative human feedback.

Sung Ho Kang, Hyun-Cheol Park2026-02-27💻 cs

Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

This paper addresses the "Data Addition Dilemma" in medical image segmentation by proposing an exchangeability-based framework that controls foreground-background feature discrepancies across deep network layers, achieving state-of-the-art performance on five datasets including a novel curated ultrasound collection.

Ayush Roy, Samin Enam, Jun Xia + 2 more2026-02-27🤖 cs.LG

LayerT2V: A Unified Multi-Layer Video Generation Framework

LayerT2V is a unified framework that generates semantically consistent, editable multi-layer videos (including background, foregrounds, and alpha mattes) in a single inference pass by leveraging temporal serialization within a shared DiT backbone, supported by the new large-scale VidLayer dataset.

Guangzhao Li, Kangrui Cen, Baixuan Zhao + 5 more2026-02-27🤖 cs.AI

RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer

RAP is a unified framework that enables real-time, high-quality audio-driven portrait animation by introducing a hybrid attention mechanism for fine-grained audio control and a static-dynamic training-inference paradigm to overcome the limitations of compressed latent representations.

Fangyu Du, Taiqing Li, Qian Qiao + 7 more2026-02-27⚡ eess

Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration

This paper proposes MixCache, a training-free framework that accelerates video DiT inference by employing a context-aware triggering mechanism and an adaptive hybrid strategy to dynamically select optimal caching granularities, thereby significantly improving both generation speed and quality.

Yuanxin Wei, Lansong Diao, Bujiao Chen + 6 more2026-02-27🤖 cs.LG

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

This paper introduces Dyslexify, a training-free defense mechanism that selectively ablates specific attention heads in CLIP vision encoders to neutralize typographic attacks, significantly improving robustness against text-based manipulations while preserving standard recognition accuracy.

Lorenz Hufe, Constantin Venhoff, Erblina Purelku + 3 more2026-02-27🤖 cs.AI

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

This paper addresses the limitations of current risk-oriented methods in constructing multimodal safety datasets by proposing a novel image-oriented self-adaptive pipeline that automatically generates a 35k real-world safety dataset and introduces a standardized evaluation metric to validate its effectiveness across various tasks.

Jingen Qu, Lijun Li, Bo Zhang + 2 more2026-02-27💬 cs.CL

Loc $^2$ : Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

This paper proposes Loc $^2$ , an interpretable and lightweight cross-view localization method that estimates ground-level camera pose by learning direct ground-aerial feature correspondences, lifting them to bird's-eye-view space via monocular depth, and applying scale-aware Procrustes alignment without requiring pixel-level annotations.

Zimin Xia, Chenghao Xu, Alexandre Alahi2026-02-27💻 cs

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

This paper proposes ST-GS, a novel framework that enhances vision-based 3D semantic occupancy prediction for autonomous driving by introducing a guidance-informed spatial aggregation strategy and a geometry-aware temporal fusion scheme to achieve state-of-the-art performance and superior temporal consistency on the nuScenes benchmark.

Xiaoyang Yan, Muleilan Pei, Shaojie Shen2026-02-27💻 cs

Visual Instruction Pretraining for Domain-Specific Foundation Models

This paper introduces Visual Instruction Pretraining (ViTP), a novel paradigm that leverages high-level reasoning to enhance low-level perceptual features through end-to-end pretraining of a Vision Transformer within a Vision-Language Model, achieving state-of-the-art performance across diverse remote sensing and medical imaging benchmarks.

Yuxuan Li, Yicheng Zhang, Wenhao Tang + 4 more2026-02-27💻 cs

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

PartSAM is the first promptable 3D part segmentation model trained natively on a large-scale dataset of over five million shape-part pairs, utilizing a triplane-based encoder-decoder architecture to achieve superior open-world generalization and accurate decomposition of both surface and internal structures compared to existing 2D-transfer methods.

Zhe Zhu, Le Wan, Rui Xu + 6 more2026-02-27💻 cs

Secure and reversible face anonymization with diffusion models

This paper introduces the first diffusion-based framework for secure and reversible face anonymization that utilizes secret-key conditioning to enable high-quality identity protection and authorized reconstruction while preventing unauthorized de-anonymization.

Pol Labarbarie, Vincent Itier, William Puech2026-02-27🤖 cs.LG

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

This paper proposes Asynchronous Denoising Diffusion Models, a novel framework that assigns distinct timesteps to individual pixels to enable prompt-related regions to leverage clearer contextual information from unrelated areas, thereby significantly improving text-to-image alignment.

Zijing Hu, Yunze Tong, Fengda Zhang + 3 more2026-02-27💻 cs

Detection and Measurement of Hailstones with Multimodal Large Language Models

This study demonstrates that pre-trained multimodal large language models, particularly when enhanced with two-stage prompting strategies that leverage reference objects, can effectively detect and measure hailstone diameters from crowdsourced social media images with an average error of 1.12cm, offering a promising complement to traditional hail sensors for rapid severe weather assessment.

Moritz Alker, David C. Schedl, Andreas Stöckl2026-02-27🤖 cs.AI

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

The paper proposes FlowRVS, a novel one-stage generative framework that reformulates Referring Video Object Segmentation as a language-guided continuous flow deformation problem, leveraging pretrained text-to-video models to achieve state-of-the-art performance by directly mapping video representations to target masks while overcoming the limitations of traditional cascaded approaches.

Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li + 6 more2026-02-27💻 cs

← Previous Next →

cs.CV