cs.CV papers | Gist.Science

MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

This paper proposes MomentMix, a data augmentation strategy combining ForegroundMix and BackgroundMix, and a Length-Aware Decoder to address feature diversity limitations and prediction biases, thereby significantly improving the localization accuracy of short moments in Video Moment Retrieval tasks.

Seojeong Park, Jiho Choi, Kyungjune Baek + 1 more2026-02-27🤖 cs.AI

Joint Optimization for 4D Human-Scene Reconstruction in the Wild

This paper proposes JOSH, an optimization-based method that jointly reconstructs 4D human motion and surrounding scenes from monocular web videos by leveraging human-scene contact constraints, along with its efficient learning-based variant JOSH3R trained on pseudo-labels derived from JOSH.

Zhizheng Liu, Joe Lin, Wayne Wu + 1 more2026-02-27💻 cs

Diffusion or Non-Diffusion Adversarial Defenses: Rethinking the Relation between Classifier and Adversarial Purifier

This paper challenges the prevailing reliance on diffusion models for adversarial defense by demonstrating that non-diffusion purifiers can achieve superior robustness, transferability, and cross-dataset generalization, notably outperforming ImageNet-trained diffusion models when applied to ImageNet despite being trained only on CIFAR-10.

Yuan-Chih Chen, Chun-Shien Lu2026-02-27💻 cs

Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation

This paper introduces Dual-IPO, an iterative framework that simultaneously and progressively optimizes both a CoT-guided reward model and a video generation model to enhance text-to-video synthesis quality and human preference alignment without requiring extensive manual annotations.

Xiaomeng Yang, Mengping Yang, Jia Gong + 3 more2026-02-27🤖 cs.AI

RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers

The paper proposes RelaCtrl, a relevance-guided framework that optimizes control signal integration in Diffusion Transformers by dynamically tailoring layer configurations and introducing a Two-Dimensional Shuffle Mixer, achieving superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta.

Ke Cao, Jing Wang, Ao Ma + 11 more2026-02-27💻 cs

CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models

This paper introduces U-F $^2$ -CBM, a novel method that transforms any frozen visual classifier into an unsupervised, label-free, and CLIP-free Concept Bottleneck Model by aligning its class distribution with vision-language counterparts, thereby achieving state-of-the-art performance without requiring manual annotations or pre-trained CLIP models.

Fawaz Sammani, Jonas Fischer, Nikos Deligiannis2026-02-27💻 cs

UniFuture: A 4D Driving World Model for Future Generation and Perception

UniFuture introduces a unified 4D driving world model that jointly generates future RGB images and depth maps through a dual-latent sharing scheme and multi-scale latent interaction, achieving superior performance in both dynamic scene forecasting and geometric perception compared to existing specialized models.

Dingkang Liang, Dingyuan Zhang, Xin Zhou + 7 more2026-02-27💻 cs

GmNet: Revisiting Gating Mechanisms From A Frequency View

Inspired by the convolution theorem, this paper analyzes gating mechanisms from a frequency perspective to reveal their role in managing frequency responses, leading to the proposal of GmNet, a lightweight model that mitigates low-frequency bias and achieves high performance in image classification.

Yifan Wang, Xu Ma, Yitian Zhang + 5 more2026-02-27💻 cs

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

The paper introduces ViT-Linearizer, a cross-architecture distillation framework that transfers the rich representations of quadratic-complexity Vision Transformers into efficient linear-time recurrent models (such as Mamba) via activation matching and masked prediction, achieving competitive ImageNet accuracy while significantly reducing inference costs for high-resolution tasks.

Guoyizhe Wei, Rama Chellappa2026-02-27🤖 cs.AI

LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

The paper introduces LAMM-ViT, a novel Vision Transformer that enhances AI face detection by integrating Region-Guided Multi-Head Attention with dynamic Layer-aware Mask Modulation to capture hierarchical structural inconsistencies across diverse generative models, achieving state-of-the-art generalization performance.

Jiangling Zhang, Weijie Zhu, Jirui Huang + 1 more2026-02-27💻 cs

Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds

This paper proposes a Reflectance Prediction-based Knowledge Distillation (RPKD) framework that enhances 3D object detection robustness in low-bitrate compressed point clouds by discarding reflectance during transmission, reconstructing it via a geometry-based prediction module, and utilizing a cross-source distillation strategy to transfer knowledge from raw to compressed data.

Hao Jing, Anhong Wang, Yifan Zhang + 2 more2026-02-27💻 cs

Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation

BriGeS is a resource-efficient method for generalized monocular depth estimation that fuses geometric and semantic foundation models via a trainable Bridging Gate and Attention Temperature Scaling to achieve state-of-the-art performance in complex scenes.

Sanggyun Ma, Wonjoon Choi, Jihun Park + 4 more2026-02-27💻 cs

Sparse Imagination for Efficient Visual World Model Planning

This paper proposes "Sparse Imagination," a transformer-based visual world model planning method that utilizes a randomized grouped attention strategy to dynamically reduce token processing during latent rollout, thereby significantly accelerating inference efficiency while maintaining high control fidelity for real-time robotic applications.

Junha Chun, Youngjoon Jeong, Taesup Kim2026-02-27🤖 cs.AI

LinGuinE: Longitudinal Guidance Estimation for Volumetric Tumour Segmentation

LinGuinE is a novel, training-free PyTorch framework that achieves state-of-the-art longitudinal volumetric tumour segmentation and lesion tracking across multiple datasets by combining image registration with guided segmentation from a single radiologist prompt, enabling flexible, direction-agnostic analysis without requiring longitudinal data training.

Nadine Garibli, Mayank Patwari, Bence Csiba + 2 more2026-02-27⚡ eess

Human-Guided Shade Artifact Suppression in CBCT-to-MDCT Translation via Schrödinger Bridge with Conditional Diffusion

This paper proposes a novel human-guided framework for CBCT-to-MDCT translation that leverages a Schrödinger Bridge formulation with conditional diffusion and classifier-free guidance to effectively suppress shade artifacts while preserving anatomical fidelity and aligning with clinical preferences through iterative human feedback.

Sung Ho Kang, Hyun-Cheol Park2026-02-27💻 cs

Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

This paper addresses the "Data Addition Dilemma" in medical image segmentation by proposing an exchangeability-based framework that controls foreground-background feature discrepancies across deep network layers, achieving state-of-the-art performance on five datasets including a novel curated ultrasound collection.

Ayush Roy, Samin Enam, Jun Xia + 2 more2026-02-27🤖 cs.LG

LayerT2V: A Unified Multi-Layer Video Generation Framework

LayerT2V is a unified framework that generates semantically consistent, editable multi-layer videos (including background, foregrounds, and alpha mattes) in a single inference pass by leveraging temporal serialization within a shared DiT backbone, supported by the new large-scale VidLayer dataset.

Guangzhao Li, Kangrui Cen, Baixuan Zhao + 5 more2026-02-27🤖 cs.AI

RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer

RAP is a unified framework that enables real-time, high-quality audio-driven portrait animation by introducing a hybrid attention mechanism for fine-grained audio control and a static-dynamic training-inference paradigm to overcome the limitations of compressed latent representations.

Fangyu Du, Taiqing Li, Qian Qiao + 7 more2026-02-27⚡ eess

Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration

This paper proposes MixCache, a training-free framework that accelerates video DiT inference by employing a context-aware triggering mechanism and an adaptive hybrid strategy to dynamically select optimal caching granularities, thereby significantly improving both generation speed and quality.

Yuanxin Wei, Lansong Diao, Bujiao Chen + 6 more2026-02-27🤖 cs.LG

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

This paper introduces Dyslexify, a training-free defense mechanism that selectively ablates specific attention heads in CLIP vision encoders to neutralize typographic attacks, significantly improving robustness against text-based manipulations while preserving standard recognition accuracy.

Lorenz Hufe, Constantin Venhoff, Erblina Purelku + 3 more2026-02-27🤖 cs.AI

← Previous Next →