cs.CV papers | Gist.Science

Distractor-free Generalizable 3D Gaussian Splatting

This paper introduces DGGS, a novel framework that achieves distractor-free generalizable 3D Gaussian Splatting by employing a scene-agnostic mask prediction module during training and a two-stage reference scoring with pruning mechanism during inference to ensure stable, high-quality reconstruction in unseen scenes.

Yanqi Bao, Jing Liao, Jing Huo + 1 more2026-02-27💻 cs

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

This paper proposes a framework that enhances Open Vocabulary Object Detection models for open-world settings by introducing Pseudo Unknown Embedding and Multi-Scale Contrastive Anchor Learning to identify and incrementally learn novel objects, thereby addressing limitations in detecting far-out-of-distribution items and reducing misclassifications while maintaining state-of-the-art performance.

Zizhao Li, Zhengkang Xiang, Joseph West + 1 more2026-02-27🤖 cs.AI

Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints

This paper proposes a novel text-to-sketch-animation method that leverages a pre-trained text-to-video diffusion model guided by SDS loss, while introducing length-area regularization for temporal consistency and As-Rigid-As-Possible loss to preserve sketch topology, thereby outperforming state-of-the-art approaches in both quantitative and qualitative evaluations.

Gaurav Rai, Ojaswa Sharma2026-02-27💻 cs

PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting

The paper introduces PPT, a scalable pretraining framework that leverages automatically generated pseudo-labeled trajectories from off-the-shelf detectors to enhance motion forecasting models' performance and generalization, particularly in low-data and cross-domain scenarios, while reducing reliance on costly manual annotations.

Yihong Xu, Yuan Yin, Éloi Zablocki + 3 more2026-02-27💻 cs

IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks

The paper proposes IV-tuning, a parameter-efficient transfer learning framework that leverages pre-trained visual models with only 3% trainable backbone parameters to overcome the generalization limitations of full fine-tuning and achieve state-of-the-art performance across various infrared-visible tasks.

Yaming Zhang, Chenqiang Gao, Fangcen Liu + 4 more2026-02-27💻 cs

MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

This paper proposes MomentMix, a data augmentation strategy combining ForegroundMix and BackgroundMix, and a Length-Aware Decoder to address feature diversity limitations and prediction biases, thereby significantly improving the localization accuracy of short moments in Video Moment Retrieval tasks.

Seojeong Park, Jiho Choi, Kyungjune Baek + 1 more2026-02-27🤖 cs.AI

Joint Optimization for 4D Human-Scene Reconstruction in the Wild

This paper proposes JOSH, an optimization-based method that jointly reconstructs 4D human motion and surrounding scenes from monocular web videos by leveraging human-scene contact constraints, along with its efficient learning-based variant JOSH3R trained on pseudo-labels derived from JOSH.

Zhizheng Liu, Joe Lin, Wayne Wu + 1 more2026-02-27💻 cs

Diffusion or Non-Diffusion Adversarial Defenses: Rethinking the Relation between Classifier and Adversarial Purifier

This paper challenges the prevailing reliance on diffusion models for adversarial defense by demonstrating that non-diffusion purifiers can achieve superior robustness, transferability, and cross-dataset generalization, notably outperforming ImageNet-trained diffusion models when applied to ImageNet despite being trained only on CIFAR-10.

Yuan-Chih Chen, Chun-Shien Lu2026-02-27💻 cs

Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation

This paper introduces Dual-IPO, an iterative framework that simultaneously and progressively optimizes both a CoT-guided reward model and a video generation model to enhance text-to-video synthesis quality and human preference alignment without requiring extensive manual annotations.

Xiaomeng Yang, Mengping Yang, Jia Gong + 3 more2026-02-27🤖 cs.AI

RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers

The paper proposes RelaCtrl, a relevance-guided framework that optimizes control signal integration in Diffusion Transformers by dynamically tailoring layer configurations and introducing a Two-Dimensional Shuffle Mixer, achieving superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta.

Ke Cao, Jing Wang, Ao Ma + 11 more2026-02-27💻 cs

CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models

This paper introduces U-F $^2$ -CBM, a novel method that transforms any frozen visual classifier into an unsupervised, label-free, and CLIP-free Concept Bottleneck Model by aligning its class distribution with vision-language counterparts, thereby achieving state-of-the-art performance without requiring manual annotations or pre-trained CLIP models.

Fawaz Sammani, Jonas Fischer, Nikos Deligiannis2026-02-27💻 cs

UniFuture: A 4D Driving World Model for Future Generation and Perception

UniFuture introduces a unified 4D driving world model that jointly generates future RGB images and depth maps through a dual-latent sharing scheme and multi-scale latent interaction, achieving superior performance in both dynamic scene forecasting and geometric perception compared to existing specialized models.

Dingkang Liang, Dingyuan Zhang, Xin Zhou + 7 more2026-02-27💻 cs

GmNet: Revisiting Gating Mechanisms From A Frequency View

Inspired by the convolution theorem, this paper analyzes gating mechanisms from a frequency perspective to reveal their role in managing frequency responses, leading to the proposal of GmNet, a lightweight model that mitigates low-frequency bias and achieves high performance in image classification.

Yifan Wang, Xu Ma, Yitian Zhang + 5 more2026-02-27💻 cs

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

The paper introduces ViT-Linearizer, a cross-architecture distillation framework that transfers the rich representations of quadratic-complexity Vision Transformers into efficient linear-time recurrent models (such as Mamba) via activation matching and masked prediction, achieving competitive ImageNet accuracy while significantly reducing inference costs for high-resolution tasks.

Guoyizhe Wei, Rama Chellappa2026-02-27🤖 cs.AI

LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

The paper introduces LAMM-ViT, a novel Vision Transformer that enhances AI face detection by integrating Region-Guided Multi-Head Attention with dynamic Layer-aware Mask Modulation to capture hierarchical structural inconsistencies across diverse generative models, achieving state-of-the-art generalization performance.

Jiangling Zhang, Weijie Zhu, Jirui Huang + 1 more2026-02-27💻 cs

Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds

This paper proposes a Reflectance Prediction-based Knowledge Distillation (RPKD) framework that enhances 3D object detection robustness in low-bitrate compressed point clouds by discarding reflectance during transmission, reconstructing it via a geometry-based prediction module, and utilizing a cross-source distillation strategy to transfer knowledge from raw to compressed data.

Hao Jing, Anhong Wang, Yifan Zhang + 2 more2026-02-27💻 cs

Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation

BriGeS is a resource-efficient method for generalized monocular depth estimation that fuses geometric and semantic foundation models via a trainable Bridging Gate and Attention Temperature Scaling to achieve state-of-the-art performance in complex scenes.

Sanggyun Ma, Wonjoon Choi, Jihun Park + 4 more2026-02-27💻 cs

Sparse Imagination for Efficient Visual World Model Planning

This paper proposes "Sparse Imagination," a transformer-based visual world model planning method that utilizes a randomized grouped attention strategy to dynamically reduce token processing during latent rollout, thereby significantly accelerating inference efficiency while maintaining high control fidelity for real-time robotic applications.

Junha Chun, Youngjoon Jeong, Taesup Kim2026-02-27🤖 cs.AI

LinGuinE: Longitudinal Guidance Estimation for Volumetric Tumour Segmentation

LinGuinE is a novel, training-free PyTorch framework that achieves state-of-the-art longitudinal volumetric tumour segmentation and lesion tracking across multiple datasets by combining image registration with guided segmentation from a single radiologist prompt, enabling flexible, direction-agnostic analysis without requiring longitudinal data training.

Nadine Garibli, Mayank Patwari, Bence Csiba + 2 more2026-02-27⚡ eess

Human-Guided Shade Artifact Suppression in CBCT-to-MDCT Translation via Schrödinger Bridge with Conditional Diffusion

This paper proposes a novel human-guided framework for CBCT-to-MDCT translation that leverages a Schrödinger Bridge formulation with conditional diffusion and classifier-free guidance to effectively suppress shade artifacts while preserving anatomical fidelity and aligning with clinical preferences through iterative human feedback.

Sung Ho Kang, Hyun-Cheol Park2026-02-27💻 cs

← Previous Next →