cs.CV papers | Gist.Science

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation

This paper introduces SGIFormer, a novel 3D instance segmentation method that combines Semantic-guided Mix Query initialization with a Geometric-enhanced Interleaving Transformer decoder to overcome existing limitations in query initialization and scalability, achieving state-of-the-art performance on major benchmarks while balancing accuracy and efficiency.

Lei Yao, Yi Wang, Moyun Liu + 1 more2026-02-27💻 cs

Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture

This paper proposes a parameter-efficient Open-Set Deepfake detection method that leverages a forgery-style mixture formulation and lightweight modules within a pre-trained Vision Transformer to achieve superior generalization across unseen forgery domains while significantly reducing computational costs.

Chenqi Kong, Anwei Luo, Peijun Bao + 5 more2026-02-27💻 cs

Abstracted Gaussian Prototypes for True One-Shot Concept Learning

This paper introduces the Abstracted Gaussian Prototypes (AGP) framework, a low-complexity, standalone system that achieves "true" one-shot learning by encoding visual concepts as Gaussian mixture models to simultaneously perform robust classification and generate human-indistinguishable novel class variants without relying on pre-training.

Chelsea Zou, Kenneth J. Kurtz2026-02-27🤖 cs.AI

SplatSDF: Boosting SDF-NeRF via Architecture-Level Fusion with Gaussian Splats

SplatSDF is a novel architecture that accelerates the convergence and improves the geometric accuracy of SDF-NeRF by directly fusing pre-trained 3D Gaussian splats into the network via a sparse injection strategy, enabling practical deployment on robotic systems without relying on consistency losses.

Runfa Blark Li, Keito Suzuki, Bang Du + 3 more2026-02-27💻 cs

Distractor-free Generalizable 3D Gaussian Splatting

This paper introduces DGGS, a novel framework that achieves distractor-free generalizable 3D Gaussian Splatting by employing a scene-agnostic mask prediction module during training and a two-stage reference scoring with pruning mechanism during inference to ensure stable, high-quality reconstruction in unseen scenes.

Yanqi Bao, Jing Liao, Jing Huo + 1 more2026-02-27💻 cs

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

This paper proposes a framework that enhances Open Vocabulary Object Detection models for open-world settings by introducing Pseudo Unknown Embedding and Multi-Scale Contrastive Anchor Learning to identify and incrementally learn novel objects, thereby addressing limitations in detecting far-out-of-distribution items and reducing misclassifications while maintaining state-of-the-art performance.

Zizhao Li, Zhengkang Xiang, Joseph West + 1 more2026-02-27🤖 cs.AI

Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints

This paper proposes a novel text-to-sketch-animation method that leverages a pre-trained text-to-video diffusion model guided by SDS loss, while introducing length-area regularization for temporal consistency and As-Rigid-As-Possible loss to preserve sketch topology, thereby outperforming state-of-the-art approaches in both quantitative and qualitative evaluations.

Gaurav Rai, Ojaswa Sharma2026-02-27💻 cs

PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting

The paper introduces PPT, a scalable pretraining framework that leverages automatically generated pseudo-labeled trajectories from off-the-shelf detectors to enhance motion forecasting models' performance and generalization, particularly in low-data and cross-domain scenarios, while reducing reliance on costly manual annotations.

Yihong Xu, Yuan Yin, Éloi Zablocki + 3 more2026-02-27💻 cs

IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks

The paper proposes IV-tuning, a parameter-efficient transfer learning framework that leverages pre-trained visual models with only 3% trainable backbone parameters to overcome the generalization limitations of full fine-tuning and achieve state-of-the-art performance across various infrared-visible tasks.

Yaming Zhang, Chenqiang Gao, Fangcen Liu + 4 more2026-02-27💻 cs

MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

This paper proposes MomentMix, a data augmentation strategy combining ForegroundMix and BackgroundMix, and a Length-Aware Decoder to address feature diversity limitations and prediction biases, thereby significantly improving the localization accuracy of short moments in Video Moment Retrieval tasks.

Seojeong Park, Jiho Choi, Kyungjune Baek + 1 more2026-02-27🤖 cs.AI

Joint Optimization for 4D Human-Scene Reconstruction in the Wild

This paper proposes JOSH, an optimization-based method that jointly reconstructs 4D human motion and surrounding scenes from monocular web videos by leveraging human-scene contact constraints, along with its efficient learning-based variant JOSH3R trained on pseudo-labels derived from JOSH.

Zhizheng Liu, Joe Lin, Wayne Wu + 1 more2026-02-27💻 cs

Diffusion or Non-Diffusion Adversarial Defenses: Rethinking the Relation between Classifier and Adversarial Purifier

This paper challenges the prevailing reliance on diffusion models for adversarial defense by demonstrating that non-diffusion purifiers can achieve superior robustness, transferability, and cross-dataset generalization, notably outperforming ImageNet-trained diffusion models when applied to ImageNet despite being trained only on CIFAR-10.

Yuan-Chih Chen, Chun-Shien Lu2026-02-27💻 cs

Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation

This paper introduces Dual-IPO, an iterative framework that simultaneously and progressively optimizes both a CoT-guided reward model and a video generation model to enhance text-to-video synthesis quality and human preference alignment without requiring extensive manual annotations.

Xiaomeng Yang, Mengping Yang, Jia Gong + 3 more2026-02-27🤖 cs.AI

RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers

The paper proposes RelaCtrl, a relevance-guided framework that optimizes control signal integration in Diffusion Transformers by dynamically tailoring layer configurations and introducing a Two-Dimensional Shuffle Mixer, achieving superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta.

Ke Cao, Jing Wang, Ao Ma + 11 more2026-02-27💻 cs

CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models

This paper introduces U-F $^2$ -CBM, a novel method that transforms any frozen visual classifier into an unsupervised, label-free, and CLIP-free Concept Bottleneck Model by aligning its class distribution with vision-language counterparts, thereby achieving state-of-the-art performance without requiring manual annotations or pre-trained CLIP models.

Fawaz Sammani, Jonas Fischer, Nikos Deligiannis2026-02-27💻 cs

UniFuture: A 4D Driving World Model for Future Generation and Perception

UniFuture introduces a unified 4D driving world model that jointly generates future RGB images and depth maps through a dual-latent sharing scheme and multi-scale latent interaction, achieving superior performance in both dynamic scene forecasting and geometric perception compared to existing specialized models.

Dingkang Liang, Dingyuan Zhang, Xin Zhou + 7 more2026-02-27💻 cs

GmNet: Revisiting Gating Mechanisms From A Frequency View

Inspired by the convolution theorem, this paper analyzes gating mechanisms from a frequency perspective to reveal their role in managing frequency responses, leading to the proposal of GmNet, a lightweight model that mitigates low-frequency bias and achieves high performance in image classification.

Yifan Wang, Xu Ma, Yitian Zhang + 5 more2026-02-27💻 cs

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

The paper introduces ViT-Linearizer, a cross-architecture distillation framework that transfers the rich representations of quadratic-complexity Vision Transformers into efficient linear-time recurrent models (such as Mamba) via activation matching and masked prediction, achieving competitive ImageNet accuracy while significantly reducing inference costs for high-resolution tasks.

Guoyizhe Wei, Rama Chellappa2026-02-27🤖 cs.AI

LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

The paper introduces LAMM-ViT, a novel Vision Transformer that enhances AI face detection by integrating Region-Guided Multi-Head Attention with dynamic Layer-aware Mask Modulation to capture hierarchical structural inconsistencies across diverse generative models, achieving state-of-the-art generalization performance.

Jiangling Zhang, Weijie Zhu, Jirui Huang + 1 more2026-02-27💻 cs

Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds

This paper proposes a Reflectance Prediction-based Knowledge Distillation (RPKD) framework that enhances 3D object detection robustness in low-bitrate compressed point clouds by discarding reflectance during transmission, reconstructing it via a geometry-based prediction module, and utilizing a cross-source distillation strategy to transfer knowledge from raw to compressed data.

Hao Jing, Anhong Wang, Yifan Zhang + 2 more2026-02-27💻 cs

← Previous Next →