cs.CV papers | Gist.Science

Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models

This paper introduces FAMDA, a simple yet effective unsupervised domain adaptation framework that leverages Vision Foundation Models as teachers within a self-training paradigm to generate high-quality pseudo-labels, enabling the training of highly efficient student networks that achieve state-of-the-art performance in multi-task dense prediction for resource-constrained robotics applications.

Beomseok Kang, Niluthpol Chowdhury Mithun, Mikhail Sizintsev, Han-Pang Chiu, Supun Samarasekera2026-03-10💻 cs

QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

QuantSparse is a unified framework that effectively combines model quantization and attention sparsification for video diffusion transformers by introducing Multi-Scale Salient Attention Distillation and Second-Order Sparse Attention Reparameterization to mitigate information loss, thereby achieving significant storage reduction and inference acceleration while substantially outperforming existing baselines in generation quality.

Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu2026-03-10💻 cs

Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

The paper introduces DualFlow, a unified framework that leverages rectified flow and a retrieval-augmented generation module to efficiently produce high-quality, semantically grounded, and rhythmically synchronized two-person 3D motion conditioned on diverse inputs like text, music, and prior sequences.

Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera2026-03-10💻 cs

PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement

This paper introduces PHASE-Net, a lightweight and theoretically grounded remote photoplethysmography model that leverages hemodynamic principles to derive a causal Temporal Convolutional Network, enhanced by novel spatial mixing and filtering modules to achieve state-of-the-art accuracy and efficiency in non-contact physiological monitoring under challenging conditions.

Bo Zhao, Dan Guo, Junzhe Cao, Yong Xu, Bochao Zou, Tao Tan, Yue Sun, Zitong Yu2026-03-10💻 cs

LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

This paper introduces LMOD+, a large-scale multimodal ophthalmology benchmark dataset and evaluation framework featuring 32,633 annotated instances across 12 conditions and 5 imaging modalities, designed to advance and systematically assess the capabilities of multimodal large language models in vision-threatening disease diagnosis, staging, and bias detection.

Zhenyue Qin, Yang Liu, Yu Yin, Jinyu Ding, Haoran Zhang, Anran Li, Dylan Campbell, Xuansheng Wu, Ke Zou, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih Chung Tham, Ninghao Liu, Xiuzhen Zhang, Qingyu Chen2026-03-10💻 cs

Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!

The paper introduces REVEL, a new task for streaming, fine-grained interactive video manipulation on any object at any time, and proposes DragStream, a training-free method that resolves latent distribution drift and context interference in autoregressive video diffusion models through adaptive distribution self-rectification and spatial-frequency selective optimization.

Junbao Zhou, Yuan Zhou, Kesen Zhao, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang2026-03-10💻 cs

Real-Time Motion-Controllable Autoregressive Video Diffusion

The paper introduces AR-Drag, a reinforcement learning-enhanced few-step autoregressive video diffusion model that achieves real-time, high-fidelity image-to-video generation with diverse motion control while significantly reducing latency compared to existing bidirectional approaches.

Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, Hanwang Zhang2026-03-10💻 cs

Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review

This systematic scoping review synthesizes thirty-three studies on unsupervised deep generative models for neuroimaging anomaly detection, highlighting their potential for pathology-agnostic localization in data-scarce settings while identifying key challenges such as methodological heterogeneity and limited external validation.

Youwan Mahé, Elise Bannier, Stéphanie Leplaideur, Elisa Fromont, Francesca Galassi2026-03-10💻 cs

Taming Modality Entanglement in Continual Audio-Visual Segmentation

This paper introduces the Continual Audio-Visual Segmentation (CAVS) task and proposes a Collision-based Multi-modal Rehearsal (CMR) framework that effectively addresses multi-modal semantic drift and co-occurrence confusion through novel sample selection and frequency adjustment strategies, significantly outperforming existing single-modal continual learning methods.

Yuyang Hong, Qi Yang, Tao Zhang, Zili Wang, Zhaojin Fu, Kun Ding, Bin Fan, Shiming Xiang2026-03-10💻 cs

Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks

This paper introduces Dream4Drive, a novel synthetic data generation framework that leverages 3D-aware guidance and a fine-tuned driving world model to create diverse, multi-view corner cases, effectively enhancing downstream perception tasks in autonomous driving without the performance gains being negated by increased training epochs.

Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang2026-03-10💻 cs

MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting

MoE-GS introduces a novel Mixture-of-Experts framework for dynamic Gaussian Splatting that utilizes a Volume-aware Pixel Router to adaptively blend heterogeneous deformation priors for superior novel view synthesis, while addressing efficiency concerns through multi-expert rendering optimizations and knowledge distillation.

In-Hwan Jin, Hyeongju Mun, Joonsoo Kim, Kugjin Yun, Kyeongbo Kong2026-03-10💻 cs

AnyPcc: Compressing Any Point Cloud with a Single Universal Model

The paper introduces AnyPcc, a universal point cloud compression framework that achieves state-of-the-art performance across diverse datasets by combining a robust Universal Context Model with an Instance-Adaptive Fine-Tuning strategy to effectively handle varying data densities and out-of-distribution scenarios.

Kangli Wang, Qianxi Yi, Yuqi Ye, Shihao Li, Wei Gao2026-03-10💻 cs

Automated Pest Counting in Water Traps through Active Robotic Stirring for Occlusion Handling

This paper proposes an automated pest counting system for water traps that utilizes a robotic arm with adaptive-speed stirring and a confidence-driven closed-loop control mechanism to effectively mitigate occlusion, significantly reducing counting errors and execution time compared to static image methods and constant-speed stirring.

Xumin Gao, Mark Stevens, Grzegorz Cielniak2026-03-10💻 cs

CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

This paper introduces CountFormer, a transformer-based framework that leverages the DINOv2 foundation model to improve structural consistency and reduce overcounting errors in exemplar-free object counting, achieving competitive performance on the FSC-147 benchmark.

Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen2026-03-10💻 cs

SAGE: Structure-Aware Generative Video Transitions between Diverse Clips

SAGE is a zero-shot, structure-aware generative framework that synthesizes visually coherent and motion-consistent video transitions between diverse clips by combining line maps and motion flow guidance with generative synthesis, effectively outperforming existing classical and generative methods without requiring fine-tuning or specific training data.

Mia Kan, Yilin Liu, Niloy Mitra2026-03-10💻 cs

Detecting AI-Generated Images via Diffusion Snap-Back Reconstruction: A Forensic Approach

This paper proposes a forensic method called "diffusion snap-back reconstruction," which detects AI-generated images by analyzing how perceptual similarity metrics change when an image is perturbed and reconstructed by a diffusion model, achieving high accuracy (AUROC of 0.993) and robustness against common distortions without relying on traditional pixel-level artifacts.

Mohd Ruhul Ameen, Akif Islam2026-03-10💻 cs

Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

This paper introduces "Jr. AI Scientist," an autonomous system that mimics a novice researcher's workflow to generate novel, scientifically valuable papers building on real academic works, while simultaneously evaluating its performance through rigorous automated and human assessments to identify both its capabilities and the significant risks and limitations of current AI-driven scientific exploration.

Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu Aizawa2026-03-10🤖 cs.LG

MUGSQA: Novel Multi-Uncertainty-Based Gaussian Splatting Quality Assessment Method, Dataset, and Benchmarks

This paper introduces MUGSQA, a novel framework comprising a multi-uncertainty-based Gaussian Splatting quality assessment dataset, a unified multi-distance subjective evaluation method, and two benchmarks designed to rigorously assess the robustness of reconstruction methods and the performance of existing quality metrics under varying input conditions.

Tianang Chen, Jian Jin, Shilv Cai, Zhuangzi Li, Weisi Lin2026-03-10💻 cs

Counting Through Occlusion: Framework for Open World Amodal Counting

This paper introduces CountOCC, a novel amodal counting framework that overcomes the limitations of existing methods under occlusion by hierarchically reconstructing complete object features through multimodal guidance and visual equivalence objectives, achieving state-of-the-art performance on newly established occlusion-augmented benchmarks.

Safaeid Hossain Arib, Rabeya Akter, Abdul Monaf Chowdhury, Md Jubair Ahmed Sourov, Md Mehedi Hasan2026-03-10💻 cs

Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks

This paper introduces the Angular Gradient Sign Method, a novel adversarial attack for hyperbolic networks that leverages the geometric decomposition of gradients to apply perturbations solely along angular (semantic) directions, thereby achieving higher fooling rates and revealing unique vulnerabilities in hierarchical embeddings compared to conventional Euclidean-based methods.

Minsoo Jo, Dongyoon Yang, Taesup Kim2026-03-10🤖 cs.LG

← Previous Next →