cs.CV papers | Gist.Science

Seeing Through Uncertainty: A Free-Energy Approach for Real-Time Perceptual Adaptation in Robust Visual Navigation

This paper introduces FEP-Nav, a biologically-inspired framework that enables robust real-time visual navigation by minimizing Variational Free Energy through a dual-mechanism architecture of top-down decoding and adaptive normalization, allowing autonomous agents to maintain performance under noisy and shifting sensory conditions without gradient-based updates.

Maytus Piriyajitakonkij, Rishabh Dev Yadav, Mingfei Sun + 2 more2026-03-06💻 cs

InstructHumans: Editing Animated 3D Human Textures with Instructions

The paper presents InstructHumans, a novel framework that introduces a modified Score Distillation Sampling for Editing (SDS-E) combined with spatial regularization and gradient-based sampling to enable instruction-driven 3D human texture editing that maintains high fidelity and consistency with the original avatar.

Jiayin Zhu, Linlin Yang, Angela Yao2026-03-06💻 cs

EasyAnimate: High-Performance Video Generation Framework with Hybrid Windows Attention and Reward Backpropagation

EasyAnimate is a high-performance video generation framework that leverages diffusion transformers enhanced by Hybrid Window Attention for improved efficiency, reward backpropagation for better quality alignment, and additional optimizations like token-length training and multimodal text encoding to achieve state-of-the-art results.

Jiaqi Xu, Kunzhe Huang, Xinyi Zou + 5 more2026-03-06💻 cs

Motion-Aware Animatable Gaussian Avatars Deblurring

This paper presents a novel method for directly reconstructing sharp 3D human Gaussian avatars from blurry multi-view videos by leveraging a 3D-aware physics-based blur model and a motion model to jointly optimize avatar representation and motion parameters.

Muyao Niu, Yifan Zhan, Qingtian Zhu + 5 more2026-03-06💻 cs

Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation

The paper introduces Track Anything Behind Everything (TABE), a zero-shot pipeline that leverages test-time fine-tuning of a pretrained video diffusion model to perform amodal video object segmentation using only a single visible query mask, eliminating the need for class-specific pretraining or retraining.

Finlay G. C. Hudson, William A. P. Smith2026-03-06💻 cs

Learnable Sparsity for Vision Generative Models

This paper proposes a model-agnostic, retraining-free structural pruning framework for diffusion models that utilizes a learnable differentiable mask and a novel end-to-end objective with time step gradient checkpointing to achieve up to 20% parameter reduction in models like SDXL and FLUX while preserving performance and minimizing memory costs.

Yang Zhang, Er Jin, Wenzhong Liang + 5 more2026-03-06💻 cs

Flatness Guided Test-Time Adaptation for Vision-Language Models

This paper proposes Flatness-Guided Adaptation (FGA), a novel framework for Vision-Language Models that unifies training and test-time procedures by leveraging sharpness-aware prompt tuning to identify flat minima and a sharpness-based sample selection strategy to align them with test data, thereby achieving superior performance with reduced computational overhead compared to existing test-time adaptation methods.

Aodi Li, Liansheng Zhuang, Xiao Long + 2 more2026-03-06💻 cs

3D Dynamics-Aware Manipulation: Endowing Manipulation Policies with 3D Foresight

This paper introduces a 3D dynamics-aware manipulation framework that enhances policy performance by integrating 3D world modeling through self-supervised depth and flow prediction tasks, thereby providing policies with crucial 3D foresight for robust depth-wise manipulation without compromising inference speed.

Yuxin He, Ruihao Zhang, Xianzu Wu + 3 more2026-03-06💻 cs

MedFuncta: A Unified Framework for Learning Efficient Medical Neural Fields

This paper introduces MedFuncta, a unified meta-learning framework that encodes diverse medical images into compact 1D latent vectors to train shared, continuous neural fields at scale, while optimizing training efficiency through sparse supervision and a novel frequency schedule, and releases the accompanying MedNF dataset with over 500,000 latent vectors to advance large-scale medical neural field research.

Paul Friedrich, Florentin Bieder, Julian McGinnis + 3 more2026-03-06💻 cs

RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond

This paper introduces RapidPoseTriangulation, a publicly available algorithm that enables fast, generalizable, and whole-body multi-person pose triangulation across multiple views in milliseconds.

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif2026-03-06💻 cs

Noise2Ghost: Self-supervised deep convolutional reconstruction for ghost imaging

The paper introduces Noise2Ghost, a self-supervised deep learning method that achieves superior noise reduction and reconstruction quality in ghost imaging without requiring clean reference data, thereby enabling high-quality imaging in low-light scenarios such as dose-sensitive x-ray fluorescence and biological studies.

Mathieu Manni, Dmitry Karpov, K. Joost Batenburg + 2 more2026-03-06🔬 physics

Collaborative Learning of Local 3D Occupancy Prediction and Versatile Global Occupancy Mapping

This paper proposes LMPOcc, a plug-and-play framework that leverages a lightweight fusion module to integrate global occupancy priors into local 3D semantic prediction while simultaneously updating global maps via multi-vehicle crowdsourcing, thereby achieving state-of-the-art performance and enabling scalable, open-vocabulary 3D scene understanding.

Shanshuai Yuan, Julong Wei, Muer Tie + 3 more2026-03-06💻 cs

PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing

PhysLLM is a novel framework that enhances remote photoplethysmography (rPPG) by synergizing Large Language Models with domain-specific components, utilizing Text Prototype Guidance and a Dual-Domain Stationary algorithm to achieve state-of-the-art accuracy and robustness against illumination changes and motion artifacts.

Yiping Xie, Bo Zhao, Mingtong Dai + 6 more2026-03-06💻 cs

ReactDance: Hierarchical Representation for High-Fidelity and Coherent Long-Form Reactive Dance Generation

ReactDance is a novel diffusion framework that achieves high-fidelity, coherent long-form reactive dance generation by employing Hierarchical Finite Scalar Quantization for fine-grained spatial control and a Blockwise Local Context strategy for efficient, temporally consistent sequence synthesis.

Jingzhong Lin, Xinru Li, Yuanyuan Qi + 8 more2026-03-06💻 cs

RESAR-BEV: An Explainable Progressive Residual Autoregressive Approach for Camera-Radar Fusion in BEV Segmentation

RESAR-BEV is an explainable, progressive residual autoregressive framework for camera-radar fusion in Bird's-Eye-View segmentation that achieves state-of-the-art performance (54.0% mIoU) and real-time speed (14.6 FPS) on the nuScenes dataset by employing a coarse-to-fine Drive-Transformer and Modifier-Transformer architecture, robust dual-path voxel encoding, and decoupled supervision to overcome multi-modal misalignment and sensor noise.

Zhiwen Zeng, Yunfei Yin, Zheng Yuan + 2 more2026-03-06💻 cs

DHECA-SuperGaze: Dual Head-Eye Cross-Attention and Super-Resolution for Unconstrained Gaze Estimation

This paper introduces DHECA-SuperGaze, a deep learning framework that enhances unconstrained gaze estimation by integrating super-resolution for low-quality images and a dual head-eye cross-attention module to model head-eye interactions, while also correcting annotation errors in the Gaze360 dataset to achieve state-of-the-art accuracy and robust generalization.

Franko Šikić, Donik Vršnak, Sven Lončarić2026-03-06💻 cs

OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation

This paper proposes OSPO, a self-improving framework that enhances fine-grained text-to-image alignment and reduces object hallucinations by autonomously constructing object-centric preference data and employing an object-weighted SimPO loss, outperforming both prior self-improving methods and specialized diffusion models.

Yoonjin Oh, Yongjin Kim, Hyomin Kim + 2 more2026-03-06💻 cs

EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models

The paper proposes EDITOR, an effective and interpretable prompt inversion technique for text-to-image diffusion models that combines pre-trained captioning initialization, latent space refinement, and embedding-to-text conversion to outperform existing methods in image similarity, textual alignment, and generalizability while enabling diverse downstream applications.

Mingzhe Li, Kejing Xia, Gehao Zhang + 5 more2026-03-06💻 cs

HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition

HypeVPR is a hierarchical embedding framework that leverages the intrinsic properties of hyperbolic space to effectively capture the hierarchical relationship between perspective and equirectangular views, enabling robust visual place recognition with improved retrieval speed and reduced storage requirements.

Suhan Woo, Seongwon Lee, Jinwoo Jang + 1 more2026-03-06💻 cs

FLAIR-HUB: Large-scale Multimodal Dataset for Land Cover and Crop Mapping

IGN introduces FLAIR-HUB, a large-scale, multi-sensor dataset featuring 20 cm resolution annotations across 2,528 km² of France to address challenges in land cover and crop mapping, demonstrating that fusing diverse modalities significantly enhances deep learning model performance.

Anatol Garioud, Sébastien Giordano, Nicolas David + 1 more2026-03-06💻 cs

← Previous Next →