cs.CV papers | Gist.Science

When to Lock Attention: Training-Free KV Control in Video Diffusion

KV-Lock is a training-free framework for DiT-based video diffusion models that dynamically adjusts background key-value locking and classifier-free guidance scales based on hallucination detection to simultaneously enhance foreground quality and maintain background consistency.

Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang, Jun Yin, Haoyuan Sun, Junfeng Jiao, Christian Claudel, Junbo Tan, Xueqian Wang2026-03-11🤖 cs.AI

DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics

The paper presents DiffWind, a physics-informed differentiable framework that unifies wind-object interaction modeling, video-based reconstruction, and forward simulation by combining 3D Gaussian Splatting, the Material Point Method, and Lattice Boltzmann constraints to accurately recover and simulate wind-driven object dynamics from video observations.

Yuanhang Lei, Boming Zhao, Zesong Yang, Xingxuan Li, Tao Cheng, Haocheng Peng, Ru Zhang, Yang Yang, Siyuan Huang, Yujun Shen, Ruizhen Hu, Hujun Bao, Zhaopeng Cui2026-03-11💻 cs

VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM

VarSplat is an uncertainty-aware 3D Gaussian Splatting SLAM system that explicitly learns per-splat appearance variance to render differentiable uncertainty maps, thereby guiding tracking and optimization toward reliable regions to achieve robust pose estimation and high-fidelity reconstruction in challenging real-world environments.

Anh Thuan Tran, Jana Kosecka2026-03-11💻 cs

Improving 3D Foot Motion Reconstruction in Markerless Monocular Human Motion Capture

This paper introduces FootMR, a method that refines 3D foot motion in markerless monocular human motion capture by lifting 2D keypoints to 3D using large-scale motion capture data and context-aware residual prediction, alongside the new MOOF dataset, to significantly outperform existing state-of-the-art approaches in accuracy and generalization.

Tom Wehrbein, Bodo Rosenhahn2026-03-11💻 cs

AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

This paper introduces AutoViVQA, a large-scale automatically constructed dataset for Vietnamese Visual Question Answering, and evaluates transformer-based multimodal models alongside various automatic metrics to assess their performance and alignment with human judgment in the Vietnamese context.

Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc, Tran Dac Thinh, Dang Duy Lan, Nguyen Quoc Thinh, Tung Le2026-03-11🤖 cs.AI

DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds

This paper introduces DRIFT, a dual-path Transformer model that effectively fuses fine-grained local and coarse-grained global features from sparse 4D radar point clouds to achieve state-of-the-art performance in automated driving perception tasks like object detection and free road estimation.

Siqi Pei, Andras Palffy, Dariu M. Gavrila2026-03-11💻 cs

TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering

The paper introduces TemporalDoRA, a parameter-efficient fine-tuning method that integrates lightweight temporal attention into the low-rank adaptation branch of vision encoders to enhance robustness against linguistic variations in surgical video question answering, validated by a new colonoscopy dataset and improved Out-of-Template performance.

Luca Carlini, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque2026-03-11💻 cs

TriFusion-SR: Joint Tri-Modal Medical Image Fusion and SR

The paper proposes TriFusion-SR, a wavelet-guided conditional diffusion framework that jointly performs tri-modal medical image fusion and super-resolution by decomposing features into frequency bands and employing rectified wavelet features with adaptive spatial-frequency fusion to achieve state-of-the-art performance in resolution and perceptual quality.

Fayaz Ali Dharejo, Sharif S. M. A., Aiman Khalil, Nachiket Chaudhary, Rizwan Ali Naqvi, Radu Timofte2026-03-11💻 cs

ProGS: Towards Progressive Coding for 3D Gaussian Splatting

ProGS introduces a novel streaming-friendly codec that organizes 3D Gaussian Splatting data into an octree structure with mutual information enhancement, achieving a 45-fold reduction in file size and over 10% visual improvement while enabling progressive coding for varying bandwidth conditions.

Zhiye Tang, Lingzhuo Liu, Shengjie Jiao, Qiudan Zhang, Junhui Hou, You Yang, Xu Wang2026-03-11💻 cs

GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System

This paper introduces GSStream, a novel volumetric scene streaming system for 3D Gaussian Splatting that leverages collaborative viewport prediction and deep reinforcement learning-based bitrate adaptation to overcome bandwidth challenges and deliver high-quality, real-time immersive experiences.

Zhiye Tang, Qiudan Zhang, Lei Zhang, Junhui Hou, You Yang, Xu Wang2026-03-11💻 cs

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

The paper proposes FrameDiT, a novel video generation architecture that introduces Matrix Attention to efficiently model global spatio-temporal dynamics by processing frames as matrices, thereby achieving state-of-the-art video quality and temporal coherence while maintaining computational efficiency comparable to local factorized attention.

Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran2026-03-11💻 cs

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

This paper introduces EXPLORE-Bench, a benchmark derived from real first-person videos to evaluate the ability of multimodal large language models to perform long-horizon egocentric scene prediction, revealing significant performance gaps compared to humans and demonstrating that stepwise reasoning offers partial improvements at a computational cost.

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha2026-03-11🤖 cs.AI

FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

FetalAgents is a novel multi-agent system that dynamically orchestrates specialized vision experts to deliver robust, end-to-end fetal ultrasound analysis and structured clinical reporting across multiple tasks, outperforming existing specialized models and multimodal large language models.

Xiaotian Hu, Junwei Huang, Mingxuan Liu, Kasidit Anmahapong, Yifei Chen, Yitong Luo, Yiming Huang, Xuguang Bai, Zihan Li, Yi Liao, Haibo Qu, Qiyuan Tian2026-03-11💻 cs

$M^2$ -Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs

The paper introduces $M^2$ -Occ, a robust 3D semantic occupancy prediction framework that leverages a Multi-view Masked Reconstruction module and a Feature Memory Module to maintain geometric and semantic coherence under incomplete multi-camera inputs, significantly outperforming existing methods in scenarios with missing views.

Kaixin Lin, Kunyu Peng, Di Wen, Yufan Chen, Ruiping Liu, Kailun Yang2026-03-11⚡ eess

Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

This paper introduces Step-Aware Contrastive Alignment (SACA), a novel framework that enhances Vision-Language Navigation in Continuous Environments by utilizing a perception-grounded auditor to extract dense, step-level supervision from imperfect trajectories, thereby overcoming the limitations of compounding errors in supervised fine-tuning and sparse rewards in reinforcement fine-tuning to achieve state-of-the-art performance.

Haoyuan Li, Rui Liu, Hehe Fan, Yi Yang2026-03-11💻 cs

ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios

This paper introduces ENIGMA-360, a publicly released, temporally synchronized ego-exo dataset containing 360 annotated procedural videos from real industrial scenarios to advance human behavior understanding and establish baselines for tasks like action segmentation and interaction detection.

Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Daniele Di Mauro, Camillo Quattrocchi, Alessandro Passanisi, Irene D'Ambra, Antonino Furnari, Giovanni Maria Farinella2026-03-11💻 cs

LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

This paper introduces LAP, a novel procedure planning model that leverages a fine-tuned Vision Language Model to convert visual observations into distinctive text embeddings for a diffusion-based planner, achieving state-of-the-art performance on multiple benchmarks by effectively resolving visual ambiguities through language.

Lei Shi, Victor Aregbede, Andreas Persson, Martin Längkvist, Amy Loutfi, Stephanie Lowry2026-03-11💻 cs

LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control

LogoDiffuser is a training-free method that leverages multimodal diffusion transformers and letter-aware attention control to generate high-quality, multilingual logo designs by inputting target characters as images to preserve structural integrity while applying creative styles.

Mingyu Kang, Hyein Seo, Yuna Jeong, Junhyeong Park, Yong Suk Choi2026-03-11💻 cs

PanoAffordanceNet: Towards Holistic Affordance Grounding in 360{\deg} Indoor Environments

This paper introduces PanoAffordanceNet, a novel framework and the first high-quality dataset (360-AGD) designed to enable holistic affordance grounding in 360-degree indoor environments by addressing challenges like geometric distortion and semantic dispersion through distortion-aware calibration and multi-level constraints.

Guoliang Zhu, Wanjun Jia, Caoyang Shao, Yuheng Zhang, Zhiyong Li, Kailun Yang2026-03-11⚡ eess

Ego: Embedding-Guided Personalization of Vision-Language Models

The paper proposes "Ego," an efficient personalization method for vision-language models that extracts visual tokens representing target concepts via internal attention mechanisms to serve as memory, enabling strong performance across single-concept, multi-concept, and video personalization tasks without requiring additional training stages or external modules.

Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi2026-03-11🤖 cs.AI

← Previous Next →

cs.CV