cs.CV papers | Gist.Science

Physics-Driven 3D Gaussian Rendering for Zero-Shot MRI Super-Resolution

This paper proposes a zero-shot MRI super-resolution framework that leverages explicit, physics-driven 3D Gaussian representations and parallel brick-based rasterization to achieve high-quality, efficient reconstruction without relying on costly paired training data.

Shuting Liu, Lei Zhang, Wei Huang, Zhao Zhang, Zizhou WangWed, 11 Ma💻 cs

Decoder-Free Distillation for Quantized Image Restoration

This paper introduces Quantization-aware Distilled Restoration (QDR), a novel framework that employs Decoder-Free Distillation and Learnable Magnitude Reweighting to overcome bottlenecks in quantization-aware training, enabling high-performance, edge-deployed image restoration models that achieve near-FP32 accuracy and significant speedups.

S. M. A. Sharif, Abdur Rehman, Seongwan Kim, Jaeho LeeWed, 11 Ma💻 cs

Grounding Synthetic Data Generation With Vision and Language Models

This paper proposes a vision-language grounded framework for interpretable synthetic data generation and evaluation in remote sensing, introducing the ARAS400k dataset which demonstrates that augmenting real data with synthetic images consistently outperforms real-data-only baselines in semantic segmentation and image captioning tasks.

Ümit Mert Ça\u{g}lar, Alptekin TemizelWed, 11 Ma🤖 cs.AI

X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models

This paper introduces X-GS, an extensible open framework that unifies 3D Gaussian Splatting with downstream multimodal models through a real-time, semantically enriched pipeline capable of processing unposed video streams for tasks like object detection and zero-shot captioning.

Yueen Ma, Irwin KingWed, 11 Ma💬 cs.CL

OTPL-VIO: Robust Visual-Inertial Odometry with Optimal Transport Line Association and Adaptive Uncertainty

This paper presents OTPL-VIO, a robust stereo visual-inertial odometry system that enhances performance in low-texture and illumination-challenging environments by employing a training-free deep descriptor with entropy-regularized optimal transport for line association and introducing adaptive uncertainty weighting to stabilize estimation.

Zikun Chen, Wentao Zhao, Yihe Niu, Tianchen Deng, Jingchuan WangWed, 11 Ma💻 cs

When to Lock Attention: Training-Free KV Control in Video Diffusion

KV-Lock is a training-free framework for DiT-based video diffusion models that dynamically adjusts background key-value locking and classifier-free guidance scales based on hallucination detection to simultaneously enhance foreground quality and maintain background consistency.

Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang, Jun Yin, Haoyuan Sun, Junfeng Jiao, Christian Claudel, Junbo Tan, Xueqian WangWed, 11 Ma🤖 cs.AI

DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics

The paper presents DiffWind, a physics-informed differentiable framework that unifies wind-object interaction modeling, video-based reconstruction, and forward simulation by combining 3D Gaussian Splatting, the Material Point Method, and Lattice Boltzmann constraints to accurately recover and simulate wind-driven object dynamics from video observations.

Yuanhang Lei, Boming Zhao, Zesong Yang, Xingxuan Li, Tao Cheng, Haocheng Peng, Ru Zhang, Yang Yang, Siyuan Huang, Yujun Shen, Ruizhen Hu, Hujun Bao, Zhaopeng CuiWed, 11 Ma💻 cs

VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM

VarSplat is an uncertainty-aware 3D Gaussian Splatting SLAM system that explicitly learns per-splat appearance variance to render differentiable uncertainty maps, thereby guiding tracking and optimization toward reliable regions to achieve robust pose estimation and high-fidelity reconstruction in challenging real-world environments.

Anh Thuan Tran, Jana KoseckaWed, 11 Ma💻 cs

Improving 3D Foot Motion Reconstruction in Markerless Monocular Human Motion Capture

This paper introduces FootMR, a method that refines 3D foot motion in markerless monocular human motion capture by lifting 2D keypoints to 3D using large-scale motion capture data and context-aware residual prediction, alongside the new MOOF dataset, to significantly outperform existing state-of-the-art approaches in accuracy and generalization.

Tom Wehrbein, Bodo RosenhahnWed, 11 Ma💻 cs

AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

This paper introduces AutoViVQA, a large-scale automatically constructed dataset for Vietnamese Visual Question Answering, and evaluates transformer-based multimodal models alongside various automatic metrics to assess their performance and alignment with human judgment in the Vietnamese context.

Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc, Tran Dac Thinh, Dang Duy Lan, Nguyen Quoc Thinh, Tung LeWed, 11 Ma🤖 cs.AI

DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds

This paper introduces DRIFT, a dual-path Transformer model that effectively fuses fine-grained local and coarse-grained global features from sparse 4D radar point clouds to achieve state-of-the-art performance in automated driving perception tasks like object detection and free road estimation.

Siqi Pei, Andras Palffy, Dariu M. GavrilaWed, 11 Ma💻 cs

TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering

The paper introduces TemporalDoRA, a parameter-efficient fine-tuning method that integrates lightweight temporal attention into the low-rank adaptation branch of vision encoders to enhance robustness against linguistic variations in surgical video question answering, validated by a new colonoscopy dataset and improved Out-of-Template performance.

Luca Carlini, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. HoqueWed, 11 Ma💻 cs

TriFusion-SR: Joint Tri-Modal Medical Image Fusion and SR

The paper proposes TriFusion-SR, a wavelet-guided conditional diffusion framework that jointly performs tri-modal medical image fusion and super-resolution by decomposing features into frequency bands and employing rectified wavelet features with adaptive spatial-frequency fusion to achieve state-of-the-art performance in resolution and perceptual quality.

Fayaz Ali Dharejo, Sharif S. M. A., Aiman Khalil, Nachiket Chaudhary, Rizwan Ali Naqvi, Radu TimofteWed, 11 Ma💻 cs

ProGS: Towards Progressive Coding for 3D Gaussian Splatting

ProGS introduces a novel streaming-friendly codec that organizes 3D Gaussian Splatting data into an octree structure with mutual information enhancement, achieving a 45-fold reduction in file size and over 10% visual improvement while enabling progressive coding for varying bandwidth conditions.

Zhiye Tang, Lingzhuo Liu, Shengjie Jiao, Qiudan Zhang, Junhui Hou, You Yang, Xu WangWed, 11 Ma💻 cs

GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System

This paper introduces GSStream, a novel volumetric scene streaming system for 3D Gaussian Splatting that leverages collaborative viewport prediction and deep reinforcement learning-based bitrate adaptation to overcome bandwidth challenges and deliver high-quality, real-time immersive experiences.

Zhiye Tang, Qiudan Zhang, Lei Zhang, Junhui Hou, You Yang, Xu WangWed, 11 Ma💻 cs

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

The paper proposes FrameDiT, a novel video generation architecture that introduces Matrix Attention to efficiently model global spatio-temporal dynamics by processing frames as matrices, thereby achieving state-of-the-art video quality and temporal coherence while maintaining computational efficiency comparable to local factorized attention.

Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen TranWed, 11 Ma💻 cs

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

This paper introduces EXPLORE-Bench, a benchmark derived from real first-person videos to evaluate the ability of multimodal large language models to perform long-horizon egocentric scene prediction, revealing significant performance gaps compared to humans and demonstrating that stepwise reasoning offers partial improvements at a computational cost.

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun ZhaWed, 11 Ma🤖 cs.AI

FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

FetalAgents is a novel multi-agent system that dynamically orchestrates specialized vision experts to deliver robust, end-to-end fetal ultrasound analysis and structured clinical reporting across multiple tasks, outperforming existing specialized models and multimodal large language models.

Xiaotian Hu, Junwei Huang, Mingxuan Liu, Kasidit Anmahapong, Yifei Chen, Yitong Luo, Yiming Huang, Xuguang Bai, Zihan Li, Yi Liao, Haibo Qu, Qiyuan TianWed, 11 Ma💻 cs

$M^2$ -Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs

The paper introduces $M^2$ -Occ, a robust 3D semantic occupancy prediction framework that leverages a Multi-view Masked Reconstruction module and a Feature Memory Module to maintain geometric and semantic coherence under incomplete multi-camera inputs, significantly outperforming existing methods in scenarios with missing views.

Kaixin Lin, Kunyu Peng, Di Wen, Yufan Chen, Ruiping Liu, Kailun YangWed, 11 Ma⚡ eess

Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

This paper introduces Step-Aware Contrastive Alignment (SACA), a novel framework that enhances Vision-Language Navigation in Continuous Environments by utilizing a perception-grounded auditor to extract dense, step-level supervision from imperfect trajectories, thereby overcoming the limitations of compounding errors in supervised fine-tuning and sparse rewards in reinforcement fine-tuning to achieve state-of-the-art performance.

Haoyuan Li, Rui Liu, Hehe Fan, Yi YangWed, 11 Ma💻 cs

← Previous Next →

cs.CV