cs.CV papers | Gist.Science

Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization

This paper proposes a unified framework for seamless Gaussian-mesh joint optimization that simultaneously refines mesh geometry and vertex colors using texture guidance and differentiable rendering to achieve high-quality 3D reconstructions suitable for downstream editing tasks.

Zhejia Cai, Puhua Jiang, Shiwei Mao + 2 more2026-03-05🤖 cs.AI

Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation

This paper proposes an Edge-awareness Semantic Concordance framework that leverages latent edge cues and uncertainty indicators to effectively fuse heterogeneous event and RGB modalities, significantly improving semantic segmentation resilience under extreme conditions such as low light and camera motion.

Nan Bao, Yifan Zhao, Lin Zhu + 1 more2026-03-05💻 cs

NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization

NeuCLIP is a novel optimization framework that reformulates the contrastive loss using convex and variational analysis to replace inefficient per-sample normalizer estimators with a compact neural network, enabling more accurate and efficient large-scale CLIP training across datasets ranging from millions to billions of samples.

Xiyuan Wei, Chih-Jen Lin, Tianbao Yang2026-03-05🤖 cs.LG

Scriboora: Rethinking Human Pose Forecasting

This paper introduces a unified pipeline for human pose forecasting that addresses reproducibility issues, demonstrates that adapting recent speech models improves state-of-the-art performance, and evaluates model robustness against realistic noise from pose estimation through the introduction of a new dataset variation and unsupervised finetuning.

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif2026-03-05💻 cs

MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis

MatPedia introduces a universal generative foundation model that leverages a novel joint RGB-PBR representation and video diffusion architecture to unify text-to-material, image-to-material, and intrinsic decomposition tasks, achieving high-fidelity material synthesis by effectively transferring visual priors from large-scale RGB data.

Di Luo, Shuhui Yang, Mingxin Yang + 6 more2026-03-05💻 cs

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

VideoChat-M1 introduces a novel multi-agent system for video understanding that employs a learnable Collaborative Policy Planning paradigm, where multiple agents dynamically generate, execute, and refine tool invocation strategies through interaction and multi-agent reinforcement learning to achieve state-of-the-art performance across diverse video benchmarks.

Boyu Chen, Zikang Wang, Zhengrong Yue + 9 more2026-03-05💻 cs

UniLight: A Unified Representation for Lighting

The paper introduces UniLight, a unified latent space representation that aligns diverse lighting modalities—including text, images, irradiance, and environment maps—through contrastive learning and spherical harmonics prediction, enabling consistent cross-modal transfer and flexible lighting control in image synthesis tasks.

Zitian Zhang, Iliyan Georgiev, Michael Fischer + 3 more2026-03-05💻 cs

Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers

This paper proposes the Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play module that stabilizes latent diffusion inverse problem solvers by bridging the gap between solver dynamics and stable reverse diffusion processes through measurement-consistent Langevin updates.

Lee Hyoseok, Sohwi Lim, Eunju Cha + 1 more2026-03-05🤖 cs.LG

3D Wavelet-Based Structural Priors for Controlled Diffusion in Whole-Body Low-Dose PET Denoising

The paper proposes WCC-Net, a 3D wavelet-conditioned ControlNet framework that integrates explicit frequency-domain structural priors into a diffusion model to achieve superior anatomical consistency and denoising performance in whole-body low-dose PET imaging compared to existing CNN, GAN, and diffusion-based methods.

Peiyuan Jing, Yue Yang, Chun-Wun Cheng + 8 more2026-03-05🤖 cs.AI

Tracing 3D Anatomy in 2D Strokes: A Multi-Stage Projection Driven Approach to Cervical Spine Fracture Identification

This paper presents an automated, multi-stage pipeline that identifies cervical spine fractures by fusing orthogonal 2D segmentations to estimate 3D volumes of interest, which are then analyzed using a 2.5D CNN-Transformer ensemble to achieve diagnostic performance comparable to expert radiologists while reducing computational dimensionality.

Fabi Nahian Madhurja, Rusab Sarmun, Muhammad E. H. Chowdhury + 3 more2026-03-05🤖 cs.AI

Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

This paper proposes VRFT-Aug, a novel visual reinforcement fine-tuning framework that integrates perception and reasoning augmentation strategies to significantly outperform existing supervised and reinforcement learning baselines in high-stakes medical imaging tasks.

Guangjing Yang, ZhangYuan Yu, Ziyuan Qin + 7 more2026-03-05🤖 cs.AI

First International StepUP Competition for Biometric Footstep Recognition: Methods, Results and Remaining Challenges

The First International StepUP Competition leveraged the newly released UNB StepUP-P150 dataset to advance biometric footstep recognition, culminating in a global contest where the top team achieved a 10.77% equal error rate while highlighting persistent challenges in generalizing to unfamiliar footwear.

Robyn Larracy, Eve MacDonald, Angkoon Phinyomark + 5 more2026-03-05🤖 cs.LG

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

VidEoMT is a simple, high-speed video segmentation model that eliminates complex tracking modules by leveraging a plain ViT encoder enhanced with a lightweight query propagation and fusion mechanism to achieve competitive accuracy at up to 160 FPS.

Narges Norouzi, Idil Esen Zulfikar, Niccolò Cavagnero + 4 more2026-03-05💻 cs

When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance

This paper proposes Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies category-specific safety directions to resolve harmful conflicts in text-to-image diffusion models, thereby significantly reducing overall harmful output rates compared to existing methods.

Yongli Xiang, Ziming Hong, Zhaoqing Wang + 3 more2026-03-05💻 cs

Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction

Skullptor bridges the gap between efficient single-image reconstruction and high-fidelity multi-view photogrammetry by combining a cross-view attention-based normal prediction model with inverse rendering optimization to achieve detailed 3D head reconstruction in seconds with reduced camera and computational requirements.

Noé Artru, Rukhshanda Hussain, Emeline Got + 3 more2026-03-05💻 cs

Momentum Memory for Knowledge Distillation in Computational Pathology

The paper proposes Momentum Memory Knowledge Distillation (MoMKD), a cross-modal framework that utilizes a momentum-updated memory to aggregate genomic and histopathology information across batches and decouples branch gradients, thereby overcoming the limitations of batch-local alignment and enabling robust, generalizable cancer diagnosis using histology-only inference.

Yongxin Guo, Hao Lu, Onur C. Koyun + 3 more2026-03-05💻 cs

Automatic Map Density Selection for Locally-Performant Visual Place Recognition

This paper proposes a dynamic Visual Place Recognition mapping approach that automatically selects the optimal reference map density to guarantee that a user-specified local recall performance level is met across a defined proportion of the environment, thereby ensuring reliable long-term deployment without unnecessary over-densification.

Somayeh Hussaini, Tobias Fischer, Michael Milford2026-03-05💻 cs

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

This paper introduces Spatial Credit Redistribution (SCR), a training-free inference-time method that mitigates hallucinations in Vision-Language Models by redistributing suppressed visual attention from dominant patches to their spatial neighbors, thereby significantly reducing hallucination rates across multiple benchmarks while preserving generation quality and maintaining negligible latency.

Niamul Hassan Samin, Md Arifur Rahman, Abdullah Ibne Hanif Arean + 2 more2026-03-05🤖 cs.AI

EvalMVX: A Unified Benchmarking for Neural 3D Reconstruction under Diverse Multiview Setups

This paper introduces EvalMVX, a comprehensive real-world dataset featuring 25 objects with aligned ground-truth meshes captured under diverse lighting and view conditions, to establish a unified benchmark for quantitatively evaluating and comparing neural multiview stereo, photometric stereo, and shape-from-polarization reconstruction methods.

Zaiyan Yang, Jieji Ren, Xiangyi Wang + 5 more2026-03-05💻 cs

Improved MambdaBDA Framework for Robust Building Damage Assessment Across Disaster Domains

This paper proposes an improved MambdaBDA framework that integrates Focal Loss, lightweight Attention Gates, and a compact Alignment Module to significantly enhance building damage assessment accuracy and generalization across diverse disaster domains by addressing class imbalance, background clutter, and domain shift.

Alp Eren Gençoğlu, Hazım Kemal Ekenel2026-03-05💻 cs

← Previous Next →