cs.CV papers | Gist.Science

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

This paper addresses the limitations of current image generation models in handling structured visuals by introducing a comprehensive framework that includes a 1.3-million-pair dataset, a unified VLM-FLUX.1 model trained with a three-stage curriculum and external reasoning, and the StructBench benchmark with StructScore metric to evaluate and improve factual fidelity in chart and diagram generation and editing.

Le Zhuo, Songhao Han, Yuandong Pu + 8 more2026-03-05💻 cs

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

TIGeR is a novel framework that enhances Vision-Language Models for robotics by integrating external computational tools to perform precise geometric calculations, supported by a new dataset and a two-stage training pipeline, thereby achieving centimeter-level accuracy in real-world manipulation tasks.

Yi Han, Enshen Zhou, Shanyu Rong + 6 more2026-03-05🤖 cs.AI

Topological Alignment of Shared Vision-Language Embedding Space

This paper introduces ToMCLIP, a topology-aware framework that enhances multilingual vision-language alignment by applying persistent homology to preserve the global geometric structure of shared embedding spaces, thereby improving zero-shot accuracy and retrieval performance compared to existing instance-level methods.

Junwon You, Dasol Kang, Jae-Hun Jung2026-03-05🤖 cs.AI

Composition-Grounded Data Synthesis for Visual Reasoning

This paper introduces COGS, a data-efficient framework that synthesizes large-scale reasoning datasets by decomposing seed questions into primitive factors and recomposing them with new images, thereby significantly enhancing the visual reasoning capabilities of multi-modal large language models in annotation-scarce domains like charts and webpages.

Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong + 5 more2026-03-05🤖 cs.LG

A Geometry-Based View of Mahalanobis OOD Detection

This paper reveals that the reliability of Mahalanobis-based out-of-distribution detection is highly dependent on the geometric properties of the feature space, specifically within-class spectral structure and local intrinsic dimensionality, and proposes a radially scaled $\ell_2$ normalization method that dynamically adjusts feature radii to optimize detection performance based on these geometric signals.

Denis Janiak, Jakub Binkowski, Tomasz Kajdanowicz2026-03-05🤖 cs.LG

Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model

The paper introduces Kaleido, an open-sourced framework for multi-subject reference video generation that overcomes existing limitations in consistency and background disentanglement through a dedicated data construction pipeline and a novel Reference Rotary Positional Encoding (R-RoPE) mechanism.

Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang + 6 more2026-03-05🤖 cs.AI

Weakly Supervised Concept Learning with Class-Level Priors for Interpretable Medical Diagnosis

This paper introduces Prior-guided Concept Predictor (PCP), a weakly supervised framework that leverages class-level concept priors and regularization to enable reliable, interpretable medical diagnosis without costly concept annotations, significantly outperforming zero-shot baselines while matching fully supervised models.

Md Nahiduzzaman, Steven Korevaar, Alireza Bab-Hadiashar + 1 more2026-03-05💻 cs

Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization

This paper proposes a unified framework for seamless Gaussian-mesh joint optimization that simultaneously refines mesh geometry and vertex colors using texture guidance and differentiable rendering to achieve high-quality 3D reconstructions suitable for downstream editing tasks.

Zhejia Cai, Puhua Jiang, Shiwei Mao + 2 more2026-03-05🤖 cs.AI

Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation

This paper proposes an Edge-awareness Semantic Concordance framework that leverages latent edge cues and uncertainty indicators to effectively fuse heterogeneous event and RGB modalities, significantly improving semantic segmentation resilience under extreme conditions such as low light and camera motion.

Nan Bao, Yifan Zhao, Lin Zhu + 1 more2026-03-05💻 cs

NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization

NeuCLIP is a novel optimization framework that reformulates the contrastive loss using convex and variational analysis to replace inefficient per-sample normalizer estimators with a compact neural network, enabling more accurate and efficient large-scale CLIP training across datasets ranging from millions to billions of samples.

Xiyuan Wei, Chih-Jen Lin, Tianbao Yang2026-03-05🤖 cs.LG

Scriboora: Rethinking Human Pose Forecasting

This paper introduces a unified pipeline for human pose forecasting that addresses reproducibility issues, demonstrates that adapting recent speech models improves state-of-the-art performance, and evaluates model robustness against realistic noise from pose estimation through the introduction of a new dataset variation and unsupervised finetuning.

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif2026-03-05💻 cs

MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis

MatPedia introduces a universal generative foundation model that leverages a novel joint RGB-PBR representation and video diffusion architecture to unify text-to-material, image-to-material, and intrinsic decomposition tasks, achieving high-fidelity material synthesis by effectively transferring visual priors from large-scale RGB data.

Di Luo, Shuhui Yang, Mingxin Yang + 6 more2026-03-05💻 cs

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

VideoChat-M1 introduces a novel multi-agent system for video understanding that employs a learnable Collaborative Policy Planning paradigm, where multiple agents dynamically generate, execute, and refine tool invocation strategies through interaction and multi-agent reinforcement learning to achieve state-of-the-art performance across diverse video benchmarks.

Boyu Chen, Zikang Wang, Zhengrong Yue + 9 more2026-03-05💻 cs

UniLight: A Unified Representation for Lighting

The paper introduces UniLight, a unified latent space representation that aligns diverse lighting modalities—including text, images, irradiance, and environment maps—through contrastive learning and spherical harmonics prediction, enabling consistent cross-modal transfer and flexible lighting control in image synthesis tasks.

Zitian Zhang, Iliyan Georgiev, Michael Fischer + 3 more2026-03-05💻 cs

Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers

This paper proposes the Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play module that stabilizes latent diffusion inverse problem solvers by bridging the gap between solver dynamics and stable reverse diffusion processes through measurement-consistent Langevin updates.

Lee Hyoseok, Sohwi Lim, Eunju Cha + 1 more2026-03-05🤖 cs.LG

3D Wavelet-Based Structural Priors for Controlled Diffusion in Whole-Body Low-Dose PET Denoising

The paper proposes WCC-Net, a 3D wavelet-conditioned ControlNet framework that integrates explicit frequency-domain structural priors into a diffusion model to achieve superior anatomical consistency and denoising performance in whole-body low-dose PET imaging compared to existing CNN, GAN, and diffusion-based methods.

Peiyuan Jing, Yue Yang, Chun-Wun Cheng + 8 more2026-03-05🤖 cs.AI

Tracing 3D Anatomy in 2D Strokes: A Multi-Stage Projection Driven Approach to Cervical Spine Fracture Identification

This paper presents an automated, multi-stage pipeline that identifies cervical spine fractures by fusing orthogonal 2D segmentations to estimate 3D volumes of interest, which are then analyzed using a 2.5D CNN-Transformer ensemble to achieve diagnostic performance comparable to expert radiologists while reducing computational dimensionality.

Fabi Nahian Madhurja, Rusab Sarmun, Muhammad E. H. Chowdhury + 3 more2026-03-05🤖 cs.AI

Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

This paper proposes VRFT-Aug, a novel visual reinforcement fine-tuning framework that integrates perception and reasoning augmentation strategies to significantly outperform existing supervised and reinforcement learning baselines in high-stakes medical imaging tasks.

Guangjing Yang, ZhangYuan Yu, Ziyuan Qin + 7 more2026-03-05🤖 cs.AI

First International StepUP Competition for Biometric Footstep Recognition: Methods, Results and Remaining Challenges

The First International StepUP Competition leveraged the newly released UNB StepUP-P150 dataset to advance biometric footstep recognition, culminating in a global contest where the top team achieved a 10.77% equal error rate while highlighting persistent challenges in generalizing to unfamiliar footwear.

Robyn Larracy, Eve MacDonald, Angkoon Phinyomark + 5 more2026-03-05🤖 cs.LG

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

VidEoMT is a simple, high-speed video segmentation model that eliminates complex tracking modules by leveraging a plain ViT encoder enhanced with a lightweight query propagation and fusion mechanism to achieve competitive accuracy at up to 160 FPS.

Narges Norouzi, Idil Esen Zulfikar, Niccolò Cavagnero + 4 more2026-03-05💻 cs

← Previous Next →