cs.CV papers | Gist.Science

Interpretable Debiasing of Vision-Language Models for Social Fairness

This paper introduces DeBiasLens, an interpretable, model-agnostic framework that utilizes sparse autoencoders to identify and selectively deactivate social attribute neurons within Vision-Language models, thereby mitigating social biases without compromising semantic knowledge.

Na Min An, Yoonna Jang, Yusuke Hirota + 3 more2026-03-02🤖 cs.AI

SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting

The paper proposes SR3R, a feed-forward framework that reformulates 3D super-resolution as a direct mapping from sparse low-resolution views to high-resolution 3D Gaussian Splatting representations, enabling robust zero-shot generalization and superior reconstruction fidelity by autonomously learning 3D-specific high-frequency details from large-scale multi-scene data.

Xiang Feng, Xiangbo Wang, Tieshi Zhong + 7 more2026-03-02💻 cs

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection

This paper proposes SteerVAD, a novel tuning-free framework that enhances video anomaly detection in frozen multi-modal LLMs by identifying latent anomaly experts and employing a hierarchical meta-controller to dynamically steer and rectify their internal representations, thereby achieving state-of-the-art performance with minimal training data.

Zhaolin Cai, Fan Li, Huiyu Duan + 2 more2026-03-02💻 cs

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

GuardAlign is a training-free defense framework for multimodal large language models that combines optimal transport-based safety detection and cross-modal attentive calibration to significantly reduce unsafe response rates while preserving model utility.

Xingyu Zhu, Beier Zhu, Junfeng Fang + 4 more2026-03-02💻 cs

Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation

This paper proposes Adaptive Visual Reinforcement (AIR), a training-free framework that mitigates hallucinations in Multimodal Large Language Models by condensing visual tokens and selectively reinforcing the most consistent image patches to enhance reliance on salient visual evidence.

Xingyu Zhu, Kesen Zhao, Liang Yi + 4 more2026-03-02💻 cs

Spatio-Temporal Garment Reconstruction Using Diffusion Mapping via Pattern Coordinates

This paper presents a unified framework for high-fidelity 3D garment reconstruction from monocular images and videos by combining Implicit Sewing Patterns with a generative diffusion model in UV space to learn expressive shape priors and enforce spatio-temporal consistency, enabling accurate recovery of both tight- and loose-fitting clothing with fine geometric details.

Yingxuan You, Ren Li, Corentin Dumery + 3 more2026-03-02💻 cs

Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization

This paper proposes Quant Experts (QE), a token-aware adaptive error reconstruction method using a mixture of shared and routed low-rank adapters to dynamically compensate for quantization errors in Large Vision-Language Models, thereby achieving near full-precision performance across various scales without retraining.

Chenwei Jia, Baoting Li, Xuchong Zhang + 3 more2026-03-02🤖 cs.AI

Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification

This paper introduces a neurosymbolic verification framework that uses SMT solvers and clinical knowledge bases to formally audit and guarantee the logical consistency of vision-language model-generated radiology reports, effectively identifying and eliminating hallucinations and deductive failures that traditional metrics miss.

Vikash Singh, Debargha Ganguly, Haotian Yu + 5 more2026-03-02💬 cs.CL

AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

AgenticOCR introduces a dynamic, query-driven parsing paradigm that autonomously identifies and extracts only relevant document regions to overcome the context overload and hallucination risks of traditional page-level chunking in multimodal Retrieval-Augmented Generation systems.

Zhengren Wang, Dongsheng Ma, Huaping Zhong + 4 more2026-03-02💬 cs.CL

Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives

This paper proposes an efficient 3D Gaussian Splatting framework that combines an adaptive, reconstruction-aware pruning strategy with novel Difference-of-Gaussians primitives to achieve up to 90% model compression while maintaining or enhancing rendering quality.

Haoran Wang, Guoxi Huang, Fan Zhang + 2 more2026-03-02💻 cs

Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics

This paper introduces TASOT, an unsupervised multimodal optimal transport framework that leverages visual and text-based cues to achieve state-of-the-art surgical phase and step segmentation without relying on costly large-scale pre-training.

Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji + 4 more2026-03-02🤖 cs.AI

HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation

HumanOrbit is a video diffusion-based method that generates consistent 360° orbit videos from a single human image, enabling the reconstruction of high-fidelity, geometrically complete 3D textured meshes with superior identity preservation compared to existing multi-view synthesis approaches.

Keito Suzuki, Kunyao Chen, Lei Wang + 5 more2026-03-02💻 cs

RAViT: Resolution-Adaptive Vision Transformer

The paper proposes RAViT, a resolution-adaptive vision transformer framework that utilizes a multi-branch architecture with an early exit mechanism to achieve accuracy comparable to standard Vision Transformers while significantly reducing computational costs by processing images at varying resolutions.

Martial Guidez, Stefan Duffner, Christophe Garcia2026-03-02🤖 cs.LG

Manifold-Preserving Superpixel Hierarchies and Embeddings for the Exploration of High-Dimensional Images

This paper introduces a novel superpixel hierarchy for high-dimensional images that integrates spatial layout with attribute manifold information, enabling consistent and effective exploration of both image and attribute spaces compared to traditional methods that ignore spatial coherence.

Alexander Vieth, Boudewijn Lelieveldt, Elmar Eisemann + 2 more2026-03-02💻 cs

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

The paper proposes the Omnivorous Vision Encoder, a framework that aligns feature representations across diverse modalities (such as RGB and depth) while distilling knowledge from a frozen DINOv2 teacher, thereby enabling robust, modality-agnostic scene understanding.

Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson + 5 more2026-03-02🤖 cs.AI

A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification

This paper introduces the first automated multimodal auditing framework for medical image classification that overcomes the limitations of existing unimodal approaches by enabling systematic discovery and explanation of hidden failures, as validated on the MIMIC-CXR-JPG dataset.

Yixuan Liu, Kanwal K. Bhatia, Ahmed E. Fetit2026-03-02🤖 cs.LG

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

This paper introduces UMPIRE, a training-free, efficient uncertainty quantification framework for Multimodal Large Language Models that leverages internal modality features to compute incoherence-adjusted semantic volumes, demonstrating superior performance in error detection and calibration across diverse modalities and challenging settings without relying on external tools.

Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin + 1 more2026-03-02💬 cs.CL

SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

This paper introduces SenCache, a principled, training-free framework that accelerates video diffusion model inference by dynamically selecting caching timesteps based on a theoretical analysis of model output sensitivity to input perturbations, thereby achieving superior visual quality compared to existing heuristic methods.

Yasaman Haghighi, Alexandre Alahi2026-03-02🤖 cs.LG

MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

The paper introduces MuViT, a transformer architecture that fuses true multi-resolution microscopy observations within a shared world-coordinate system to effectively integrate wide-field context with high-resolution detail, demonstrating consistent performance improvements over existing baselines across various microscopy tasks.

Albert Dominguez Mantes, Gioele La Manno, Martin Weigert2026-03-02🤖 cs.LG

Enhancing Spatial Understanding in Image Generation via Reward Modeling

This paper introduces a novel approach to enhance spatial understanding in text-to-image generation by constructing a large-scale preference dataset, developing a high-performance reward model called SpatialScore, and leveraging it for online reinforcement learning to significantly improve the accuracy of complex spatial relationships in generated images.

Zhenyu Tang, Chaoran Feng, Yufan Deng + 5 more2026-03-02💻 cs

← Previous Next →