cs.CV papers | Gist.Science

A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification

This paper introduces the first automated multimodal auditing framework for medical image classification that overcomes the limitations of existing unimodal approaches by enabling systematic discovery and explanation of hidden failures, as validated on the MIMIC-CXR-JPG dataset.

Yixuan Liu, Kanwal K. Bhatia, Ahmed E. Fetit2026-03-02🤖 cs.LG

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

This paper introduces UMPIRE, a training-free, efficient uncertainty quantification framework for Multimodal Large Language Models that leverages internal modality features to compute incoherence-adjusted semantic volumes, demonstrating superior performance in error detection and calibration across diverse modalities and challenging settings without relying on external tools.

Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin + 1 more2026-03-02💬 cs.CL

SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

This paper introduces SenCache, a principled, training-free framework that accelerates video diffusion model inference by dynamically selecting caching timesteps based on a theoretical analysis of model output sensitivity to input perturbations, thereby achieving superior visual quality compared to existing heuristic methods.

Yasaman Haghighi, Alexandre Alahi2026-03-02🤖 cs.LG

MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

The paper introduces MuViT, a transformer architecture that fuses true multi-resolution microscopy observations within a shared world-coordinate system to effectively integrate wide-field context with high-resolution detail, demonstrating consistent performance improvements over existing baselines across various microscopy tasks.

Albert Dominguez Mantes, Gioele La Manno, Martin Weigert2026-03-02🤖 cs.LG

Enhancing Spatial Understanding in Image Generation via Reward Modeling

This paper introduces a novel approach to enhance spatial understanding in text-to-image generation by constructing a large-scale preference dataset, developing a high-performance reward model called SpatialScore, and leveraging it for online reinforcement learning to significantly improve the accuracy of complex spatial relationships in generated images.

Zhenyu Tang, Chaoran Feng, Yufan Deng + 5 more2026-03-02💻 cs

Joint Geometric and Trajectory Consistency Learning for One-Step Real-World Super-Resolution

This paper proposes GTASR, a lightweight one-step Real-World Super-Resolution framework that overcomes the limitations of existing Consistency Models by introducing Trajectory Alignment and Dual-Reference Structural Rectification to eliminate consistency drift and ensure structural coherence.

Chengyan Deng, Zhangquan Chen, Li Yu + 3 more2026-03-02💻 cs

Histopathology Image Normalization via Latent Manifold Compaction

This paper introduces Latent Manifold Compaction (LMC), an unsupervised framework that harmonizes histopathology images by compacting stain-induced latent manifolds to learn batch-invariant embeddings, thereby significantly improving cross-batch generalization and outperforming state-of-the-art normalization methods in downstream classification and detection tasks.

Xiaolong Zhang, Jianwei Zhang, Selim Sevim + 3 more2026-03-02🤖 cs.LG

Hierarchical Action Learning for Weakly-Supervised Action Segmentation

The paper proposes the Hierarchical Action Learning (HAL) model, which leverages the distinct temporal evolution rates of low-level visual and high-level action latent variables within a hierarchical causal framework to achieve strictly identifiable and state-of-the-art weakly-supervised action segmentation.

Junxian Huang, Ruichu Cai, Hao Zhu + 5 more2026-03-02💻 cs

Mode Seeking meets Mean Seeking for Fast Long Video Generation

This paper proposes a Decoupled Diffusion Transformer that combines a global Flow Matching head for long-term narrative coherence with a local Distribution Matching head for short-video fidelity, enabling the fast generation of high-quality, minute-scale videos by effectively bridging the gap between limited long-form data and abundant short-form data.

Shengqu Cai, Weili Nie, Chao Liu + 8 more2026-03-02🤖 cs.LG

BSDM: Background Suppression Diffusion Model for Hyperspectral Anomaly Detection

This paper proposes BSDM, a novel unsupervised background suppression diffusion model that learns latent background distributions and adapts to diverse domains via a statistical offset module to effectively detect hyperspectral anomalies without requiring labeled data.

Jitao Ma, Weiying Xie, Yunsong Li + 1 more2026-02-27💻 cs

StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning

StableMaterials is a novel semi-supervised learning framework that leverages Latent Diffusion Models and adversarial distillation to generate diverse, high-resolution, and tileable photorealistic PBR materials with minimal reliance on annotated data.

Giuseppe Vecchio2026-02-27💻 cs

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation

This paper introduces SGIFormer, a novel 3D instance segmentation method that combines Semantic-guided Mix Query initialization with a Geometric-enhanced Interleaving Transformer decoder to overcome existing limitations in query initialization and scalability, achieving state-of-the-art performance on major benchmarks while balancing accuracy and efficiency.

Lei Yao, Yi Wang, Moyun Liu + 1 more2026-02-27💻 cs

Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture

This paper proposes a parameter-efficient Open-Set Deepfake detection method that leverages a forgery-style mixture formulation and lightweight modules within a pre-trained Vision Transformer to achieve superior generalization across unseen forgery domains while significantly reducing computational costs.

Chenqi Kong, Anwei Luo, Peijun Bao + 5 more2026-02-27💻 cs

Abstracted Gaussian Prototypes for True One-Shot Concept Learning

This paper introduces the Abstracted Gaussian Prototypes (AGP) framework, a low-complexity, standalone system that achieves "true" one-shot learning by encoding visual concepts as Gaussian mixture models to simultaneously perform robust classification and generate human-indistinguishable novel class variants without relying on pre-training.

Chelsea Zou, Kenneth J. Kurtz2026-02-27🤖 cs.AI

SplatSDF: Boosting SDF-NeRF via Architecture-Level Fusion with Gaussian Splats

SplatSDF is a novel architecture that accelerates the convergence and improves the geometric accuracy of SDF-NeRF by directly fusing pre-trained 3D Gaussian splats into the network via a sparse injection strategy, enabling practical deployment on robotic systems without relying on consistency losses.

Runfa Blark Li, Keito Suzuki, Bang Du + 3 more2026-02-27💻 cs

Distractor-free Generalizable 3D Gaussian Splatting

This paper introduces DGGS, a novel framework that achieves distractor-free generalizable 3D Gaussian Splatting by employing a scene-agnostic mask prediction module during training and a two-stage reference scoring with pruning mechanism during inference to ensure stable, high-quality reconstruction in unseen scenes.

Yanqi Bao, Jing Liao, Jing Huo + 1 more2026-02-27💻 cs

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

This paper proposes a framework that enhances Open Vocabulary Object Detection models for open-world settings by introducing Pseudo Unknown Embedding and Multi-Scale Contrastive Anchor Learning to identify and incrementally learn novel objects, thereby addressing limitations in detecting far-out-of-distribution items and reducing misclassifications while maintaining state-of-the-art performance.

Zizhao Li, Zhengkang Xiang, Joseph West + 1 more2026-02-27🤖 cs.AI

Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints

This paper proposes a novel text-to-sketch-animation method that leverages a pre-trained text-to-video diffusion model guided by SDS loss, while introducing length-area regularization for temporal consistency and As-Rigid-As-Possible loss to preserve sketch topology, thereby outperforming state-of-the-art approaches in both quantitative and qualitative evaluations.

Gaurav Rai, Ojaswa Sharma2026-02-27💻 cs

PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting

The paper introduces PPT, a scalable pretraining framework that leverages automatically generated pseudo-labeled trajectories from off-the-shelf detectors to enhance motion forecasting models' performance and generalization, particularly in low-data and cross-domain scenarios, while reducing reliance on costly manual annotations.

Yihong Xu, Yuan Yin, Éloi Zablocki + 3 more2026-02-27💻 cs

IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks

The paper proposes IV-tuning, a parameter-efficient transfer learning framework that leverages pre-trained visual models with only 3% trainable backbone parameters to overcome the generalization limitations of full fine-tuning and achieve state-of-the-art performance across various infrared-visible tasks.

Yaming Zhang, Chenqiang Gao, Fangcen Liu + 4 more2026-02-27💻 cs

← Previous Next →