cs.CV papers | Gist.Science

Aligned explanations in neural networks

This paper introduces Pointwise-interpretable Networks (PiNets), a modeling framework that ensures "explanatory alignment" by combining statistical intelligence with a pseudo-linear structure to produce neural network explanations that directly underlie predictions rather than merely rationalizing them.

Corentin Lobet, Francesca Chiaromonte2026-03-03📊 stat

TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models

TP-Blend is a lightweight, training-free framework that achieves precise object-style blending in diffusion models by combining Cross-Attention Object Fusion for spatially aware feature reassignment and Self-Attention Style Fusion for detail-sensitive texture modulation, enabling simultaneous high-fidelity object replacement and style transfer.

Xin Jin, Yichuan Zhong, Yapeng Tian2026-03-03🤖 cs.AI

Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints

This paper presents a zero-shot 3D object alignment framework that optimizes relative pose using CLIP-driven gradients and geometry-aware constraints via a differentiable renderer, achieving semantically faithful and physically plausible results without requiring new model training.

Rotem Gatenyo, Ohad Fried2026-03-03💻 cs

Counterfactual Explanations on Robust Perceptual Geodesics

This paper introduces Perceptual Counterfactual Geodesics (PCG), a method that generates robust and semantically valid counterfactual explanations by tracing geodesics under a perceptually aligned Riemannian metric, thereby overcoming the off-manifold artifacts and adversarial vulnerabilities inherent in existing latent-space optimization approaches.

Eslam Zaher, Maciej Trzaskowski, Quan Nguyen + 1 more2026-03-03🤖 cs.LG

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Vision-DeepResearch introduces a novel multimodal deep-research paradigm that leverages multi-turn, multi-entity, and multi-scale visual and textual search, trained via cold-start supervision and reinforcement learning, to significantly outperform existing models and strong closed-source foundation models in solving complex, noise-heavy real-world questions.

Wenxuan Huang, Yu Zeng, Qiuchen Wang + 13 more2026-03-03🤖 cs.AI

When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection

This paper introduces the CAAD-3K benchmark and a conditional compatibility learning framework that leverages vision-language representations to detect anomalies based on subject-context compatibility, thereby addressing the limitations of traditional methods that treat abnormality as an intrinsic property independent of context.

Shashank Mishra, Didier Stricker, Jason Rambach2026-03-03🤖 cs.LG

Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning

This paper introduces HitEmotion, a Theory-of-Mind-grounded benchmark and a corresponding framework combining ToM-guided reasoning chains with TMPO reinforcement learning to significantly enhance the deep emotional understanding and reasoning capabilities of multimodal large language models.

Meng Luo, Bobo Li, Shanqing Xu + 8 more2026-03-03💻 cs

Gradient-Aligned Calibration for Post-Training Quantization of Diffusion Models

This paper proposes a novel post-training quantization method for diffusion models that optimizes calibration sample weights to align gradients across timesteps, thereby overcoming the sub-optimality of uniform weighting and significantly improving quantization performance.

Dung Anh Hoang, Cuong Pham anh Trung Le, Jianfei Cai + 1 more2026-03-03🤖 cs.LG

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

This paper introduces CaCoVID, a reinforcement learning-based token compression framework for video large language models that optimizes token selection by explicitly maximizing their contribution to correct predictions rather than relying on attention scores, thereby significantly reducing computational overhead while maintaining performance.

Yinchao Ma, Qiang Zhou, Zhibin Wang + 4 more2026-03-03🤖 cs.AI

CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions

This paper introduces CloDS, an unsupervised framework that learns cloth dynamics from multi-view visual observations under unknown conditions by employing a three-stage pipeline featuring a novel dual-position opacity modulation for robust video-to-geometry grounding.

Yuliang Zhan, Jian Li, Wenbing Huang + 3 more2026-03-03🤖 cs.AI

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

To address the limitations of existing benchmarks in evaluating multimodal large language models' visual and textual search capabilities, this paper introduces the Vision-DeepResearch Benchmark (VDR-Bench), a rigorously curated dataset of 2,000 instances designed for realistic assessment, alongside a proposed multi-round cropped-search workflow that effectively enhances visual retrieval performance.

Yu Zeng, Wenxuan Huang, Zhen Fang + 14 more2026-03-03💬 cs.CL

Investigating Disability Representations in Text-to-Image Models

This study investigates the underexplored representation of disability in text-to-image models like Stable Diffusion XL and DALL-E 3, revealing persistent imbalances and affective framing issues that highlight the urgent need for continuous evaluation and refinement to foster more inclusive portrayals.

Yang Tian, Yu Fan, Liudmila Zavolokina + 1 more2026-03-03💬 cs.CL

RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

The paper introduces RFDM, a residual flow diffusion model that enables efficient, causal, and variable-length video editing by adapting 2D image-to-image diffusion to predict frame residuals, achieving performance comparable to 3D models with significantly lower computational costs.

Mohammadreza Salehi, Mehdi Noroozi, Luca Morreale + 4 more2026-03-03💻 cs

Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D

This paper benchmarks five state-of-the-art image-to-3D foundation models on medical and natural datasets, revealing that while all struggle with severe depth ambiguity in single-slice reconstruction, SAM3D best preserves topological similarity to medical shapes, ultimately demonstrating that reliable medical 3D inference requires domain-specific adaptation beyond current zero-shot capabilities.

Yan Luo, Advaith Ravishankar, Serena Liu + 2 more2026-03-03💻 cs

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

EchoTorrent is a novel multi-modal video generation framework that overcomes latency and temporal stability challenges through a fourfold design involving multi-teacher training, adaptive CFG calibration, hybrid long tail forcing, and VAE decoder refinement to enable swift, sustained, and high-fidelity streaming inference with precise audio-lip synchronization.

Rang Meng, Yingjie Yin, Yuming Li + 1 more2026-03-03💻 cs

Deformation-Free Cross-Domain Image Registration via Position-Encoded Temporal Attention

This paper introduces GPEReg-Net, a novel framework for deformation-free cross-domain image registration that factorizes images into domain-invariant scene and appearance components recombined via AdaIN, enhanced by a position-encoded temporal attention mechanism to achieve state-of-the-art performance and efficiency on retinal and synthetic benchmarks.

Yiwen Wang, Jiahao Qin2026-03-03🤖 cs.AI

OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

OmniCT introduces a unified slice-volume Large Vision-Language Model that overcomes the limitations of existing fragmented approaches by integrating spatial consistency and organ-level semantic enhancements to achieve comprehensive, high-precision CT analysis across both local and global clinical tasks.

Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang + 12 more2026-03-03🤖 cs.AI

Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy

Prefer-DAS is a novel domain adaptive segmentation framework for electron microscopy that leverages sparse prompts and local preference alignment through self-training and contrastive learning to achieve high-performance, flexible segmentation with minimal annotation requirements.

Jiabao Chen, Shan Xiong, Jialin Peng2026-03-03💻 cs

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

The paper introduces Hepato-LLaVA, a specialized Multi-modal Large Language Model featuring a novel Sparse Topo-Pack Attention mechanism and the clinically validated HepatoPathoVQA dataset, to achieve state-of-the-art performance in hepatocellular carcinoma diagnosis and captioning on gigapixel whole slide images by effectively addressing resolution constraints and feature aggregation inefficiencies.

Yuxuan Yang, Zhonghao Yan, Yi Zhang + 6 more2026-03-03💻 cs

Leveraging Causal Reasoning Method for Explaining Medical Image Segmentation Models

This paper introduces a causal reasoning-based explanation framework for medical image segmentation that quantifies the influence of input regions and network components via average treatment effects, demonstrating superior faithfulness over existing methods and revealing significant heterogeneity in model perceptual strategies.

Limai Jiang, Ruitao Xie, Bokai Yang + 6 more2026-03-03💻 cs

← Previous Next →