cs.CV papers | Gist.Science

InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

This paper presents InterCoG, a novel text-vision interleaved chain-of-grounding reasoning framework that enhances fine-grained image editing in complex multi-entity scenes by explicitly deducing target locations through text-based spatial reasoning before performing visual grounding and outcome specification, supported by a new dataset and auxiliary training modules to ensure spatial precision.

Yecong Wan, Fan Li, Chunwei Wang + 3 more2026-03-04💻 cs

What Helps---and What Hurts: Bidirectional Explanations for Vision Transformers

This paper introduces BiCAM, a bidirectional class activation mapping method that captures both supportive and suppressive contributions in Vision Transformers to enhance explanation faithfulness and enable efficient adversarial detection without retraining.

Qin Su, Tie Luo2026-03-04🤖 cs.AI

PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts

This paper introduces PromptStereo, a zero-shot stereo matching method that enhances the iterative refinement stage by integrating monocular structure and stereo motion cues as prompts into a Prompt Recurrent Unit (PRU), thereby achieving state-of-the-art generalization performance while preserving inherent monocular depth priors.

Xianqi Wang, Hao Yang, Hangtian Wang + 4 more2026-03-04💻 cs

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

The paper introduces Nano-EmoX, a compact 2.2B-parameter multimodal language model trained via the Perception-to-Empathy (P2E) curriculum framework, which unifies six core affective tasks across a three-level cognitive hierarchy to achieve state-of-the-art performance in emotional intelligence from low-level perception to high-level empathy.

Jiahao Huang, Fengyan Lin, Xuechao Yang + 4 more2026-03-04🤖 cs.AI

SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

SimRecon is a novel framework that achieves high-fidelity, physically plausible compositional scene reconstruction from real videos by integrating a "Perception-Generation-Simulation" pipeline with two specialized bridging modules: Active Viewpoint Optimization for visual fidelity and a Scene Graph Synthesizer for physical plausibility.

Chong Xia, Kai Zhu, Zizhuo Wang + 3 more2026-03-04💻 cs

OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

This paper introduces OnlineX, a feed-forward framework that achieves unified online 3D reconstruction and semantic understanding by employing a decoupled active-to-stable state evolution paradigm to resolve cumulative drift while jointly modeling visual and language fields for real-time, high-fidelity performance.

Chong Xia, Fangfu Liu, Yule Wang + 2 more2026-03-04💻 cs

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

The paper proposes HiFi-Inpaint, a novel framework that utilizes Shared Enhancement Attention and a Detail-Aware Loss to overcome data and supervision limitations, achieving state-of-the-art, high-fidelity generation of detail-preserving human-product images.

Yichen Liu, Donghao Zhou, Jie Wang + 9 more2026-03-04💻 cs

Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting

This paper introduces TimeGS, a novel time series forecasting framework that reframes prediction as 2D generative rendering by leveraging adaptive Gaussian kernels and a chronologically continuous rasterization mechanism to overcome the topological mismatches and resolution inefficiencies of existing 2D reshaping methods, thereby achieving state-of-the-art performance.

Yixin Wang, Yifan Hu, Peiyuan Liu + 3 more2026-03-04🤖 cs.AI

CamDirector: Towards Long-Term Coherent Video Trajectory Editing

CamDirector is a novel video trajectory editing framework that achieves long-term temporal coherence and precise camera control by combining a hybrid warping scheme with a world cache and a history-guided autoregressive diffusion model, validated by a new benchmark called iPhone-PTZ.

Zhihao Shi, Kejia Yin, Weilin Wan + 5 more2026-03-04💻 cs

Social-JEPA: Emergent Geometric Isomorphism

This paper demonstrates that independent agents trained with predictive learning objectives on distinct viewpoints of the same environment naturally develop geometrically isomorphic latent spaces, enabling zero-shot knowledge transfer and efficient interoperability without parameter sharing or coordination.

Haoran Zhang, Youjin Wang, Yi Duan + 6 more2026-03-04🤖 cs.AI

From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification

This study presents a multimodal animal identification framework that leverages a massive dataset of 1.9 million images and synthetic textual descriptions to achieve an 84.28% Top-1 accuracy, representing an 11% improvement over unimodal baselines through systematic ablation of encoders and an optimal gated fusion strategy.

Vasiliy Kudryavtsev, Kirill Borodin, German Berezin + 3 more2026-03-04💻 cs

Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection

This paper proposes PDP, a novel prompt-decoupled framework for Incremental Object Detection that utilizes a dual-pool prompting paradigm to separate task-general and task-specific knowledge while employing a prototypical pseudo-label generation module to mitigate prompt drift, thereby achieving state-of-the-art performance on MS-COCO and PASCAL VOC benchmarks.

Yaoteng Zhang, Zhou Qing, Junyu Gao + 1 more2026-03-04🤖 cs.AI

AutoFFS: Adversarial Deformations for Facial Feminization Surgery Planning

The paper introduces AutoFFS, a novel data-driven framework that utilizes adversarial free-form deformations to generate quantitative, counterfactual skull morphologies for objective and reproducible preoperative planning in Facial Feminization Surgery.

Paul Friedrich, Florentin Bieder, Florian M. Thieringer + 1 more2026-03-04⚡ eess

Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification

This paper presents a systematic evaluation of loss functions, architectures, and post-training strategies for long-tailed multi-label chest X-ray classification on the CXR-LT 2026 benchmark, demonstrating that LDAM-DRW combined with a ConvNeXt-Large backbone and classifier re-training achieves a top-5 ranking with 0.3950 mAP while offering practical insights into the development-to-test performance gap.

Nikhileswara Rao Sulake2026-03-04⚡ eess

HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

HAMMER is a novel framework that leverages multimodal large language models to achieve intention-driven 3D affordance grounding by aggregating interaction intentions into contact-aware embeddings and employing hierarchical cross-modal integration with multi-granular geometry lifting for accurate 3D localization.

Lei Yao, Yong Chen, Yuejiao Su + 3 more2026-03-04💻 cs

Preconditioned Score and Flow Matching

This paper identifies that the ill-conditioned covariance of intermediate distributions in flow matching and score-based diffusion causes optimization bias and stagnation, and proposes reversible preconditioning maps to reshape this geometry, thereby enabling continued progress along suppressed directions and yielding better-trained models.

Shadab Ahamed, Eshed Gal, Simon Ghyselincks + 3 more2026-03-04🤖 cs.AI

MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

MERG3R is a training-free, model-agnostic divide-and-conquer framework that enables neural visual geometry models to scale to large, unordered image collections by partitioning data into manageable subsets and merging local reconstructions into a globally consistent 3D model, thereby overcoming GPU memory limitations while improving accuracy and scalability.

Leo Kaixuan Cheng, Abdus Shaikh, Ruofan Liang + 3 more2026-03-04💻 cs

Beyond Caption-Based Queries for Video Moment Retrieval

This paper investigates the performance degradation of existing Video Moment Retrieval methods when transitioning from caption-based to search queries, identifies language and multi-moment gaps alongside a decoder-query collapse as key causes, and proposes architectural modifications to significantly improve generalization on multi-moment search queries.

David Pujol-Perich, Albert Clapés, Dima Damen + 2 more2026-03-04💻 cs

Retrieving Patient-Specific Radiomic Feature Sets for Transparent Knee MRI Assessment

This paper proposes a transparent, patient-specific radiomic framework that employs a two-stage retrieval strategy to select compact, complementary feature sets for knee MRI diagnosis, achieving performance competitive with deep learning models while offering enhanced interpretability through auditable links between specific anatomical regions and clinical outcomes.

Yaxi Chen, Simin Ni, Jingjing Zhang + 7 more2026-03-04💻 cs

Cultural Counterfactuals: Evaluating Cultural Biases in Large Vision-Language Models with Counterfactual Examples

This paper introduces "Cultural Counterfactuals," a high-quality synthetic dataset of nearly 60,000 images created by placing diverse individuals into varied cultural contexts to enable the precise measurement and evaluation of cultural biases related to religion, nationality, and socioeconomic status in Large Vision-Language Models.

Phillip Howard, Xin Su, Kathleen C. Fraser2026-03-04💻 cs

← Previous Next →