cs.CV papers | Gist.Science

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

This paper proposes VC-STaR, a novel self-improving framework that leverages visual contrastive pairs to mitigate hallucinations in model-generated rationales, resulting in the VisCoR-55K dataset that significantly enhances the visual reasoning capabilities of Vision Language Models.

Zhiyu Pan, Yizheng Wu, Jiashen Hua + 5 more2026-03-04💬 cs.CL

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

This paper proposes CAPT, a Confusion-Aware Prompt Tuning framework that mitigates vision-language misalignment by explicitly modeling persistent category confusion through a Confusion Bank and integrating semantic and sample-level cues via specialized miners and a multi-granularity expert to significantly reduce classification errors.

Maoyuan Shao, Yutong Gao, Xinyang Huang + 3 more2026-03-04🤖 cs.AI

CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration

The paper proposes CAWM-Mamba, a unified end-to-end framework that jointly performs infrared-visible image fusion and compound adverse weather restoration using a Weather-Aware Preprocess Module, Cross-modal Feature Interaction Module, and Wavelet Space State Block to outperform existing methods in handling multiple simultaneous degradations while enhancing downstream perception tasks.

Huichun Liu, Xiaosong Li, Zhuangfan Huang + 3 more2026-03-04💻 cs

SOLAR: SVD-Optimized Lifelong Attention for Recommendation

The paper introduces SOLAR, a recommendation framework that employs SVD-Optimized Attention to achieve theoretically lossless, low-rank sequence modeling with reduced computational complexity, enabling efficient processing of ultra-long user behavior sequences and delivering significant performance gains in Kuaishou's online recommendation system.

Chenghao Zhang, Chao Feng, Yuanhao Pu + 8 more2026-03-04🤖 cs.LG

ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration

This paper proposes ATD, a novel transformer-based architecture for image restoration that utilizes a learnable token dictionary and a token dictionary cross-attention mechanism to achieve global dependency modeling with linear complexity, thereby overcoming the performance and efficiency limitations of existing window-based methods.

Leheng Zhang, Wei Long, Yawei Li + 3 more2026-03-04💻 cs

Neural Electromagnetic Fields for High-Resolution Material Parameter Reconstruction

This paper introduces NEMF, a novel framework that leverages high-fidelity geometry and ambient RF signals to solve the ill-posed physical inversion problem, enabling the non-invasive reconstruction of dense material parameters for creating functional, simulatable Digital Twins.

Zhe Chen, Peilin Zheng, Wenshuo Chen + 3 more2026-03-04⚡ eess

Maximizing Generalization: The Effect of Different Augmentation Techniques on Lightweight Vision Transformer for Bengali Character Classification

This study demonstrates that combining Random Affine and Color Jitter augmentation techniques significantly enhances the generalization and accuracy of the lightweight EfficientViT model for Bengali handwritten character recognition on the Ekush and AIBangla datasets, achieving peak accuracies of 97.48% and 97.57% respectively.

Rafi Hassan Chowdhury, Naimul Haque, Kaniz Fatiha2026-03-04💻 cs

Synthetic-Child: An AIGC-Based Synthetic Data Pipeline for Privacy-Preserving Child Posture Estimation

This paper introduces Synthetic-Child, an AIGC-based pipeline that generates 12,000 privacy-preserving synthetic images of children using 3D modeling and FLUX-1 diffusion to train a quantized RTMPose-M model, achieving 71.2 AP on real-world data and outperforming both adult-data baselines and commercial posture correctors in accuracy and speed for edge deployment.

Taowen Zeng2026-03-04💻 cs

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

VLMFusionOcc3D is a robust multimodal framework for autonomous driving that leverages Vision-Language Models to resolve semantic ambiguities and employs a weather-aware adaptive fusion mechanism to significantly improve 3D semantic occupancy prediction accuracy, particularly under adverse weather conditions.

A. Enes Doruk, Hasan F. Ates2026-03-04💻 cs

Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

This paper introduces DrPose, a direct reward fine-tuning algorithm that leverages a novel dataset of pose-image pairs to optimize multi-view diffusion models for generating 3D humans with more natural and diverse poses from single images, eliminating the need for expensive 3D assets.

Seunguk Do, Minwoo Huh, Joonghyuk Shin + 1 more2026-03-04💻 cs

Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

This paper proposes IB-IUMAD, a novel incremental unified multimodal anomaly detection framework that mitigates catastrophic forgetting by leveraging a Mamba decoder to disentangle inter-object feature coupling and an information bottleneck module to filter redundant features, thereby preserving discriminative information across evolving categories.

Kaifang Long, Lianbo Ma, Jiaqi Liu + 2 more2026-03-04💻 cs

SEP-YOLO: Fourier-Domain Feature Representation for Transparent Object Instance Segmentation

This paper introduces SEP-YOLO, a novel framework that combines frequency-domain detail enhancement and multi-scale spatial refinement to achieve state-of-the-art transparent object instance segmentation, while also providing high-quality annotations for the Trans10K dataset.

Fengming Zhang, Tao Yan, Jianchao Huang2026-03-04💻 cs

OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

To overcome the limitations of fragmented supervision in fashion intelligence, the authors introduce FashionX, a comprehensive million-scale dataset, and OmniFashion, a unified vision-language framework that enables multi-task reasoning and interactive dialogue across diverse fashion applications.

Zhengwei Yang, Andi Long, Hao Li + 3 more2026-03-04💻 cs

Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

This paper introduces M3IRT, a multimodal item response theory framework that decomposes model ability and item difficulty into image-only, text-only, and cross-modal components to filter out shortcut questions, thereby enabling more reliable and cost-effective evaluation of genuine cross-modal reasoning in Multimodal Large Language Models.

Shunki Uebayashi, Kento Masui, Kyohei Atarashi + 5 more2026-03-04💬 cs.CL

DREAM: Where Visual Understanding Meets Text-to-Image Generation

DREAM is a unified framework that synergistically combines discriminative and generative objectives through Masking Warmup and Semantically Aligned Decoding, achieving state-of-the-art performance in both visual understanding and text-to-image generation on the CC12M dataset.

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati + 8 more2026-03-04🤖 cs.LG

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

The paper introduces VisionCreator, a native visual-generation agentic model that unifies understanding, thinking, planning, and creation capabilities through specialized training on a novel dataset and benchmark, demonstrating superior performance over larger closed-source models in complex visual creation tasks.

Jinxiang Lai, Zexin Lu, Jiajun He + 11 more2026-03-04💻 cs

ReCo-Diff: Residual-Conditioned Deterministic Sampling for Cold Diffusion in Sparse-View CT

The paper introduces ReCo-Diff, a residual-conditioned deterministic sampling framework that enhances sparse-view CT reconstruction by continuously correcting predictions based on observation residuals, thereby achieving superior accuracy, stability, and robustness compared to existing cold diffusion methods.

Yong Eun Choi, Hyoung Suk Park, Kiwan Jeon + 2 more2026-03-04💻 cs

FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution

FiDeSR is a high-fidelity, one-step diffusion framework for real-world image super-resolution that combines a detail-aware training weighting strategy, a residual-in-residual noise refinement mechanism, and low/high-frequency adaptive enhancers to simultaneously achieve superior perceptual quality and faithful content restoration.

Aro Kim, Myeongjin Jang, Chaewon Moon + 3 more2026-03-04💻 cs

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

ShareVerse is a multi-agent video generation framework that enables consistent shared world modeling by leveraging a large-scale CARLA dataset, a spatial concatenation strategy for multi-view coherence, and cross-agent attention mechanisms to ensure geometric and interactive consistency across agents.

Jiayi Zhu, Jianing Zhang, Yiying Yang + 2 more2026-03-04🤖 cs.AI

Intelligent Pathological Diagnosis of Gestational Trophoblastic Diseases via Visual-Language Deep Learning Model

This paper presents GTDoctor, a visual-language deep learning model and its associated GTDiagnosis software system, which significantly improve the speed, accuracy, and consistency of gestational trophoblastic disease pathological diagnosis through automated lesion segmentation and personalized analysis.

Yuhang Liu, Yueyang Cang, Wenge Que + 12 more2026-03-04🤖 cs.AI

← Previous Next →