cs.CV papers | Gist.Science

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

VLMFusionOcc3D is a robust multimodal framework for autonomous driving that leverages Vision-Language Models to resolve semantic ambiguities and employs a weather-aware adaptive fusion mechanism to significantly improve 3D semantic occupancy prediction accuracy, particularly under adverse weather conditions.

A. Enes Doruk, Hasan F. Ates2026-03-04💻 cs

Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

This paper introduces DrPose, a direct reward fine-tuning algorithm that leverages a novel dataset of pose-image pairs to optimize multi-view diffusion models for generating 3D humans with more natural and diverse poses from single images, eliminating the need for expensive 3D assets.

Seunguk Do, Minwoo Huh, Joonghyuk Shin + 1 more2026-03-04💻 cs

Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

This paper proposes IB-IUMAD, a novel incremental unified multimodal anomaly detection framework that mitigates catastrophic forgetting by leveraging a Mamba decoder to disentangle inter-object feature coupling and an information bottleneck module to filter redundant features, thereby preserving discriminative information across evolving categories.

Kaifang Long, Lianbo Ma, Jiaqi Liu + 2 more2026-03-04💻 cs

SEP-YOLO: Fourier-Domain Feature Representation for Transparent Object Instance Segmentation

This paper introduces SEP-YOLO, a novel framework that combines frequency-domain detail enhancement and multi-scale spatial refinement to achieve state-of-the-art transparent object instance segmentation, while also providing high-quality annotations for the Trans10K dataset.

Fengming Zhang, Tao Yan, Jianchao Huang2026-03-04💻 cs

OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

To overcome the limitations of fragmented supervision in fashion intelligence, the authors introduce FashionX, a comprehensive million-scale dataset, and OmniFashion, a unified vision-language framework that enables multi-task reasoning and interactive dialogue across diverse fashion applications.

Zhengwei Yang, Andi Long, Hao Li + 3 more2026-03-04💻 cs

Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

This paper introduces M3IRT, a multimodal item response theory framework that decomposes model ability and item difficulty into image-only, text-only, and cross-modal components to filter out shortcut questions, thereby enabling more reliable and cost-effective evaluation of genuine cross-modal reasoning in Multimodal Large Language Models.

Shunki Uebayashi, Kento Masui, Kyohei Atarashi + 5 more2026-03-04💬 cs.CL

DREAM: Where Visual Understanding Meets Text-to-Image Generation

DREAM is a unified framework that synergistically combines discriminative and generative objectives through Masking Warmup and Semantically Aligned Decoding, achieving state-of-the-art performance in both visual understanding and text-to-image generation on the CC12M dataset.

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati + 8 more2026-03-04🤖 cs.LG

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

The paper introduces VisionCreator, a native visual-generation agentic model that unifies understanding, thinking, planning, and creation capabilities through specialized training on a novel dataset and benchmark, demonstrating superior performance over larger closed-source models in complex visual creation tasks.

Jinxiang Lai, Zexin Lu, Jiajun He + 11 more2026-03-04💻 cs

ReCo-Diff: Residual-Conditioned Deterministic Sampling for Cold Diffusion in Sparse-View CT

The paper introduces ReCo-Diff, a residual-conditioned deterministic sampling framework that enhances sparse-view CT reconstruction by continuously correcting predictions based on observation residuals, thereby achieving superior accuracy, stability, and robustness compared to existing cold diffusion methods.

Yong Eun Choi, Hyoung Suk Park, Kiwan Jeon + 2 more2026-03-04💻 cs

FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution

FiDeSR is a high-fidelity, one-step diffusion framework for real-world image super-resolution that combines a detail-aware training weighting strategy, a residual-in-residual noise refinement mechanism, and low/high-frequency adaptive enhancers to simultaneously achieve superior perceptual quality and faithful content restoration.

Aro Kim, Myeongjin Jang, Chaewon Moon + 3 more2026-03-04💻 cs

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

ShareVerse is a multi-agent video generation framework that enables consistent shared world modeling by leveraging a large-scale CARLA dataset, a spatial concatenation strategy for multi-view coherence, and cross-agent attention mechanisms to ensure geometric and interactive consistency across agents.

Jiayi Zhu, Jianing Zhang, Yiying Yang + 2 more2026-03-04🤖 cs.AI

Intelligent Pathological Diagnosis of Gestational Trophoblastic Diseases via Visual-Language Deep Learning Model

This paper presents GTDoctor, a visual-language deep learning model and its associated GTDiagnosis software system, which significantly improve the speed, accuracy, and consistency of gestational trophoblastic disease pathological diagnosis through automated lesion segmentation and personalized analysis.

Yuhang Liu, Yueyang Cang, Wenge Que + 12 more2026-03-04🤖 cs.AI

MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

This paper proposes MiM-DiT, a unified image restoration framework that integrates a dual-level Mixture-of-Experts architecture with pretrained diffusion transformers to effectively handle diverse and fine-grained degradation types through adaptive coarse-grained and fine-grained expert selection.

Lingshun Kong, Jiawei Zhang, Zhengpeng Duan + 6 more2026-03-04💻 cs

From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

The paper proposes CoR-Painter, a novel framework that enhances autoregressive image generation by introducing a "How-to-What" paradigm with constrained reasoning to explicitly derive spatial and compositional rules before generating detailed descriptions, thereby achieving state-of-the-art performance in spatial accuracy and coherence.

Ruxue Yan, Xubo Liu, Wenya Guo + 3 more2026-03-04⚡ eess

TenExp: Mixture-of-Experts-Based Tensor Decomposition Structure Search Framework

The paper proposes TenExp, a novel unsupervised framework that leverages a mixture-of-experts approach to dynamically search for and activate optimal single or mixed tensor decompositions, thereby overcoming the limitations of existing methods confined to fixed factor-interaction families.

Ting-Wei Zhou, Xi-Le Zhao, Sheng Liu + 3 more2026-03-04💻 cs

Cross-view geo-localization, Image retrieval, Multiscale geometric modeling, Frequency domain enhancement

This paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), a lightweight three-branch architecture that leverages complementary spatial and frequency domain representations to effectively address geometric asymmetry and texture inconsistencies in cross-view geo-localization, achieving state-of-the-art performance through multiscale structural modeling and frequency invariance.

Hongying Zhang, ShuaiShuai Ma2026-03-04💻 cs

Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

This paper introduces RSHBench, a benchmark for diagnosing hallucinations in remote sensing visual question-answering, and proposes RADAR, a training-free inference method that leverages intrinsic attention to improve grounding and reduce hallucinations in multimodal large language models.

Yi Liu, Jing Zhang, Di Wang + 3 more2026-03-04💻 cs

HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning

This paper proposes HiLoRA, a hierarchical Low-Rank Adaptation framework for Federated Learning that leverages a three-tier adapter structure and subspace-based client clustering to effectively capture global, subgroup, and client-specific knowledge, thereby enhancing both personalization and generalization in Vision Transformer models.

Zihao Peng, Nan Zou, Jiandian Zeng + 4 more2026-03-04💻 cs

Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

The paper introduces UNICORN, a unified public benchmark featuring a standardized two-step evaluation framework and a novel aggregate metric to systematically assess the cross-modality and cross-task generalization of medical foundation models across diverse imaging and natural language data from multiple institutions.

Michelle Stegeman, Lena Philipp, Fennie van der Graaf + 19 more2026-03-04💻 cs

R3GW: Relightable 3D Gaussians for Outdoor Scenes in the Wild

R3GW introduces a novel method for reconstructing outdoor scenes from unconstrained photo collections by separating the scene into relightable foreground and non-reflective sky components, enabling state-of-the-art physically based relighting and high-quality novel view synthesis under arbitrary illumination conditions.

Margherita Lea Corona, Wieland Morgenstern, Peter Eisert + 1 more2026-03-04💻 cs

← Previous Next →