cs.CV papers | Gist.Science

DREAM: Where Visual Understanding Meets Text-to-Image Generation

DREAM is a unified framework that synergistically combines discriminative and generative objectives through Masking Warmup and Semantically Aligned Decoding, achieving state-of-the-art performance in both visual understanding and text-to-image generation on the CC12M dataset.

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati + 8 more2026-03-04🤖 cs.LG

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

The paper introduces VisionCreator, a native visual-generation agentic model that unifies understanding, thinking, planning, and creation capabilities through specialized training on a novel dataset and benchmark, demonstrating superior performance over larger closed-source models in complex visual creation tasks.

Jinxiang Lai, Zexin Lu, Jiajun He + 11 more2026-03-04💻 cs

ReCo-Diff: Residual-Conditioned Deterministic Sampling for Cold Diffusion in Sparse-View CT

The paper introduces ReCo-Diff, a residual-conditioned deterministic sampling framework that enhances sparse-view CT reconstruction by continuously correcting predictions based on observation residuals, thereby achieving superior accuracy, stability, and robustness compared to existing cold diffusion methods.

Yong Eun Choi, Hyoung Suk Park, Kiwan Jeon + 2 more2026-03-04💻 cs

FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution

FiDeSR is a high-fidelity, one-step diffusion framework for real-world image super-resolution that combines a detail-aware training weighting strategy, a residual-in-residual noise refinement mechanism, and low/high-frequency adaptive enhancers to simultaneously achieve superior perceptual quality and faithful content restoration.

Aro Kim, Myeongjin Jang, Chaewon Moon + 3 more2026-03-04💻 cs

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

ShareVerse is a multi-agent video generation framework that enables consistent shared world modeling by leveraging a large-scale CARLA dataset, a spatial concatenation strategy for multi-view coherence, and cross-agent attention mechanisms to ensure geometric and interactive consistency across agents.

Jiayi Zhu, Jianing Zhang, Yiying Yang + 2 more2026-03-04🤖 cs.AI

Intelligent Pathological Diagnosis of Gestational Trophoblastic Diseases via Visual-Language Deep Learning Model

This paper presents GTDoctor, a visual-language deep learning model and its associated GTDiagnosis software system, which significantly improve the speed, accuracy, and consistency of gestational trophoblastic disease pathological diagnosis through automated lesion segmentation and personalized analysis.

Yuhang Liu, Yueyang Cang, Wenge Que + 12 more2026-03-04🤖 cs.AI

MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

This paper proposes MiM-DiT, a unified image restoration framework that integrates a dual-level Mixture-of-Experts architecture with pretrained diffusion transformers to effectively handle diverse and fine-grained degradation types through adaptive coarse-grained and fine-grained expert selection.

Lingshun Kong, Jiawei Zhang, Zhengpeng Duan + 6 more2026-03-04💻 cs

From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

The paper proposes CoR-Painter, a novel framework that enhances autoregressive image generation by introducing a "How-to-What" paradigm with constrained reasoning to explicitly derive spatial and compositional rules before generating detailed descriptions, thereby achieving state-of-the-art performance in spatial accuracy and coherence.

Ruxue Yan, Xubo Liu, Wenya Guo + 3 more2026-03-04⚡ eess

TenExp: Mixture-of-Experts-Based Tensor Decomposition Structure Search Framework

The paper proposes TenExp, a novel unsupervised framework that leverages a mixture-of-experts approach to dynamically search for and activate optimal single or mixed tensor decompositions, thereby overcoming the limitations of existing methods confined to fixed factor-interaction families.

Ting-Wei Zhou, Xi-Le Zhao, Sheng Liu + 3 more2026-03-04💻 cs

Cross-view geo-localization, Image retrieval, Multiscale geometric modeling, Frequency domain enhancement

This paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), a lightweight three-branch architecture that leverages complementary spatial and frequency domain representations to effectively address geometric asymmetry and texture inconsistencies in cross-view geo-localization, achieving state-of-the-art performance through multiscale structural modeling and frequency invariance.

Hongying Zhang, ShuaiShuai Ma2026-03-04💻 cs

Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

This paper introduces RSHBench, a benchmark for diagnosing hallucinations in remote sensing visual question-answering, and proposes RADAR, a training-free inference method that leverages intrinsic attention to improve grounding and reduce hallucinations in multimodal large language models.

Yi Liu, Jing Zhang, Di Wang + 3 more2026-03-04💻 cs

HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning

This paper proposes HiLoRA, a hierarchical Low-Rank Adaptation framework for Federated Learning that leverages a three-tier adapter structure and subspace-based client clustering to effectively capture global, subgroup, and client-specific knowledge, thereby enhancing both personalization and generalization in Vision Transformer models.

Zihao Peng, Nan Zou, Jiandian Zeng + 4 more2026-03-04💻 cs

Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

The paper introduces UNICORN, a unified public benchmark featuring a standardized two-step evaluation framework and a novel aggregate metric to systematically assess the cross-modality and cross-task generalization of medical foundation models across diverse imaging and natural language data from multiple institutions.

Michelle Stegeman, Lena Philipp, Fennie van der Graaf + 19 more2026-03-04💻 cs

R3GW: Relightable 3D Gaussians for Outdoor Scenes in the Wild

R3GW introduces a novel method for reconstructing outdoor scenes from unconstrained photo collections by separating the scene into relightable foreground and non-reflective sky components, enabling state-of-the-art physically based relighting and high-quality novel view synthesis under arbitrary illumination conditions.

Margherita Lea Corona, Wieland Morgenstern, Peter Eisert + 1 more2026-03-04💻 cs

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

This paper presents NOVA, a pair-free video editing framework that combines sparse user-provided keyframe guidance with dense motion and texture synthesis, trained via a degradation-simulation strategy to achieve high edit fidelity and temporal consistency without requiring large-scale paired datasets.

Tianlin Pan, Jiayi Dai, Chenpu Yuan + 7 more2026-03-04💻 cs

Structure-Aware Text Recognition for Ancient Greek Critical Editions

This paper addresses the limitations of visual language models in recognizing the complex layouts of Ancient Greek critical editions by introducing a large-scale synthetic corpus and a real-world benchmark, demonstrating that while zero-shot performance lags behind traditional tools, fine-tuned models like Qwen3VL-8B can achieve state-of-the-art accuracy.

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot + 1 more2026-03-04💻 cs

ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

The paper introduces ScribeTokens, a fixed-vocabulary tokenization method for digital ink that decomposes pen movements into unit pixel steps, demonstrating superior performance over vector representations in both handwritten text generation and recognition, particularly when enhanced by a novel next-ink-token prediction pretraining strategy.

Douglass Wang2026-03-04💻 cs

Scale-invariant Gaussian derivative residual networks

This paper introduces GaussDerResNets, a novel deep learning architecture that combines scale-covariant Gaussian derivative layers with residual skip connections to achieve provable scale invariance and superior generalization to unseen image scales across multiple datasets while maintaining high accuracy and computational efficiency.

Andrzej Perzanowski, Tony Lindeberg2026-03-04🤖 cs.LG

Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

By probing LVLMs with a synthetic directed graph dataset, this study reveals that while node and structural information are linearly encoded early in the vision encoder, edge representations emerge only later in the language model's text tokens, explaining the models' persistent struggles with relational understanding.

Haruto Yoshida, Keito Kudo, Yoichi Aoki + 4 more2026-03-04💬 cs.CL

Multimodal-Prior-Guided Importance Sampling for Hierarchical Gaussian Splatting in Sparse-View Novel View Synthesis

This paper introduces a multimodal-prior-guided importance sampling framework for hierarchical 3D Gaussian Splatting that fuses photometric, semantic, and geometric cues to strategically refine sparse-view novel view synthesis, thereby achieving state-of-the-art reconstruction quality while mitigating overfitting and noise.

Kaiqiang Xiong, Zhanke Wang, Ronggang Wang2026-03-04💻 cs

← Previous Next →