cs.CV papers | Gist.Science

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Uni-CoT introduces a unified Chain-of-Thought framework that leverages a two-level reasoning paradigm and structured training to enable efficient, coherent multimodal reasoning across text and vision, achieving state-of-the-art performance on image generation and editing benchmarks with limited computational resources.

Luozheng Qin, Jia Gong, Yuqing Sun + 6 more2026-03-03💬 cs.CL

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

This paper presents ImagiDrive, a unified end-to-end autonomous driving framework that synergistically integrates a Vision-Language Model-based driving agent with a Driving World Model-based scene imaginer to iteratively refine planning decisions through a closed-loop imagination-and-planning process, demonstrating superior robustness and performance on nuScenes and NAVSIM datasets.

Jingyu Li, Bozhou Zhang, Xin Jin + 3 more2026-03-03💻 cs

CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

This paper introduces CineTrans, a novel framework that leverages a newly constructed Cine250K dataset and a training-free, mask-based control mechanism derived from attention map analysis to generate coherent, cinematic multi-shot videos with stable, film-style transitions, significantly outperforming existing baselines in transition control and temporal consistency.

Xiaoxue Wu, Bingjie Gao, Yu Qiao + 2 more2026-03-03💻 cs

MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

This paper introduces MOON, the first generative Multimodal Large Language Model designed for e-commerce product understanding, which leverages guided Mixture-of-Experts, semantic region detection, and specialized negative sampling to overcome existing alignment and noise challenges while establishing a new large-scale benchmark for evaluation.

Daoze Zhang, Chenghan Fu, Zhanheng Nie + 7 more2026-03-03🤖 cs.AI

Next Visual Granularity Generation

This paper introduces Next Visual Granularity (NVG), a novel image generation framework that progressively refines images from global layouts to fine details through a structured sequence of varying token granularities, achieving state-of-the-art performance on ImageNet with FID scores superior to the VAR series.

Yikai Wang, Zhouxia Wang, Zhonghua Wu + 3 more2026-03-03🤖 cs.AI

Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation

This paper introduces ARMed, a novel reinforcement learning framework that mitigates reward collapse through adaptive semantic rewards and chain-of-thought supervision to significantly enhance open-ended medical reasoning in vision-language models.

Yizhou Liu, Dingkang Yang, Zizhi Chen + 5 more2026-03-03💻 cs

Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization

This paper proposes a disentangled multi-modal learning framework that addresses heterogeneity, multi-scale integration, and data dependency challenges in cancer characterization by decomposing histology and transcriptomics into tumor and microenvironment subspaces, aligning signals across magnifications, enabling transcriptome-agnostic inference, and aggregating informative tokens to outperform state-of-the-art methods in diagnosis, prognosis, and survival prediction.

Yupei Zhang, Xiaofei Wang, Anran Liu + 2 more2026-03-03⚡ eess

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution

This paper proposes TADSR, a time-aware one-step diffusion network that enhances real-world image super-resolution by introducing a time-aware VAE encoder and a time-aware VSD loss to fully leverage the generative priors of pre-trained stable diffusion models across different timesteps, achieving state-of-the-art performance with controllable fidelity-realism trade-offs in a single step.

Tianyi Zhang, Zheng-Peng Duan, Peng-Tao Jiang + 4 more2026-03-03⚡ eess

FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers

FastAvatar introduces a unified, feedforward framework leveraging a Large Gaussian Reconstruction Transformer to rapidly reconstruct high-quality, animatable 3D Gaussian avatars from diverse daily recordings within seconds, enabling incremental quality improvement through flexible data utilization.

Yue Wu, Xuanhong Chen, Yufan Wu + 3 more2026-03-03💻 cs

Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

The paper proposes Gradient-based Influence-Aware Constrained Decoding (GACD), a finetuning-free inference method that mitigates multimodal hallucinations in large language models by using first-order Taylor gradients to estimate and suppress spurious visual-text correlations while rebalancing cross-modal contributions.

Shan Wang, Maying Shen, Nadine Chang + 3 more2026-03-03💬 cs.CL

RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion

The paper introduces RTGMFF, a novel multimodal framework that enhances fMRI-based brain disorder diagnosis by integrating deterministic ROI-driven text generation with a hybrid frequency-spatial encoder and adaptive semantic alignment to overcome signal noise and inter-subject variability, achieving superior performance on ADHD-200 and ABIDE benchmarks.

Junhao Jia, Yifei Sun, Yunyou Liu + 5 more2026-03-03💻 cs

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

This paper introduces T2I-CoReBench, a comprehensive benchmark featuring 1,080 complex prompts and a 12-dimensional taxonomy to rigorously evaluate text-to-image models' composition and reasoning capabilities, revealing that while models struggle with high-density composition, their ability to perform implicit reasoning remains a critical bottleneck.

Ouxiang Li, Yuan Wang, Xinting Hu + 7 more2026-03-03💻 cs

UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features

UniView addresses the ill-posed nature of single-image novel view synthesis by leveraging a multimodal large language model to retrieve similar reference images and integrating their features through a plug-and-play adapter with a decoupled triple attention mechanism, thereby significantly reducing distortions and outperforming state-of-the-art methods.

Haowang Cui, Rui Chen, Jiaze Wang + 2 more2026-03-03💻 cs

Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control

This paper presents an improved 3D scene stylization framework that leverages text-guided generative image editing with a reference-based attention mechanism and multi-depth view generation to ensure high-quality, view-consistent results, while introducing a novel region-controlled loss function for applying distinct styles to specific semantic areas within a scene.

Haruo Fujiwara, Yusuke Mukuta, Tatsuya Harada2026-03-03💻 cs

LADB: Latent Aligned Diffusion Bridges for Semi-Supervised Domain Translation

The paper proposes Latent Aligned Diffusion Bridges (LADB), a semi-supervised framework that aligns source and target distributions in a shared latent space to enable high-fidelity, controllable domain translation using partially paired data, thereby overcoming the data scarcity and annotation costs associated with traditional diffusion models.

Xuqin Wang, Tao Wu, Yanfeng Zhang + 6 more2026-03-03💻 cs

TrueSkin: Towards Fair and Accurate Skin Tone Recognition and Generation

This paper introduces TrueSkin, a comprehensive dataset of 7,299 images across six skin tone classes, to benchmark and improve the fairness and accuracy of existing large multimodal and generative models, which currently struggle with systematic biases in skin tone recognition and synthesis.

Haoming Lu2026-03-03💻 cs

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

This paper proposes BWCache, a training-free method that accelerates DiT-based video generation by dynamically caching and reusing block features across diffusion timesteps based on a similarity threshold, achieving up to a 6 $\times$ speedup while maintaining visual fidelity.

Hanshuai Cui, Zhiqing Tang, Zhifei Xu + 3 more2026-03-03🤖 cs.AI

Brain-HGCN: A Hyperbolic Graph Convolutional Network for Brain Functional Network Analysis

The paper proposes Brain-HGCN, a hyperbolic graph convolutional network that leverages negatively curved space and a signed aggregation mechanism to accurately model the hierarchical topology of brain functional networks, achieving superior performance in psychiatric disorder classification compared to standard Euclidean methods.

Junhao Jia, Yunyou Liu, Cheng Yang + 4 more2026-03-03💻 cs

Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose

This paper introduces I2S, a lightweight, real-time framework that achieves state-of-the-art user identification (97.52% F1-score) in AR-based security systems by analyzing 3D hand poses and human-object interactions through a novel multi-stage feature extraction process.

Muhammad Hamza, Danish Hamid, Muhammad Tahir Akram2026-03-03🤖 cs.LG

Geodesic Prototype Matching via Diffusion Maps for Interpretable Fine-Grained Recognition

This paper introduces GeoProto, a novel prototype-based recognition framework that leverages diffusion maps and differentiable Nyström interpolation to model the intrinsic nonlinear geometry of deep features, thereby significantly improving the interpretability and accuracy of fine-grained classification compared to traditional Euclidean methods.

Junhao Jia, Yunyou Liu, Yifei Sun + 4 more2026-03-03💻 cs

← Previous Next →