cs.CV papers | Gist.Science

Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation

This paper introduces ARMed, a novel reinforcement learning framework that mitigates reward collapse through adaptive semantic rewards and chain-of-thought supervision to significantly enhance open-ended medical reasoning in vision-language models.

Yizhou Liu, Dingkang Yang, Zizhi Chen + 5 more2026-03-03💻 cs

Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization

This paper proposes a disentangled multi-modal learning framework that addresses heterogeneity, multi-scale integration, and data dependency challenges in cancer characterization by decomposing histology and transcriptomics into tumor and microenvironment subspaces, aligning signals across magnifications, enabling transcriptome-agnostic inference, and aggregating informative tokens to outperform state-of-the-art methods in diagnosis, prognosis, and survival prediction.

Yupei Zhang, Xiaofei Wang, Anran Liu + 2 more2026-03-03⚡ eess

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution

This paper proposes TADSR, a time-aware one-step diffusion network that enhances real-world image super-resolution by introducing a time-aware VAE encoder and a time-aware VSD loss to fully leverage the generative priors of pre-trained stable diffusion models across different timesteps, achieving state-of-the-art performance with controllable fidelity-realism trade-offs in a single step.

Tianyi Zhang, Zheng-Peng Duan, Peng-Tao Jiang + 4 more2026-03-03⚡ eess

FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers

FastAvatar introduces a unified, feedforward framework leveraging a Large Gaussian Reconstruction Transformer to rapidly reconstruct high-quality, animatable 3D Gaussian avatars from diverse daily recordings within seconds, enabling incremental quality improvement through flexible data utilization.

Yue Wu, Xuanhong Chen, Yufan Wu + 3 more2026-03-03💻 cs

Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

The paper proposes Gradient-based Influence-Aware Constrained Decoding (GACD), a finetuning-free inference method that mitigates multimodal hallucinations in large language models by using first-order Taylor gradients to estimate and suppress spurious visual-text correlations while rebalancing cross-modal contributions.

Shan Wang, Maying Shen, Nadine Chang + 3 more2026-03-03💬 cs.CL

RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion

The paper introduces RTGMFF, a novel multimodal framework that enhances fMRI-based brain disorder diagnosis by integrating deterministic ROI-driven text generation with a hybrid frequency-spatial encoder and adaptive semantic alignment to overcome signal noise and inter-subject variability, achieving superior performance on ADHD-200 and ABIDE benchmarks.

Junhao Jia, Yifei Sun, Yunyou Liu + 5 more2026-03-03💻 cs

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

This paper introduces T2I-CoReBench, a comprehensive benchmark featuring 1,080 complex prompts and a 12-dimensional taxonomy to rigorously evaluate text-to-image models' composition and reasoning capabilities, revealing that while models struggle with high-density composition, their ability to perform implicit reasoning remains a critical bottleneck.

Ouxiang Li, Yuan Wang, Xinting Hu + 7 more2026-03-03💻 cs

UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features

UniView addresses the ill-posed nature of single-image novel view synthesis by leveraging a multimodal large language model to retrieve similar reference images and integrating their features through a plug-and-play adapter with a decoupled triple attention mechanism, thereby significantly reducing distortions and outperforming state-of-the-art methods.

Haowang Cui, Rui Chen, Jiaze Wang + 2 more2026-03-03💻 cs

Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control

This paper presents an improved 3D scene stylization framework that leverages text-guided generative image editing with a reference-based attention mechanism and multi-depth view generation to ensure high-quality, view-consistent results, while introducing a novel region-controlled loss function for applying distinct styles to specific semantic areas within a scene.

Haruo Fujiwara, Yusuke Mukuta, Tatsuya Harada2026-03-03💻 cs

LADB: Latent Aligned Diffusion Bridges for Semi-Supervised Domain Translation

The paper proposes Latent Aligned Diffusion Bridges (LADB), a semi-supervised framework that aligns source and target distributions in a shared latent space to enable high-fidelity, controllable domain translation using partially paired data, thereby overcoming the data scarcity and annotation costs associated with traditional diffusion models.

Xuqin Wang, Tao Wu, Yanfeng Zhang + 6 more2026-03-03💻 cs

TrueSkin: Towards Fair and Accurate Skin Tone Recognition and Generation

This paper introduces TrueSkin, a comprehensive dataset of 7,299 images across six skin tone classes, to benchmark and improve the fairness and accuracy of existing large multimodal and generative models, which currently struggle with systematic biases in skin tone recognition and synthesis.

Haoming Lu2026-03-03💻 cs

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

This paper proposes BWCache, a training-free method that accelerates DiT-based video generation by dynamically caching and reusing block features across diffusion timesteps based on a similarity threshold, achieving up to a 6 $\times$ speedup while maintaining visual fidelity.

Hanshuai Cui, Zhiqing Tang, Zhifei Xu + 3 more2026-03-03🤖 cs.AI

Brain-HGCN: A Hyperbolic Graph Convolutional Network for Brain Functional Network Analysis

The paper proposes Brain-HGCN, a hyperbolic graph convolutional network that leverages negatively curved space and a signed aggregation mechanism to accurately model the hierarchical topology of brain functional networks, achieving superior performance in psychiatric disorder classification compared to standard Euclidean methods.

Junhao Jia, Yunyou Liu, Cheng Yang + 4 more2026-03-03💻 cs

Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose

This paper introduces I2S, a lightweight, real-time framework that achieves state-of-the-art user identification (97.52% F1-score) in AR-based security systems by analyzing 3D hand poses and human-object interactions through a novel multi-stage feature extraction process.

Muhammad Hamza, Danish Hamid, Muhammad Tahir Akram2026-03-03🤖 cs.LG

Geodesic Prototype Matching via Diffusion Maps for Interpretable Fine-Grained Recognition

This paper introduces GeoProto, a novel prototype-based recognition framework that leverages diffusion maps and differentiable Nyström interpolation to model the intrinsic nonlinear geometry of deep features, thereby significantly improving the interpretability and accuracy of fine-grained classification compared to traditional Euclidean methods.

Junhao Jia, Yunyou Liu, Yifei Sun + 4 more2026-03-03💻 cs

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

The paper proposes SHINE, a training-free framework that leverages pretrained diffusion models like FLUX to achieve physically plausible, high-fidelity image composition with accurate lighting and reflections, while introducing the ComplexCompo benchmark to rigorously evaluate performance in challenging scenarios.

Shilin Lu, Zhuming Lian, Zihan Zhou + 3 more2026-03-03🤖 cs.AI

QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models

This paper introduces QuadGPT, the first end-to-end autoregressive framework that generates native quadrilateral meshes with superior geometric and topological quality by employing a unified tokenization method for mixed topologies and a specialized Reinforcement Learning fine-tuning strategy.

Jian Liu, Chunshi Wang, Song Guo + 9 more2026-03-03💻 cs

DistillKac: Few-Step Image Generation via Damped Wave Equations

DistillKac is a fast, few-step image generation framework that leverages damped wave equations and stochastic Kac representation to enforce finite-speed probability transport, enabling stable classifier-free guidance and efficient endpoint-only distillation for high-quality sample synthesis.

Weiqiao Han, Chenlin Meng, Christopher D. Manning + 1 more2026-03-03📊 stat

Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

This paper addresses the limitations of existing visual emotion evaluation methods for Multimodal Large Language Models (MLLMs) by proposing an open-vocabulary, automated Emotion Statement Judgment framework that reveals current models' strengths in context-based interpretation but highlights significant gaps in understanding subjective perception compared to humans.

Daiqing Wu, Dongbao Yang, Sicheng Zhao + 2 more2026-03-03💻 cs

COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics

The paper introduces COMPASS, a robust framework that generates efficient and valid conformal prediction intervals for medical segmentation metrics by calibrating directly in the model's feature space rather than treating the segmentation-to-metric pipeline as a black box.

Matt Y. Cheung, Ashok Veeraraghavan, Guha Balakrishnan2026-03-03⚡ eess

← Previous Next →