cs.CV papers | Gist.Science

MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation

MammoWise is a practical, local, multi-model pipeline that leverages open-source Vision Language Models enhanced with few-shot prompting and Retrieval Augmented Generation to generate high-quality mammography reports and perform accurate clinical classifications while ensuring privacy and reproducibility.

Raiyan Jahangir, Nafiz Imtiaz Khan, Amritanand Sudheerkumar + 1 more2026-02-27💻 cs

Space Syntax-guided Post-training for Residential Floor Plan Generation

This paper proposes Space Syntax-guided Post-training (SSPT), a framework that integrates architectural theory into residential floor plan generation via a non-differentiable oracle and reinforcement learning to enhance public space dominance and functional hierarchy while outperforming distribution-fitted baselines in efficiency and stability.

Zhuoyang Jiang, Dongqing Zhang2026-02-27🤖 cs.LG

Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

Pix2Key is a novel composed image retrieval framework that utilizes semantic decomposition and self-supervised visual dictionary learning to represent queries and candidates as open-vocabulary dictionaries, thereby achieving superior intent-aware matching and diversity-aware reranking without relying on supervised triplets.

Guoyizhe Wei, Yang Jiao, Nan Xi + 4 more2026-02-27💻 cs

HARU-Net: Hybrid Attention Residual U-Net for Edge-Preserving Denoising in Cone-Beam Computed Tomography

This paper introduces HARU-Net, a novel Hybrid Attention Residual U-Net architecture that integrates hybrid attention transformers and residual learning to effectively denoise low-dose Cone-Beam Computed Tomography (CBCT) images while preserving critical anatomical edges, outperforming state-of-the-art methods in both image quality metrics and computational efficiency.

Khuram Naveed, Ruben Pauwels2026-02-27⚡ eess

DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI

DisQ-HNet is a novel, interpretable framework that synthesizes tau-PET images from T1 and FLAIR MRI by employing a Partial Information Decomposition-guided vector-quantized encoder and a Half-UNet decoder to disentangle modality contributions while preserving anatomical details and disease-relevant signals for Alzheimer's disease analysis.

Agamdeep S. Chopra, Caitlin Neher, Tianyi Ren + 2 more2026-02-27🤖 cs.AI

DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

DrivePTS is a progressive learning framework that enhances autonomous driving scene generation by mitigating geometric condition inter-dependencies, enriching semantic context through multi-view hierarchical text descriptions, and improving structural fidelity via frequency-guided loss, thereby achieving state-of-the-art realism and controllability.

Zhechao Wang, Yiming Zeng, Lufan Ma + 4 more2026-02-27🤖 cs.AI

SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction

SwiftNDC is a fast, general framework that employs a Neural Depth Correction field to generate cross-view consistent depth maps, providing a robust geometric initialization that significantly accelerates 3D Gaussian Splatting for high-fidelity mesh reconstruction and improves novel-view synthesis quality.

Kang Han, Wei Xiang, Lu Yu + 3 more2026-02-27💻 cs

Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise

The paper proposes Quality-Aware Robust Multi-View Clustering (QARMVC), a novel framework that addresses heterogeneous observation noise by leveraging reconstruction discrepancies to generate instance-level quality scores, which then guide a hierarchical learning strategy to adaptively suppress noise and construct a robust global consensus.

Peihan Wu, Guanjie Cheng, Yufei Tong + 2 more2026-02-27🤖 cs.AI

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

This paper exposes a critical evaluation pitfall where common human preference models are biased toward large guidance scales, leading to inflated scores despite degraded image quality, and proposes a novel guidance-aware evaluation framework (GA-Eval) alongside a new method (TDG) to demonstrate that many recent diffusion guidance improvements are illusory and that simply increasing CFG scales often outperforms them in practice.

Dian Xie, Shitong Shao, Lichen Bai + 5 more2026-02-27🤖 cs.AI

GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

GIFSplat introduces a purely feed-forward, iterative refinement framework for 3D Gaussian Splatting from sparse unposed views that distills a frozen diffusion prior into Gaussian-level cues to achieve state-of-the-art reconstruction quality with second-scale inference time, eliminating the need for camera poses or test-time optimization.

Tianyu Chen, Wei Xiang, Kang Han + 4 more2026-02-27💻 cs

Causal Motion Diffusion Models for Autoregressive Motion Generation

This paper introduces Causal Motion Diffusion Models (CMDM), a unified framework that combines a Motion-Language-Aligned Causal VAE with an autoregressive diffusion transformer to achieve high-quality, real-time, and temporally consistent motion generation while overcoming the limitations of existing bidirectional and unstable autoregressive approaches.

Qing Yu, Akihisa Watanabe, Kent Fujiwara2026-02-27💻 cs

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

BetterScene enhances novel view synthesis for sparse, unconstrained real-world photos by integrating a feed-forward 3D Gaussian Splatting model with a Stable Video Diffusion backbone that is fine-tuned via temporal equivariance regularization and vision foundation model-aligned representations within its VAE module to produce consistent, artifact-free views.

Yuci Han, Charles Toth, John E. Anderson + 2 more2026-02-27🤖 cs.AI

$ϕ$ -DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

This paper introduces $\phi$ -DPO, a novel Fairness Direct Preference Optimization framework for Large Multimodal Models that mitigates both catastrophic forgetting and data imbalance-induced bias through a new loss function and pairwise preference alignment, achieving state-of-the-art performance in continual learning benchmarks.

Thanh-Dat Truong, Huu-Thien Tran, Jackson Cothren + 2 more2026-02-27🤖 cs.LG

LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals

The paper presents LoR-LUT, a unified low-rank formulation that generates compact, interpretable 3D lookup tables by combining basis LUTs with low-rank residual corrections to achieve expert-level image retouching with high perceptual fidelity and a sub-megabyte model size.

Ziqi Zhao, Abhijit Mishra, Shounak Roychowdhury2026-02-27💻 cs

DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

This paper proposes DP-aware AdaLN-Zero, a sensitivity-aware conditioning mechanism for diffusion transformers that mitigates heavy-tailed gradients caused by heterogeneous contexts, thereby reducing clipping bias and improving utility in differentially private time-series tasks without compromising standard performance.

Tao Huang, Jiayang Meng, Xu Yang + 2 more2026-02-27🤖 cs.LG

Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

The paper introduces SATtxt, a vision-language foundation model that distills multi-spectral knowledge into an RGB-only inference framework aligned with instruction-augmented LLMs, thereby achieving superior zero-shot classification, retrieval, and linear probing performance on satellite imagery benchmarks.

Minh Kha Do, Wei Xiang, Kang Han + 5 more2026-02-27💻 cs

Coded-E2LF: Coded Aperture Light Field Imaging from Events

This paper presents Coded-E2LF, a purely event-based computational imaging method that utilizes a coded aperture to reconstruct 4-D light fields with pixel-level accuracy from a stationary event-only camera, marking the first demonstration of achieving this without relying on intensity images.

Tomoya Tsuchida, Keita Takahashi, Chihiro Tsutake + 2 more2026-02-27💻 cs

CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection

This paper introduces CGSA, a novel source-free domain adaptive object detection framework that integrates object-centric learning with a DETR-based detector by employing hierarchical slot awareness and class-guided contrastive learning to achieve superior cross-domain adaptation without requiring source data.

Boyang Dai, Zeng Fan, Zihao Qi + 2 more2026-02-27🤖 cs.AI

Instruction-based Image Editing with Planning, Reasoning, and Generation

This paper proposes a novel framework for instruction-based image editing that bridges understanding and generation by leveraging a multi-modality chain-of-thought approach to separately handle planning, editing region reasoning, and hint-guided generation, thereby achieving superior performance on complex real-world images compared to prior single-modality methods.

Liya Ji, Chenyang Qi, Qifeng Chen2026-02-27🤖 cs.AI

CRAG: Can 3D Generative Models Help 3D Assembly?

The paper proposes CRAG, a novel framework that reformulates 3D assembly as a joint generation and pose estimation task to simultaneously synthesize missing geometry and predict part poses, thereby achieving state-of-the-art performance on complex, incomplete objects by leveraging the mutual reinforcement between structural reasoning and holistic shape inference.

Zeyu Jiang, Sihang Li, Siqi Tan + 8 more2026-02-27💻 cs

← Previous Next →

cs.CV