cs.CV papers | Gist.Science

Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise

The paper proposes Quality-Aware Robust Multi-View Clustering (QARMVC), a novel framework that addresses heterogeneous observation noise by leveraging reconstruction discrepancies to generate instance-level quality scores, which then guide a hierarchical learning strategy to adaptively suppress noise and construct a robust global consensus.

Peihan Wu, Guanjie Cheng, Yufei Tong + 2 more2026-02-27🤖 cs.AI

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

This paper exposes a critical evaluation pitfall where common human preference models are biased toward large guidance scales, leading to inflated scores despite degraded image quality, and proposes a novel guidance-aware evaluation framework (GA-Eval) alongside a new method (TDG) to demonstrate that many recent diffusion guidance improvements are illusory and that simply increasing CFG scales often outperforms them in practice.

Dian Xie, Shitong Shao, Lichen Bai + 5 more2026-02-27🤖 cs.AI

GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

GIFSplat introduces a purely feed-forward, iterative refinement framework for 3D Gaussian Splatting from sparse unposed views that distills a frozen diffusion prior into Gaussian-level cues to achieve state-of-the-art reconstruction quality with second-scale inference time, eliminating the need for camera poses or test-time optimization.

Tianyu Chen, Wei Xiang, Kang Han + 4 more2026-02-27💻 cs

Causal Motion Diffusion Models for Autoregressive Motion Generation

This paper introduces Causal Motion Diffusion Models (CMDM), a unified framework that combines a Motion-Language-Aligned Causal VAE with an autoregressive diffusion transformer to achieve high-quality, real-time, and temporally consistent motion generation while overcoming the limitations of existing bidirectional and unstable autoregressive approaches.

Qing Yu, Akihisa Watanabe, Kent Fujiwara2026-02-27💻 cs

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

BetterScene enhances novel view synthesis for sparse, unconstrained real-world photos by integrating a feed-forward 3D Gaussian Splatting model with a Stable Video Diffusion backbone that is fine-tuned via temporal equivariance regularization and vision foundation model-aligned representations within its VAE module to produce consistent, artifact-free views.

Yuci Han, Charles Toth, John E. Anderson + 2 more2026-02-27🤖 cs.AI

$ϕ$ -DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

This paper introduces $\phi$ -DPO, a novel Fairness Direct Preference Optimization framework for Large Multimodal Models that mitigates both catastrophic forgetting and data imbalance-induced bias through a new loss function and pairwise preference alignment, achieving state-of-the-art performance in continual learning benchmarks.

Thanh-Dat Truong, Huu-Thien Tran, Jackson Cothren + 2 more2026-02-27🤖 cs.LG

LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals

The paper presents LoR-LUT, a unified low-rank formulation that generates compact, interpretable 3D lookup tables by combining basis LUTs with low-rank residual corrections to achieve expert-level image retouching with high perceptual fidelity and a sub-megabyte model size.

Ziqi Zhao, Abhijit Mishra, Shounak Roychowdhury2026-02-27💻 cs

DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

This paper proposes DP-aware AdaLN-Zero, a sensitivity-aware conditioning mechanism for diffusion transformers that mitigates heavy-tailed gradients caused by heterogeneous contexts, thereby reducing clipping bias and improving utility in differentially private time-series tasks without compromising standard performance.

Tao Huang, Jiayang Meng, Xu Yang + 2 more2026-02-27🤖 cs.LG

Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

The paper introduces SATtxt, a vision-language foundation model that distills multi-spectral knowledge into an RGB-only inference framework aligned with instruction-augmented LLMs, thereby achieving superior zero-shot classification, retrieval, and linear probing performance on satellite imagery benchmarks.

Minh Kha Do, Wei Xiang, Kang Han + 5 more2026-02-27💻 cs

Coded-E2LF: Coded Aperture Light Field Imaging from Events

This paper presents Coded-E2LF, a purely event-based computational imaging method that utilizes a coded aperture to reconstruct 4-D light fields with pixel-level accuracy from a stationary event-only camera, marking the first demonstration of achieving this without relying on intensity images.

Tomoya Tsuchida, Keita Takahashi, Chihiro Tsutake + 2 more2026-02-27💻 cs

CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection

This paper introduces CGSA, a novel source-free domain adaptive object detection framework that integrates object-centric learning with a DETR-based detector by employing hierarchical slot awareness and class-guided contrastive learning to achieve superior cross-domain adaptation without requiring source data.

Boyang Dai, Zeng Fan, Zihao Qi + 2 more2026-02-27🤖 cs.AI

Instruction-based Image Editing with Planning, Reasoning, and Generation

This paper proposes a novel framework for instruction-based image editing that bridges understanding and generation by leveraging a multi-modality chain-of-thought approach to separately handle planning, editing region reasoning, and hint-guided generation, thereby achieving superior performance on complex real-world images compared to prior single-modality methods.

Liya Ji, Chenyang Qi, Qifeng Chen2026-02-27🤖 cs.AI

CRAG: Can 3D Generative Models Help 3D Assembly?

The paper proposes CRAG, a novel framework that reformulates 3D assembly as a joint generation and pose estimation task to simultaneously synthesize missing geometry and predict part poses, thereby achieving state-of-the-art performance on complex, incomplete objects by leveraging the mutual reinforcement between structural reasoning and holistic shape inference.

Zeyu Jiang, Sihang Li, Siqi Tan + 8 more2026-02-27💻 cs

QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

This paper challenges the notion that quadrifocal tensors are impractical by introducing a novel synchronization framework based on Tucker decomposition and joint optimization with lower-order tensors, enabling the effective recovery of multiple camera views from higher-order geometric constraints.

Daniel Miao, Gilad Lerman, Joe Kileel2026-02-27🔢 math

Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

This paper proposes a plug-and-play Multimodal Weight Allocation Module (MWAM) that utilizes a Frequency Ratio Metric to quantify and dynamically rebalance modality contributions during training, thereby mitigating catastrophic performance degradation caused by missing modalities and enhancing robustness across diverse multimodal architectures.

Siqi Lu, Wanying Xu, Yongbin Zheng + 3 more2026-02-27💻 cs

Interactive Medical-SAM2 GUI: A Napari-based semi-automatic annotation tool for medical images

This paper introduces Interactive Medical-SAM2 GUI, an open-source, Napari-based desktop application that streamlines semi-automatic 3D medical image annotation by integrating SAM2-style mask propagation with interactive box and point prompting to enable efficient, cohort-oriented workflows for research.

Woojae Hong, Jong Ha Hwang, Jiyong Chung + 3 more2026-02-27💻 cs

Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

This paper addresses the limitations of existing small-scale audio-visual quality assessment datasets by proposing a practical crowdsourcing framework and systematic data preparation strategy to construct YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, which features 1,620 user-generated sequences with rich annotations for multimodal perception research.

Renyu Yang, Jian Jin, Lili Meng + 4 more2026-02-27💻 cs

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

This paper introduces a novel framework for monocular open-vocabulary 3D occupancy prediction in indoor scenes that leverages geometry-only supervision and 3D Language-Embedded Gaussians, enhanced by an opacity-aware Poisson-based aggregation operator and a progressive temperature decay schedule to overcome feature mixing and convergence challenges, thereby achieving state-of-the-art performance on the Occ-ScanNet benchmark.

Changqing Zhou, Yueru Luo, Han Zhang + 2 more2026-02-27💻 cs

SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling

This paper proposes SPMamba-YOLO, a novel underwater object detection network that integrates a Spatial Pyramid Pooling Enhanced Layer Aggregation Network (SPPELAN), a Pyramid Split Attention (PSA) mechanism, and a Mamba-based state space modeling module to effectively address challenges like light attenuation and small targets, achieving a 4.9% mAP@0.5 improvement over YOLOv8n on the URPC2022 dataset.

Guanghao Liao, Zhen Liu, Liyuan Cao + 2 more2026-02-27💻 cs

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

This paper introduces ViCLIP-OT, a novel foundation vision-language model that combines CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport loss to achieve state-of-the-art performance in Vietnamese image-text retrieval across both in-domain and zero-shot settings.

Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham2026-02-27🤖 cs.AI

← Previous Next →

cs.CV