cs.CV papers | Gist.Science

FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification

FedARKS is a novel federated learning framework for person re-identification that overcomes the limitations of global feature reliance and uniform averaging by introducing Robust Knowledge and Knowledge Selection mechanisms to capture subtle domain-invariant details and prioritize high-quality client contributions for improved domain generalization.

Xin Xu, Binchang Ma, Zhixi Yu, Wei Liu2026-03-09💻 cs

Cross-Resolution Distribution Matching for Diffusion Distillation

The paper proposes Cross-Resolution Distribution Matching Distillation (RMD), a novel framework that bridges cross-resolution distribution gaps using logSNR-based mapping and noise re-injection to achieve high-fidelity, few-step multi-resolution cascaded inference with up to 33.4X speedup on SDXL and 25.6X on Wan2.1-14B.

Feiyang Chen, Hongpeng Pan, Haonan Xu, Xinyu Duan, Yang Yang, Zhefeng Wang2026-03-09💻 cs

Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

Place-it-R1 is an end-to-end framework that leverages Multimodal Large Language Models (MLLMs) with Chain-of-Thought reasoning to orchestrate video diffusion via a "Think-then-Place" paradigm, ensuring physically consistent and environment-aware video object insertion through iterative refinement and user-controllable plausibility-fidelity trade-offs.

Bohai Gu, Taiyi Wu, Dazhao Du, Jian Liu, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo2026-03-09🤖 cs.AI

Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models

This paper introduces Spatial Colour Mixing as a stress test revealing that Vision-Language Models suffer from systematic perceptual weaknesses under structured color distortions where humans remain robust, suggesting that perception-aware preprocessing can significantly improve model reliability.

Nicoleta-Nina Basoc, Adrian Cosma, Emilian Radoi2026-03-09💻 cs

Longitudinal NSCLC Treatment Progression via Multimodal Generative Models

This paper introduces a Virtual Treatment (VT) framework that utilizes dose-aware multimodal conditional image-to-image translation, specifically leveraging diffusion-based models, to synthesize plausible longitudinal CT scans of non-small cell lung cancer (NSCLC) tumor evolution under radiotherapy, thereby supporting in-silico treatment monitoring and adaptive radiotherapy research.

Massimiliano Mantegna, Elena Mulero Ayllón, Alice Natalina Caragliano, Francesco Di Feola, Claudia Tacconi, Michele Fiore, Edy Ippolito, Carlo Greco, Sara Ramella, Philippe C. Cattin, Paolo Soda, Matteo Tortora, Valerio Guarrasi2026-03-09💻 cs

VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models

This paper introduces VLM-RobustBench, a comprehensive benchmark evaluating the robustness of four vision-language model families across 133 corruption settings, revealing that current models are semantically strong but spatially fragile, with low-severity geometric distortions causing significantly larger performance drops than visually severe photometric corruptions.

Rohit Saxena, Alessandro Suglia, Pasquale Minervini2026-03-09🤖 cs.AI

Reflective Flow Sampling Enhancement

This paper introduces Reflective Flow Sampling (RF-Sampling), a training-free, theoretically-grounded inference framework that significantly enhances text-prompt alignment and generation quality for flow-based models like FLUX by implicitly performing gradient ascent on alignment scores through flow inversion and textual representation integration.

Zikai Zhou, Muyao Wang, Shitong Shao, Lichen Bai, Haoyi Xiong, Bo Han, Zeke Xie2026-03-09🤖 cs.AI

FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models

FreeOcc is a training-free pipeline that leverages pretrained foundation models to generate dense 3D semantic and panoptic occupancy predictions from multi-view images without requiring domain-specific training, achieving performance comparable to state-of-the-art weakly supervised methods.

Andrew Caunes, Thierry Chateau, Vincent Fremont2026-03-09💻 cs

A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement

This paper proposes a semi-supervised framework for breast ultrasound segmentation that leverages training-free, appearance-based prompts in vision-language models to generate structurally consistent pseudo-labels, which are then refined through a dual-teacher mechanism and contrastive learning to achieve fully supervised-level performance with only 2.5% labeled data.

Ruili Li, Jiayi Ding, Ruiyu Li, Yilun Jin, Shiwen Ge, Yuwen Zeng, Xiaoyong Zhang, Eichi Takaya, Jan Vrba, Noriyasu Homma2026-03-09💻 cs

JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas

JOPP-3D is a novel open-vocabulary semantic segmentation framework that jointly leverages panoramic images and 3D point clouds to enable language-driven scene understanding, achieving significant performance improvements over state-of-the-art methods in both 2D and 3D domains.

Sandeep Inuganti, Hideaki Kanayama, Kanta Shimizu, Mahdi Chamseddine, Soichiro Yokota, Didier Stricker, Jason Rambach2026-03-09💻 cs

Optimizing 3D Diffusion Models for Medical Imaging via Multi-Scale Reward Learning

This paper proposes a novel framework that enhances 3D medical image generation by fine-tuning pre-trained diffusion models with Proximal Policy Optimization guided by a multi-scale reward system, resulting in improved image quality and superior utility for downstream clinical classification tasks.

Yueying Tian, Xudong Han, Meng Zhou, Rodrigo Aviles-Espinosa, Rupert Young, Philip Birch2026-03-09💻 cs

Making Training-Free Diffusion Segmentors Scale with the Generative Power

This paper addresses the scalability limitations of training-free diffusion segmentors by identifying and bridging gaps in attention map aggregation and token score imbalances through proposed techniques of auto aggregation and per-pixel rescaling, thereby enabling better utilization of powerful generative models for semantic segmentation.

Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang2026-03-09💻 cs

Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning

This paper proposes a two-stage framework that first trains a contrastive encoder on labeled invented alphabets and then uses teacher-student distillation to learn unsupervised, deformation-invariant embeddings for historically attested scripts, effectively bridging supervised discriminative learning with unsupervised discovery of latent cross-script similarities without requiring ground-truth evolutionary relationships.

Claire Roman, Philippe Meyer2026-03-09🤖 cs.AI

Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots

This paper introduces the Motion Turing Test framework and the HHMotion dataset to evaluate human-likeness in humanoid robots by analyzing kinematic data, revealing current motion deviations and demonstrating that a specialized baseline model outperforms multimodal large language models in automatically predicting human-likeness scores.

Mingzhe Li, Mengyin Liu, Zekai Wu, Xincheng Lin, Junsheng Zhang, Ming Yan, Zengye Xie, Changwang Zhang, Chenglu Wen, Lan Xu, Siqi Shen, Cheng Wang2026-03-09💻 cs

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

This paper introduces CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that leverages patient context, guideline-based severity weighting, and a comprehensive error taxonomy to achieve superior alignment with radiologist judgments compared to existing metrics.

Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish, Sung Eun Kim, Oishi Banerjee, Pranav Rajpurkar2026-03-09🤖 cs.AI

SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection

The paper introduces SpaCRD, a transfer learning-based multimodal deep fusion method that integrates histology images and spatial transcriptomics data to achieve robust and accurate cancer region detection across diverse samples, platforms, and batches, outperforming existing state-of-the-art methods.

Shuailin Xue, Jun Wan, Lihua Zhang, Wenwen Min2026-03-09💻 cs

Adaptive Language-Aware Image Reflection Removal Network

This paper proposes ALANet, an adaptive language-aware network that effectively removes complex image reflections by integrating filtering and optimization strategies to mitigate the negative impact of inaccurate machine-generated language descriptions, alongside the introduction of a new CRLAV dataset for evaluation.

Siyan Fang, Yuntao Wang, Jinpu Zhang, Ziwen Li, Yuehuan Wang2026-03-09💻 cs

Point-Supervised Skeleton-Based Human Action Segmentation

This paper introduces a point-supervised framework for skeleton-based human action segmentation that leverages multimodal features and a novel pseudo-labeling strategy to achieve competitive performance with significantly reduced annotation costs compared to fully-supervised methods.

Hongsong Wang, Yiqin Shen, Pengbo Yan, Jie Gui2026-03-09💻 cs

VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction

VG3S is a novel framework that enhances 3D semantic occupancy prediction for autonomous driving by integrating strong geometric priors from frozen Vision Foundation Models into Gaussian splatting via a hierarchical feature adapter, achieving significant performance gains on the nuScenes benchmark.

Xiaoyang Yan, Muleilan Pei, Shaojie Shen2026-03-09💻 cs

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

The paper introduces CoE, a training-free multimodal summarization framework that leverages a Hierarchical Event Graph to guide a Chain-of-Events reasoning process, effectively addressing limitations in cross-modal grounding and temporal modeling while achieving state-of-the-art performance across diverse datasets.

Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu2026-03-09🤖 cs.AI

← Previous Next →