AdaGen: Learning Adaptive Policy for Image Synthesis

AdaGen introduces a general, learnable framework that employs reinforcement learning with an adversarial reward to dynamically adapt step-specific parameters during iterative image synthesis, thereby overcoming the limitations of static, manually-designed schedules and achieving superior performance across diverse generative models with reduced inference costs.

Zanlin Ni, Yulin Wang, Yeguo Hua, Renping Zhou, Jiayi Guo, Jun Song, Bo Zheng, Gao Huang2026-03-10💻 cs

TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models

TrajPred is a novel framework that enhances surgical instrument-tissue interaction recognition in vision-language models by encoding instrument trajectories to capture temporal motion cues and generating fine-grained visual semantic embeddings, thereby significantly improving performance and vision-text alignment on the CholecT50 benchmark.

Jiajun Cheng, Xiaofan Yu, Subarna, Sainan Liu, Shan Lin2026-03-10💻 cs

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

This paper presents OV-DEIM, a real-time end-to-end DETR-style open-vocabulary object detector that combines the DEIMv2 framework with a query supplement strategy and a novel GridSynthetic data augmentation technique to achieve state-of-the-art performance and efficiency, particularly for rare categories.

Leilei Wang, Longfei Liu, Xi Shen, Xuanlong Yu, Ying Tiffany He, Fei Richard Yu, Yingyi Chen2026-03-10💻 cs

Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking

This paper introduces TFM, a temporal attack framework that exploits the vulnerability of text-to-video models to generate harmful content by providing only sparse boundary conditions (start and end frames) and implicitly substituting sensitive cues, thereby bypassing existing safety filters and significantly increasing jailbreak success rates.

Moyang Chen, Zonghao Ying, Wenzhuo Xu, Quancheng Zou, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang2026-03-10💻 cs

Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

This paper proposes CAPL, a framework that mitigates multi-image hallucinations in large vision-language models by introducing a selectable image token interaction mechanism for fine-grained cross-image alignment and a preference learning strategy that trains the model to rely on genuine visual evidence rather than textual priors.

Xiaochen Yang, Hao Fang, Jiawei Kong, Yaoxin Mao, Bin Chen, Shu-Tao Xia2026-03-10💻 cs

Efficient Chest X-ray Representation Learning via Semantic-Partitioned Contrastive Learning

This paper introduces Semantic-Partitioned Contrastive Learning (S-PCL), a streamlined self-supervised pre-training framework for Chest X-rays that achieves superior accuracy and computational efficiency by enforcing agreement between randomly partitioned semantic subsets, thereby eliminating the need for heavy augmentations, auxiliary decoders, or momentum encoders.

Wangyu Feng, Shawn Young, Lijian Xu2026-03-10💻 cs

Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge

This paper introduces EyExIn, a data-efficient framework that enhances retinal Vision Language Models by employing a dual-stream encoding strategy and a deep expert injection mechanism to bridge perception and reasoning gaps, thereby achieving state-of-the-art precision in ophthalmic diagnosis while preventing hallucinations.

Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu, Shengzhu Yang, Weihang Zhang, Huazhu Fu, Huiqi Li2026-03-10💻 cs