VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

This paper introduces Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that exploits the visual instruction-following capabilities of Image-to-Video models by disguising malicious text prompts as benign visual cues in reference images, achieving high attack success rates across state-of-the-art commercial systems.

Bowen Zheng, Yongli Xiang, Ziming Hong + 4 more2026-03-03💻 cs

HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles

HorizonForge is a unified framework that enables photorealistic, controllable driving scene generation with arbitrary trajectories and vehicles by combining editable Gaussian-Mesh representations and noise-aware video diffusion, significantly outperforming existing methods in fidelity and consistency while introducing the HorizonSuite benchmark for standardized evaluation.

Yifan Wang, Francesco Pittaluga, Zaid Tasneem + 3 more2026-03-03💻 cs

Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps

This paper introduces Light-Geometry Interaction (LGI) maps, a novel representation derived from monocular depth that encodes light-aware occlusion to enable a unified, physics-consistent pipeline for joint shadow generation and relighting, addressing common artifacts like floating shadows through a bridge-matching generative model trained on a newly curated large-scale benchmark.

Shan Wang, Peixia Li, Chenchen Xu + 4 more2026-03-03💻 cs

Certainty-Validity: A Diagnostic Framework for Discrete Commitment Systems

This paper introduces the Certainty-Validity (CVS) Framework, a diagnostic tool for discrete commitment systems that exposes the critical flaw of standard accuracy metrics by distinguishing between appropriate uncertainty and harmful confident hallucinations, ultimately arguing that effective training for reasoning systems should prioritize maximizing the CVS score to prevent models from overcommitting to ambiguous data.

Datorien L. Anderson2026-03-03🤖 cs.LG

Multimodal Modular Chain of Thoughts in Energy Performance Certificate Assessment

This paper introduces Multimodal Modular Chain of Thoughts (MMCoT), a cost-efficient framework utilizing Vision-Language models to improve automated Energy Performance Certificate (EPC) pre-assessment by decomposing the estimation into structured reasoning stages, which demonstrated statistically significant accuracy gains over standard prompting on a UK residential dataset.

Zhen Peng, Peter J. Bentley2026-03-03🤖 cs.AI

VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation

This paper proposes VoxelDiffusionCut, a novel method that leverages a diffusion model to iteratively estimate internal 3D structures from observed cutting surfaces and plan non-destructive cuts, thereby enabling the safe extraction of target components like batteries and motors from complex products by effectively capturing predictive uncertainty to avoid erroneous damage.

Takumi Hachimine, Yuhwan Kwon, Cheng-Yu Kuo + 2 more2026-03-03💻 cs

QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference

QuickGrasp is a responsive, QoS-aware system that bridges the accuracy-latency trade-off in video-language querying by employing a local-first architecture with on-demand edge augmentation, shared vision representations, and adaptive tokenization to match large model performance while significantly reducing response delays.

Miao Zhang, Ruixiao Zhang, Jianxin Shi + 3 more2026-03-03⚡ eess