EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

EchoTorrent is a novel multi-modal video generation framework that overcomes latency and temporal stability challenges through a fourfold design involving multi-teacher training, adaptive CFG calibration, hybrid long tail forcing, and VAE decoder refinement to enable swift, sustained, and high-fidelity streaming inference with precise audio-lip synchronization.

Rang Meng, Yingjie Yin, Yuming Li + 1 more2026-03-03💻 cs

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

The paper introduces Hepato-LLaVA, a specialized Multi-modal Large Language Model featuring a novel Sparse Topo-Pack Attention mechanism and the clinically validated HepatoPathoVQA dataset, to achieve state-of-the-art performance in hepatocellular carcinoma diagnosis and captioning on gigapixel whole slide images by effectively addressing resolution constraints and feature aggregation inefficiencies.

Yuxuan Yang, Zhonghao Yan, Yi Zhang + 6 more2026-03-03💻 cs

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

This paper introduces Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that exploits the visual instruction-following capabilities of Image-to-Video models by disguising malicious text prompts as benign visual cues in reference images, achieving high attack success rates across state-of-the-art commercial systems.

Bowen Zheng, Yongli Xiang, Ziming Hong + 4 more2026-03-03💻 cs

HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles

HorizonForge is a unified framework that enables photorealistic, controllable driving scene generation with arbitrary trajectories and vehicles by combining editable Gaussian-Mesh representations and noise-aware video diffusion, significantly outperforming existing methods in fidelity and consistency while introducing the HorizonSuite benchmark for standardized evaluation.

Yifan Wang, Francesco Pittaluga, Zaid Tasneem + 3 more2026-03-03💻 cs

Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps

This paper introduces Light-Geometry Interaction (LGI) maps, a novel representation derived from monocular depth that encodes light-aware occlusion to enable a unified, physics-consistent pipeline for joint shadow generation and relighting, addressing common artifacts like floating shadows through a bridge-matching generative model trained on a newly curated large-scale benchmark.

Shan Wang, Peixia Li, Chenchen Xu + 4 more2026-03-03💻 cs

Certainty-Validity: A Diagnostic Framework for Discrete Commitment Systems

This paper introduces the Certainty-Validity (CVS) Framework, a diagnostic tool for discrete commitment systems that exposes the critical flaw of standard accuracy metrics by distinguishing between appropriate uncertainty and harmful confident hallucinations, ultimately arguing that effective training for reasoning systems should prioritize maximizing the CVS score to prevent models from overcommitting to ambiguous data.

Datorien L. Anderson2026-03-03🤖 cs.LG