Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D

This paper benchmarks five state-of-the-art image-to-3D foundation models on medical and natural datasets, revealing that while all struggle with severe depth ambiguity in single-slice reconstruction, SAM3D best preserves topological similarity to medical shapes, ultimately demonstrating that reliable medical 3D inference requires domain-specific adaptation beyond current zero-shot capabilities.

Yan Luo, Advaith Ravishankar, Serena Liu + 2 more2026-03-03💻 cs

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

EchoTorrent is a novel multi-modal video generation framework that overcomes latency and temporal stability challenges through a fourfold design involving multi-teacher training, adaptive CFG calibration, hybrid long tail forcing, and VAE decoder refinement to enable swift, sustained, and high-fidelity streaming inference with precise audio-lip synchronization.

Rang Meng, Yingjie Yin, Yuming Li + 1 more2026-03-03💻 cs

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

The paper introduces Hepato-LLaVA, a specialized Multi-modal Large Language Model featuring a novel Sparse Topo-Pack Attention mechanism and the clinically validated HepatoPathoVQA dataset, to achieve state-of-the-art performance in hepatocellular carcinoma diagnosis and captioning on gigapixel whole slide images by effectively addressing resolution constraints and feature aggregation inefficiencies.

Yuxuan Yang, Zhonghao Yan, Yi Zhang + 6 more2026-03-03💻 cs

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

This paper introduces Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that exploits the visual instruction-following capabilities of Image-to-Video models by disguising malicious text prompts as benign visual cues in reference images, achieving high attack success rates across state-of-the-art commercial systems.

Bowen Zheng, Yongli Xiang, Ziming Hong + 4 more2026-03-03💻 cs

HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles

HorizonForge is a unified framework that enables photorealistic, controllable driving scene generation with arbitrary trajectories and vehicles by combining editable Gaussian-Mesh representations and noise-aware video diffusion, significantly outperforming existing methods in fidelity and consistency while introducing the HorizonSuite benchmark for standardized evaluation.

Yifan Wang, Francesco Pittaluga, Zaid Tasneem + 3 more2026-03-03💻 cs

Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps

This paper introduces Light-Geometry Interaction (LGI) maps, a novel representation derived from monocular depth that encodes light-aware occlusion to enable a unified, physics-consistent pipeline for joint shadow generation and relighting, addressing common artifacts like floating shadows through a bridge-matching generative model trained on a newly curated large-scale benchmark.

Shan Wang, Peixia Li, Chenchen Xu + 4 more2026-03-03💻 cs