cs.CV papers | Gist.Science

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

This paper introduces Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that exploits the visual instruction-following capabilities of Image-to-Video models by disguising malicious text prompts as benign visual cues in reference images, achieving high attack success rates across state-of-the-art commercial systems.

Bowen Zheng, Yongli Xiang, Ziming Hong + 4 more2026-03-03💻 cs

HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles

HorizonForge is a unified framework that enables photorealistic, controllable driving scene generation with arbitrary trajectories and vehicles by combining editable Gaussian-Mesh representations and noise-aware video diffusion, significantly outperforming existing methods in fidelity and consistency while introducing the HorizonSuite benchmark for standardized evaluation.

Yifan Wang, Francesco Pittaluga, Zaid Tasneem + 3 more2026-03-03💻 cs

Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps

This paper introduces Light-Geometry Interaction (LGI) maps, a novel representation derived from monocular depth that encodes light-aware occlusion to enable a unified, physics-consistent pipeline for joint shadow generation and relighting, addressing common artifacts like floating shadows through a bridge-matching generative model trained on a newly curated large-scale benchmark.

Shan Wang, Peixia Li, Chenchen Xu + 4 more2026-03-03💻 cs

PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

PhotoAgent is an autonomous image editing system that leverages explicit aesthetic planning, tree search, and closed-loop visual feedback to execute multi-step editing tasks without requiring detailed user prompts, supported by the newly introduced UGC-Edit benchmark for evaluation.

Mingde Yao, Zhiyuan You, King-Man Tam + 2 more2026-03-03💻 cs

OmniGAIA: Towards Native Omni-Modal AI Agents

This paper introduces OmniGAIA, a comprehensive benchmark for evaluating omni-modal agents on complex reasoning and tool-use tasks across video, audio, and image modalities, alongside OmniAtlas, a native omni-modal foundation agent trained with advanced strategies to bridge the gap between current bi-modal models and next-generation real-world AI assistants.

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin + 8 more2026-03-03💬 cs.CL

HELMLAB: An Analytical, Data-Driven Color Space for Perceptual Distance in UI Design Systems

This paper introduces HELMLAB, a novel 72-parameter analytical color space for UI design systems that utilizes learned transformations and perceptual corrections to achieve a 20.2% improvement in perceptual distance accuracy over CIEDE2000 while maintaining invertibility and providing practical utilities for design workflows.

Gorkem Yildiz2026-03-03💻 cs

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

The paper introduces AgentVista, a comprehensive benchmark spanning 25 sub-domains that evaluates generalist multimodal agents on ultra-challenging, realistic long-horizon tasks requiring hybrid tool use, revealing significant performance gaps in current state-of-the-art models.

Zhaochen Su, Jincheng Gao, Hangyu Guo + 10 more2026-03-03💻 cs

V-MORALS: Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space

This paper introduces V-MORALS, a novel method that estimates Regions of Attraction in a learned latent space using only image-based trajectory data and Morse Graphs, thereby overcoming the limitations of existing approaches that require full state knowledge or known system dynamics.

Faiz Aladin, Ashwin Balasubramanian, Lars Lindemann + 1 more2026-03-03🤖 cs.LG

Hierarchical Multi-Scale Graph Learning with Knowledge-Guided Attention for Whole-Slide Image Survival Analysis

The paper proposes HMKGN, a hierarchical multi-scale graph network that leverages knowledge-guided attention to model spatially organized, multi-scale interactions within whole-slide images, significantly outperforming existing methods in cancer survival prediction across four TCGA cohorts.

Bin Xu, Yufei Zhou, Boling Song + 6 more2026-03-03⚡ eess

AoE: Always-on Egocentric Human Video Collection for Embodied AI

This paper introduces the Always-on Egocentric (AoE) system, a low-cost, scalable data collection framework that leverages distributed human agents and smartphones to generate high-quality egocentric interaction data for training embodied AI foundation models.

Bowen Yang, Zishuo Li, Yang Sun + 15 more2026-03-03💻 cs

Learning Under Extreme Data Scarcity: Subject-Level Evaluation of Lightweight CNNs for fMRI-Based Prodromal Parkinsons Detection

This study demonstrates that in the context of extreme data scarcity for prodromal Parkinson's disease detection using fMRI, enforcing strict subject-level evaluation reveals severe information leakage in standard image-level splits and shows that lightweight models like MobileNet V1 generalize more reliably than deeper architectures.

Naimur Rahman2026-03-03🤖 cs.LG

Certainty-Validity: A Diagnostic Framework for Discrete Commitment Systems

This paper introduces the Certainty-Validity (CVS) Framework, a diagnostic tool for discrete commitment systems that exposes the critical flaw of standard accuracy metrics by distinguishing between appropriate uncertainty and harmful confident hallucinations, ultimately arguing that effective training for reasoning systems should prioritize maximizing the CVS score to prevent models from overcommitting to ambiguous data.

Datorien L. Anderson2026-03-03🤖 cs.LG

Automated Quality Check of Sensor Data Annotations

This paper presents an open-source tool that automates the quality assurance of multi-sensor railway training data by detecting nine common annotation errors with high precision, thereby significantly reducing manual workload and accelerating the development of AI-driven automated driving systems.

Niklas Freund, Zekiye Ilknur-Öz, Tobias Klockau + 3 more2026-03-03💻 cs

Multimodal Modular Chain of Thoughts in Energy Performance Certificate Assessment

This paper introduces Multimodal Modular Chain of Thoughts (MMCoT), a cost-efficient framework utilizing Vision-Language models to improve automated Energy Performance Certificate (EPC) pre-assessment by decomposing the estimation into structured reasoning stages, which demonstrated statistically significant accuracy gains over standard prompting on a UK residential dataset.

Zhen Peng, Peter J. Bentley2026-03-03🤖 cs.AI

VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation

This paper proposes VoxelDiffusionCut, a novel method that leverages a diffusion model to iteratively estimate internal 3D structures from observed cutting surfaces and plan non-destructive cuts, thereby enabling the safe extraction of target components like batteries and motors from complex products by effectively capturing predictive uncertainty to avoid erroneous damage.

Takumi Hachimine, Yuhwan Kwon, Cheng-Yu Kuo + 2 more2026-03-03💻 cs

Efficient Image Super-Resolution with Multi-Scale Spatial Adaptive Attention Networks

This paper proposes the Multi-scale Spatial Adaptive Attention Network (MSAAN), a lightweight image super-resolution framework that integrates novel modules for multi-scale feature aggregation and spatial adaptive attention to achieve superior reconstruction fidelity with significantly reduced computational complexity compared to state-of-the-art methods.

Sushi Rao, Jingwei Li2026-03-03💻 cs

BiSe-Unet: A Lightweight Dual-path U-Net with Attention-refined Context for Real-time Medical Image Segmentation

The paper introduces BiSe-Unet, a lightweight dual-path U-Net architecture that combines an attention-refined context path with a shallow spatial path and a depthwise separable decoder to achieve real-time, high-precision medical image segmentation on resource-constrained edge devices like the Raspberry Pi 5.

M Iffat Hossain, Laura Brattain2026-03-03💻 cs

NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence

NovaLAD is a fast, CPU-optimized document extraction pipeline that leverages concurrent YOLO models, rule-based grouping, and selective Vision LLM processing to convert unstructured documents into structured formats with state-of-the-art accuracy on the DP-Bench benchmark, enabling efficient Generative AI applications without requiring GPU hardware.

Aman Ulla2026-03-03🤖 cs.AI

CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers

This paper introduces CT-Flow, an agentic framework that leverages the Model Context Protocol to transform static 3D CT analysis into a dynamic, tool-mediated workflow, achieving state-of-the-art performance on the newly curated CT-FlowBench by autonomously orchestrating complex diagnostic tasks through iterative tool use.

Yannian Gu, Xizhuo Zhang, Linjie Mu + 4 more2026-03-03🤖 cs.AI

QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference

QuickGrasp is a responsive, QoS-aware system that bridges the accuracy-latency trade-off in video-language querying by employing a local-first architecture with on-demand edge augmentation, shared vision representations, and adaptive tokenization to match large model performance while significantly reducing response delays.

Miao Zhang, Ruixiao Zhang, Jianxin Shi + 3 more2026-03-03⚡ eess

← Previous Next →