LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

This paper introduces LAP, a novel procedure planning model that leverages a fine-tuned Vision Language Model to convert visual observations into distinctive text embeddings for a diffusion-based planner, achieving state-of-the-art performance on multiple benchmarks by effectively resolving visual ambiguities through language.

Lei Shi, Victor Aregbede, Andreas Persson, Martin Längkvist, Amy Loutfi, Stephanie LowryWed, 11 Ma💻 cs

ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios

This paper introduces ENIGMA-360, a publicly released, temporally synchronized ego-exo dataset containing 360 annotated procedural videos from real industrial scenarios to advance human behavior understanding and establish baselines for tasks like action segmentation and interaction detection.

Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Daniele Di Mauro, Camillo Quattrocchi, Alessandro Passanisi, Irene D'Ambra, Antonino Furnari, Giovanni Maria FarinellaWed, 11 Ma💻 cs

Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

This paper introduces Step-Aware Contrastive Alignment (SACA), a novel framework that enhances Vision-Language Navigation in Continuous Environments by utilizing a perception-grounded auditor to extract dense, step-level supervision from imperfect trajectories, thereby overcoming the limitations of compounding errors in supervised fine-tuning and sparse rewards in reinforcement fine-tuning to achieve state-of-the-art performance.

Haoyuan Li, Rui Liu, Hehe Fan, Yi YangWed, 11 Ma💻 cs

Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

This paper proposes a task-based scheduling framework that ensures end-to-end data freshness in safety-critical multi-rate systems by introducing a Consensus Offset Search algorithm to align task releases with data lifespan constraints, thereby eliminating the artificial latency of Logical Execution Time and the inefficiency of redundant oversampling while preserving Global EDF schedulability.

José Luis Conradi Hoffmann, Antônio Augusto FröhlichWed, 11 Ma💻 cs

FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

FetalAgents is a novel multi-agent system that dynamically orchestrates specialized vision experts to deliver robust, end-to-end fetal ultrasound analysis and structured clinical reporting across multiple tasks, outperforming existing specialized models and multimodal large language models.

Xiaotian Hu, Junwei Huang, Mingxuan Liu, Kasidit Anmahapong, Yifei Chen, Yitong Luo, Yiming Huang, Xuguang Bai, Zihan Li, Yi Liao, Haibo Qu, Qiyuan TianWed, 11 Ma💻 cs

WVA: A Global Optimization Control Plane for llmd

The paper introduces WVA, a global optimization control plane co-designed with the \texttt{llmd} inference engine that leverages internal saturation states and fragmentation-aware strategies to achieve significantly higher throughput, fewer request failures, and lower power consumption compared to traditional Kubernetes autoscalers when managing heterogeneous LLM workloads.

Abhishek Malvankar, Lionel Villard, Mohammed Abdi, Evgeny Shindin, Braulio Dumba, Vishakha Ramani, Asser Tantawi, Tamar EilamWed, 11 Ma💻 cs

A Regularized Ensemble Kalman Filter for Stochastic Phase Field Models of Brittle Fracture

This paper proposes a regularized ensemble Kalman filter framework that integrates sensor displacement data into stochastic phase-field models of brittle fracture to infer the evolving displacement and phase-field states, thereby correcting model predictions while ensuring physical consistency through a novel regularization step.

Lucas Hermann, Ralf Jänicke, Knut Andreas Meyer, Ulrich RömerWed, 11 Ma💻 cs

Idempotent Slices with Applications to Code-Size Reduction

This paper formalizes the concept of idempotent backward slices and presents a sound, efficient algorithm for extracting them from Gated Static Single Assignment (GSA) form to enable a novel sparse code-size reduction optimization that merges non-contiguous instructions, achieving up to 7.24% size reduction in specific benchmarks.

Rafael Alvarenga de Azevedo, Daniel Augusto Costa de Sa, Rodrigo Caetano Rocha, Fernando Magno Quintão PereiraWed, 11 Ma💻 cs

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

The paper proposes FrameDiT, a novel video generation architecture that introduces Matrix Attention to efficiently model global spatio-temporal dynamics by processing frames as matrices, thereby achieving state-of-the-art video quality and temporal coherence while maintaining computational efficiency comparable to local factorized attention.

Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen TranWed, 11 Ma💻 cs

Robotic Scene Cloning:Advancing Zero-Shot Robotic Scene Adaptation in Manipulation via Visual Prompt Editing

This paper introduces Robotic Scene Cloning (RSC), a novel method that enhances zero-shot robotic manipulation by editing existing operation trajectories through visual prompting and condition injection to generate accurate, scene-consistent samples that significantly improve policy generalization in real-world environments.

Binyuan Huang, Yuqing Wen, Yucheng Zhao, Yaosi Hu, Tiancai Wang, Chang Wen Chen, Haoqiang Fan, Zhenzhong ChenWed, 11 Ma💻 cs

TriFusion-SR: Joint Tri-Modal Medical Image Fusion and SR

The paper proposes TriFusion-SR, a wavelet-guided conditional diffusion framework that jointly performs tri-modal medical image fusion and super-resolution by decomposing features into frequency bands and employing rectified wavelet features with adaptive spatial-frequency fusion to achieve state-of-the-art performance in resolution and perceptual quality.

Fayaz Ali Dharejo, Sharif S. M. A., Aiman Khalil, Nachiket Chaudhary, Rizwan Ali Naqvi, Radu TimofteWed, 11 Ma💻 cs

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

This paper introduces the concept of "Interaction Smells" in multi-turn human-LLM code generation, establishes a taxonomy based on real-world data, analyzes their distribution across leading models, and proposes the Invariant-aware Constraint Evolution (InCE) framework to effectively mitigate these issues and improve task success rates.

Binquan Zhang, Li Zhang, Lin Shi, Song Wang, Yuwei Qian, Linhui Zhao, Fang Liu, An Fu, Yida YeWed, 11 Ma💻 cs

TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering

The paper introduces TemporalDoRA, a parameter-efficient fine-tuning method that integrates lightweight temporal attention into the low-rank adaptation branch of vision encoders to enhance robustness against linguistic variations in surgical video question answering, validated by a new colonoscopy dataset and improved Out-of-Template performance.

Luca Carlini, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. HoqueWed, 11 Ma💻 cs