cs papers | Gist.Science

Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

This paper introduces Step-Aware Contrastive Alignment (SACA), a novel framework that enhances Vision-Language Navigation in Continuous Environments by utilizing a perception-grounded auditor to extract dense, step-level supervision from imperfect trajectories, thereby overcoming the limitations of compounding errors in supervised fine-tuning and sparse rewards in reinforcement fine-tuning to achieve state-of-the-art performance.

Haoyuan Li, Rui Liu, Hehe Fan, Yi YangWed, 11 Ma💻 cs

ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios

This paper introduces ENIGMA-360, a publicly released, temporally synchronized ego-exo dataset containing 360 annotated procedural videos from real industrial scenarios to advance human behavior understanding and establish baselines for tasks like action segmentation and interaction detection.

Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Daniele Di Mauro, Camillo Quattrocchi, Alessandro Passanisi, Irene D'Ambra, Antonino Furnari, Giovanni Maria FarinellaWed, 11 Ma💻 cs

LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

This paper introduces LAP, a novel procedure planning model that leverages a fine-tuned Vision Language Model to convert visual observations into distinctive text embeddings for a diffusion-based planner, achieving state-of-the-art performance on multiple benchmarks by effectively resolving visual ambiguities through language.

Lei Shi, Victor Aregbede, Andreas Persson, Martin Längkvist, Amy Loutfi, Stephanie LowryWed, 11 Ma💻 cs

Caterpillar-Inspired Spring-Based Compressive Continuum Robot for Bristle-based Exploration

This paper presents a compact, spring-based, tendon-driven continuum robot inspired by caterpillar locomotion and equipped with artificial bristle sensors, which integrates with commercial robotic arms to enable effective, compliant exploration and surface perception in confined spaces with a mean position error of 4.32 mm.

Zhixian Hu, Yu She, Juan WachsWed, 11 Ma💻 cs

Simultaneous Embedding of Two Paths on the Grid

This paper establishes that minimizing the longest edge length in a simultaneous geometric embedding of two paths on an integer grid is NP-hard, while providing an $O(n^{3/2})$ algorithm to minimize the grid perimeter when one path is $x$ -monotone and the other is $y$ -monotone.

Stephen Kobourov, William Lenhart, Giuseppe Liotta, Daniel Perz, Pavel Valtr, Johannes ZinkWed, 11 Ma💻 cs

The Richest Paradigm You're Not Using: Commercial Videogames at the Intersection of Human-Computer Interaction and Cognitive Science

This paper argues that commercial videogames serve as a powerful, underutilized research environment at the intersection of human-computer interaction and cognitive science, offering ecologically valid contexts to study perception, attention, and executive functioning through a systematic framework that maps game affordances to cognitive demands.

Jaap Munneke, Jennifer E. CorbettWed, 11 Ma💻 cs

Epistemic Closure: Autonomous Mechanism Completion for Physically Consistent Simulation

This paper introduces a Neuro-Symbolic Generative Agent that overcomes the "Implicit Context" problem in scientific discovery by autonomously validating and completing physical mechanisms through dimensionless scaling analysis, thereby preventing physical hallucinations and ensuring thermodynamically consistent simulations.

Yue Wua, Tianhao Su, Rui Hu, Mingchuan Zhao, Shunbo Hu, Deng Pan, Jizhong HuangWed, 11 Ma💻 cs

LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control

LogoDiffuser is a training-free method that leverages multimodal diffusion transformers and letter-aware attention control to generate high-quality, multilingual logo designs by inputting target characters as images to preserve structural integrity while applying creative styles.

Mingyu Kang, Hyein Seo, Yuna Jeong, Junhyeong Park, Yong Suk ChoiWed, 11 Ma💻 cs

MuxGel: Simultaneous Dual-Modal Visuo-Tactile Sensing via Spatially Multiplexing and Deep Reconstruction

MuxGel is a spatially multiplexed visuo-tactile sensor that overcomes the opacity trade-off in existing GelSight-style devices by using a checkerboard coating to simultaneously capture pre-contact vision and post-contact tactile signals through a single camera, with high-fidelity reconstruction achieved via a deep learning framework.

Zhixian Hu, Zhengtong Xu, Sheeraz Athar, Juan Wachs, Yu SheWed, 11 Ma💻 cs

Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors

This paper challenges the assumption that neutralizing known triggers eliminates backdoors by demonstrating that perceptually distinct "alternative triggers" can reliably activate latent backdoor directions in feature space, thereby advocating for defenses that target these underlying representation patterns rather than specific input triggers.

Gorka Abad, Ermes Franch, Stefanos Koffas, Stjepan PicekWed, 11 Ma💻 cs

Deblurring structural edges in variable thickness topology optimization via density-gradient-informed projection

This paper introduces a density-gradient-informed (DGI) projection method combined with a robust penalization strategy to effectively eliminate low-thickness regions and deblur structural edges in variable thickness topology optimization, achieving sharp solid-void transitions with negligible impact on structural compliance.

Gabriel Stankiewicz, Chaitanya Dev, Paul SteinmannWed, 11 Ma💻 cs

CLIOPATRA: Extracting Private Information from LLM Insights

The paper introduces CLIOPATRA, a novel attack demonstrating that current heuristic privacy protections in LLM insight systems like Anthropic's Clio are insufficient, as a realistic adversary can successfully extract sensitive user medical history by injecting malicious chats, even with minimal prior knowledge of the target.

Meenatchi Sundaram Muthu Selva Annamalai, Emiliano De Cristofaro, Peter KairouzWed, 11 Ma💻 cs

TIMID: Time-Dependent Mistake Detection in Videos of Robot Executions

This paper introduces TIMID, a weakly supervised video anomaly detection framework that leverages task and mistake prompts to detect complex, time-dependent errors in robot executions, addressing the limitations of existing models and out-of-the-box VLMs through a novel multi-robot simulation dataset for zero-shot evaluation.

Nerea Gallego (University of Zaragoza), Fernando Salanova (University of Zaragoza), Claudio Mannarano (University of Zaragoza, University of Torino), Cristian Mahulea (University of Zaragoza), Eduardo Montijano (University of Zaragoza)Wed, 11 Ma💻 cs

Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency

This paper introduces Test-time Ego-Exo Adaptation for Action Anticipation (TE $^{2}$ A $^{3}$ ), a novel task addressed by the Dual-Clue enhanced Prototype Growing Network (DCPGN) which utilizes a Multi-Label Prototype Growing Module and a Dual-Clue Consistency Module to effectively bridge the inter-view gap and adapt models online without target-view training data.

Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Qingbo Wu, Fanman Meng, Lili Pan, Hongliang LiWed, 11 Ma💻 cs

Expressive Power of Property Graph Constraint Languages

This paper presents the first systematic study of the expressive power of the PG-Keys language by establishing a unifying framework to compare it with Graph Functional Dependencies (GFD) and Graph Generating Dependencies (GGD), ultimately revealing a strict hierarchy of expressiveness that clarifies PG-Keys' capabilities within the context of the upcoming GQL standard.

Stefania Dumbrava, Nadime Francis, Victor Marsault, Steven SaillyWed, 11 Ma💻 cs

RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

This paper introduces a new fine-grained Audio-Visual Learning task called Region-Aware Sound Source Understanding (RA-SSU), supported by two novel datasets (f-Music and f-Lifescene) and a state-of-the-art model named SSUFormer, which utilizes specialized modules to achieve precise sound source segmentation and detailed frame-level textual descriptions.

Muyi Sun, Yixuan Wang, Hong Wang, Chen Su, Man Zhang, Xingqun Qi, Qi Li, Zhenan SunWed, 11 Ma💻 cs

ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

ConfCtrl is a confidence-aware video interpolation framework that enables precise camera control in video diffusion for novel view synthesis by combining confidence-weighted point cloud projections with a Kalman-inspired predict-update mechanism to balance pose guidance and geometric consistency while reconstructing unseen regions.

Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Yang Bai, Chi Zhang, Ziyuan Liu, Abhinav ValadaWed, 11 Ma💻 cs

EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

This paper introduces EmoSURA, a novel evaluation framework that improves the assessment of long-form emotional speech captions by decomposing them into atomic perceptual units for audio-grounded verification, addressing the limitations of traditional metrics and LLM judges while providing the standardized SURABench resource.

Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang, Shahin Amiriparian, Jun Luo, Björn SchullerWed, 11 Ma💻 cs

BrainSTR: Spatio-Temporal Contrastive Learning for Interpretable Dynamic Brain Network Modeling

BrainSTR is a spatio-temporal contrastive learning framework that enhances the interpretability of dynamic brain network modeling for neuropsychiatric diagnosis by adaptively partitioning brain states, identifying critical phases, and extracting sparse, disease-specific connectivity patterns to construct a discriminative semantic space validated across ASD, BD, and MDD datasets.

Guiliang Guo, Guangqi Wen, Lingwen Liu, Ruoxian Song, Peng Cao, Jinzhu Yang, Fei Wang, Xiaoli Liu, Osmar R. ZaianeWed, 11 Ma💻 cs

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

This paper introduces VLM-Loc, a framework that leverages large vision-language models to achieve precise text-to-point-cloud localization by transforming 3D maps into bird's-eye-view images and scene graphs for enhanced spatial reasoning, alongside the release of the CityLoc benchmark for systematic evaluation.

Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang, Benjamin Busam, Xieyuanli Chen, Yun LiuWed, 11 Ma💻 cs

← Previous Next →