cs papers | Gist.Science

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

The paper proposes CMMR-VLN, a vision-and-language navigation framework that enhances large language model agents with structured multimodal memory retrieval and reflection-based updates to selectively leverage prior experiences, significantly improving performance in long-horizon and unfamiliar scenarios compared to existing methods.

Haozhou Li, Xiangyu Dong, Huiyan Jiang, Yaoming Zhou, Xiaoguang Ma2026-03-10💻 cs

Dual-Horizon Hybrid Internal Model for Low-Gravity Quadrupedal Jumping with Hardware-in-the-Loop Validation

This paper introduces a Dual-Horizon Hybrid Internal Model that enables stable, continuous quadrupedal jumping under lunar gravity using only proprioceptive sensing, validated through the MATRIX hardware-in-the-loop testbed which emulates reduced gravity and lunar terrain in real time.

Haozhe Xu, Yifei Zhao, Wenhao Feng, Zhipeng Wang, Hongrui Sang, Cheng Cheng, Xiuxian Li, Zhen Yin, Bin He2026-03-10💻 cs

SafarDB: FPGA-Accelerated Distributed Transactions via Replicated Data Types

SafarDB is a novel FPGA-accelerated distributed transaction system that co-designs a network-attached replication engine with a custom FPGA network interface to achieve significantly lower latency and higher throughput for both Conflict-Free and Well-coordinated Replicated Data Types compared to state-of-the-art RDMA-based implementations.

Javad Saberlatibari, Prithviraj Yuvaraj, Mohsen Lesani, Philip Brisk, Mohammad Sadoghi2026-03-10💻 cs

ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation

This paper proposes the ViSA-enhanced framework, a triple-phase collaborative architecture that leverages structured visual prompting to enable Vision-Language Models to perform direct spatial reasoning on image planes, achieving a 70.3% improvement in success rate over state-of-the-art aerial Vision-Language Navigation methods on the CityNav benchmark.

Haoyu Tong, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, Yaoming Zhou, Chenghao Lin2026-03-10💻 cs

It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models

This paper addresses the significant challenge of analog clock reading in state-of-the-art Vision-Language Models by introducing the real-world, diverse TickTockVQA dataset and the Swap-DPO fine-tuning framework, which together substantially improve spatial-temporal reasoning and accuracy under complex visual conditions.

Jaeha Choi, Jin Won Lee, Siwoo You, Jangho Lee2026-03-10💻 cs

Structure-Preserving Graph Contrastive Learning for Mathematical Information Retrieval

This paper proposes Variable Substitution, a domain-specific graph augmentation technique that preserves the structural and semantic integrity of mathematical formulas, significantly enhancing the performance of graph contrastive learning models for mathematical information retrieval compared to generic strategies.

Chun-Hsi Ku, Hung-Hsuan Chen2026-03-10💻 cs

PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

This paper introduces PIRA-Bench, a novel benchmark and the PIRF baseline framework designed to advance GUI agents from reactive instruction-following to proactive intent recommendation by evaluating their ability to anticipate user needs from noisy, continuous visual inputs.

Yuxiang Chai, Shunye Tang, Han Xiao, Rui Liu, Hongsheng Li2026-03-10💻 cs

Alignment--Process--Outcome: Rethinking How AIs and Humans Collaborate

This paper proposes a unified dynamic framework using "task" and "intent" lenses to reconceptualize human-AI collaboration, arguing that alignment, process structure, and outcome quality are non-linearly related and require a structural analysis beyond simple outcome metrics.

Haichang Li, Anjun Zhu, Arpit Narechania2026-03-10💻 cs

Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

This paper proposes "Missing No More," a novel dictionary-guided framework that addresses the challenge of missing infrared modality in image fusion by learning a shared convolutional dictionary to enable interpretable coefficient-domain inference and fusion, thereby avoiding uncontrolled pixel-space generation while improving perceptual quality and downstream detection performance.

Yafei Zhang, Meng Ma, Huafeng Li, Yu Liu2026-03-10💻 cs

Vector Field Augmented Differentiable Policy Learning for Vision-Based Drone Racing

This paper introduces DiffRacing, a novel framework that enhances differentiable policy learning for vision-based drone racing by integrating vector fields to provide stable gradient signals for balancing high-speed gate traversal with obstacle avoidance, while employing a differentiable Delta Action Model to enable robust sim-to-real transfer without explicit system identification.

Yang Su, Feng Yu, Yu Hu, Xinze Niu, Linzuo Zhang, Fangyu Sun, Danping Zou2026-03-10💻 cs

VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion

VSDiffusion is a visibility-constrained two-stage diffusion framework that addresses the ill-posed nature of shadow generation by incorporating visibility priors through a shadow-gated cross-attention branch and a learned soft prior map to produce geometrically consistent and realistic cast shadows.

Jing Li, Jing Zhang2026-03-10💻 cs

AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis

AffordGrasp is a diffusion-based framework that generates physically stable and semantically accurate human grasps by integrating affordance-aware latent representations with a dual-conditioning process to bridge the modality gap between 3D object geometry and textual interaction instructions.

Xiaofei Wu, Yi Zhang, Yumeng Liu, Yuexin Ma, Yujiao Shi, Xuming He2026-03-10💻 cs

Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

This paper introduces MambaDance, a novel dance generation framework that replaces Transformers with a Mamba-based diffusion model and employs a Gaussian-based beat representation to effectively capture the sequential, rhythmic, and music-synchronized nature of dance across varying sequence lengths.

Sangjune Park, Inhyeok Choi, Donghyeon Soon, Youngwoo Jeon, Kyungdon Joo2026-03-10💻 cs

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

This paper proposes a two-stage cascaded framework that generates controllable complex human motion videos by first using an autoregressive model to synthesize 2D skeleton sequences from text descriptions and then employing a pose-conditioned diffusion model with adaptive layer fusion to render high-fidelity videos, supported by a new synthetic dataset designed to overcome limitations in existing benchmarks.

Ashkan Taghipour, Morteza Ghahremani, Zinuo Li, Hamid Laga, Farid Boussaid, Mohammed Bennamoun2026-03-10💻 cs

QualiTeacher: Quality-Conditioned Pseudo-Labeling for Real-World Image Restoration

QualiTeacher introduces a novel framework for real-world image restoration that transforms imperfect pseudo-labels into conditional supervisory signals by explicitly conditioning the student model on estimated label quality, thereby enabling the learning of a quality-graded restoration manifold that avoids artifact mimicry and achieves superior generalization.

Fengyang Xiao, Jingjia Feng, Peng Hu, Dingming Zhang, Lei Xu, Guanyi Qin, Lu Li, Chunming He, Sina Farsiu2026-03-10💻 cs

The Unit Gap: How Sharing Works in Boolean Circuits

This paper establishes that the size difference between optimal Boolean circuits and formulas over the AIG basis is strictly limited to 0 or 1, characterizing the precise conditions under which sharing occurs and proving that any non-zero gap arises exclusively from a single gate with fan-out 2.

Kirill Krinkin2026-03-10💻 cs

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

This paper presents a robust multimodal framework for the 10th ABAW Expression Recognition Challenge that utilizes a dual-branch Transformer with safe cross-attention and modality dropout to dynamically fuse audio and visual data, effectively addressing partial occlusions, missing modalities, and class imbalance to achieve 60.79% accuracy on the Aff-Wild2 validation set.

Jun Yu, Naixiang Zheng, Guoyuan Wang, Yunxiang Zhang, Lingsi Zhu, Jiaen Liang, Wei Huang, Shengping Liu2026-03-10💻 cs

Samyama: A Unified Graph-Vector Database with In-Database Optimization, Agentic Enrichment, and Hardware Acceleration

This paper introduces Samyama, a high-performance, unified graph-vector database written in Rust that integrates persistent storage, vector indexing, native optimization solvers, and agentic LLM enrichment into a single engine, achieving competitive throughput and latency on commodity hardware while offering GPU-accelerated enterprise features.

Madhulatha Mandarapu, Sandeep Kunkunuru2026-03-10💻 cs

CEMR: An Effective Subgraph Matching Algorithm with Redundant Extension Elimination

The paper proposes CEMR, a novel subgraph matching algorithm that significantly improves efficiency on large graphs by eliminating redundant computations through common extension merging and reusing, while further optimizing performance with two pruning techniques.

Linglin Yang, Xunbin Su, Lei Zou, Xiangyang Gou, Yinnian Lin2026-03-10💻 cs

Distributed Coordination Algorithms with Efficient Communication for Open Multi-Agent Systems with Dynamic Communication Links and Processing Delays

This paper proposes and analyzes three communication-efficient distributed algorithms that achieve finite-time quantized average consensus in open multi-agent systems with dynamic directed links, arbitrary bounded processing delays, and continuous node turnover, while establishing novel topological conditions for convergence and demonstrating superior performance through simulations.

Jiaqi Hu, Karl H. Johansson, Apostolos I. Rikos2026-03-10💻 cs

← Previous Next →