cs papers | Gist.Science

OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

This paper introduces OmniEarth, a comprehensive benchmark comprising 9,275 images and 44,210 verified instructions that evaluates Vision-Language Models across 28 geospatial tasks with a focus on perception, reasoning, and robustness, revealing significant performance gaps in current models for remote sensing applications.

Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, Bo YangWed, 11 Ma💻 cs

MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

The paper introduces MORE-R1, a novel Large Vision-Language Model that leverages a two-stage training process combining Supervised Fine-Tuning on automatically constructed stepwise reasoning data and Reinforcement Learning with Group Relative Policy Optimization to achieve state-of-the-art performance in Multimodal Object-Entity Relation Extraction.

Xiang Yuan, Xu Chu, Xinrong Chen, Haochen Li, Zonghong Dai, Hongcheng Fan, Xiaoyue Yuan, Weiping Li, Tong MoWed, 11 Ma💻 cs

Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

PruneSID is a training-free, synergistic importance-diversity framework that significantly enhances Vision-Language Model efficiency by employing Principal Semantic Components Analysis and Intra-group Non-Maximum Suppression to achieve state-of-the-art accuracy with extreme token compression and faster prefilling speeds.

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie PeiWed, 11 Ma💻 cs

StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving

StyleVLA is a physics-informed Vision Language Action model built on Qwen3-VL-4B that generates diverse, kinematically feasible driving trajectories tailored to specific styles, significantly outperforming state-of-the-art proprietary models on domain-specific autonomous driving tasks.

Yuan Gao, Dengyuan Hua, Mattia Piccinini, Finn Rasmus Schäfer, Korbinian Moller, Lin Li, Johannes BetzWed, 11 Ma💻 cs

Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

This paper proposes a novel component-aware, self-refining framework that combines a Self-Attention-based Autoencoder, a Coordinate-Preserving Gated Fusion module, and a Spatially Adaptive Refinement Revisor to generate high-fidelity, semantically accurate photorealistic images from freehand sketches, significantly outperforming existing GAN and diffusion models across diverse facial and non-facial datasets.

Ali Zia, Muhammad Umer Ramzan, Usman Ali, Muhammad Faheem, Abdelwahed Khamis, Shahnawaz QureshiWed, 11 Ma💻 cs

Streaming Autoregressive Video Generation via Diagonal Distillation

This paper introduces Diagonal Distillation, an asymmetric autoregressive framework that leverages temporal context and implicit optical flow to enable high-fidelity, real-time streaming video generation with a 277.3x speedup while mitigating error accumulation and motion incoherence.

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-HsuanYang, Weiyang LiuWed, 11 Ma💻 cs

Towards Viewpoint-centric Artifact-based Regulatory Requirements Engineering for Compliance by Design

This paper reports on the synthesis and seeks feedback for the future evaluation of an Artefact Model for Regulatory Requirements Engineering (AM4RRE), aiming to bridge the gap between organizational regulatory processes and ad-hoc software development practices to achieve systematic, integrated compliance by design.

Oleksandr KosenkovWed, 11 Ma💻 cs

SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

The paper proposes SurgFed, a language-guided multi-task federated learning framework that utilizes Language-guided Channel Selection and Language-guided Hyper Aggregation to overcome tissue and task diversity challenges, thereby improving surgical video segmentation and depth estimation across heterogeneous clinical environments.

Zheng Fang, Ziwei Niu, Ziyue Wang, Zhu Zhuo, Haofeng Liu, Shuyang Qian, Jun Xia, Yueming JinWed, 11 Ma💻 cs

EmbC-Test: How to Speed Up Embedded Software Testing Using LLMs and RAG

This paper introduces EmbC-Test, a Retrieval-Augmented Generation (RAG) pipeline that leverages project-specific artifacts to ground large language models, enabling the automated generation of syntactically correct and runtime-valid embedded C tests that reduce testing time by up to 66% while producing 270 tests per hour.

Maximilian Harnot, Sebastian Komarnicki, Michal Polok, Timo OksanenWed, 11 Ma💻 cs

Avoiding Big Integers: Parallel Multimodular Algebraic Verification of Arithmetic Circuits

This paper introduces a hybrid algebraic verification technique implemented in TalisMan2.0 that utilizes parallel multimodular reasoning to efficiently verify arithmetic circuits with large operands while avoiding the computational overhead of big integer arithmetic.

Clemens Hofstadler, Daniela Kaufmann, Chen ChenWed, 11 Ma💻 cs

Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation

The paper presents Context-Nav, a training-free framework for text-goal instance navigation that combines caption-driven frontier ranking for global exploration with viewpoint-aware 3D spatial verification to accurately disambiguate target objects in cluttered environments, achieving state-of-the-art performance on InstanceNav and CoIN-Bench.

Won Shik Jang, Ue-Hwan KimWed, 11 Ma💻 cs

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

This paper investigates the reliability of Vision-Language Models (VLMs) in autonomous driving by exposing their tendencies toward response inconsistency and weak temporal reasoning, and subsequently proposes the FutureVQA benchmark and a self-supervised chain-of-thought tuning method to enhance grounded future scene reasoning without requiring temporal labels.

Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain PaganiWed, 11 Ma💻 cs

Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks

This paper introduces RuleSafe, a new long-horizon articulated manipulation benchmark featuring non-Markovian safe-unlocking tasks, and proposes VQ-Memory, a vector-quantized temporal representation that significantly enhances the planning, generalization, and efficiency of Vision-Language-Action models in complex robotic simulations.

Wang Honghui, Jing Zhi, Ao Jicong, Song Shiji, Li Xuelong, Huang Gao, Bai ChenjiaWed, 11 Ma💻 cs

RESBev: Making BEV Perception More Robust

The paper introduces RESBev, a plug-and-play framework that enhances the robustness of Bird's-eye-view (BEV) perception systems against sensor degradation and adversarial attacks by reframing the problem as latent semantic prediction to reconstruct clean features from corrupted observations.

Lifeng Zhuo, Kefan Jin, Zhe Liu, Hesheng WangWed, 11 Ma💻 cs

DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation

This paper proposes DCAU-Net, a novel medical image segmentation framework that combines Differential Cross Attention to efficiently model long-range dependencies while reducing computational complexity, and a Channel-Spatial Feature Fusion strategy to adaptively integrate semantic and spatial details, thereby achieving enhanced segmentation accuracy and robustness.

Yanxin Li, Hui Wan, Libin LanWed, 11 Ma💻 cs

Dynamic Multimodal Expression Generation for LLM-Driven Pedagogical Agents: From User Experience Perspective

This paper proposes a large language model-driven method for generating dynamic, semantically aligned speech and gestures for pedagogical agents in virtual reality, demonstrating through user experience experiments that such multimodal expressions significantly enhance learning effectiveness, engagement, and social presence while reducing fatigue and boredom.

Ninghao Wan, Jiarun Song, Fuzheng YangWed, 11 Ma💻 cs

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

This paper proposes a reinforcement learning-based post-training strategy that extends Group Relative Policy Optimization (GRPO) with hybrid and process-level rewards to enable existing unified vision-language models to generate high-quality multimodal interleaved outputs without relying on large-scale interleaved datasets.

Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li ZhangWed, 11 Ma💻 cs

Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA

This paper introduces DynHiL-EQA, a new dataset for dynamic human-in-the-loop Embodied Question Answering, and proposes DIVRR, a training-free framework that enhances robustness and inference efficiency by refining ambiguous views and selectively managing memory in dynamic environments.

Xin Lu, Rui Li, Xun Huang, Weixin Li, Chuanqing Zhuang, Jiayuan Li, Zhengda Lu, Jun Xiao, Yunhong WangWed, 11 Ma💻 cs

NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models

This paper introduces NS-VLA, a novel Neuro-Symbolic Vision-Language-Action framework that integrates symbolic encoding, solving, and online reinforcement learning to achieve superior data efficiency, zero-shot generalizability, and expanded exploration in robotic manipulation compared to existing methods.

Ziyue Zhu, Shangyang Wu, Shuai Zhao, Zhiqiu Zhao, Shengjie Li, Yi Wang, Fang Li, Haoran LuoWed, 11 Ma💻 cs

Compartmentalization-Aware Automated Program Repair

This paper presents a specialized, LLM-based Automated Program Repair framework designed to automatically secure cross-compartment interfaces in compartmentalized software by integrating a fuzzer, compartment-aware analysis techniques, and a validation component to address the limitations of existing general-purpose repair tools.

Jia Hu, Youcheng Sun, Pierre OlivierWed, 11 Ma💻 cs

← Previous Next →