cs papers | Gist.Science

RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models

This paper introduces RetoVLA, a lightweight Vision-Language-Action model that enhances spatial reasoning and real-world robotic performance by repurposing discarded register tokens to inject global spatial context into the action-planning module without increasing parameter counts.

Jiyeon Koo, Taewan Cho, Hyunjoon Kang, Eunseom Pyo, Tae Gyun Oh, Taeryang Kim, Andrew Jaeyong Choi2026-03-10💻 cs

Quantized Visual Geometry Grounded Transformer

This paper introduces QuantVGGT, the first quantization framework for billion-scale Visual Geometry Grounded Transformers (VGGTs), which overcomes unique calibration and distribution challenges through Dual-Smoothed Fine-Grained Quantization and Noise-Filtered Diverse Sampling to achieve significant memory and speedup gains while maintaining high reconstruction accuracy.

Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu2026-03-10💻 cs

Autonomous UAV-Quadruped Docking in Complex Terrains via Active Posture Alignment and Constraint-Aware Control

This paper presents an autonomous docking framework for UAVs and quadruped robots in GPS-denied, complex terrains, utilizing a deep reinforcement learning-based posture stabilization system for the ground robot and a three-phase, constraint-aware control strategy for the UAV to achieve successful landings on steep slopes and uneven surfaces.

Haozhe Xu, Cheng Cheng, Hongrui Sang, Zhipeng Wang, Qiyong He, Xiuxian Li, Bin He2026-03-10💻 cs

Motion-Aware Transformer for Multi-Object Tracking

The paper introduces MATR, a Motion-Aware Transformer that explicitly predicts object movements to update track queries in advance, thereby resolving query collisions in end-to-end frameworks and achieving state-of-the-art performance on multiple multi-object tracking benchmarks.

Xu Yang, Gady Agam2026-03-10💻 cs

GS-2M: Material-aware Gaussian Splatting for High-fidelity Mesh Reconstruction

The paper presents GS-2M, a material-aware framework that jointly optimizes 3D Gaussian Splatting attributes and employs a novel roughness supervision strategy to achieve high-fidelity mesh reconstruction of reflective surfaces without relying on complex neural components or external priors.

Dinh Minh Nguyen, Malte Avenhaus, Thomas Lindemeier2026-03-10💻 cs

Towards Strategic Persuasion with Language Models

This paper introduces a theory-driven framework grounded in Bayesian persuasion theory to evaluate and train large language models as strategic persuaders, demonstrating that both frontier and smaller models can achieve significant persuasion gains and exhibit sophisticated strategies through reinforcement learning.

Zirui Cheng, Jiaxuan You2026-03-10💻 cs

SAC-Loco: Safe and Adjustable Compliant Quadrupedal Locomotion

This paper proposes SAC-Loco, a safety-aware framework that integrates a teacher-student reinforcement learning approach for adjustable force compliance with a safety-oriented recovery policy and a real-time safety critic, enabling quadruped robots to achieve robust and stable locomotion under diverse external force disturbances without requiring explicit force sensing.

Aoqian Zhang, Zixuan Zhuang, Chunzheng Wang, Shuzhi Sam Ge, Fan Shi, Cheng Xiang2026-03-10💻 cs

Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models

This paper introduces FAMDA, a simple yet effective unsupervised domain adaptation framework that leverages Vision Foundation Models as teachers within a self-training paradigm to generate high-quality pseudo-labels, enabling the training of highly efficient student networks that achieve state-of-the-art performance in multi-task dense prediction for resource-constrained robotics applications.

Beomseok Kang, Niluthpol Chowdhury Mithun, Mikhail Sizintsev, Han-Pang Chiu, Supun Samarasekera2026-03-10💻 cs

QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

QuantSparse is a unified framework that effectively combines model quantization and attention sparsification for video diffusion transformers by introducing Multi-Scale Salient Attention Distillation and Second-Order Sparse Attention Reparameterization to mitigate information loss, thereby achieving significant storage reduction and inference acceleration while substantially outperforming existing baselines in generation quality.

Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu2026-03-10💻 cs

Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

The paper introduces DualFlow, a unified framework that leverages rectified flow and a retrieval-augmented generation module to efficiently produce high-quality, semantically grounded, and rhythmically synchronized two-person 3D motion conditioned on diverse inputs like text, music, and prior sequences.

Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera2026-03-10💻 cs

ELHPlan: Efficient Long-Horizon Task Planning for Multi-Agent Collaboration

ELHPlan is a novel framework for efficient long-horizon multi-agent planning that utilizes intention-bound action chains within a cyclical validation process to achieve comparable task success rates to state-of-the-art methods while significantly reducing computational costs and token consumption.

Shaobin Ling, Yun Wang, Chenyou Fan, Tin Lun Lam, Junjie Hu2026-03-10💻 cs

PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement

This paper introduces PHASE-Net, a lightweight and theoretically grounded remote photoplethysmography model that leverages hemodynamic principles to derive a causal Temporal Convolutional Network, enhanced by novel spatial mixing and filtering modules to achieve state-of-the-art accuracy and efficiency in non-contact physiological monitoring under challenging conditions.

Bo Zhao, Dan Guo, Junzhe Cao, Yong Xu, Bochao Zou, Tao Tan, Yue Sun, Zitong Yu2026-03-10💻 cs

LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

This paper introduces LMOD+, a large-scale multimodal ophthalmology benchmark dataset and evaluation framework featuring 32,633 annotated instances across 12 conditions and 5 imaging modalities, designed to advance and systematically assess the capabilities of multimodal large language models in vision-threatening disease diagnosis, staging, and bias detection.

Zhenyue Qin, Yang Liu, Yu Yin, Jinyu Ding, Haoran Zhang, Anran Li, Dylan Campbell, Xuansheng Wu, Ke Zou, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih Chung Tham, Ninghao Liu, Xiuzhen Zhang, Qingyu Chen2026-03-10💻 cs

Demystifying Codensity Monads via Duality

This paper introduces a unifying duality-based categorical framework that simplifies and generalizes the proofs of codensity monad presentations for various important monads in logic and computation, while also yielding novel presentations for filter, lower Vietoris, and expectation monads.

Fabian Lenke, Nico Wittrock, Stefan Milius, Henning Urbat2026-03-10💻 cs

Radio-based Multi-Robot Odometry and Relative Localization

This paper presents a robust, open-source multi-robot UGV-UAV localization system that fuses UWB and radar data with standard odometry sensors within a pose-graph optimization framework to accurately estimate relative positions in challenging environments, outperforming state-of-the-art methods and offering extensibility to full SLAM.

Andrés Martínez-Silva, David Alejo, Luis Merino, Fernando Caballero2026-03-10💻 cs

XPPG-PCA: Reference-free automatic speech severity evaluation with principal components

This paper introduces XPPG-PCA, a novel unsupervised and reference-free method for objectively evaluating speech pathology severity that overcomes the limitations of existing automated approaches by demonstrating robust, generalizable performance comparable to or exceeding established reference-based methods across multiple datasets.

Bence Mark Halpern, Thomas B. Tienkamp, Teja Rebernik + 5 more2026-03-10💻 cs

Beyond Collision Cones: Dynamic Obstacle Avoidance for Nonholonomic Robots via Dynamic Parabolic Control Barrier Functions

This paper introduces a Dynamic Parabolic Control Barrier Function (DPCBF) that adaptively shapes safety constraints based on distance and relative velocity to overcome the infeasibility and conservativeness of traditional collision-cone methods, enabling nonholonomic robots to successfully navigate dense environments with up to 100 dynamic obstacles.

Hun Kuk Park, Taekyung Kim, Dimitra Panagou2026-03-10💻 cs

Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!

The paper introduces REVEL, a new task for streaming, fine-grained interactive video manipulation on any object at any time, and proposes DragStream, a training-free method that resolves latent distribution drift and context interference in autoregressive video diffusion models through adaptive distribution self-rectification and spatial-frequency selective optimization.

Junbao Zhou, Yuan Zhou, Kesen Zhao, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang2026-03-10💻 cs

Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation guided Structured Pruning

This paper presents a state-of-the-art speaker verification system leveraging the large-scale w2v-BERT 2.0 model with MFA structure and LoRA fine-tuning, which achieves top performance on Vox1-O and Vox1-H benchmarks while utilizing knowledge distillation-guided structured pruning to reduce model size by 80% with minimal accuracy loss.

Ze Li, Ming Cheng, Ming Li2026-03-10💻 cs

PAD-TRO: Projection-Augmented Diffusion for Direct Trajectory Optimization

This paper introduces PAD-TRO, a novel direct trajectory optimization framework that integrates a gradient-free projection mechanism into the reverse diffusion process to generate dynamically feasible state sequences, achieving zero dynamic feasibility errors and a significantly higher success rate in complex quadrotor navigation compared to existing single-shooting approaches.

Jushan Chen, Santiago Paternain2026-03-10💻 cs

← Previous Next →