cs papers | Gist.Science

Mix-modal Federated Learning for MRI Image Segmentation

This paper introduces MixMFL, a novel non-centralized federated learning paradigm for MRI image segmentation that addresses client-wise modality and data heterogeneity through a proposed MDM-MixMFL framework featuring modality decoupling for tailored and shared updates, and a memorizing mechanism to compensate for incomplete local modalities.

Guyue Hu, Siyuan Song, Jingpeng Sun, Zhe Jin, Chenglong Li, Jin Tang2026-03-10💻 cs

UltraUPConvNet: A UPerNet- and ConvNeXt-Based Multi-Task Network for Ultrasound Tissue Segmentation and Disease Prediction

UltraUPConvNet is a computationally efficient, multi-task framework based on UPerNet and ConvNeXt that simultaneously performs ultrasound tissue segmentation and disease prediction, achieving state-of-the-art performance on a large-scale dataset with reduced computational overhead.

Zhi Chen, Le Zhang2026-03-10💻 cs

Traffic-MLLM: Curiosity-Regularized Supervised Learning for Traffic Scenario Case-Based Reasoning

Traffic-MLLM is a retrieval-free neural framework that enhances autonomous driving decision-making by integrating multi-source traffic data with a curiosity-driven refinement mechanism to learn a structured, generalizable case space for robust reasoning in long-tail scenarios.

Waikit Xiu, Qiang Lu, Bingchen Liu, Chen Sun, Xiying Li2026-03-10💻 cs

ActivePose: Active 6D Object Pose Estimation and Tracking for Robotic Manipulation

ActivePose proposes an active 6D object pose estimation and tracking framework that integrates a Vision-Language Model with "robotic imagination" to dynamically resolve viewpoint-induced ambiguities through Next-Best-View selection and employs a diffusion-policy for robust camera trajectory control, significantly outperforming classical baselines in both simulation and real-world robotic manipulation tasks.

Sheng Liu, Zhe Li, Weiheng Wang, Han Sun, Heng Zhang, Hongpeng Chen, Yusen Qin, Arash Ajoudani, Yizhao Wang2026-03-10💻 cs

Bio-inspired tail oscillation enables robot fast crawling on deformable granular terrains

Inspired by mudskippers, this study demonstrates that actively oscillating a robot's tail significantly enhances crawling speed and reduces drag on deformable granular terrains by fluidizing the substrate, offering new design principles for locomotion in challenging environments.

Shipeng Liu, Meghana Sagare, Shubham Patil, Feifei Qian2026-03-10💻 cs

SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

The paper proposes SAGA, a novel linear attention mechanism that employs selective adaptive gating and efficient Hadamard-product decomposition to overcome the low-rank limitations of existing methods, thereby achieving significant gains in computational efficiency, memory usage, and top-1 accuracy on high-resolution vision tasks.

Yuan Cao, Dong Wang2026-03-10💻 cs

Cumulative Consensus Score: Label-Free and Model-Agnostic Evaluation of Object Detectors in Deployment

This paper introduces the Cumulative Consensus Score (CCS), a label-free and model-agnostic metric that evaluates object detector reliability in deployment by measuring the spatial consistency of predictions across test-time augmented views, achieving high congruence with traditional ground-truth-based metrics.

Avinaash Manoharan, Xiangyu Yin, Domenik Helm, Chih-Hong Cheng2026-03-10💻 cs

WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory

This paper introduces WHU-STree, a comprehensive, multi-modal benchmark dataset featuring synchronized point clouds and high-resolution images of over 21,000 street trees across two cities, designed to overcome limitations in existing datasets by enabling diverse inventory tasks and advancing research in multi-modal fusion and cross-domain generalization for urban tree management.

Ruifei Ding, Zhe Chen, Wen Fan + 5 more2026-03-10💻 cs

Agile in the Face of Delay: Asynchronous End-to-End Learning for Real-World Aerial Navigation

This paper proposes an asynchronous reinforcement learning framework equipped with a Temporal Encoding Module and a two-stage curriculum to decouple high-frequency control from low-frequency perception, enabling robust, agile autonomous aerial navigation in complex real-world environments with zero-shot sim-to-real transfer.

Yude Li, Zhexuan Zhou, Huizhe Li, Youmin Gong, Jie Mei2026-03-10💻 cs

GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

GeoAware-VLA enhances the viewpoint generalization of Vision-Language-Action models by integrating features from a frozen, pretrained geometric vision model via a lightweight projection layer, achieving significant improvements in zero-shot performance on unseen camera poses across both simulation benchmarks and real-world robotic platforms without requiring explicit 3D training data.

Ali Abouzeid, Malak Mansour, Qinbo Sun, Zezhou Sun, Dezhen Song2026-03-10💻 cs

OIPP: Object-Adaptive Impact Point Predictor for Catching Diverse In-Flight Objects

This paper introduces the Object-Adaptive Impact Point Predictor (OIPP) and a new real-world dataset of 8,000 diverse trajectories to enable quadruped robots with baskets to accurately predict the landing positions of various in-flight objects, even during early flight stages and for unseen objects, thereby significantly improving catching success rates.

Ngoc Huy Nguyen, Kazuki Shibata, Takamitsu Matsubara2026-03-10💻 cs

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

This paper introduces LibriTTS-VI, the first public corpus for numerical voice impression control, and proposes novel training methods to mitigate impression leakage, achieving significantly improved controllability over existing prompt-based approaches.

Junki Ohmura, Yuki Ito, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura2026-03-10💻 cs

Compose by Focus: Scene Graph-based Atomic Skills

This paper introduces a scene graph-based framework that enhances the compositional generalization of generalist robots by learning robust, focused atomic skills via graph neural networks and diffusion models, which are then orchestrated by a vision-language model planner to achieve superior performance in complex, long-horizon tasks.

Han Qi, Changhe Chen, Heng Yang2026-03-10💻 cs

DroFiT: A Lightweight Band-fused Frequency Attention Toward Real-time UAV Speech Enhancement

DroFiT is a lightweight, single-microphone speech enhancement network that integrates a frequency-wise Transformer with a hybrid encoder-decoder and TCN back-end to achieve real-time, memory-efficient drone noise removal on resource-constrained UAV platforms.

Jeongmin Lee, Chanhong Jeon, Hyungjoo Seo, Taewook Kang2026-03-10💻 cs

Event-Based Visual Teach-and-Repeat via Fast Fourier-Domain Cross-Correlation

This paper presents a novel event-camera-based visual teach-and-repeat system that achieves ultra-low latency (2.88 ms) and robust autonomous navigation over 3000+ meters in diverse conditions by utilizing fast Fourier-domain cross-correlation for efficient event-stream matching.

Gokul B. Nair, Alejandro Fontan, Michael Milford, Tobias Fischer2026-03-10💻 cs

Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

This paper reveals that current video benchmarks often fail to evaluate audio-visual reasoning due to over-reliance on visual cues, but demonstrates that integrating speech encoders with efficient token compression significantly improves performance on tasks requiring speech comprehension and cross-modal grounding.

Geewook Kim, Minjoon Seo2026-03-10💻 cs

Efficient Construction of Implicit Surface Models From a Single Image for Motion Generation

This paper introduces Fast Image-to-Neural Surface (FINS), a lightweight framework that efficiently reconstructs high-fidelity implicit surfaces and SDF fields from a single image within seconds by leveraging multi-resolution hash grids and pre-trained foundation models, outperforming existing methods in speed and accuracy for robotics applications.

Wei-Teng Chu, Tianyi Zhang, Matthew Johnson-Roberson, Weiming Zhi2026-03-10💻 cs

RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models

This paper introduces RetoVLA, a lightweight Vision-Language-Action model that enhances spatial reasoning and real-world robotic performance by repurposing discarded register tokens to inject global spatial context into the action-planning module without increasing parameter counts.

Jiyeon Koo, Taewan Cho, Hyunjoon Kang, Eunseom Pyo, Tae Gyun Oh, Taeryang Kim, Andrew Jaeyong Choi2026-03-10💻 cs

Quantized Visual Geometry Grounded Transformer

This paper introduces QuantVGGT, the first quantization framework for billion-scale Visual Geometry Grounded Transformers (VGGTs), which overcomes unique calibration and distribution challenges through Dual-Smoothed Fine-Grained Quantization and Noise-Filtered Diverse Sampling to achieve significant memory and speedup gains while maintaining high reconstruction accuracy.

Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu2026-03-10💻 cs

Autonomous UAV-Quadruped Docking in Complex Terrains via Active Posture Alignment and Constraint-Aware Control

This paper presents an autonomous docking framework for UAVs and quadruped robots in GPS-denied, complex terrains, utilizing a deep reinforcement learning-based posture stabilization system for the ground robot and a three-phase, constraint-aware control strategy for the UAV to achieve successful landings on steep slopes and uneven surfaces.

Haozhe Xu, Cheng Cheng, Hongrui Sang, Zhipeng Wang, Qiyong He, Xiuxian Li, Bin He2026-03-10💻 cs

← Previous Next →