cs 篇论文 | Gist.Science

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

本文提出了 SurgCUT3R 框架，通过构建基于公开立体数据集的大规模伪真深度数据生成管线、采用混合监督策略以及设计分层推理架构，有效解决了单目内窥镜视频在手术场景下因缺乏监督数据及长序列累积漂移导致的 3D 重建难题，实现了兼具高精度与高效率的手术场景连续理解。

Kaiyuan Xu, Fangzhou Hong, Daniel Elson, Baoru Huang2026-03-10💻 cs

T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding

本文提出了 T2SGrid 框架，通过将视频片段内的帧按时间顺序重组为复合网格图像，将视频时序理解转化为空间理解任务，从而有效解决了现有方法在时序建模中面临的计算开销大、注意力稀疏及空间细节丢失等问题，并在视频时序定位基准上取得了优越性能。

Chaohong Guo, Yihan He, Yongwei Nie, Fei Ma, Xuemiao Xu, Chengjiang Long2026-03-10💻 cs

VSL-Skin: Individually Addressable Phase-Change Voxel Skin for Variable-Stiffness and Virtual Joints Bridging Soft and Rigid Robots

本文提出了 VSL-Skin，一种首个实现厘米级精度独立寻址体素控制的变刚度晶格皮肤系统，通过相变材料在保持结构完整性的同时实现了近两个数量级的刚度调制、30% 轴向压缩及自修复功能，从而支持可编程虚拟关节并弥合了软体与刚性机器人之间的鸿沟。

Zihan Oliver Zeng, Jiajun An, Preston Luk, Upinder Kaur2026-03-10💻 cs

Configurable Runtime Orchestration for Dynamic Data Retrieval in Distributed Systems

本文提出了一种基于配置的运行时编排框架，通过请求时动态生成执行图并实现依赖感知的并行调度，解决了分布式系统中因工作流预定义而导致的集成灵活性不足问题，从而在无需重新部署代码的情况下实现了高效、低延迟的动态数据检索。

Abhiram Kandiraju2026-03-10💻 cs

Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

该论文提出利用预对齐的多模态编码器（如 OpenShape 与 Point-BERT）结合多模态硬对比学习（HCL），在无需视图合成或目标数据库重训练的情况下，实现了图像到 3D 形状的零样本及监督检索，并在多个数据集上取得了超越现有方法的性能。

Paul Julius Kühn, Cedric Spengler, Michael Weinmann, Arjan Kuijper, Saptarshi Neil Sinha2026-03-10💻 cs

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

该论文提出了一种感知感知的多模态空间推理框架，通过引入视觉参考令牌（VRT）实现对象级 grounding 并构建多模态思维链数据集，仅凭标准监督微调便在 SURDS 基准测试中大幅超越了包括强化学习后训练在内的现有方法，显著提升了单目驾驶场景下的空间理解能力。

Yanchun Cheng, Rundong Wang, Xulei Yang, Alok Prakash, Daniela Rus, Marcelo H Ang Jr, ShiJie Li2026-03-10💻 cs

ADAS-TO: A Large-Scale Multimodal Naturalistic Dataset and Empirical Characterization of Human Takeovers during ADAS Engagement

本文发布了首个专注于 ADAS 向人工接管过渡的大规模自然驾驶数据集 ADAS-TO，该数据集包含来自 327 名驾驶员的 15,659 个同步视频与 CAN 日志片段，并通过结合运动学筛选与视觉语言模型分析，揭示了关键接管事件中的风险特征及提前 3 秒出现可操作视觉线索的规律，为开发语义感知预警系统提供了重要依据。

Yuhang Wang, Yiyao Xu, Jingran Sun, Hao Zhou2026-03-10💻 cs

Foundational World Models Accurately Detect Bimanual Manipulator Failures

该论文提出了一种基于预训练视觉基础模型（Cosmos Tokenizer）压缩潜在空间的概率性世界模型，通过结合保形预测框架生成不确定性指标来构建运行时监控器，从而在无需显式定义故障模式的情况下，以极少的参数量实现了对双机械臂操作任务中异常故障的高效准确检测。

Isaac R. Ward, Michelle Ho, Houjun Liu, Aaron Feldman, Joseph Vincent, Liam Kruse, Sean Cheong, Duncan Eddy, Mykel J. Kochenderfer, Mac Schwager2026-03-10💻 cs

MipSLAM: Alias-Free Gaussian Splatting SLAM

本文提出了 MipSLAM，一种通过椭圆自适应抗混叠算法、谱感知位姿图优化及局部频域感知损失，有效解决现有 3D 高斯泼溅 SLAM 系统混叠伪影与轨迹漂移问题，并在多分辨率下实现高保真渲染与鲁棒定位的实时框架。

Yingzhao Li, Yan Li, Shixiong Tian, Yanjie Liu, Lijun Zhao, Gim Hee Lee2026-03-10💻 cs

AdaGen: Learning Adaptive Policy for Image Synthesis

AdaGen 提出了一种基于强化学习和对抗奖励机制的通用自适应框架，通过 Markov 决策过程动态优化图像生成过程中的步长参数调度，从而在降低推理成本的同时显著提升多种生成范式下的图像质量与多样性。

Zanlin Ni, Yulin Wang, Yeguo Hua, Renping Zhou, Jiayi Guo, Jun Song, Bo Zheng, Gao Huang2026-03-10💻 cs

Large Language Model-Driven Full-Component Evolution of Adaptive Large Neighborhood Search

该论文提出了一种由大语言模型驱动的闭环进化框架，能够自动重构自适应大邻域搜索（ALNS）的全部七个核心组件，在 TSPLIB 基准测试中显著提升了求解质量并揭示了反直觉的设计模式。

Shaohua Yu, Tianyu Chen, Linyan Liu2026-03-10💻 cs

TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models

该论文提出了 TrajPred 框架，通过编码手术器械轨迹引入时序运动线索，并结合提示微调与动词重述技术生成细粒度视觉语义嵌入，从而显著提升了机器人手术中器械 - 组织交互识别的精度与视 - 文对齐效果。

Jiajun Cheng, Xiaofan Yu, Subarna, Sainan Liu, Shan Lin2026-03-10💻 cs

cs

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding

VSL-Skin: Individually Addressable Phase-Change Voxel Skin for Variable-Stiffness and Virtual Joints Bridging Soft and Rigid Robots

Configurable Runtime Orchestration for Dynamic Data Retrieval in Distributed Systems

Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

ADAS-TO: A Large-Scale Multimodal Naturalistic Dataset and Empirical Characterization of Human Takeovers during ADAS Engagement

Foundational World Models Accurately Detect Bimanual Manipulator Failures

MipSLAM: Alias-Free Gaussian Splatting SLAM

AdaGen: Learning Adaptive Policy for Image Synthesis

Large Language Model-Driven Full-Component Evolution of Adaptive Large Neighborhood Search

TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models

Privacy-Preserving Patient Identity Management Framework for Secure Healthcare Access

Two-Stage Path Following for Mobile Manipulators via Dimensionality-Reduced Graph Search and Numerical Optimization

An Extended Consent-Based Access Control Framework: Pre-Commit Validation and Emergency Access

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

SuperSkillsStack: Agency, Domain Knowledge, Imagination, and Taste in Human-AI Design Education

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

Enhancing Web Agents with a Hierarchical Memory Tree

Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking