cs papers | Gist.Science

Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

This paper proposes RL-Co, a reinforcement learning-based sim-real co-training framework that combines supervised fine-tuning on mixed real and simulated data with interactive simulation fine-tuning anchored by real-world data, achieving significant improvements in real-world success rates, generalization, and data efficiency for Vision-Language-Action models.

Liangzhi Shi, Shuaihang Chen, Feng Gao, Yinuo Chen, Kang Chen, Tonghe Zhang, Hongzhi Zang, Weinan Zhang, Chao Yu, Yu Wang2026-03-09💻 cs

DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles

DAV-GSWT is a data-efficient framework that combines diffusion priors with active view sampling to synthesize high-fidelity Gaussian Splatting Wang Tiles from minimal input observations, enabling the generation of expansive, photorealistic landscapes without relying on densely sampled exemplar reconstructions.

Rong Fu, Jiekai Wu, Haiyun Wei, Yee Tan Jia, Yang Li, Xiaowen Ma, Wangyu Wu, Simon Fong2026-03-09💻 cs

Operational Agency: A Permeable Legal Fiction for Tracing Culpability in AI Systems

This paper proposes "Operational Agency," a legal framework utilizing an "Operational Agency Graph" to trace and apportion human culpability in autonomous AI systems by evaluating their goal-directedness, foresight, and safety architecture, thereby ensuring accountability without granting AI legal personhood.

Anirban Mukherjee, Hannah Hanwen Chang2026-03-09💻 cs

Robust Self-Supervised Cross-Modal Super-Resolution against Real-World Misaligned Observations

The paper proposes RobSelf, a robust self-supervised model that jointly optimizes a misalignment-aware feature translator and a content-aware reference filter to achieve state-of-the-art cross-modal super-resolution on real-world misaligned data with significantly improved efficiency.

Xiaoyu Dong, Jiahuan Li, Ziteng Cui, Naoto Yokoya2026-03-09💻 cs

UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

UrbanAlign proposes a novel post-hoc calibration framework that aligns frozen vision-language models with human preferences for urban scene assessment by mining interpretable dimensions, extracting robust concept scores via an Observer-Debater-Judge chain, and calibrating them through locally-weighted ridge regression, achieving state-of-the-art accuracy without any model retraining.

Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi2026-03-09💻 cs

Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

This paper demonstrates that affordance reasoning in Vision Foundation Models can be achieved in a zero-shot, training-free manner by fusing DINO's inherent geometric part structures with Flux's verb-conditioned interaction priors, thereby establishing geometric and interaction perception as the fundamental, composable building blocks of affordance understanding.

Qing Zhang, Xuesong Li, Jing Zhang2026-03-09💻 cs

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

StoryTailor is a zero-shot pipeline that generates temporally coherent, action-rich multi-subject visual narratives on a single RTX 4090 by synergizing Gaussian-Centered Attention, Action-Boost Singular Value Reweighting, and a Selective Forgetting Cache to simultaneously ensure action faithfulness, subject identity fidelity, and cross-frame background continuity.

Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang2026-03-09💻 cs

UniVBench: Towards Unified Evaluation for Video Foundation Models

The paper introduces UniVBench, a comprehensive benchmark featuring 200 high-quality, human-created multi-shot videos and a unified agentic evaluation system (UniV-Eval) to holistically assess video foundation models across understanding, generation, editing, and reconstruction tasks, addressing the limitations of existing fragmented and task-specific evaluations.

Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, Zuozhu Liu2026-03-09💻 cs

Protein Graph Neural Networks for Heterogeneous Cryo-EM Reconstruction

This paper introduces a geometry-aware Graph Neural Network autodecoder that leverages protein-structure priors and ellipsoidal support lifting to achieve higher accuracy in heterogeneous single-particle cryo-EM reconstruction compared to traditional MLP-based methods.

Jonathan Krook, Axel Janson, Joakim Andén + 2 more2026-03-09💻 cs

Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

The paper introduces DPCache, a training-free acceleration framework for diffusion models that formulates sampling as a global path planning problem and utilizes dynamic programming on a path-aware cost tensor to select optimal key timesteps, thereby achieving significant speedups with minimal quality loss and even surpassing full-step baselines in certain metrics.

Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang2026-03-09💻 cs

Unified Learning-to-Rank for Multi-Channel Retrieval in Large-Scale E-Commerce Search

This paper proposes a unified, query-dependent learning-to-rank model that effectively merges heterogeneous retrieval channels for large-scale e-commerce search by jointly optimizing business KPIs and capturing short-term user intent, resulting in a 2.85% conversion lift and deployment on Target.com while meeting strict latency constraints.

Aditya Gaydhani, Guangyue Xu, Dhanush Kamath, Ankit Singh, Alex Li2026-03-09💻 cs

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

This paper introduces Synthetic Visual Genome 2 (SVG2), a large-scale automated panoptic video scene graph dataset with over 636K videos, and presents TRaSER, a novel model that leverages trajectory-aligned token mechanisms to significantly outperform existing baselines in scene graph generation and downstream video question answering tasks.

Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna2026-03-09💻 cs

Learning Robust Control Policies for Inverted Pose on Miniature Blimp Robots

This paper presents a novel framework that combines a calibrated 3D simulation environment, a robust TD3-based control policy with domain randomization, and a mapping layer to successfully enable miniature blimp robots to achieve and maintain inverted poses in real-world settings.

Yuanlin Yang, Lin Hong, Fumin Zhang2026-03-09💻 cs

Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation

This paper proposes a novel adaptive dynamic dehazing framework that utilizes a closed-loop optimization mechanism combining task performance feedback and text-based instruction guidance to enable real-time, training-free adaptation of dehazing outputs for diverse downstream vision tasks.

Yafei Zhang, Shuaitian Song, Huafeng Li, Shujuan Wang, Yu Liu2026-03-09💻 cs

Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark

This paper introduces PanScale, a large-scale cross-scale pansharpening dataset and benchmark, alongside ScaleFormer, a novel transformer-based architecture that achieves superior generalization across varying image resolutions by reframing scale adaptation as sequence length generalization through tokenization and rotary positional encoding.

Ke Cao, Xuanhua He, Xueheng Li, Lingting Zhu, Yingying Wang, Ao Ma, Zhanjie Zhang, Man Zhou, Chengjun Xie, Jie Zhang2026-03-09💻 cs

From OCR to Analysis: Tracking Correction Provenance in Digital Humanities Pipelines

This paper proposes a provenance-aware framework for tracking OCR correction lineage in digital humanities pipelines, demonstrating that recording edit details at the span level significantly improves the reproducibility and interpretability of downstream NLP tasks by revealing how textual transformations impact scholarly analysis.

Haoze Guo, Ziqi Wei2026-03-09💻 cs

Mobile-VTON: High-Fidelity On-Device Virtual Try-On

Mobile-VTON is a high-fidelity, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices by utilizing a modular TGT architecture with feature-guided adversarial distillation and trajectory-consistency training to match server-based performance without requiring cloud computing.

Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong2026-03-09💻 cs

ROSER: Few-Shot Robotic Sequence Retrieval for Scalable Robot Learning

The paper introduces ROSER, a lightweight few-shot retrieval framework that extracts reusable, task-centric segments from unlabeled robotic logs using only 3-5 reference examples, thereby overcoming data scarcity by enabling scalable, high-accuracy utilization of large-scale continuous interaction datasets without task-specific training.

Zillur Rahman, Eddison Pham, Alejandro Daniel Noel, Cristian Meo2026-03-09💻 cs

FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters

FastLightGen is a novel algorithm that simultaneously compresses model parameters and reduces inference steps through an optimized teacher-student distillation framework, achieving state-of-the-art efficiency and visual quality in video generation with significantly fewer resources.

Shitong Shao, Yufei Gu, Zeke Xie2026-03-09💻 cs

VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning

This paper introduces VSearcher, a reinforcement learning-based multimodal search agent that transforms static models into capable long-horizon web browsers through an iterative data synthesis pipeline and an SFT-then-RL training strategy, achieving superior performance on the proposed MM-SearchExam benchmark.

Ruiyang Zhang, Qianguo Sun, Chao Song, Yiyan Qi, Zhedong Zheng2026-03-09💻 cs

← Previous Next →