cs.CV papers | Gist.Science

MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

MoSA is a novel framework that decouples human video generation into structure and appearance components, utilizing a 3D structure transformer and specialized constraints to achieve superior motion coherence and realistic human-environment interactions compared to existing models.

Haoyu Wang, Hao Tang, Donglin Di + 5 more2026-02-25💻 cs

Decouple, Reorganize, and Fuse: A Multimodal Framework for Cancer Survival Prediction

This paper proposes DeReF, a novel multimodal framework for cancer survival prediction that addresses limitations in existing fusion methods by introducing a random feature reorganization strategy between modality decoupling and dynamic Mixture-of-Experts fusion to enhance feature diversity and inter-modal information interaction.

Huayi Wang, Haochao Ying, Yuyang Xu + 5 more2026-02-25💻 cs

Learning Unified Representations from Heterogeneous Data for Robust Heart Rate Modeling

This paper proposes a robust heart rate modeling framework that addresses source and user heterogeneity through random feature dropout, history-aware attention, and contrastive learning, achieving significant performance improvements on a new benchmark dataset (PARROTAO) and existing public data.

Zhengdong Huang, Zicheng Xie, Wentao Tian + 3 more2026-02-25🤖 cs.LG

EHWGesture -- A dataset for multimodal understanding of clinical gestures

This paper introduces EHWGesture, a comprehensive multimodal video dataset featuring synchronized RGB-Depth and event camera recordings with precise motion capture ground truth, designed to advance clinical gesture understanding through diverse multi-view data and embedded action quality assessments.

Gianluca Amprimo, Alberto Ancilotto, Alessandro Savino + 5 more2026-02-25🤖 cs.AI

PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models

This paper introduces Proportionate Credit Policy Optimization (PCPO), a novel framework that stabilizes reinforcement learning for text-to-image models by correcting disproportionate credit assignment, thereby accelerating convergence, preventing model collapse, and significantly outperforming state-of-the-art baselines like DanceGRPO.

Jeongjae Lee, Jong Chul Ye2026-02-25🤖 cs.AI

On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

This paper introduces RobustVLA, a framework that enhances Vision-Language-Action models against diverse multi-modal perturbations through output-level adversarial optimization and input-level semantic consistency, achieving significant performance gains over state-of-the-art baselines on both simulated and real-world robotic tasks.

Jianing Guo, Zhenhong Wu, Chang Tu + 13 more2026-02-25🤖 cs.AI

DeLTa: Demonstration and Language-Guided Novel Transparent Object Manipulation

The paper proposes DeLTa, a novel framework that integrates depth estimation, 6D pose estimation, and vision-language planning to enable precise, long-horizon manipulation of novel transparent objects using only a single demonstration and natural language instructions.

Taeyeop Lee, Gyuree Kang, Bowen Wen + 5 more2026-02-25💻 cs

Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

This paper introduces Spatial-DISE, a unified benchmark and large-scale dataset grounded in a four-quadrant taxonomy of spatial reasoning, which reveals significant gaps between current Vision-Language Models and human competence while providing a scalable framework for advancing human-like spatial intelligence.

Xinmiao Huang, Qisong He, Zhenglin Huang + 5 more2026-02-25💻 cs

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

UniGenBench++ is a unified, multilingual semantic evaluation benchmark for text-to-image generation that addresses the limitations of existing datasets through a diverse, hierarchically structured set of 600 prompts and 27 fine-grained criteria, leveraging both a state-of-the-art MLLM and a trained offline evaluator to systematically assess model robustness and semantic consistency.

Yibin Wang, Zhimin Li, Yuhang Zang + 8 more2026-02-25💻 cs

egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks

This paper introduces egoEMOTION, the first dataset combining egocentric visual and physiological signals with self-reported emotion and personality data from 43 participants, to establish new benchmarks for affect and trait recognition in real-world scenarios.

Matthias Jammot, Björn Braun, Paul Streli + 2 more2026-02-25💻 cs

Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes

This paper proposes a novel framework that integrates 3D acoustic localization from a phased microphone array with dynamic RGB-D point clouds to generate 4D audio-visual representations, enabling precise spatial mapping of surgical tool-tissue interactions for enhanced multimodal scene understanding in dynamic operating rooms.

Jonas Hein, Lazaros Vlachopoulos, Maurits Geert Laurent Olthof + 3 more2026-02-25⚡ eess

SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping

This paper introduces SpecAware, a novel spectral-content aware foundation model that leverages a hypernetwork-driven embedding process and a new 400k-scale dataset to unify multi-sensor hyperspectral remote sensing learning by dynamically adapting to varying spectral channels through sensor meta-attributes and image semantic features.

Renjie Ji, Xue Wang, Chao Niu + 3 more2026-02-25💻 cs

A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding

This paper introduces VCFlow, a novel hierarchical architecture inspired by the human visual system's ventral-dorsal streams that achieves subject-agnostic fMRI-based visual reconstruction with high efficiency and minimal accuracy loss, eliminating the need for extensive subject-specific training data.

Jingyu Lu, Haonan Wang, Qixiang Zhang + 1 more2026-02-25🤖 cs.AI

Changes in Real Time: Online Scene Change Detection with Multi-View Fusion

This paper presents the first pose-agnostic, label-free online Scene Change Detection method that leverages multi-view fusion, PnP-based pose estimation, and 3D Gaussian Splatting to achieve real-time performance exceeding 10 FPS while surpassing the accuracy of existing offline approaches.

Chamuditha Jayanga Galappaththige, Jason Lai, Lloyd Windrim + 3 more2026-02-25💻 cs

CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis

CuriGS is a curriculum-guided framework that enhances sparse-view 3D Gaussian Splatting reconstruction by progressively training with pseudo-views of increasing perturbation levels, which are selectively promoted to the training set based on multi-signal quality metrics to overcome supervision scarcity and overfitting.

Zijian Wu, Mingfeng Jiang, Zidian Lin + 5 more2026-02-25💻 cs

Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

This paper introduces Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible framework that reduces Diffusion Transformer parameters by 50% with minimal performance loss through redundant layer identification and a plug-and-play alternating distillation scheme, enabling efficient deployment in resource-constrained environments.

Jian Ma, Qirong Peng, Xujie Zhu + 3 more2026-02-25💻 cs

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

This paper introduces Visual Preference Policy Optimization (ViPO), a lightweight and architecture-agnostic variant of Group Relative Policy Optimization that enhances visual generation by replacing coarse scalar rewards with structured, pixel-level advantage maps to better align models with human preferences and correct localized artifacts.

Ziqi Ni, Yuanzhi Liang, Rui Li + 4 more2026-02-25💻 cs

The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection

This paper introduces KeyTailor, a novel framework featuring a keyframe-driven details injection strategy to enhance garment dynamics and background integrity in video virtual try-on without increasing architectural complexity, accompanied by the large-scale ViT-HD dataset to address data limitations.

Qingdong He, Xueqin Chen, Yanjie Pan + 7 more2026-02-25💻 cs

CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

CogFlow is a novel three-stage framework that enhances visual mathematical problem solving by introducing a knowledge internalization stage and specialized reward mechanisms to ensure extracted visual cues are faithfully integrated into reasoning, supported by a new high-quality dataset called MathCog.

Shuhang Chen, Yunqiu Xu, Junjie Xie + 7 more2026-02-25🤖 cs.AI

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Fast-ThinkAct is an efficient Vision-Language-Action framework that utilizes preference-guided distillation of verbalizable latent reasoning to significantly reduce inference latency while maintaining strong performance in long-horizon planning, few-shot adaptation, and failure recovery.

Chi-Pin Huang, Yunze Man, Zhiding Yu + 4 more2026-02-25🤖 cs.AI

← Previous Next →