cs.CV papers | Gist.Science

Point Cloud as a Foreign Language for Multi-modal Large Language Model

The paper introduces SAGE, the first end-to-end multi-modal large language model that treats raw point clouds as a "foreign language" via a lightweight 3D tokenizer and semantic alignment-based preference optimization, achieving superior performance and efficiency over existing encoder-based methods in 3D understanding tasks.

Sneha Paul, Zachary Patterson, Nizar BouguilaWed, 11 Ma💻 cs

Progressive Split Mamba: Effective State Space Modelling for Image Restoration

The paper proposes Progressive Split-Mamba (PS-Mamba), a topology-aware hierarchical state-space framework that addresses the spatial distortion and long-range decay limitations of standard Mamba models in image restoration by employing geometry-consistent partitioning and symmetric cross-scale shortcuts to effectively balance local structural preservation with global coherence.

Mohammed Hassanin, Nour Moustafa, Weijian Deng, Ibrahim RadwanWed, 11 Ma💻 cs

RTFDNet: Fusion-Decoupling for Robust RGB-T Segmentation

RTFDNet is a three-branch encoder-decoder network that unifies synergistic feature fusion with cross-modal and region decoupling regularization to achieve robust RGB-T semantic segmentation, enabling strong performance even when sensor signals are partially missing without requiring multi-stage training.

Kunyu Tan, Mingjian LiangWed, 11 Ma💻 cs

Agentic AI as a Network Control-Plane Intelligence Layer for Federated Learning over 6G

This paper proposes an Agentic AI framework that serves as a control-plane intelligence layer for 6G networks, utilizing specialized agents to dynamically manage federated learning tasks by integrating network conditions with learning objectives to optimize client selection, resource allocation, and scheduling.

Loc X. Nguyen, Ji Su Yoon, Huy Q. Le, Yu Qiao, Avi Deb Raha, Eui-Nam Huh, Nguyen H. Tran, Choong Seon HongWed, 11 Ma💻 cs

Rotation Equivariant Mamba for Vision Tasks

This paper introduces EQ-VMamba, the first rotation-equivariant visual Mamba architecture that incorporates a specialized cross-scan strategy and group Mamba blocks to achieve superior performance and robustness across various vision tasks while reducing parameter counts by approximately 50% compared to non-equivariant baselines.

Zhongchen Zhao, Qi Xie, Keyu Huang, Lei Zhang, Deyu Meng, Zongben XuWed, 11 Ma💻 cs

Transformer-Based Multi-Region Segmentation and Radiomic Analysis of HR-pQCT Imaging

This paper introduces a novel, fully automated framework that utilizes a SegFormer transformer to segment multiple anatomical regions in HR-pQCT images and extract radiomic features, demonstrating that soft tissue analysis outperforms traditional bone-based metrics in detecting osteoporosis.

Mohseu Rashid Subah, Mohammed Abdul Gani Zilani, Thomas L. Nickolas, Matthew R. Allen, Stuart J. Warden, Rachel K. SurowiecWed, 11 Ma💻 cs

Progressive Representation Learning for Multimodal Sentiment Analysis with Incomplete Modalities

This paper proposes PRLF, a Progressive Representation Learning Framework that addresses missing modalities in Multimodal Sentiment Analysis by dynamically estimating modality reliability and iteratively aligning incomplete features with a dominant modality to enhance robustness and performance.

Jindi Bao, Jianjun Qian, Mengkai Yan, Jian YangWed, 11 Ma💻 cs

Training-free Motion Factorization for Compositional Video Generation

This paper proposes a training-free, model-agnostic motion factorization framework that decomposes complex video generation into motionlessness, rigid, and non-rigid categories through a "planning before generation" paradigm to synthesize diverse instances with controlled appearance and motion.

Zixuan Wang, Ziqin Zhou, Feng Chen, Duo Peng, Yixin Hu, Changsheng Li, Yinjie LeiWed, 11 Ma💻 cs

MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration

MedKCO is a medical vision-language pretraining framework that overcomes the limitations of simultaneous concept learning by employing a two-level curriculum for data ordering and a self-paced asymmetric contrastive loss to dynamically adjust the learning objective, thereby significantly improving feature representations and downstream task performance.

Chenran Zhang, Ruiqi Wu, Tao Zhou, Yi ZhouWed, 11 Ma💻 cs

Chain of Event-Centric Causal Thought for Physically Plausible Video Generation

This paper proposes a framework for physically plausible video generation that models phenomena as causally connected event sequences by integrating physics-driven chain-of-thought reasoning with transition-aware cross-modal prompting to ensure dynamic continuity and physical consistency.

Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu, Wen Li, Yinjie LeiWed, 11 Ma💻 cs

OmniEdit: A Training-free framework for Lip Synchronization and Audio-Visual Editing

OmniEdit is a novel, training-free framework that achieves robust lip synchronization and audio-visual editing by reformulating the editing paradigm to replace the edit sequence with a target sequence, thereby eliminating the need for supervised fine-tuning and ensuring a smooth, stable generation process.

Lixiang Lin, Siyuan Jin, Jinshan ZhangWed, 11 Ma💻 cs

Intelligent Spatial Estimation for Fire Hazards in Engineering Sites: An Enhanced YOLOv8-Powered Proximity Analysis Framework

This paper presents an enhanced dual-model YOLOv8 framework that integrates fire and smoke detection with proximity-based risk assessment to generate quantitative hazard scores and real-time situational awareness for engineering sites.

Ammar K. AlMhdawi, Nonso Nnamoko, Alaa Mashan UbaidWed, 11 Ma💻 cs

Spectral-Structured Diffusion for Single-Image Rain Removal

The paper introduces SpectralDiff, a spectral-structured diffusion framework that leverages structured spectral perturbations and a full-product U-Net architecture to efficiently and effectively remove multi-directional rain streaks from single images.

Yucheng Xing, Xin WangWed, 11 Ma💻 cs

Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning

This paper proposes a novel diffusion-based framework that enhances Copy Detection Pattern authentication by integrating printer signatures and ControlNet to effectively distinguish genuine prints from high-quality counterfeits, outperforming traditional methods in generalization and accuracy.

Bolutife Atoki, Iuliia Tkachenko, Bertrand Kerautret, Carlos Crispim-JuniorWed, 11 Ma💻 cs

SkipGS: Post-Densification Backward Skipping for Efficient 3DGS Training

SkipGS is a plug-and-play method that accelerates 3D Gaussian Splatting training by adaptively skipping redundant backward passes during the post-densification phase based on view-specific loss statistics, achieving a 23.1% reduction in total training time without compromising reconstruction quality.

Jingxing Li, Yongjae Leeand, Deliang FanWed, 11 Ma💻 cs

SurgCalib: Gaussian Splatting-Based Hand-Eye Calibration for Robot-Assisted Minimally Invasive Surgery

This paper presents SurgCalib, a markerless, Gaussian Splatting-based framework that achieves accurate hand-eye calibration for the da Vinci surgical robot by refining kinematic estimates through a differentiable rendering pipeline, thereby overcoming cable-driven inaccuracies and avoiding the sterility issues associated with traditional fiducial markers.

Zijian Wu, Shuojue Yang, Yu Chung Lee, Eitan Prisman, Yueming Jin, Septimiu E. SalcudeanWed, 11 Ma💻 cs

SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing

The paper introduces SVG-EAR, a parameter-free method that enhances sparse video generation in Diffusion Transformers by using semantic clustering for linear compensation and error-aware routing to selectively compute high-error blocks, thereby achieving significant speedups while maintaining generation fidelity.

Xuanyi Zhou, Qiuyang Mang, Shuo Yang, Haocheng Xi, Jintao Zhang, Huanzhi Mao, Joseph E. Gonzalez, Kurt Keutzer, Ion Stoica, Alvin CheungWed, 11 Ma💻 cs

LiM-YOLO: Less is More with Pyramid Level Shift and Normalized Auxiliary Branch for Ship Detection in Optical Remote Sensing Imagery

LiM-YOLO is a streamlined ship detection model for optical remote sensing imagery that achieves state-of-the-art accuracy with fewer parameters by shifting the detection pyramid from P3-P5 to P2-P4 to better resolve small vessels and employing Group Normalization to stabilize training on high-resolution inputs.

Seon-Hoon Kim, Hyeji Sim, Youeyun Jung, Ok-Chul Jung, Yerin KimWed, 11 Ma⚡ eess

Exploiting Completeness Perception with Diffusion Transformer for Unified 3D MRI Synthesis

This paper introduces CoPeDiT, a unified 3D MRI synthesis framework that leverages a self-perceptive latent diffusion model with completeness-aware prompts to generate high-fidelity, structurally consistent images without relying on external manual guidance for missing data.

Junkai Liu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen, Le ZhangWed, 11 Ma⚡ eess

TIDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers

TIDE is a training-free method that enables Diffusion Transformers to generate high-resolution images with arbitrary aspect ratios by introducing a text anchoring mechanism to correct prompt information loss and a step-aware dynamic temperature control to eliminate artifacts caused by attention dilution.

Yihua Liu, Fanjiang Ye, Bowen Lin, Rongyu Fang, Chengming ZhangWed, 11 Ma💻 cs

← Previous Next →