cs.CV papers | Gist.Science

VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI

The paper introduces VocSegMRI, a multimodal framework that leverages cross-attention fusion and contrastive learning to integrate video, audio, and phonological signals, achieving state-of-the-art vocal tract segmentation in real-time MRI with a Dice score of 0.95 and robust performance even when audio is unavailable.

Daiqi Liu, Tomás Arias-Vergara, Johannes Enk, Fangxu Xing, Maureen Stone, Jerry L. Prince, Jana Hutter, Andreas Maier, Jonghye Woo, Paula Andrea Pérez-ToroWed, 11 Ma💻 cs

CoRe-GS: Coarse-to-Refined Gaussian Splatting with Semantic Object Focus

CoRe-GS is a coarse-to-refine Gaussian Splatting framework that accelerates 3D reconstruction for robotic applications by selectively optimizing only task-relevant points of interest, thereby significantly reducing training time and mitigating artifacts while maintaining high-quality semantic segmentation.

Hannah Schieber, Dominik Frischmann, Victor Schaack, Simon Boche, Angela Schoellig, Stefan Leutenegger, Daniel RothWed, 11 Ma💻 cs

You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

YOPO is a minimalist, single-stage transformer framework that unifies 2D object detection and category-level 9-DoF pose estimation from monocular RGB images without requiring pseudo-depth or CAD models, achieving state-of-the-art performance on multiple benchmarks.

Hakjin Lee, Junghoon Seo, Jaehoon SimWed, 11 Ma💻 cs

Improving Large Vision-Language Models' Understanding for Flow Field Data

This paper introduces FieldLVLM, a novel framework that enhances Large Vision-Language Models' ability to interpret complex scientific field data by combining a specialized pipeline for extracting physical features into structured text with a data-compressed tuning strategy, resulting in superior performance on scientific benchmarks.

Xiaomei Zhang, Hanyu Zheng, Xiangyu Zhu, Jinghuan Wei, Junhong Zou, Zhen Lei, Zhaoxiang ZhangWed, 11 Ma💻 cs

SpikeSMOKE: Spiking Neural Networks for Monocular 3D Object Detection with Cross-Scale Gated Coding

This paper proposes SpikeSMOKE, a low-power monocular 3D object detection framework based on Spiking Neural Networks that introduces a Cross-Scale Gating Coding mechanism and a lightweight residual block to overcome information loss and computational inefficiency, achieving superior performance on KITTI and other datasets while significantly reducing energy consumption and model complexity compared to traditional ANN-based approaches.

Xuemei Chen, Huamin Wang, Jing Peng, Hangchi Shen, Shukai Duan, Shiping Wen, Tingwen HuangWed, 11 Ma💻 cs

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

This paper introduces EasyText, a controllable Diffusion Transformer framework that leverages character positioning encoding and position interpolation to achieve high-quality, precise multilingual text rendering, supported by a newly constructed large-scale synthetic dataset for pretraining and fine-tuning.

Runnan Lu, Yuxuan Zhang, Jiaming Liu, Haofan Wang, Yiren SongWed, 11 Ma💻 cs

MARRS: Masked Autoregressive Unit-based Reaction Synthesis

This paper proposes MARRS, a novel framework that generates coordinated and fine-grained human reaction motions by combining a Unit-distinguished Motion VAE, Action-Conditioned Fusion with token masking, and Adaptive Unit Modulation to overcome the limitations of vector quantization and enhance inter-unit interaction.

Yabiao Wang, Shuo Wang, Jiangning Zhang, Jiafu Wu, Qingdong He, Yong LiuWed, 11 Ma💻 cs

M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for optical-SAR Object Detection

This paper introduces M4-SAR, a large-scale, multi-resolution, multi-polarization, and multi-source dataset with nearly one million labeled instances, alongside a unified benchmarking toolkit and a novel end-to-end fusion framework (E2E-OSDet) that collectively advance optical-SAR object detection by demonstrating significant performance gains over single-source methods in complex environments.

Chao Wang, Wei Lu, Xiang Li, Jian Yang, Lei LuoWed, 11 Ma💻 cs

Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach

This paper introduces BR-Gen, a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations, and proposes NFA-ViT, a noise-guided Vision Transformer that amplifies subtle forgery traces to significantly improve the detection and generalization of localized AI-generated image forgeries.

Lvpan Cai, Haowei Wang, Jiayi Ji, Yanshu Zhoumen, Shen Chen, Taiping Yao, Xiaoshuai SunWed, 11 Ma💻 cs

Semi-Supervised Biomedical Image Segmentation via Diffusion Models and Teacher-Student Co-Training

This paper proposes a novel semi-supervised teacher-student framework for biomedical image segmentation that leverages denoising diffusion probabilistic models to generate and iteratively refine pseudo-labels, demonstrating superior performance over state-of-the-art methods in scenarios with limited annotated data.

Luca Ciampi, Gabriele Lagani, Giuseppe Amato, Fabrizio FalchiWed, 11 Ma💻 cs

Recognition-Synergistic Scene Text Editing

This paper introduces RS-STE, a novel unified framework that synergistically integrates text recognition and editing via a multi-modal parallel decoder and cyclic self-supervised fine-tuning to achieve state-of-the-art scene text editing performance on both synthetic and real-world benchmarks.

Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, Wenjie PeiWed, 11 Ma💻 cs

A Survey on Wi-Fi Sensing Generalizability: Taxonomy, Techniques, Datasets, and Future Research Prospects

This survey provides a comprehensive review of over 200 papers on Wi-Fi sensing generalizability, offering a structured taxonomy of techniques to address domain shifts, summarizing key datasets, and outlining future research directions and community resources.

Fei Wang, Tingting Zhang, Wei Xi, Han Ding, Ge Wang, Di Zhang, Yuanhao Cui, Fan Liu, Jinsong Han, Jie Xu, Tony Xiao HanWed, 11 Ma💻 cs

Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics

This paper introduces iMarkers, a novel class of invisible fiducial markers detectable only by robots and AR devices, which overcome the visual aesthetic limitations of traditional markers while offering customizable production, robust detection algorithms, and proven effectiveness across diverse robotics scenarios.

Ali Tourani, Deniz Isinsu Avsar, Hriday Bavle, Jose Luis Sanchez-Lopez, Jan Lagerwall, Holger VoosWed, 11 Ma💻 cs

ARSGaussian: 3D Gaussian Splatting with LiDAR for Aerial Remote Sensing Novel View Synthesis

This paper introduces ARSGaussian, a novel view synthesis method for aerial remote sensing that integrates LiDAR constraints, distortion-aware coordinate transformations, and geometric consistency losses to mitigate floaters and overgrowth while achieving high-precision geo-alignment, supported by the newly released AIR-LONGYAN dataset.

Yiling Yao, Bing Zhang, Wenjuan Zhang, Lianru Gao, Dailiang Peng, Bocheng Li, Yaning Wang, Bowen WangWed, 11 Ma💻 cs

Active Prompt Learning with Vision-Language Model Priors

This paper proposes a budget-efficient active prompt learning framework for vision-language models that utilizes class-guided clustering and adaptive class-wise thresholding to achieve higher accuracy with fewer labeled samples compared to existing baselines.

Hoyoung Kim, Seokhee Jin, Changhwan Sung, Jaechang Kim, Jungseul OkWed, 11 Ma💻 cs

TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation

TIMotion is an efficient framework for human-human motion generation that improves upon existing methods by introducing Causal Interactive Injection, Role-Evolving Scanning, and Localized Pattern Amplification to better model temporal dynamics and interactive roles, thereby achieving superior performance on benchmark datasets.

Yabiao Wang, Shuo Wang, Jiangning Zhang, Ke Fan, Jiafu Wu, Zhucun Xue, Yong LiuWed, 11 Ma💻 cs

ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare

ReCoSplat is an autoregressive feed-forward Gaussian Splatting model that overcomes the training-inference pose mismatch dilemma through a novel Render-and-Compare module and achieves state-of-the-art online novel view synthesis with efficient long-sequence handling via hybrid KV cache compression.

Freeman Cheng, Botao Ye, Xueting Li, Junqi You, Fangneng Zhan, Ming-Hsuan YangWed, 11 Ma💻 cs

Leveraging whole slide difficulty in Multiple Instance Learning to improve prostate cancer grading

This paper introduces the concept of Whole Slide Difficulty (WSD), derived from diagnostic disagreements between expert and non-expert pathologists, and demonstrates that leveraging this metric through multi-task learning or weighted loss functions significantly improves the accuracy of prostate cancer Gleason grading in Multiple Instance Learning models, particularly for higher-grade cases.

Marie Arrivat, Rémy Peyret, Elsa Angelini, Pietro GoriWed, 11 Ma💻 cs

Unsupervised Domain Adaptation with Target-Only Margin Disparity Discrepancy

This paper proposes a novel unsupervised domain adaptation framework based on a reformulated Margin Disparity Discrepancy to bridge the modality gap between annotated CT and unannotated interventional CBCT scans, achieving state-of-the-art performance in liver segmentation for both unsupervised and few-shot settings.

Gauthier Miralles, Loïc Le Folgoc, Vincent Jugnon, Pietro GoriWed, 11 Ma💻 cs

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

This paper proposes an interpretable text-motion retrieval framework that represents 3D human motion as joint-angle pseudo-images processed by Vision Transformers and aligns them with text via a token-wise late interaction mechanism, thereby overcoming the limitations of global-embedding methods by capturing fine-grained correspondences and improving retrieval accuracy.

Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu XiaoWed, 11 Ma💻 cs

← Previous Next →