cs.CV papers | Gist.Science

Learning From Design Procedure To Generate CAD Programs for Data Augmentation

This paper proposes a novel data augmentation paradigm that leverages Large Language Models to generate diverse, industry-resembling CAD programs by conditioning them on reference surfaces and modeling procedures, thereby addressing the scarcity of complex, spline-based geometric data in existing training sets.

Yan-Ying Chen, Dule Shu, Matthew Hong, Andrew Taber, Jonathan Li, Matthew KlenkTue, 10 Ma🤖 cs.LG

PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection

PaQ-DETR is a unified object detection framework that addresses query utilization imbalance by dynamically generating image-specific queries from shared latent patterns and employing a quality-aware one-to-many assignment strategy, resulting in consistent mAP improvements across various DETR backbones.

Zhengjian Kang, Jun Zhuang, Kangtong Mo, Qi Chen, Rui Liu, Ye ZhangTue, 10 Ma💻 cs

DLRMamba: Distilling Low-Rank Mamba for Edge Multispectral Fusion Object Detection

The paper proposes DLRMamba, a novel framework for edge-based multispectral object detection that combines a Low-Rank SS2D module to reduce parameter redundancy with a Structure-Aware Distillation strategy to preserve feature fidelity, achieving superior efficiency and accuracy on resource-constrained hardware.

Qianqian Zhang, Leon Tabaro, Ahmed M. Abdelmoniem, Junshe AnTue, 10 Ma💻 cs

Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

This paper introduces ESM-YOLO+, a lightweight visible-infrared fusion network that employs a Mask-Enhanced Attention Fusion module and training-time Structural Representation enhancement to achieve high-precision small-target detection in complex remote sensing scenes while significantly reducing model complexity compared to baselines.

Qianqian Zhang, Xiaolong Jia, Ahmed M. Abdelmoniem, Li Zhou, Junshe AnTue, 10 Ma💻 cs

HIERAMP: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation

This paper proposes HIERAMP, a method that leverages the coarse-to-fine generation capability of Vision Autoregressive (VAR) models to amplify hierarchical semantics through dynamic class token injection, thereby improving dataset distillation performance by better capturing object structures and details without explicitly optimizing global proximity.

Lin Zhao, Xinru Jiang, Xi Xiao, Qihui Fan, Lei Lu, Yanzhi Wang, Xue Lin, Octavia Camps, Pu Zhao, Jianyang GuTue, 10 Ma💻 cs

Extracting and analyzing 3D histomorphometric features related to perineural and lymphovascular invasion in prostate cancer

This study presents a 3D histomorphometric analysis pipeline using nnU-Net segmentation on optically cleared prostatectomy specimens to extract features related to perineural and lymphovascular invasion, demonstrating that 3D perineural invasion features significantly outperform their 2D counterparts in predicting 5-year biochemical recurrence in prostate cancer.

Sarah S. L. Chow, Rui Wang, Robert B. Serafin, Yujie Zhao, Elena Baraznenok, Xavier Farré, Jennifer Salguero-Lopez, Gan Gao, Huai-Ching Hsieh, Lawrence D. True, Priti Lal, Anant Madabhushi, Jonathan T. C. LiuTue, 10 Ma💻 cs

Virtual Intraoperative CT (viCT): Sequential Anatomic Updates for Modeling Tissue Resection Throughout Endoscopic Sinus Surgery

This paper introduces Virtual Intraoperative CT (viCT), a method that sequentially updates preoperative CT scans during endoscopic sinus surgery by integrating monocular endoscopic video-derived 3D reconstructions to visualize evolving tissue resection boundaries with submillimeter accuracy, thereby addressing the limitations of static image guidance.

Nicole M. Gunderson, Graham J. Harris, Jeremy S. Ruthberg, Pengcheng Chen, Di Mao, Randall A. Bly, Waleed M. Abuzeid, Eric J. SeibelTue, 10 Ma💻 cs

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

SurgCUT3R is a novel framework that addresses the challenges of data scarcity and pose drift in monocular endoscopic video reconstruction by leveraging a synthetic data generation pipeline, hybrid supervision, and a hierarchical inference strategy to achieve robust, accurate, and efficient 3D surgical scene understanding.

Kaiyuan Xu, Fangzhou Hong, Daniel Elson, Baoru HuangTue, 10 Ma💻 cs

Conditional Unbalanced Optimal Transport Maps: An Outlier-Robust Framework for Conditional Generative Modeling

This paper introduces Conditional Unbalanced Optimal Transport Maps (CUOTM), a robust conditional generative framework that mitigates the outlier sensitivity of classical Conditional Optimal Transport by relaxing distribution-matching constraints via Csiszár divergence penalties while preserving conditioning marginals through a theoretically justified triangular $c$ -transform parameterization.

Jiwoo Yoon, Kyumin Choi, Jaewoong ChoiTue, 10 Ma🤖 cs.LG

T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding

The paper proposes T2SGrid, a novel framework that reformulates video temporal grounding as a spatial understanding task by arranging video frames into composite grid images via overlapping sliding windows, thereby overcoming the limitations of existing temporal encoding methods and achieving superior performance on standard benchmarks.

Chaohong Guo, Yihan He, Yongwei Nie, Fei Ma, Xuemiao Xu, Chengjiang LongTue, 10 Ma💻 cs

Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

This paper proposes a novel approach to image-based shape retrieval that leverages pre-aligned multi-modal encoders and a hard contrastive learning loss to achieve state-of-the-art performance in both zero-shot and supervised settings, eliminating the need for explicit view-based supervision or view synthesis.

Paul Julius Kühn, Cedric Spengler, Michael Weinmann, Arjan Kuijper, Saptarshi Neil SinhaTue, 10 Ma💻 cs

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

This paper proposes a perception-aware multimodal reasoning framework that enhances Vision-Language Models' spatial understanding in monocular driving scenarios by representing objects with Visual Reference Tokens and utilizing a Multimodal Chain-of-Thought dataset, achieving significant performance gains on the SURDS benchmark through standard supervised fine-tuning.

Yanchun Cheng, Rundong Wang, Xulei Yang, Alok Prakash, Daniela Rus, Marcelo H Ang Jr, ShiJie LiTue, 10 Ma💻 cs

ADAS-TO: A Large-Scale Multimodal Naturalistic Dataset and Empirical Characterization of Human Takeovers during ADAS Engagement

This paper introduces ADAS-TO, the first large-scale naturalistic multimodal dataset of 15,659 ADAS-to-manual takeover events from 327 drivers, which combines kinematic and vision-language analysis to characterize safety-critical scenarios and demonstrate that actionable visual cues often precede takeovers by over three seconds.

Yuhang Wang, Yiyao Xu, Jingran Sun, Hao ZhouTue, 10 Ma💻 cs

MipSLAM: Alias-Free Gaussian Splatting SLAM

MipSLAM is a novel 3D Gaussian Splatting SLAM framework that achieves high-fidelity anti-aliased rendering and robust pose estimation by integrating an Elliptical Adaptive Anti-aliasing algorithm, a Spectral-Aware Pose Graph Optimization module, and a local frequency-domain perceptual loss to overcome aliasing artifacts and trajectory drift.

Yingzhao Li, Yan Li, Shixiong Tian, Yanjie Liu, Lijun Zhao, Gim Hee LeeTue, 10 Ma💻 cs

AdaGen: Learning Adaptive Policy for Image Synthesis

AdaGen introduces a general, learnable framework that employs reinforcement learning with an adversarial reward to dynamically adapt step-specific parameters during iterative image synthesis, thereby overcoming the limitations of static, manually-designed schedules and achieving superior performance across diverse generative models with reduced inference costs.

Zanlin Ni, Yulin Wang, Yeguo Hua, Renping Zhou, Jiayi Guo, Jun Song, Bo Zheng, Gao HuangTue, 10 Ma💻 cs

TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models

TrajPred is a novel framework that enhances surgical instrument-tissue interaction recognition in vision-language models by encoding instrument trajectories to capture temporal motion cues and generating fine-grained visual semantic embeddings, thereby significantly improving performance and vision-text alignment on the CholecT50 benchmark.

Jiajun Cheng, Xiaofan Yu, Subarna, Sainan Liu, Shan LinTue, 10 Ma💻 cs

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

This paper presents OV-DEIM, a real-time end-to-end DETR-style open-vocabulary object detector that combines the DEIMv2 framework with a query supplement strategy and a novel GridSynthetic data augmentation technique to achieve state-of-the-art performance and efficiency, particularly for rare categories.

Leilei Wang, Longfei Liu, Xi Shen, Xuanlong Yu, Ying Tiffany He, Fei Richard Yu, Yingyi ChenTue, 10 Ma💻 cs

Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking

This paper introduces TFM, a temporal attack framework that exploits the vulnerability of text-to-video models to generate harmful content by providing only sparse boundary conditions (start and end frames) and implicitly substituting sensitive cues, thereby bypassing existing safety filters and significantly increasing jailbreak success rates.

Moyang Chen, Zonghao Ying, Wenzhuo Xu, Quancheng Zou, Deyue Zhang, Dongdong Yang, Xiangzheng ZhangTue, 10 Ma💻 cs

Fine-Grained 3D Facial Reconstruction for Micro-Expressions

This paper proposes a novel fine-grained 3D facial reconstruction method for micro-expressions that integrates global dynamic features with locally-enriched cues from 2D motions, facial priors, and 3D geometry to overcome data scarcity and achieve superior geometric accuracy and perceptual detail compared to state-of-the-art approaches.

Che Sun, Xinjie Zhang, Rui Gao, Xu Chen, Yuwei Wu, Yunde JiaTue, 10 Ma💻 cs

Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

This paper proposes CAPL, a framework that mitigates multi-image hallucinations in large vision-language models by introducing a selectable image token interaction mechanism for fine-grained cross-image alignment and a preference learning strategy that trains the model to rely on genuine visual evidence rather than textual priors.

Xiaochen Yang, Hao Fang, Jiawei Kong, Yaoxin Mao, Bin Chen, Shu-Tao XiaTue, 10 Ma💻 cs

← Previous Next →