cs.CV papers | Gist.Science

MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies

This paper introduces MotionBits, a novel concept and learning-free segmentation method that identifies the smallest manipulable rigid bodies through kinematic spatial twist equivalence, outperforming state-of-the-art embodied perception models on the new MoRiBo benchmark and enabling more effective downstream robotic manipulation and reasoning tasks.

Howard H. Qian, Kejia Ren, Yu Xiang, Vicente Ordonez, Kaiyu Hang2026-03-10💻 cs

Active View Selection with Perturbed Gaussian Ensemble for Tomographic Reconstruction

This paper introduces Perturbed Gaussian Ensemble, an active view selection framework for sparse-view CT that leverages stochastic density scaling of uncertain Gaussian primitives to identify high-variance projections, thereby significantly improving reconstruction fidelity and reducing geometric artifacts compared to existing methods.

Yulun Wu, Ruyi Zha, Wei Cao, Yingying Li, Yuanhao Cai, Yaoyao Liu2026-03-10💻 cs

An Extended Topological Model For High-Contrast Optical Flow

This paper introduces an extended 3-manifold topological model for high-contrast optical flow that resolves the limitations of previous torus-based approaches by identifying that the most significant motion patches are concentrated near binary step-edge circles rather than the torus, thereby offering new insights into the topological and geometric structures underlying visual data inference.

Brad Turow, Jose A. Perea2026-03-10🔢 math

ColonSplat: Reconstruction of Peristaltic Motion in Colonoscopy with Dynamic Gaussian Splatting

This paper introduces ColonSplat, a dynamic Gaussian Splatting framework that achieves superior 3D reconstruction of peristaltic colon motion by preserving global geometric consistency, supported by a new synthetic benchmark dataset called DynamicColon and a critical analysis of existing methods' limitations.

Weronika Smolak-Dy\.zewska, Joanna Kaleta, Diego Dall'Alba, Przemysław Spurek2026-03-10💻 cs

IGLU: The Integrated Gaussian Linear Unit Activation Function

This paper introduces IGLU, a novel parametric activation function derived from a scale mixture of GELU gates that utilizes a Cauchy CDF to provide heavy-tailed gradient properties and robustness against vanishing gradients, alongside a computationally efficient rational approximation (IGLU-Approx) that achieves competitive or superior performance across vision and language tasks compared to standard baselines like ReLU and GELU.

Mingi Kang, Zai Yang, Jeova Farias Sales Rocha Neto2026-03-10🤖 cs.LG

A prior information informed learning architecture for flying trajectory prediction

This paper proposes a hardware-efficient trajectory prediction framework that integrates environmental priors with a Dual-Transformer-Cascaded (DTC) architecture to accurately predict the landing points of flying objects, such as tennis balls, by outperforming existing methods in complex real-world scenarios.

Xianda Huang, Zidong Han, Ruibo Jin, Zhenyu Wang, Wenyu Li, Xiaoyang Li, Yi Gong2026-03-10💻 cs

PICS: Pairwise Image Compositing with Spatial Interactions

The paper introduces PICS, a self-supervised framework that improves pairwise image compositing by employing an Interaction Transformer with mask-guided Mixture-of-Experts and adaptive blending to explicitly model spatial interactions and preserve physical consistency between objects and backgrounds.

Hang Zhou, Xinxin Zuo, Sen Wang, Li Cheng2026-03-10💻 cs

OPTED: Open Preprocessed Trachoma Eye Dataset Using Zero-Shot SAM 3 Segmentation

This paper introduces OPTED, an open-source preprocessed trachoma eye dataset derived from 2,832 images using a zero-shot SAM 3 pipeline to automatically extract and standardize regions of interest, thereby addressing the scarcity of high-quality data for automated trachoma classification in Sub-Saharan Africa.

Kibrom Gebremedhin, Hadush Hailu, Bruk Gebregziabher2026-03-10💻 cs

Learning From Design Procedure To Generate CAD Programs for Data Augmentation

This paper proposes a novel data augmentation paradigm that leverages Large Language Models to generate diverse, industry-resembling CAD programs by conditioning them on reference surfaces and modeling procedures, thereby addressing the scarcity of complex, spline-based geometric data in existing training sets.

Yan-Ying Chen, Dule Shu, Matthew Hong, Andrew Taber, Jonathan Li, Matthew Klenk2026-03-10🤖 cs.LG

PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection

PaQ-DETR is a unified object detection framework that addresses query utilization imbalance by dynamically generating image-specific queries from shared latent patterns and employing a quality-aware one-to-many assignment strategy, resulting in consistent mAP improvements across various DETR backbones.

Zhengjian Kang, Jun Zhuang, Kangtong Mo, Qi Chen, Rui Liu, Ye Zhang2026-03-10💻 cs

DLRMamba: Distilling Low-Rank Mamba for Edge Multispectral Fusion Object Detection

The paper proposes DLRMamba, a novel framework for edge-based multispectral object detection that combines a Low-Rank SS2D module to reduce parameter redundancy with a Structure-Aware Distillation strategy to preserve feature fidelity, achieving superior efficiency and accuracy on resource-constrained hardware.

Qianqian Zhang, Leon Tabaro, Ahmed M. Abdelmoniem, Junshe An2026-03-10💻 cs

Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

This paper introduces ESM-YOLO+, a lightweight visible-infrared fusion network that employs a Mask-Enhanced Attention Fusion module and training-time Structural Representation enhancement to achieve high-precision small-target detection in complex remote sensing scenes while significantly reducing model complexity compared to baselines.

Qianqian Zhang, Xiaolong Jia, Ahmed M. Abdelmoniem, Li Zhou, Junshe An2026-03-10💻 cs

HIERAMP: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation

This paper proposes HIERAMP, a method that leverages the coarse-to-fine generation capability of Vision Autoregressive (VAR) models to amplify hierarchical semantics through dynamic class token injection, thereby improving dataset distillation performance by better capturing object structures and details without explicitly optimizing global proximity.

Lin Zhao, Xinru Jiang, Xi Xiao, Qihui Fan, Lei Lu, Yanzhi Wang, Xue Lin, Octavia Camps, Pu Zhao, Jianyang Gu2026-03-10💻 cs

Extracting and analyzing 3D histomorphometric features related to perineural and lymphovascular invasion in prostate cancer

This study presents a 3D histomorphometric analysis pipeline using nnU-Net segmentation on optically cleared prostatectomy specimens to extract features related to perineural and lymphovascular invasion, demonstrating that 3D perineural invasion features significantly outperform their 2D counterparts in predicting 5-year biochemical recurrence in prostate cancer.

Sarah S. L. Chow, Rui Wang, Robert B. Serafin, Yujie Zhao, Elena Baraznenok, Xavier Farré, Jennifer Salguero-Lopez, Gan Gao, Huai-Ching Hsieh, Lawrence D. True, Priti Lal, Anant Madabhushi, Jonathan T. C. Liu2026-03-10💻 cs

Virtual Intraoperative CT (viCT): Sequential Anatomic Updates for Modeling Tissue Resection Throughout Endoscopic Sinus Surgery

This paper introduces Virtual Intraoperative CT (viCT), a method that sequentially updates preoperative CT scans during endoscopic sinus surgery by integrating monocular endoscopic video-derived 3D reconstructions to visualize evolving tissue resection boundaries with submillimeter accuracy, thereby addressing the limitations of static image guidance.

Nicole M. Gunderson, Graham J. Harris, Jeremy S. Ruthberg, Pengcheng Chen, Di Mao, Randall A. Bly, Waleed M. Abuzeid, Eric J. Seibel2026-03-10💻 cs

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

SurgCUT3R is a novel framework that addresses the challenges of data scarcity and pose drift in monocular endoscopic video reconstruction by leveraging a synthetic data generation pipeline, hybrid supervision, and a hierarchical inference strategy to achieve robust, accurate, and efficient 3D surgical scene understanding.

Kaiyuan Xu, Fangzhou Hong, Daniel Elson, Baoru Huang2026-03-10💻 cs

Conditional Unbalanced Optimal Transport Maps: An Outlier-Robust Framework for Conditional Generative Modeling

This paper introduces Conditional Unbalanced Optimal Transport Maps (CUOTM), a robust conditional generative framework that mitigates the outlier sensitivity of classical Conditional Optimal Transport by relaxing distribution-matching constraints via Csiszár divergence penalties while preserving conditioning marginals through a theoretically justified triangular $c$ -transform parameterization.

Jiwoo Yoon, Kyumin Choi, Jaewoong Choi2026-03-10🤖 cs.LG

T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding

The paper proposes T2SGrid, a novel framework that reformulates video temporal grounding as a spatial understanding task by arranging video frames into composite grid images via overlapping sliding windows, thereby overcoming the limitations of existing temporal encoding methods and achieving superior performance on standard benchmarks.

Chaohong Guo, Yihan He, Yongwei Nie, Fei Ma, Xuemiao Xu, Chengjiang Long2026-03-10💻 cs

Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

This paper proposes a novel approach to image-based shape retrieval that leverages pre-aligned multi-modal encoders and a hard contrastive learning loss to achieve state-of-the-art performance in both zero-shot and supervised settings, eliminating the need for explicit view-based supervision or view synthesis.

Paul Julius Kühn, Cedric Spengler, Michael Weinmann, Arjan Kuijper, Saptarshi Neil Sinha2026-03-10💻 cs

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

This paper proposes a perception-aware multimodal reasoning framework that enhances Vision-Language Models' spatial understanding in monocular driving scenarios by representing objects with Visual Reference Tokens and utilizing a Multimodal Chain-of-Thought dataset, achieving significant performance gains on the SURDS benchmark through standard supervised fine-tuning.

Yanchun Cheng, Rundong Wang, Xulei Yang, Alok Prakash, Daniela Rus, Marcelo H Ang Jr, ShiJie Li2026-03-10💻 cs

← Previous Next →