cs.CV papers | Gist.Science

Implicit U-KAN2.0: Dynamic, Efficient and Interpretable Medical Image Segmentation

This paper introduces Implicit U-KAN 2.0, a novel medical image segmentation model that combines second-order neural ordinary differential equations (SONO) with MultiKAN layers in a two-phase encoder-decoder architecture to achieve superior performance, enhanced interpretability, and dimension-independent approximation capabilities while reducing computational costs.

Chun-Wun Cheng, Yining Zhao, Yanqi Cheng + 3 more2026-03-05🤖 cs.LG

Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?

This paper presents a large-scale analysis of 326 image classification models across nine quality dimensions beyond accuracy, revealing that vision-language models, self-supervised initialization, and dataset size significantly influence model behavior, and introduces the QUBA score to holistically rank and recommend models based on specific user needs.

Robin Hesse, Doğukan Bağcı, Bernt Schiele + 2 more2026-03-05🤖 cs.LG

Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction

This paper introduces DeCon, a self-supervised learning framework that jointly pre-trains encoders and decoders using a weighted contrastive loss, achieving state-of-the-art performance across various dense prediction tasks and datasets by significantly enhancing representation quality compared to conventional encoder-only approaches.

Sébastien Quetin, Tapotosh Ghosh, Farhad Maleki2026-03-05💻 cs

Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

This paper presents a unified physics-based framework that leverages Vision-Language Models and a novel VLM-Guided Relative Movement Dynamics (RMD) representation to automatically generate reward functions for synthesizing scalable, long-horizon human-object interactions across diverse object types without manual reward engineering.

Zekai Deng, Ye Shi, Kaiyang Ji + 3 more2026-03-05💻 cs

Generating Fine Details of Entity Interactions

This paper introduces \data, a dataset of fine-grained interaction prompts, and proposes \model, a novel framework leveraging Multimodal Large Language Models for prompt decomposition, image critique, and targeted refinement to significantly enhance the generation of complex object interactions in text-to-image synthesis.

Xinyi Gu, Jiayuan Mao2026-03-05🤖 cs.LG

When Memory Becomes a Vulnerability: Towards Multi-turn Jailbreak Attacks against Text-to-Image Generation Systems

This paper introduces Inception, the first multi-turn jailbreak attack that exploits the memory mechanisms of text-to-image systems by embedding malicious intent across segmented and recursively expanded conversational turns, achieving a 20% higher success rate than state-of-the-art methods in bypassing safety filters.

Shiqian Zhao, Jiayang Liu, Yiming Li + 9 more2026-03-05💻 cs

Intelligent Diagnosis Using Dual-Branch Attention Network for Rare Thyroid Carcinoma Recognition with Ultrasound Imaging

This paper proposes the Channel-Spatial Attention Synergy Network (CSASN), a novel multitask learning framework that integrates dual-branch EfficientNet and ViT architectures with attention mechanisms to effectively address data imbalance and morphological heterogeneity for the accurate diagnosis of rare thyroid carcinoma subtypes using ultrasound imaging.

Peiqi Li, Yincheng Gao, Renxing Li + 10 more2026-03-05💻 cs

Apple's Synthetic Defocus Noise Pattern: Characterization and Forensic Applications

This paper characterizes Apple's Synthetic Defocus Noise Pattern (SDNP) found in iPhone portrait-mode images, proposing a modeling method to mitigate its interference with PRNU-based camera source verification while demonstrating its utility for tracing images across different iPhone models and iOS versions.

David Vázquez-Padín, Fernando Pérez-González, Pablo Pérez-Miguélez2026-03-05💻 cs

Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

This paper introduces Multi-Objective Balanced Covering (MoB), a novel visual token pruning framework that leverages Hausdorff distance and $\epsilon$ -covering theory to derive a closed-form error bound and dynamically balance prompt alignment with visual preservation, achieving significant inference acceleration with minimal performance loss across diverse multimodal models.

Yangfu Li, Hongjian Zhan, Tianyi Chen + 2 more2026-03-05💬 cs.CL

From Press to Pixels: Evolving Urdu Text Recognition

This paper introduces the Urdu Newspaper Benchmark (UNB) dataset and a novel pipeline combining YOLOv11x for layout analysis, SwinIR for super-resolution, and fine-tuned LLMs to demonstrate that modern language models significantly outperform traditional OCR systems in recognizing complex, low-quality Urdu newspaper text.

Samee Arif, Sualeha Farid2026-03-05💻 cs

Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation

This paper proposes "Feature Mixing," an extremely simple and fast modality-agnostic method for multimodal outlier synthesis that achieves state-of-the-art performance in out-of-distribution detection and segmentation while offering significant speedups, alongside the introduction of a new multimodal dataset called CARLA-OOD.

Moru Liu, Hao Dong, Jessica Kelly + 2 more2026-03-05🤖 cs.AI

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change

This paper introduces the BAH dataset, a multimodal collection of 1,427 videos from 300 participants annotated for ambivalence and hesitancy recognition, alongside baseline benchmarking results that highlight the need for advanced models to support personalized digital health interventions.

Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan + 6 more2026-03-05🤖 cs.LG

Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models

This paper introduces TADA, a targeted diffusion-based augmentation framework that selectively generates synthetic images for hard-to-learn examples to improve classifier generalization with significantly reduced computational overhead compared to full-dataset augmentation.

Dang Nguyen, Jiping Li, Jinghao Zheng + 1 more2026-03-05🤖 cs.LG

Structural Vibration Monitoring with Diffractive Optical Processors

This paper presents a low-power, cost-effective diffractive optical system that integrates a passive diffractive layer with a shallow neural network to remotely and accurately reconstruct 3D structural vibration spectra, overcoming the scalability and complexity limitations of traditional Structural Health Monitoring solutions.

Yuntian Wang, Zafer Yilmaz, Yuhang Li + 5 more2026-03-05🔬 physics.optics

EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations

EgoWorld is a novel framework that reconstructs semantically coherent egocentric views from rich exocentric observations—including point clouds, 3D hand poses, and text—by leveraging depth estimation and diffusion models to overcome the limitations of existing 2D-based translation methods and achieve state-of-the-art performance across diverse datasets.

Junho Park, Andrew Sangwoo Ye, Taein Kwon2026-03-05🤖 cs.AI

Partial Weakly-Supervised Oriented Object Detection

This paper proposes the Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework, which leverages partially weak annotations and unlabeled data through an Orientation-and-Scale-aware Student model and a Class-Agnostic Pseudo-Label Filtering strategy to achieve performance comparable to semi-supervised methods while significantly reducing annotation costs.

Mingxin Liu, Peiyuan Zhang, Yuan Liu + 8 more2026-03-05💻 cs

Fast Equivariant Imaging: Acceleration for Unsupervised Learning via Augmented Lagrangian and Auxiliary PnP Denoisers

This paper introduces Fast Equivariant Imaging (FEI), a novel unsupervised learning framework that leverages the Augmented Lagrangian method and auxiliary Plug-and-Play denoisers to achieve a 10x training acceleration and improved generalization for deep imaging tasks like X-ray CT reconstruction and inpainting without requiring ground-truth data.

Guixian Xu, Jinglai Li, Junqi Tang2026-03-05🤖 cs.LG

D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping

This paper proposes D2Dewarp, a fine-grained document image dewarping model that leverages dual-dimensional horizontal-vertical geometric representation learning and a new large-scale dataset (DocDewarpHV) to achieve superior rectification results compared to state-of-the-art methods.

Heng Li, Xiangping Wu, Qingcai Chen2026-03-05💻 cs

VITA: Vision-to-Action Flow Matching Policy

VITA is a novel, noise-free, and conditioning-free flow matching framework that accelerates inference by directly mapping visual representations to structured latent actions via a jointly trained autoencoder and flow latent decoding, achieving state-of-the-art performance on diverse robotic tasks.

Dechen Gao, Boqi Zhao, Andrew Lee + 6 more2026-03-05🤖 cs.AI

Classification of Histopathology Slides with Persistent Homology Convolutions

This paper introduces Persistent Homology Convolutions, a novel method that captures local topological features in histopathology slides, demonstrating that this approach outperforms standard CNNs in classification accuracy and hyperparameter robustness by effectively integrating geometric information into deep learning models.

Shrunal Pothagoni, Benjamin Schweinhart2026-03-05💻 cs

← Previous Next →