cs.CV papers | Gist.Science

Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements

Yolo-Key-6D is a novel single-stage, end-to-end framework that achieves real-time monocular 6D pose estimation with competitive accuracy by integrating a keypoint-based auxiliary head for enhanced 3D geometry understanding and utilizing a continuous 9D rotation representation for stable training.

Kemal Alperen Çetiner, Hazım Kemal Ekenel2026-03-05💻 cs

UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

The paper introduces UniSync, a unified lip synchronization framework that combines mask-free pose-anchored training with mask-based blending inference to achieve high-fidelity, generalizable results across diverse real-world scenarios, including stylized avatars and challenging lighting conditions, while also proposing the RealWorld-LipSync benchmark for evaluation.

Ruidi Fan, Yang Zhou, Siyuan Wang + 3 more2026-03-05💻 cs

A novel network for classification of cuneiform tablet metadata

This paper introduces a novel convolution-inspired network that effectively classifies cuneiform tablet metadata by integrating local and global information from high-resolution point clouds, outperforming the state-of-the-art Point-BERT model while addressing challenges posed by limited annotated datasets.

Frederik Hagelskjær2026-03-05🤖 cs.AI

From Misclassifications to Outliers: Joint Reliability Assessment in Classification

This paper proposes a unified evaluation framework with new metrics (DS-F1 and DS-AURC) and an improved method (SURE+) to jointly assess and enhance classifier reliability by integrating out-of-distribution detection and in-distribution failure prediction, demonstrating that double scoring functions significantly outperform traditional single scoring approaches.

Yang Li, Youyang Sha, Yinzhi Wang + 4 more2026-03-05🤖 cs.LG

Architecture and evaluation protocol for transformer-based visual object tracking in UAV applications

This paper proposes a Modular Asynchronous Tracking Architecture (MATA) that integrates a transformer-based tracker with an Extended Kalman Filter and ego-motion compensation to address UAV tracking challenges, while introducing a hardware-independent evaluation protocol and a new Normalized Time to Failure (NT2F) metric to better quantify robustness and real-time performance on embedded systems.

Augustin Borne, Pierre Notin, Christophe Hennequin + 4 more2026-03-05💻 cs

Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks

This paper introduces FGAesthetics, a large-scale fine-grained image aesthetic assessment database with pairwise comparison annotations, and proposes FGAesQ, a novel framework that leverages relative ranks through specialized tokenization and alignment techniques to achieve superior discriminative performance in both fine-grained and coarse-grained aesthetic evaluation scenarios.

Zhichao Yang, Jianjie Wang, Zhixianhe Zhang + 4 more2026-03-05💻 cs

N-gram Injection into Transformers for Dynamic Language Model Adaptation in Handwritten Text Recognition

This paper proposes an N-gram Injection (NGI) method that dynamically adapts Transformer-based handwritten text recognition models to target language distributions at inference time by injecting external n-gram language models, thereby significantly reducing performance gaps caused by language shifts without requiring additional training on target data.

Florent Meyer, Laurent Guichard, Denis Coquenet + 3 more2026-03-05💻 cs

DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping

The paper introduces DISC, a fully GPU-accelerated framework that utilizes a novel single-pass, distance-weighted mechanism to extract dense semantic context from CLIP embeddings, enabling efficient, real-time open-set semantic mapping that significantly outperforms existing state-of-the-art methods in accuracy and scalability.

Felix Igelbrink, Lennart Niecksch, Martin Atzmueller + 1 more2026-03-05💻 cs

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

This paper presents CMDR-IAD, a lightweight unsupervised framework that achieves state-of-the-art industrial anomaly detection by combining bidirectional 2D-3D cross-modal mapping with dual-branch reconstruction to robustly handle noisy, weak-texture, or missing modalities without relying on memory banks.

Radia Daci, Vito Renò, Cosimo Patruno + 4 more2026-03-05🤖 cs.AI

Slice-wise quality assessment of high b-value breast DWI via deep learning-based artifact detection

This study demonstrates that a DenseNet121-based deep learning model effectively detects hyper- and hypointense intensity artifacts on high b-value (1500 s/mm²) breast diffusion-weighted MRI slices, achieving high AUROC scores and providing promising results for automated quality assessment.

Ameya Markale, Luise Brock, Ihor Horishnyi + 10 more2026-03-05💻 cs

Spatial Causal Prediction in Video

This paper introduces Spatial Causal Prediction (SCP), a new task paradigm and benchmark (SCP-Bench) designed to evaluate and improve video models' ability to infer unseen spatial states and causal outcomes beyond visible observations, revealing significant gaps between current AI and human intelligence in this domain.

Yanguang Zhao, Jie Yang, Shengqiong Wu + 9 more2026-03-05💻 cs

RVN-Bench: A Benchmark for Reactive Visual Navigation

The paper introduces RVN-Bench, a new collision-aware benchmark built on Habitat 2.0 and HM3D scenes that enables the training and evaluation of safe, robust indoor visual navigation policies for mobile robots in unseen, cluttered environments.

Jaewon Lee, Jaeseok Heo, Gunmin Lee + 3 more2026-03-05🤖 cs.AI

Towards Generalized Multimodal Homography Estimation

This paper proposes a training data synthesis method that generates diverse, unaligned image pairs from single inputs alongside a novel network architecture to enhance the robustness and generalization of multimodal homography estimation across unseen domains.

Jinkun You, Jiaxin Cheng, Jie Zhang + 1 more2026-03-05🤖 cs.AI

Structural Action Transformer for 3D Dexterous Manipulation

This paper proposes the Structural Action Transformer (SAT), a novel 3D dexterous manipulation policy that reframes actions as variable-length, unordered joint trajectories and utilizes an Embodied Joint Codebook to achieve superior sample efficiency and cross-embodiment skill transfer from heterogeneous datasets.

Xiaohan Lei, Min Wang, Bohong Weng + 2 more2026-03-05💻 cs

ProFound: A moderate-sized vision foundation model for multi-task prostate imaging

The paper introduces ProFound, a domain-specialized vision foundation model pre-trained on over 22,000 prostate MRI volumes via self-supervised learning, which demonstrates superior or competitive performance across 11 diverse clinical tasks compared to state-of-the-art specialized and foundation models.

Yipei Wang, Yinsong Xu, Weixi Yi + 11 more2026-03-05💻 cs

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

This paper introduces BLOCK, an open-source bi-stage pipeline that leverages a large multimodal model to generate consistent 3D character previews and a fine-tuned FLUX.2 model with a novel EvolveLoRA curriculum to decode these previews into pixel-perfect Minecraft skins.

Hengquan Guo2026-03-05🤖 cs.AI

UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization

This paper proposes UniRain, a unified image deraining framework that combines a RAG-based dataset distillation pipeline for selecting high-quality training samples and a multi-objective reweighted optimization strategy within an asymmetric MoE architecture to effectively restore images degraded by diverse rain streaks and raindrops across both daytime and nighttime conditions.

Qianfeng Yang, Qiyuan Guan, Xiang Chen + 3 more2026-03-05💻 cs

Scaling Dense Event-Stream Pretraining from Visual Foundation Models

This paper proposes a novel self-supervised pretraining method that leverages structure-aware distillation from visual foundation models to overcome annotation bottlenecks and semantic collapse, enabling scalable learning of versatile, fine-grained representations from dense event streams.

Zhiwen Chen, Junhui Hou, Zhiyu Zhu + 2 more2026-03-05💻 cs

Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction

Dual-Solver is a generalized ODE solver for diffusion models that employs learnable parameters to dynamically interpolate prediction types, select integration domains, and adjust residuals, thereby significantly improving image quality and CLIP scores in low-function-evaluation regimes across various backbones.

Soochul Park, Yeon Ju Lee2026-03-05🤖 cs.LG

Phi-4-reasoning-vision-15B Technical Report

This technical report introduces Phi-4-reasoning-vision-15B, a compact open-weight multimodal model that achieves competitive performance in scientific, mathematical, and UI reasoning through strategic architecture choices, rigorous data curation, and a hybrid training approach, demonstrating that smaller models can excel with significantly less compute.

Jyoti Aneja, Michael Harrison, Neel Joshi + 3 more2026-03-05🤖 cs.AI

← Previous Next →