cs.CV papers | Gist.Science

ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

The paper introduces ScribeTokens, a fixed-vocabulary tokenization method for digital ink that decomposes pen movements into unit pixel steps, demonstrating superior performance over vector representations in both handwritten text generation and recognition, particularly when enhanced by a novel next-ink-token prediction pretraining strategy.

Douglass Wang2026-03-04💻 cs

Scale-invariant Gaussian derivative residual networks

This paper introduces GaussDerResNets, a novel deep learning architecture that combines scale-covariant Gaussian derivative layers with residual skip connections to achieve provable scale invariance and superior generalization to unseen image scales across multiple datasets while maintaining high accuracy and computational efficiency.

Andrzej Perzanowski, Tony Lindeberg2026-03-04🤖 cs.LG

Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

By probing LVLMs with a synthetic directed graph dataset, this study reveals that while node and structural information are linearly encoded early in the vision encoder, edge representations emerge only later in the language model's text tokens, explaining the models' persistent struggles with relational understanding.

Haruto Yoshida, Keito Kudo, Yoichi Aoki + 4 more2026-03-04💬 cs.CL

Multimodal-Prior-Guided Importance Sampling for Hierarchical Gaussian Splatting in Sparse-View Novel View Synthesis

This paper introduces a multimodal-prior-guided importance sampling framework for hierarchical 3D Gaussian Splatting that fuses photometric, semantic, and geometric cues to strategically refine sparse-view novel view synthesis, thereby achieving state-of-the-art reconstruction quality while mitigating overfitting and noise.

Kaiqiang Xiong, Zhanke Wang, Ronggang Wang2026-03-04💻 cs

SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

SIGMark is a scalable, distortion-free in-generation watermarking framework for video diffusion models that enables blind extraction and robustness against temporal disturbances through Global Frame-wise PseudoRandom Coding and a Segment Group-Ordering module tailored for causal 3D VAEs.

Xinjie Zhu, Zijing Zhao, Hui Jin + 5 more2026-03-04💻 cs

SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers

The paper proposes SemanticDialect, a semantic-aware mixed-format quantization framework for Video Diffusion Transformers that utilizes block-wise format selection, activation decomposition with residual correction, and semantic token grouping to achieve high-quality video generation with significantly reduced memory and compute costs.

Wonsuk Jang, Thierry Tambe2026-03-04💻 cs

StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting

StegaFFD is a privacy-preserving face forgery detection framework that embeds facial images into natural cover images using steganography to avoid suspicion, while employing specialized decomposition, attention, and alignment mechanisms to accurately detect forgeries within the steganographic domain despite semantic interference.

Guoqing Ma, Xun Lin, Hui Ma + 6 more2026-03-04🤖 cs.AI

LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

This paper introduces LLandMark, a modular multi-agent framework that enhances multimodal video retrieval for complex, real-world queries by integrating specialized agents for landmark detection, LLM-driven image generation, and OCR refinement to achieve culturally grounded and explainable results, particularly for Vietnamese scenes.

Minh-Chi Phung, Thien-Bao Le, Cam-Tu Tran-Thi + 2 more2026-03-04💻 cs

Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting

MVD-HuGaS is a novel framework that achieves state-of-the-art free-view 3D human rendering from a single image by leveraging a fine-tuned multi-view diffusion model to generate consistent multi-view images, an alignment module for joint Gaussian and pose optimization, and a depth-based facial distortion mitigation module to ensure high-fidelity reconstruction.

Kaiqiang Xiong, Rui Peng, Jiahao Wu + 5 more2026-03-04💻 cs

3D-DRES: Detailed 3D Referring Expression Segmentation

This paper introduces 3D-DRES, a new task for fine-grained phrase-to-3D-instance mapping, supported by the DetailRefer dataset with pioneering phrase-instance annotations and the DetailBase baseline, which together enhance 3D vision-language understanding and improve performance on traditional 3D-RES benchmarks.

Qi Chen, Changli Wu, Jiayi Ji + 2 more2026-03-04💻 cs

ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

The paper proposes ProGIC, a progressive and lightweight generative image compression codec based on residual vector quantization and a compact backbone, which achieves significant bitrate savings, faster encoding/decoding speeds, and flexible progressive transmission compared to existing methods.

Hao Cao, Chengbin Liang, Wenqi Guo + 2 more2026-03-04💻 cs

Harmonic Beltrami Signature Network: a Shape Prior Module in Deep Learning Framework

This paper introduces the Harmonic Beltrami Signature Network (HBSN), a novel deep learning architecture that efficiently computes rotation-, scale-, and translation-invariant shape representations to serve as a versatile shape prior module, thereby enhancing the performance of existing computer vision segmentation models.

Chenran Lin, Lok Ming Lui2026-03-04💻 cs

Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement

This paper presents Articulation in Motion (AiM), a prior-free framework that leverages a dual-Gaussian scene representation and sequential RANSAC to automatically segment articulated objects into rigid parts, estimate their kinematics, and reconstruct interactive 3D replicas from a single static scan and an interaction video without requiring prior knowledge of the number of parts.

Hao Ai, Wenjie Chang, Jianbo Jiao + 2 more2026-03-04💻 cs

HDINO: A Concise and Efficient Open-Vocabulary Detector

HDINO is a concise and efficient open-vocabulary detector that eliminates reliance on manually curated datasets and resource-intensive feature extraction by employing a two-stage training strategy with a One-to-Many Semantic Alignment Mechanism and Difficulty Weighted Classification Loss to achieve state-of-the-art performance on COCO with significantly fewer training images than existing methods.

Hao Zhang, Yiqun Wang, Qinran Lin + 2 more2026-03-04💻 cs

GloPath: An Entity-Centric Foundation Model for Glomerular Lesion Assessment and Clinicopathological Insights

GloPath is a scalable, entity-centric foundation model trained on over one million glomeruli that outperforms state-of-the-art methods in diverse lesion assessment tasks and uncovers significant clinicopathological associations, advancing the clinical translation of AI in renal pathology.

Qiming He, Jing Li, Tian Guan + 26 more2026-03-04💻 cs

TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration

TC-Padé is a novel feature prediction framework that leverages Trajectory-Consistent Padé approximation with adaptive coefficient modulation and step-aware strategies to significantly accelerate diffusion models in low-step regimes while maintaining high generation quality and overcoming the trajectory drift limitations of existing polynomial-based methods.

Benlei Cui, Shaoxuan He, Bukun Huang + 8 more2026-03-04💻 cs

Leveraging Label Proportion Prior for Class-Imbalanced Semi-Supervised Learning

This paper proposes a lightweight semi-supervised learning framework that introduces a stochastic Proportion Loss, adapted from learning from label proportions, to align model predictions with global class distributions and effectively mitigate class imbalance issues in SSL.

Kohki Akiba, Shinnosuke Matsuo, Shota Harada + 1 more2026-03-04🤖 cs.LG

Semi-Supervised Few-Shot Adaptation of Vision-Language Models

This paper proposes an efficient semi-supervised method for few-shot adaptation of vision-language models in medical imaging that leverages unlabeled data to propagate text-informed pseudo-labels, thereby reducing annotation requirements by over 50% while addressing class imbalance challenges.

Julio Silva-Rodríguez, Ender Konukoglu2026-03-04💻 cs

Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention

This paper proposes a framework combining a foundation-model-based anomaly synthesis pipeline (FMAS) and a Wavelet Domain Attention Module (WDAM) to generate realistic synthetic anomalies and enhance feature extraction, significantly improving industrial anomaly detection performance on benchmark datasets without requiring fine-tuning.

Wensheng Wu, Zheming Lu, Ziqian Lu + 5 more2026-03-04💻 cs

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

TagaVLM is an end-to-end framework that enhances Vision-Language Navigation by explicitly injecting topological structures into the VLM backbone via Spatial Topology Aware Residual Attention and Interleaved Navigation Prompts, achieving state-of-the-art performance on the R2R benchmark and demonstrating that targeted architectural improvements on smaller models can outperform brute-force scaling for embodied spatial reasoning.

Jiaxing Liu, Zexi Zhang, Xiaoyan Li + 3 more2026-03-04💻 cs

← Previous Next →