cs.CV papers | Gist.Science

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

This paper introduces ViCLIP-OT, a novel foundation vision-language model that combines CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport loss to achieve state-of-the-art performance in Vietnamese image-text retrieval across both in-domain and zero-shot settings.

Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham2026-02-27🤖 cs.AI

No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

This paper introduces MoFit, a caption-free membership inference attack framework for latent diffusion models that constructs model-fitted synthetic embeddings to effectively identify training data memorization without relying on ground-truth text captions.

Joonsung Jeon, Woo Jae Kim, Suhyeon Ha + 2 more2026-02-27💻 cs

UFO-DETR: Frequency-Guided End-to-End Detector for UAV Tiny Objects

This paper proposes UFO-DETR, an end-to-end object detection framework that integrates an LSKNet backbone, DAttention, AIFI, and a novel DynFreq-C3 module to effectively address scale variations and dense distributions in UAV imagery, achieving superior accuracy and efficiency compared to RT-DETR-L for edge computing applications.

Yuankai Chen, Kai Lin, Qihong Wu + 6 more2026-02-27💻 cs

SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

This paper introduces SoPE, a novel spherical coordinate-based positional embedding that replaces the suboptimal Rotary Position Embedding in 3D Large Vision-Language Models to better preserve 3D geometric structures and directional dependencies, thereby significantly enhancing spatial perception and generalization across various benchmarks.

Guanting Ye, Qiyan Zhao, Wenhao Yu + 7 more2026-02-27🤖 cs.AI

IRSDE-Despeckle: A Physics-Grounded Diffusion Model for Generalizable Ultrasound Despeckling

The paper introduces IRSDE-Despeckle, a physics-grounded diffusion model trained on simulated ultrasound data that effectively removes speckle noise while preserving anatomical details, outperforms existing methods, and provides uncertainty quantification to identify potential failure regions.

Shuoqi Chen, Yujia Wu, Geoffrey P. Luke2026-02-27💻 cs

HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models

HulluEdit is a single-pass, reference-free framework that mitigates hallucinations in Large Vision-Language Models by employing orthogonal subspace editing to selectively suppress conflicting priors while mathematically preserving visual evidence and maintaining inference efficiency.

Yangguang Lin, Quan Fang, Yufei Li + 3 more2026-02-27💻 cs

Sapling-NeRF: Geo-Localised Sapling Reconstruction in Forests for Ecological Monitoring

This paper presents "Sapling-NeRF," a novel pipeline that fuses NeRF, LiDAR SLAM, and GNSS to achieve accurate, geo-localised 3D reconstruction of forest saplings, overcoming the limitations of traditional sensing methods to enable repeatable, quantitative ecological monitoring of fine-scale structural traits.

Miguel Ángel Muñoz-Bañón, Nived Chebrolu, Sruthi M. Krishna Moorthy + 4 more2026-02-27💻 cs

Asymmetric Idiosyncrasies in Multimodal Models

This paper reveals that while captioning models embed distinctive stylistic signatures detectable with near-perfect accuracy, these idiosyncrasies largely vanish in the images generated by text-to-image models due to a failure to preserve key textual variations, offering a new framework to quantify both captioning styles and prompt-following capabilities.

Muzi Tao, Chufan Shi, Huijuan Wang + 2 more2026-02-27💻 cs

ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control

ProjFlow is a training-free, zero-shot sampler that achieves exact spatial motion control by leveraging a novel kinematics-aware metric to enforce linear constraints while preserving motion naturalness, effectively addressing challenges in tasks like motion inpainting and 2D-to-3D lifting without requiring task-specific training.

Akihisa Watanabe, Qing Yu, Edgar Simo-Serra + 1 more2026-02-27💻 cs

Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval

This paper proposes a unified multi-scale hidden-code framework that enables the recovery and factual retrieval of tampered natural images beyond mere detection, validated by extensive experiments on the newly constructed ImageNet-S benchmark.

Yuan-Chih Chen, Chun-Shien Lu2026-02-27💻 cs

TrajTok: Learning Trajectory Tokens enables better Video Understanding

The paper proposes TrajTok, an end-to-end, lightweight video tokenizer that dynamically adapts token granularity to semantic complexity via implicit spatiotemporal clustering, significantly improving video understanding efficiency and performance across classification, retrieval, and long-video reasoning tasks compared to existing methods.

Chenhao Zheng, Jieyu Zhang, Jianing Zhang + 6 more2026-02-27💻 cs

SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

SceneTransporter is an end-to-end framework that leverages entropic Optimal Transport within a compositional DiT model to enforce exclusive patch-to-latent routing and competitive grouping, thereby enabling the generation of coherent, structured 3D scenes from a single image.

Ling Wang, Hao-Xiang Guo, Xinzhou Wang + 9 more2026-02-27💻 cs

Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

This paper proposes a robust human trajectory prediction method that leverages a self-supervised skeleton representation model pretrained with masked autoencoding to effectively handle missing joint data caused by occlusions, thereby improving prediction accuracy in real-world scenarios without compromising performance in clean conditions.

Taishu Arashima, Hiroshi Kera, Kazuhiko Kawamoto2026-02-27💻 cs

GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation

The paper proposes GSTurb, a novel framework that combines optical flow-guided tilt correction with Gaussian splatting to effectively mitigate atmospheric turbulence-induced pixel displacement and blur, achieving state-of-the-art performance on both synthetic and real-world datasets.

Hanliang Du, Zhangji Lu, Zewei Cai + 3 more2026-02-27💻 cs

Face Time Traveller : Travel Through Ages Without Losing Identity

Face Time Traveller (FaceTT) is a novel diffusion-based framework that achieves high-fidelity, identity-consistent age transformations by integrating a Face-Attribute-Aware Prompt Refinement strategy, a tuning-free Angular Inversion method, and an Adaptive Attention Control mechanism to overcome the limitations of existing models in preserving identity and background realism.

Purbayan Kar, Ayush Ghadiya, Vishal Chudasama + 2 more2026-02-27💻 cs

CMSA-Net: Causal Multi-scale Aggregation with Adaptive Multi-source Reference for Video Polyp Segmentation

The paper proposes CMSA-Net, a robust video polyp segmentation framework that leverages a Causal Multi-scale Aggregation module and a Dynamic Multi-source Reference strategy to effectively address challenges like weak semantic discrimination and temporal instability, achieving state-of-the-art performance on the SUN-SEG dataset.

Tong Wang, Yaolei Qi, Siwen Wang + 3 more2026-02-27💻 cs

Reflectance Multispectral Imaging for Soil Composition Estimation and USDA Texture Classification

This paper presents a cost-effective, field-deployable multispectral imaging system combined with machine learning that achieves high accuracy (R² up to 0.99 and over 99% classification accuracy) in predicting soil composition and USDA texture classes, offering a rapid, non-destructive alternative to traditional laboratory testing for agriculture and geotechnical engineering.

G. A. S. L Ranasinghe, J. A. S. T. Jayakody, M. C. L. De Silva + 5 more2026-02-27⚡ eess

Moral Preferences of LLMs Under Directed Contextual Influence

This paper introduces a pilot evaluation harness to demonstrate that directed contextual influences significantly reshape LLMs' moral decisions in trolley-problem scenarios, revealing that baseline neutrality is a poor predictor of steerability and that reasoning can paradoxically amplify bias despite reducing average sensitivity.

Phil Blandfort, Tushar Karayil, Urja Pawar + 3 more2026-02-27💬 cs.CL

A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

The paper introduces CheXficient, a data- and compute-efficient chest X-ray foundation model that achieves performance comparable to or better than large-scale counterparts by employing active, principled data curation to prioritize informative samples, thereby overcoming the redundancy and class imbalance issues inherent in brute-force scaling.

Chong Wang, Yabin Zhang, Yunhe Gao + 9 more2026-02-27💻 cs

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

This paper proposes Diagnostic-driven Progressive Evolution (DPE), a scalable, iterative training paradigm for Large Multimodal Models that leverages multi-agent data generation and dynamic failure diagnosis to continuously identify and reinforce capability blind spots, resulting in stable performance gains across diverse benchmarks.

Hongrui Jia, Chaoya Jiang, Shikun Zhang + 1 more2026-02-27💻 cs

← Previous Next →