cs.CV papers | Gist.Science

FTSplat: Feed-forward Triangle Splatting Network

FTSplat is a feed-forward framework that directly predicts continuous, simulation-ready triangle surfaces from multi-view images in a single pass, overcoming the optimization bottlenecks of NeRF and 3DGS while providing the explicit manifold geometry required for robotics and simulation.

Xiong Jinlin, Li Can, Shen Jiawei, Qi Zhigang, Sun Lei, Zhao Dongyang2026-03-09💻 cs

OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving

The paper proposes OD-RASE, an ontology-driven framework that leverages large-scale visual language models and diffusion models to proactively identify accident-prone road structures and generate reliable infrastructure improvement proposals, thereby enhancing the safety of autonomous driving systems.

Kota Shimomura, Masaki Nambata, Atsuya Ishikawa, Ryota Mimura, Takayuki Kawabuchi, Takayoshi Yamashita, Koki Inoue2026-03-09💻 cs

Facial Expression Recognition Using Residual Masking Network

This paper proposes a novel Residual Masking Network that integrates a Deep Residual Network with a Unet-like segmentation architecture to refine feature maps via an attention mechanism, achieving state-of-the-art accuracy on the FER2013 and VEMO facial expression recognition datasets.

Luan Pham, The Huynh Vu, Tuan Anh Tran2026-03-09🤖 cs.AI

SLER-IR: Spherical Layer-wise Expert Routing for All-in-One Image Restoration

The paper proposes SLER-IR, a novel all-in-one image restoration framework that utilizes spherical layer-wise expert routing, a spherical uniform degradation embedding with contrastive learning, and a global-local granularity fusion module to effectively overcome feature interference and spatial non-uniform degradations, achieving state-of-the-art performance across multiple restoration tasks.

Peng Shurui, Xin Lin, Shi Luo, Jincen Ou, Dizhe Zhang, Lu Qi, Truong Nguyen, Chao Ren2026-03-09💻 cs

Adaptive Radial Projection on Fourier Magnitude Spectrum for Document Image Skew Estimation

This paper proposes a robust skew estimation method for document images using Adaptive Radial Projection on the 2D Discrete Fourier Magnitude spectrum, introduces the DISE-2021 dataset for evaluation, and demonstrates that the approach outperforms existing methods.

Luan Pham, Phu Hao Hoang, Xuan Toan Mai, Tuan Anh Tran2026-03-09💻 cs

LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution

LucidNFT is a multi-reward reinforcement learning framework for generative real-world super-resolution that addresses faithfulness hallucinations and optimization bottlenecks by introducing a degradation-robust consistency evaluator, a decoupled advantage normalization strategy, and a large-scale real-degradation dataset to achieve superior perceptual-faithfulness trade-offs.

Song Fei, Tian Ye, Sixiang Chen, Zhaohu Xing, Jianyu Lai, Lei Zhu2026-03-09💻 cs

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

The paper proposes E-AdaPrune, an energy-driven adaptive token pruning framework that dynamically allocates visual token budgets based on spectral energy to improve Vision-Language Model efficiency and performance without adding learnable parameters or significant latency.

Jialuo He, Huangxun Chen2026-03-09🤖 cs.AI

Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation

This paper introduces VINE, a unified framework for few-shot segmentation that leverages spatial-view graphs and discriminative priors to refine class-specific prototypes, effectively addressing structural misalignment and cross-view inconsistency to generate accurate masks even under challenging viewpoint variations.

Hongli Liu, Yu Wang, Shengjie Zhao2026-03-09💻 cs

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

The paper introduces OVGGT, a training-free framework that enables constant-memory and constant-compute streaming 3D geometry reconstruction from arbitrarily long videos by combining Self-Selective Caching and Dynamic Anchor Protection to overcome the quadratic cost and unbounded memory growth of existing transformer-based methods.

Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen2026-03-09💻 cs

Exploring Open-Vocabulary Object Recognition in Images using CLIP

This paper proposes a streamlined, training-free two-stage open-vocabulary object recognition framework that combines object segmentation with CLIP-based and CNN/MLP-based feature alignment, demonstrating that a CLIP-only approach without Singular Value Decomposition achieves state-of-the-art performance on standard benchmarks.

Wei Yu Chen, Ying Dai2026-03-09💻 cs

Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

This paper introduces Skeleton-to-Image Encoding (S2I), a novel method that transforms heterogeneous 3D skeleton sequences into standardized image-like formats to leverage powerful vision-pretrained models for effective self-supervised skeleton representation learning and cross-modal action recognition.

Siyuan Yang, Jun Liu, Hao Cheng, Chong Wang, Shijian Lu, Hedvig Kjellstrom, Weisi Lin, Alex C. Kot2026-03-09🤖 cs.AI

CR-QAT: Curriculum Relational Quantization-Aware Training for Open-Vocabulary Object Detection

This paper proposes CR-QAT, a framework combining curriculum-based progressive quantization and text-centric relational knowledge distillation to mitigate the severe performance degradation of naive low-bit quantization in open-vocabulary object detection, thereby achieving significant accuracy improvements on zero-shot benchmarks.

Jinyeong Park, Donghwa Kim, Brent ByungHoon Kang, Hyeongboo Baek, Jibum Kim2026-03-09💻 cs

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

PROBE is a learning-free LiDAR place recognition descriptor that models BEV occupancy as Bernoulli random variables and analytically marginalizes over continuous translations via the polar Jacobian to achieve sensor-independent, rotation-robust 3D place recognition with state-of-the-art performance among handcrafted methods.

Jinseop Lee, Byoungho Lee, Gichul Yoo2026-03-09💻 cs

Imagine How To Change: Explicit Procedure Modeling for Change Captioning

The paper introduces ProCap, a novel framework that improves change captioning by reformulating static image comparison into dynamic procedure modeling through a two-stage design that learns latent change dynamics from sparse keyframes and utilizes learnable procedure queries to generate temporally coherent descriptions of how changes occur.

Jiayang Sun, Zixin Guo, Min Cao, Guibo Zhu, Jorma Laaksonen2026-03-09🤖 cs.AI

Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions

This paper introduces DynUAV, a comprehensive benchmark featuring over 1.7 million annotations across 42 video sequences to address the limitations of existing datasets in evaluating multi-object tracking under the intense ego-motion, scale variations, and motion blur characteristic of complex UAV operations.

Jingtao Ye, Kexin Zhang, Xunchi Ma, Yuehan Li, Guangming Zhu, Peiyi Shen, Linhua Jiang, Xiangdong Zhang, Liang Zhang2026-03-09💻 cs

Towards High-resolution and Disentangled Reference-based Sketch Colorization

This paper presents a dual-branch framework with Gram Regularization Loss and an anime-specific Tagger Network to directly minimize the distribution shift between training and inference data, achieving state-of-the-art high-resolution, disentangled, and controllable reference-based sketch colorization.

Dingkun Yan, Xinrui Wang, Ru Wang, Zhuoru Li, Jinze Yu, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo2026-03-09💻 cs

HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild

This paper introduces HarvestFlex, the first study demonstrating that vision-language-action policies can be successfully adapted to real-world greenhouse strawberry harvesting using a closed-loop system with three-view RGB sensing and minimal teleoperated data, achieving a 74.0% success rate without relying on depth sensors or explicit geometric calibration.

Ziyang Zhao, Shuheng Wang, Zhonghua Miao, Ya Xiong2026-03-09💻 cs

Technical Report: Automated Optical Inspection of Surgical Instruments

This technical report details a collaboration with industry leaders in Pakistan's Sialkot surgical cluster to develop an Automated Optical Inspection system using deep learning models (YOLOv8, ResNet-152, and EfficientNet-b4) on a new dataset of 4,414 images to detect manufacturing defects in surgical instruments, thereby enhancing patient safety and manufacturing quality.

Zunaira Shafqat, Atif Aftab Ahmed Jilani, Qurrat Ul Ain2026-03-09🤖 cs.AI

MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLMs

This paper presents MM-ISTS, a multimodal framework that leverages vision-text large language models and a novel two-stage encoding mechanism to enhance irregularly sampled time series forecasting by integrating temporal, visual, and textual modalities for improved pattern recognition and contextual understanding.

Zhi Lei, Chenxi Liu, Hao Miao, Wanghui Qiu, Bin Yang, Chenjuan Guo2026-03-09🤖 cs.AI

RePer-360: Releasing Perspective Priors for 360 $^\circ$ Depth Estimation via Self-Modulation

RePer-360 is a distortion-aware self-modulation framework that adapts perspective-trained depth foundation models to 360° panoramic depth estimation by preserving pretrained priors through a lightweight geometry-aligned guidance module and a Self-Conditioned AdaLN-Zero mechanism, achieving superior performance with only 1% of the training data.

Cheng Guan, Chunyu Lin, Zhijie Shen, Junsong Zhang, Jiyuan Wang2026-03-09💻 cs

← Previous Next →

cs.CV