cs.CV papers | Gist.Science

LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification

LE-NeuS is a latency-efficient neuro-symbolic framework for long-form video question answering that achieves a significant reduction in inference latency (from 90x to ~10x compared to base VLMs) while preserving accuracy gains through CLIP-guided adaptive frame sampling and batched proposition detection.

Shawn Liang, Sahil Shah, Chengwei Zhou + 5 more2026-03-02💻 cs

No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency

This paper introduces a calibration-free cross-sensor view synthesis framework that leverages a match-densify-consolidate pipeline and 3D Gaussian Splatting to generate aligned RGB-X data without requiring expensive sensor calibration or 3D priors for the non-RGB modality.

Cho-Ying Wu, Zixun Huang, Xinyu Huang + 1 more2026-03-02💻 cs

Evidential Neural Radiance Fields

This paper introduces Evidential Neural Radiance Fields, a probabilistic framework that enables the simultaneous quantification of both aleatoric and epistemic uncertainty in 3D scene modeling through a single forward pass without compromising rendering quality or incurring significant computational overhead.

Ruxiao Duan, Alex Wong2026-03-02🤖 cs.AI

CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird's-Eye-View Semantic Segmentation

CycleBEV is a training-only regularization framework that enhances Bird's-Eye-View semantic segmentation by introducing an inverse view transformation network to enforce cycle consistency between perspective and BEV spaces, thereby improving geometric and semantic feature learning without increasing inference complexity.

Jeongbin Hong, Dooseop Choi, Taeg-Hyun An + 2 more2026-03-02🤖 cs.AI

Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning

This paper introduces HDFLIM, a framework that achieves efficient image captioning by aligning frozen vision and language models through hyperdimensional computing operations like binding and bundling, thereby eliminating the need for computationally intensive multimodal fine-tuning while maintaining performance comparable to end-to-end training methods.

Abhishek Dalvi, Vasant Honavar2026-03-02🤖 cs.AI

Incremental dimension reduction for efficient and accurate visual anomaly detection

This paper proposes an incremental dimension reduction algorithm that processes image features in batches to update truncated singular value decomposition, thereby enabling efficient and accurate visual anomaly detection on large-scale datasets with minimal memory overhead.

Teng-Yok Lee2026-03-02💻 cs

Extended Reality (XR): The Next Frontier in Education

This article examines how Extended Reality (XR) transforms education through immersive learning while addressing the significant barriers of cost, technical complexity, and ethical data concerns that must be overcome to balance innovation with accessibility and security.

Shadeeb Hossain2026-03-02💻 cs

Egocentric Visibility-Aware Human Pose Estimation

This paper addresses the challenge of keypoint invisibility in egocentric human pose estimation by introducing the large-scale, visibility-annotated Eva-3M dataset and the novel EvaPose method, which leverages explicit visibility information to achieve state-of-the-art performance.

Peng Dai, Yu Zhang, Yiqiang Feng + 2 more2026-03-02💻 cs

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

This paper introduces DLEBench, the first benchmark designed to evaluate Instruction-based Image Editing Models on small-scale object editing through a challenging dataset of 1,889 samples and a refined dual-mode evaluation protocol, revealing significant performance gaps in current models.

Shibo Hong, Boxian Ai, Jun Kuang + 5 more2026-03-02🤖 cs.AI

BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds

BuildAnyPoint is a novel generative framework that leverages a Loosely Cascaded Diffusion Transformer and autoregressive mesh generation to reconstruct structured 3D building abstractions from diverse and sparse point clouds, achieving superior surface accuracy and distribution uniformity compared to prior methods.

Tongyan Hua, Haoran Gong, Yuan Liu + 3 more2026-03-02💻 cs

Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering

This paper introduces Semantically Decoupled Latent Steering (SDLS), a training-free inference-time framework that utilizes LLM-driven semantic decomposition and QR-based orthogonalization to generate intervention vectors that specifically suppress prior-comparison hallucinations in radiology report generation while preserving clinical accuracy.

Ao Li, Rui Liu, Mingjie Li + 6 more2026-03-02💻 cs

Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation

The paper proposes Vision-Language Weed Segmentation (VL-WS), a novel framework that leverages vision-language alignment and a unified multi-domain training corpus to achieve superior generalization and data efficiency in fine-grained crop-weed segmentation across diverse agricultural environments.

Nazia Hossain, Xintong Jiang, Yu Tian + 3 more2026-03-02💻 cs

Any Model, Any Place, Any Time: Get Remote Sensing Foundation Model Embeddings On Demand

To address the challenges of heterogeneity in remote sensing foundation models, this paper introduces rs-embed, a Python library that enables users to retrieve embeddings from any supported model for any location and time range through a unified, single-line interface.

Dingqi Ye, Daniel Kiv, Wei Hu + 2 more2026-03-02🤖 cs.LG

HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

HiDrop is a novel framework that significantly accelerates Multimodal Large Language Models (MLLMs) by aligning token pruning with hierarchical layer functions through Late Injection, Concave Pyramid Pruning, and Early Exit mechanisms, achieving a 90% reduction in visual tokens with a 1.72x training speedup while maintaining original performance.

Hao Wu, Yingqi Fan, Jinyang Dai + 3 more2026-03-02💬 cs.CL

A Reliable Indoor Navigation System for Humans Using AR-based Technique

This paper proposes a reliable indoor navigation system for humans that integrates Vuforia Area Target for environment modeling, AI NavMesh for pathfinding, and the A* algorithm to deliver faster, more accurate, and intuitive real-time guidance compared to traditional signage and GPS-based methods.

Vijay U. Rathod, Manav S. Sharma, Shambhavi Verma + 3 more2026-03-02💻 cs

EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding

EgoGraph is a training-free, dynamic knowledge graph framework that overcomes the limitations of existing methods in ultra-long egocentric video understanding by constructing a unified schema and temporal relational modeling to capture long-term cross-entity dependencies, thereby achieving state-of-the-art performance on long-term video question answering benchmarks.

Shitong Sun, Ke Han, Yukai Huang + 2 more2026-03-02💻 cs

Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?

This paper introduces VGUBench to demonstrate that while Unified Multimodal Large Language Models exhibit strong textual reasoning and visual rendering capabilities individually, they fail to maintain semantic equivalence when required to generate visual answers, revealing a critical breakdown in cross-modal semantic alignment rather than a lack of generation fidelity.

Hongbo Jiang, Jie Li, Yunhang Shen + 4 more2026-03-02💻 cs

StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation

StemVLA is an open-source Vision-Language-Action model that enhances robot manipulation performance on long-horizon tasks by explicitly integrating predicted future 3D spatial geometry and aggregated 4D historical spatiotemporal representations to improve spatial reasoning and decision-making in dynamic environments.

Jiasong Xiao, Yutao She, Kai Li + 3 more2026-03-02💻 cs

A Difference-in-Difference Approach to Detecting AI-Generated Images

This paper proposes a novel difference-in-difference method that improves AI-generated image detection by utilizing second-order differences in reconstruction error to achieve superior generalization and accuracy compared to existing first-order approaches.

Xinyi Qi, Kai Ye, Chengchun Shi + 3 more2026-03-02💻 cs

UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking

UTPTrack introduces a simple, unified token pruning framework that jointly compresses the search region and both dynamic and static templates via an attention-guided strategy, achieving state-of-the-art accuracy-efficiency trade-offs in visual tracking while preserving baseline performance across RGB and multimodal scenarios.

Hao Wu, Xudong Wang, Jialiang Zhang + 5 more2026-03-02💬 cs.CL

← Previous Next →