cs.CV papers | Gist.Science

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

This paper introduces MM-Mem, a cognition-inspired pyramidal multimodal memory architecture that leverages Fuzzy-Trace Theory and a Semantic Information Bottleneck to progressively distill verbatim visual details into abstract semantic schemas, thereby enabling efficient long-horizon video understanding through hierarchical storage and entropy-driven retrieval.

Niu Lian, Yuting Wang, Hanshu Yao + 5 more2026-03-03💬 cs.CL

UltraStar: Semantic-Aware Star Graph Modeling for Echocardiography Navigation

To address the limitations of existing sequential models in handling noisy echocardiography probe trajectories, this paper proposes UltraStar, a semantic-aware star graph framework that reformulates navigation as anchor-based global localization by connecting the current view directly to representative historical keyframes, thereby achieving robust performance and better scalability on large-scale datasets.

Teng Wang, Haojun Jiang, Chenxi Li + 6 more2026-03-03💻 cs

WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments

This paper introduces WildCross, a large-scale cross-modal benchmark featuring over 476K annotated RGB frames and synchronized lidar data designed to advance place recognition and metric depth estimation in unstructured natural environments where existing urban-focused datasets fall short.

Joshua Knights, Joseph Reid, Kaushik Roy + 3 more2026-03-03💻 cs

SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout

This paper presents SCATR, a novel LiDAR-based tracking-by-attention framework that mitigates new instance suppression and bridges the performance gap with detection-based methods through two architecture-agnostic training strategies: Second Chance Assignment and Track Query Dropout, achieving state-of-the-art results on the nuScenes benchmark.

Brian Cheong, Letian Wang, Sandro Papais + 1 more2026-03-03💻 cs

ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models

The paper proposes ATA, a novel training-free, plug-and-play framework that enhances Vision-Language-Action models by introducing implicit reasoning through complementary attention-guided and action-guided strategies, thereby improving task success and robustness without the need for additional annotations or retraining.

Cheng Yang, Jianhao Jiao, Lingyi Huang + 8 more2026-03-03🤖 cs.AI

Radiometrically Consistent Gaussian Surfels for Inverse Rendering

This paper introduces RadioGS, a novel inverse rendering framework that leverages a radiometric consistency constraint and Gaussian surfels to accurately disentangle material properties from complex global illumination effects, enabling efficient relighting and superior performance over existing Gaussian-based methods.

Kyu Beom Han, Jaeyoon Kim, Woo Jae Kim + 2 more2026-03-03💻 cs

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

This paper introduces PhotoBench, the first benchmark constructed from authentic personal albums to shift photo retrieval from simple visual matching to complex, intent-driven reasoning by exposing critical limitations in current unified embedding and agentic systems regarding non-visual constraints and multi-source fusion.

Tianyi Xu, Rong Shan, Junjie Wu + 11 more2026-03-03🤖 cs.AI

Rate-Distortion Signatures of Generalization and Information Trade-offs

This paper introduces a rate-distortion-theoretic framework that characterizes the generalization trade-offs of human and machine vision systems using geometric signatures of slope and curvature, revealing that while both follow a common lossy-compression principle, humans exhibit smoother and more flexible trade-offs compared to the steeper, more brittle regimes of modern deep networks.

Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin2026-03-03🧬 q-bio

Downstream Task Inspired Underwater Image Enhancement: A Perception-Aware Study from Dataset Construction to Network Design

This paper proposes a Downstream Task-Inspired Underwater Image Enhancement (DTI-UIE) framework that integrates a human visual perception model, a task-driven perceptual loss, and an automatically constructed dataset to generate enhanced images specifically optimized for improving downstream vision tasks like object detection and semantic segmentation.

Bosen Lin, Feng Gao, Yanwei Yu + 2 more2026-03-03⚡ eess

Neural Operator-Grounded Continuous Tensor Function Representation and Its Applications

This paper introduces Neural Operator-Grounded Continuous Tensor Function Representation (NO-CTR), a novel framework that replaces discrete, linear mode- $n$ products with continuous, nonlinear neural operators to more faithfully represent complex real-world data across various grid structures and point clouds, while theoretically guaranteeing universal approximation and demonstrating superior performance in multi-dimensional data completion tasks.

Ruoyang Su, Xi-Le Zhao, Sheng Liu + 3 more2026-03-03🔢 math

FireRed-OCR Technical Report

FireRed-OCR is a novel framework that transforms general-purpose VLMs into high-performance, pixel-precise document parsing experts by leveraging a "Geometry + Semantics" data factory and a three-stage progressive training strategy to overcome structural hallucinations and achieve state-of-the-art results on complex document benchmarks.

Hao Wu, Haoran Lou, Xinyue Li + 19 more2026-03-03⚡ eess

Tiny-DroNeRF: Tiny Neural Radiance Fields aboard Federated Learning-enabled Nano-drones

This paper introduces Tiny-DroNeRF, a lightweight Neural Radiance Field model optimized for sub-100mW microcontrollers on nano-drones and enhanced by a federated learning scheme, which achieves dense 3D scene reconstruction with a 96% memory reduction and improved accuracy despite extreme resource constraints.

Ilenia Carboni, Elia Cereda, Lorenzo Lamberti + 3 more2026-03-03⚡ eess

Event-Only Drone Trajectory Forecasting with RPM-Modulated Kalman Filtering

This paper proposes a novel event-only drone trajectory forecasting method that extracts propeller rotational speed directly from raw event data and integrates it into an RPM-aware Kalman filter, achieving superior short-to-medium horizon prediction accuracy compared to learning-based approaches without relying on RGB imagery or training data.

Hari Prasanth S. M., Pejman Habibiroudkenar, Eerik Alamikkotervo + 2 more2026-03-03⚡ eess

3D Field of Junctions: A Noise-Robust, Training-Free Structural Prior for Volumetric Inverse Problems

This paper introduces a training-free, noise-robust 3D Field of Junctions (3D FoJ) representation that optimizes volumetric wedge junctions to serve as a structural prior, successfully outperforming both classical and neural methods in low-SNR 3D imaging tasks such as CT, cryo-ET, and point cloud denoising without risking hallucination.

Namhoon Kim, Narges Moeini, Justin Romberg + 1 more2026-03-03⚡ eess

Data Augmentation via Mixed Class Interpolation using Cycle-Consistent Generative Adversarial Networks Applied to Cross-Domain Imagery

This paper proposes a novel data augmentation method called Conditional CycleGAN Mixup Augmentation (C2GMA) that leverages visible-band imagery to synthesize mixed-class non-visible domain examples via CycleGANs, significantly improving classification accuracy in data-scarce Synthetic Aperture Radar (SAR) applications.

Hiroshi Sasaki, Chris G. Willcocks, Toby P. Breckon2026-03-02🤖 cs.LG

Dite-HRNet: Dynamic Lightweight High-Resolution Network for Human Pose Estimation

The paper introduces Dite-HRNet, a dynamic lightweight high-resolution network that incorporates novel dynamic split convolution and adaptive context modeling to efficiently capture multi-scale features and long-range spatial dependencies, achieving superior performance on human pose estimation benchmarks compared to state-of-the-art lightweight networks.

Qun Li, Ziyi Zhang, Fu Xiao + 2 more2026-03-02💻 cs

CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving

The paper proposes CO^3, a novel unsupervised framework for outdoor 3D point cloud representation learning that leverages cooperative vehicle- and infrastructure-side LiDAR views along with contextual shape prediction to overcome reconstruction challenges and achieve state-of-the-art performance on downstream detection tasks.

Runjian Chen, Yao Mu, Runsen Xu + 5 more2026-03-02💻 cs

A Fault Detection Scheme Utilizing Convolutional Neural Network for PV Solar Panels with High Accuracy

This paper proposes a straightforward and effective Convolutional Neural Network (CNN) scheme for detecting faults in PV solar panels, achieving high accuracy rates of 91.1% for binary classification and 88.6% for multi-classification while outperforming previous models using the same datasets.

Maryam Paparimoghadamborazjani, Amin Kazemi2026-03-02🤖 cs.LG

Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases

This paper addresses reward overoptimization in diffusion models by identifying mismatches with temporal inductive bias and the regularization role of dormant neurons, leading to the proposal of TDPO-R, a policy gradient algorithm that leverages temporal inductive bias and resets active neurons to mitigate primacy bias.

Ziyi Zhang, Sen Zhang, Yibing Zhan + 3 more2026-03-02🤖 cs.LG

Uni-ISP: Toward Unifying the Learning of ISPs from Multiple Mobile Cameras

This paper introduces Uni-ISP, a novel pipeline that unifies the learning of image signal processors for diverse mobile cameras using device-aware embeddings and a specialized training scheme, supported by a new real-world 4K dataset called FiveCam, to achieve superior accuracy, adaptability, and versatility in both forward and inverse ISP tasks.

Lingen Li, Mingde Yao, Xingyu Meng + 3 more2026-03-02💻 cs

← Previous Next →