cs.CV papers | Gist.Science

Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents

The paper introduces Earth-Agent, a novel agentic framework that unifies RGB and spectral Earth observation data within an MCP-based tool ecosystem to enable complex, multi-step quantitative reasoning, accompanied by the Earth-Bench benchmark for comprehensive evaluation of such capabilities.

Peilin Feng, Zhutao Lv, Junyan Ye + 8 more2026-03-04💻 cs

PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization

PROFusion achieves robust and accurate real-time dense 3D reconstruction under unstable camera motions by combining a learning-based camera pose regression network for reliable initialization with an optimization-based refinement algorithm to align depth images with scene geometry.

Siyan Dong, Zijun Wang, Lulu Cai + 2 more2026-03-04💻 cs

Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting

Proxy-GS introduces a fast, sub-millisecond proxy system to provide occlusion awareness that simultaneously accelerates rendering through efficient culling and improves training quality by guiding densification toward visible surfaces, thereby outperforming existing methods in both speed and fidelity for large-scale, occluded scenes.

Yuanyuan Gao, Yuning Gong, Yifei Liu + 6 more2026-03-04💻 cs

EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model

EchoGen introduces the first feed-forward subject-driven generation framework built on Visual Auto-Regressive (VAR) models, utilizing a novel dual-path injection strategy to achieve high-fidelity, zero-shot subject generation with significantly faster inference speeds than existing diffusion-based methods.

Ruixiao Dong, Zhendong Wang, Keli Liu + 5 more2026-03-04💻 cs

TTT3R: 3D Reconstruction as Test-Time Training

TTT3R is a training-free method that enhances the length generalization of 3D reconstruction models by framing them as online learning problems and deriving a closed-form learning rate based on alignment confidence, achieving significant improvements in pose estimation while maintaining high efficiency.

Xingyu Chen, Yue Chen, Yuliang Xiu + 2 more2026-03-04💻 cs

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

BindWeave is a unified framework that leverages an MLLM-DiT architecture to perform deep cross-modal reasoning for grounding complex prompt semantics, thereby enabling high-fidelity, subject-consistent video generation across diverse single and multi-subject scenarios.

Zhaoyang Li, Dongjun Qian, Kai Su + 6 more2026-03-04💻 cs

Arbitrary Generative Video Interpolation

The paper introduces ArbInterp, a novel generative video frame interpolation framework that overcomes the limitations of fixed-length synthesis by enabling flexible, high-fidelity interpolation at arbitrary timestamps and durations through a Timestamp-aware Rotary Position Embedding and an appearance-motion decoupled segment-wise generation strategy.

Guozhen Zhang, Haiguang Wang, Chunyu Wang + 3 more2026-03-04💻 cs

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

The D2E framework demonstrates that scaling vision-action pretraining on large-scale, standardized desktop gaming data enables a 1B-parameter model to achieve state-of-the-art performance in real-world embodied AI tasks, effectively bridging the gap between digital interactions and physical robot manipulation and navigation.

Suhwan Choi, Jaeyoon Jung, Haebin Seong + 7 more2026-03-04🤖 cs.AI

Human3R: Everyone Everywhere All at Once

Human3R is a unified, feed-forward framework that enables real-time, single-pass online 4D reconstruction of multiple humans, dense 3D scenes, and camera trajectories from monocular videos, eliminating the need for multi-stage pipelines and heavy pre-processing dependencies.

Yue Chen, Xingyu Chen, Yuxuan Xue + 3 more2026-03-04💻 cs

MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition

MIRAGE is an efficient runtime scheduling framework for multi-vector image retrieval that employs a novel hierarchical decomposition paradigm with automatic parameter configuration to significantly enhance retrieval accuracy while reducing computational costs by up to 3.5 times compared to existing systems.

Maoliang Li, Ke Li, Yaoyang Liu + 5 more2026-03-04💻 cs

Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment

This paper reveals that the generalization of reinforcement learning-based image quality assessment models stems from their conversion of visual data into compact text representations, leading to the proposal of RALI, a lightweight algorithm that directly aligns images with these representations to achieve comparable performance with significantly reduced computational costs.

Shijie Zhao, Xuanyu Zhang, Weiqi Li + 4 more2026-03-04💻 cs

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

This paper employs mechanistic interpretability to map the internal information flow of VideoLLMs, revealing a consistent three-stage pathway of cross-frame interaction, video-language integration, and answer generation that enables effective temporal reasoning while allowing for significant attention edge pruning without performance loss.

Minji Kim, Taekyung Kim, Bohyung Han2026-03-04💻 cs

Self-Aug: Query and Entropy Adaptive Decoding for Large Vision-Language Models

This paper introduces Self-Aug, a training-free decoding strategy for Large Vision-Language Models that combines query-dependent self-augmentation prompting and entropy-adaptive thresholding to significantly reduce hallucinations and enhance factual consistency without requiring additional model training.

Eun Woo Im, Muhammad Kashif Ali, Vivek Gupta2026-03-04🤖 cs.AI

Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality

This paper proposes an unconditional diffusion model trained on augmented HiRISE heightmaps to reconstruct missing Martian terrain data in virtual reality, demonstrating superior accuracy and perceptual similarity compared to traditional interpolation methods.

Giuseppe Lorenzo Catalano, Agata Marta Soccini2026-03-04🤖 cs.AI

CASR-Net: An Image Processing-focused Deep Learning-based Coronary Artery Segmentation and Refinement Network for X-ray Coronary Angiogram

This paper introduces CASR-Net, a three-stage deep learning pipeline featuring a novel multichannel preprocessing strategy and a Self-ONN-based UNet architecture that achieves state-of-the-art coronary artery segmentation and refinement on X-ray angiograms, thereby enhancing the accuracy of coronary artery disease diagnosis.

Alvee Hassan, Rusab Sarmun, Muhammad E. H. Chowdhury + 4 more2026-03-04🤖 cs.AI

Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects

Kinematify is an automated framework that synthesizes physically consistent, high-degree-of-freedom articulated objects directly from arbitrary RGB images or text by combining Monte Carlo Tree Search for kinematic topology inference with geometry-driven optimization for joint parameter estimation.

Jiawei Wang, Dingyou Wang, Jiaming Hu + 3 more2026-03-04💻 cs

Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision

This paper introduces DetGain, an architecture-agnostic online data curation method for object detection that dynamically selects informative training samples by estimating their marginal contributions to dataset-level Average Precision, thereby improving accuracy and robustness across various detectors.

Zitang Sun, Masakazu Yoshimura, Junji Otsuka + 2 more2026-03-04💻 cs

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

PrismAudio is a novel video-to-audio generation framework that addresses objective entanglement and human preference alignment by integrating a decomposed Chain-of-Thought reasoning structure with multi-dimensional rewards and a computationally efficient Fast-GRPO algorithm, achieving state-of-the-art performance across semantic, temporal, aesthetic, and spatial dimensions.

Huadai Liu, Kaicheng Luo, Wen Wang + 6 more2026-03-04⚡ eess

Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

The paper introduces Markov-VAR, a novel visual autoregressive model that reformulates next-scale prediction as a Markov process using a sliding window to compress historical context, thereby significantly improving both generation quality and computational efficiency compared to traditional full-context VAR approaches.

Yu Zhang, Jingyi Liu, Yiwei Shi + 4 more2026-03-04💻 cs

ALARM: Automated MLLM-Based Anomaly Detection in Complex-EnviRonment Monitoring with Uncertainty Quantification

This paper introduces ALARM, an automated framework that leverages multi-modal large language models integrated with uncertainty quantification and quality-assurance techniques to achieve robust and reliable visual anomaly detection in complex, ambiguous environments.

Congjing Zhang, Feng Lin, Xinyi Zhao + 5 more2026-03-04🤖 cs.AI

← Previous Next →