cs.CV papers | Gist.Science

Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers

This paper introduces SpectralCache, a training-free, frequency-aware caching framework that accelerates Diffusion Transformers by dynamically scheduling timesteps, managing cumulative error budgets, and decomposing features to achieve a 2.46x speedup with minimal quality loss.

Guandong Li2026-03-06💻 cs

Dark3R: Learning Structure from Motion in the Dark

Dark3R is a novel framework that enables robust structure-from-motion and novel view synthesis in extreme low-light conditions (SNR < -4 dB) by adapting large-scale 3D foundation models through teacher-student distillation trained on noisy-clean raw image pairs without 3D supervision.

Andrew Y Guo, Anagh Malik, SaiKiran Tedla + 7 more2026-03-06💻 cs

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

OpenFrontier is a training-free, lightweight navigation framework that achieves robust zero-shot generalization in open-world environments by leveraging vision-language models to identify semantic frontiers as visual anchors for goal-directed navigation, eliminating the need for dense 3D mapping, policy training, or model fine-tuning.

Esteban Padilla, Boyang Sun, Marc Pollefeys + 1 more2026-03-06💻 cs

ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

This paper introduces ORMOT, a novel task extending referring multi-object tracking to omnidirectional imagery to overcome field-of-view limitations, supported by the newly constructed ORSet dataset and the ORTrack large vision-language model framework.

Sijia Chen, Zihan Zhou, Yanqiu Yu + 2 more2026-03-06💻 cs

Fusion-CAM: Integrating Gradient and Region-Based Class Activation Maps for Robust Visual Explanations

This paper introduces Fusion-CAM, a novel framework that enhances visual explanations for deep neural networks by adaptively fusing denoised gradient-based and region-based Class Activation Maps to overcome the limitations of noise and over-smoothing in existing methods.

Hajar Dekdegue, Moncef Garouani, Josiane Mothe + 1 more2026-03-06💻 cs

Loop Closure via Maximal Cliques in 3D LiDAR-Based SLAM

This paper introduces CliReg, a novel deterministic loop closure validation algorithm that replaces RANSAC with a maximal clique search on feature compatibility graphs to achieve more robust and accurate 3D LiDAR-based SLAM performance under noisy and ambiguous conditions.

Javier Laserna, Saurabh Gupta, Oscar Martinez Mozos + 2 more2026-03-06💻 cs

Video-based Locomotion Analysis for Fish Health Monitoring

This paper presents a video-based system utilizing a YOLOv11 detector within a multi-object tracking framework to estimate fish locomotion activities, such as swimming direction and speed, for effective health monitoring and disease detection in aquaculture.

Timon Palm, Clemens Seibold, Anna Hilsmann + 1 more2026-03-06💻 cs

MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis

The paper introduces MobileFetalCLIP, a framework utilizing Selective Repulsive Knowledge Distillation to train a compact 11.4M parameter student model that outperforms its 304M parameter teacher in fetal ultrasound analysis while enabling real-time deployment on mobile devices.

Numan Saeed, Fadillah Adamsyah Maani, Mohammad Yaqub2026-03-06🤖 cs.AI

RelaxFlow: Text-Driven Amodal 3D Generation

RelaxFlow is a training-free, text-driven framework that resolves semantic ambiguity in amodal 3D generation by decoupling rigid observation control from relaxed structural guidance through a novel relaxation mechanism, enabling the completion of occluded regions while strictly preserving input fidelity.

Jiayin Zhu, Guoji Fu, Xiaolu Liu + 3 more2026-03-06🤖 cs.AI

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

This paper proposes SAIL, a weakly-supervised dense video captioning framework that improves temporal localization and description by constructing semantically-aware masks through cross-modal alignment and enhancing training signals with LLM-generated synthetic captions via an inter-mask mechanism.

Ye-Chan Kim, SeungJu Cha, Si-Woo Kim + 3 more2026-03-06🤖 cs.AI

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

The paper introduces CompACT, a compact discrete tokenizer that compresses observations into just 8 tokens to enable computationally efficient, real-time decision planning within world models while maintaining competitive performance.

Dongwon Kim, Gawon Seo, Jinsung Lee + 2 more2026-03-06🤖 cs.AI

NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

The paper presents NaiLIA, a multimodal retrieval method that effectively aligns dense, multi-layered intent descriptions with user-specified color palettes to retrieve nail design images, outperforming existing vision-language models on a newly constructed benchmark of over 10,000 annotated images.

Kanon Amemiya, Daichi Yashima, Kei Katsumata + 4 more2026-03-06💻 cs

RealWonder: Real-Time Physical Action-Conditioned Video Generation

RealWonder is the first real-time system that generates action-conditioned videos from a single image by bridging 3D reconstruction, physics simulation, and a distilled video generator to simulate the physical consequences of forces, robotic manipulations, and camera controls on various materials.

Wei Liu, Ziyu Chen, Zizhang Li + 3 more2026-03-06🤖 cs.AI

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

This paper introduces the Longest Stable Prefix (LSP) scheduler, a training-free inference paradigm for Diffusion Language Models that accelerates generation by up to 3.4x through contiguous prefix absorption, thereby resolving KV cache fragmentation and improving hardware efficiency without compromising output quality.

Pengxiang Li, Joey Tsai, Hongwei Xue + 2 more2026-03-06💻 cs

EdgeDAM: Real-time Object Tracking for Mobile Devices

EdgeDAM is a lightweight, real-time single-object tracking framework for mobile devices that achieves robust performance under occlusion and distractor interference by introducing a Dual-Buffer Distractor-Aware Memory mechanism and a Confidence-Driven Switching strategy with Held-Box Stabilization.

Syed Muhammad Raza, Syed Murtaza Hussain Abidi, Khawar Islam + 2 more2026-03-06💻 cs

HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

This paper introduces HALP, a method that detects hallucinations in vision-language models before text generation by analyzing internal representations, achieving high accuracy across diverse architectures and enabling efficient safety interventions like early abstention.

Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun + 1 more2026-03-06💻 cs

Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields

This paper proposes a novel Neural Radiance Field (NeRF) approach enhanced with an adaptive weighted MSE loss to reconstruct 3D scenes from sparse Longwave Infrared Hyperspectral images, demonstrating its effectiveness in improving gas plume detection performance compared to standard methods.

Scout Jarman, Zigfried Hampel-Arias, Adra Carr + 1 more2026-03-06💻 cs

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

This paper introduces MM-Lifelong, a large-scale dataset of natural, unscripted daily life footage spanning up to a month, and proposes the Recursive Multimodal Agent (ReMA) to overcome the working memory and localization limitations of existing models in long-term multimodal understanding.

Guo Chen, Lidong Lu, Yicheng Liu + 17 more2026-03-06💻 cs

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

The paper introduces CalibAtt, a training-free method that accelerates text-to-video generation by identifying and skipping stable, negligible attention connections through an offline calibration process, achieving up to 1.58x speedup while maintaining generation quality across various models.

Shai Yehezkel, Shahar Yadin, Noam Elata + 2 more2026-03-06💻 cs

FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

FaceCam is a novel system that generates high-quality portrait videos with customizable camera trajectories by introducing a scale-aware conditioning representation and specialized data generation strategies, effectively overcoming geometric distortions and visual artifacts common in existing methods without relying on 3D priors.

Weijie Lyu, Ming-Hsuan Yang, Zhixin Shu2026-03-06💻 cs

← Previous Next →