cs.CV papers | Gist.Science

FreeFly-Thinking : Aligning Chain-of-Thought Reasoning with Continuous UAV Navigation

The paper introduces FreeFly-Thinking, an end-to-end Vision-Language Navigation framework for UAVs that leverages a two-stage training strategy and explicit chain-of-thought reasoning to achieve robust and efficient navigation in complex outdoor urban environments.

Jiaxu Zhou, Shaobo Wang, Zhiyuan Yang, Zhenjun Yu, Tao LiTue, 10 Ma💻 cs

FastSTAR: Spatiotemporal Token Pruning for Efficient Autoregressive Video Synthesis

FastSTAR is a training-free acceleration framework for Spacetime Autoregressive (STAR) video generation that utilizes a novel Spatiotemporal Token Pruning strategy to identify and skip redundant computations, achieving up to a 2.01x speedup with minimal quality degradation.

Sungwoong Yune, Suheon Jeong, Joo-Young KimTue, 10 Ma💻 cs

Shaping Parameter Contribution Patterns for Out-of-Distribution Detection

This paper proposes Shaping Parameter Contribution Patterns (SPCP), a training-time method that enhances out-of-distribution detection by encouraging classifiers to adopt dense, boundary-oriented parameter contribution patterns instead of relying on sparse, brittle ones that lead to overconfident predictions on anomalous inputs.

Haonan Xu, Yang YangTue, 10 Ma🤖 cs.LG

VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

VINO is a self-supervised learning framework that overcomes the "co-occurrence trap" in dense video by using a teacher-student distillation approach with structural priors to force representations to focus on foreground objects rather than background context, achieving state-of-the-art unsupervised object discovery performance.

Seul-Ki Yeom, Marcel Simon, Eunbin Lee, Tae-Ho KimTue, 10 Ma💻 cs

LightMedSeg: Lightweight 3D Medical Image Segmentation with Learned Spatial Anchors

LightMedSeg is a lightweight, modular 3D medical image segmentation architecture that leverages anatomical priors, adaptive context modeling, and computational efficiency techniques to achieve high accuracy with minimal parameters and FLOPs, making it a deployable solution for resource-constrained clinical environments.

Kavyansh Tyagi, Vishwas Rathi, Puneet GoyalTue, 10 Ma🤖 cs.LG

Single Image Super-Resolution via Bivariate `A Trous Wavelet Diffusion

This paper introduces BATDiff, an unsupervised single-image super-resolution model that leverages bivariate A-trous wavelet transforms and cross-scale parent-child dependencies to generate sharper, more structurally consistent high-frequency details while minimizing artifacts and dataset-driven hallucinations.

Heidari Maryam, Anantrasirichai Nantheera, Achim AlinTue, 10 Ma💻 cs

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

The paper proposes HY-WU, a memory-first adaptation framework that replaces static weight overwriting with a functional neural memory module to synthesize instance-specific weight updates on-the-fly, thereby enabling robust continual learning and personalization without degrading previously learned behaviors.

Tencent HY TeamTue, 10 Ma💻 cs

FabricGen: Microstructure-Aware Woven Fabric Generation

FabricGen is an end-to-end framework that generates realistic woven fabric materials from text by decoupling macro-scale texture synthesis via fine-tuned diffusion models from micro-scale weaving pattern generation using a specialized LLM-driven procedural geometric model.

Yingjie Tang, Di Luo, Zixiong Wang, Xiaoli Ling, jian Yang, Beibei WangTue, 10 Ma💻 cs

PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation

This paper introduces PresentBench, a fine-grained, rubric-based benchmark comprising 238 instances and detailed binary checklists that enables more reliable, human-aligned evaluation of automated slide generation, revealing that NotebookLM currently outperforms other methods in this domain.

Xin-Sheng Chen, Jiayu Zhu, Pei-lin Li, Hanzheng Wang, Shuojin Yang, Meng-Hao GuoTue, 10 Ma💻 cs

LEPA: Learning Geometric Equivariance in Satellite Remote Sensing Data with a Predictive Architecture

This paper introduces LEPA, a learned architecture that conditions on geometric augmentations to accurately predict transformed satellite image embeddings, effectively overcoming the limitations of standard interpolation in non-convex geospatial foundation model manifolds and significantly improving geometric adjustment performance.

Erik Scheurer, Rocco Sedona, Stefan Kesselheim, Gabriele CavallaroTue, 10 Ma💻 cs

Variational Flow Maps: Make Some Noise for One-Step Conditional Generation

This paper introduces Variational Flow Maps (VFMs), a framework that enables high-quality, single-step conditional generation and inverse problem solving by learning a noise adapter to align the initial noise distribution with observations, thereby bypassing the need for iterative sampling trajectories required by traditional diffusion models.

Abbas Mammadov, So Takao, Bohan Chen, Ricardo Baptista, Morteza Mardani, Yee Whye Teh, Julius BernerTue, 10 Ma🤖 cs.LG

Virtual Try-On for Cultural Clothing: A Benchmarking Study

This paper introduces BD-VITON, a new benchmark dataset featuring culturally diverse Bangladeshi garments with complex draping and layering challenges, and evaluates the performance of state-of-the-art virtual try-on models on this dataset to demonstrate significant improvements over zero-shot inference.

Muhammad Tausif Ul Islam, Shahir Awlad, Sameen Yeaser Adib, Md. Atiqur Rahman, Sabbir Ahmed, Md. Hasanul KabirTue, 10 Ma💻 cs

MAviS: A Multimodal Conversational Assistant For Avian Species

This paper introduces MAviS, a domain-adaptive multimodal conversational assistant for avian species that leverages the newly created MAviS-Dataset and is evaluated on the MAviS-Bench to achieve state-of-the-art performance in fine-grained bird species understanding and multimodal question answering.

Yevheniia Kryklyvets, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jinxing Zhou, Fahad Shabzan Khan, Rao Anwer, Salman Khan, Hisham CholakkalTue, 10 Ma💻 cs

Training for Trustworthy Saliency Maps: Adversarial Training Meets Feature-Map Smoothing

This paper proposes a training-centered approach that combines adversarial training with a lightweight feature-map smoothing block to generate saliency maps that are simultaneously sparse, input-stable, and output-stable, thereby enhancing their perceived trustworthiness and sufficiency.

Dipkamal Bhusal, Md Tanvirul Alam, Nidhi RastogiTue, 10 Ma💻 cs

StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

This paper introduces StructSAM, a novel token merging framework that preserves structural boundaries and spectral properties in Segment Anything Models (SAM) by using gradient-based energy scores and grid-based screening to achieve significant computational savings with minimal accuracy loss across natural and medical imaging benchmarks.

Duy M. H. Nguyen, Tuan A. Tran, Duong Nguyen, Siwei Xie, Trung Q. Nguyen, Mai T. N. Truong, Daniel Palenicek, An T. Le, Michael Barz, TrungTin Nguyen, Tuan Dam, Ngan Le, Minh Vu, Khoa Doan, Vien Ngo, Pengtao Xie, James Zou, Daniel Sonntag, Jan Peters, Mathias NiepertTue, 10 Ma🤖 cs.LG

Faster-HEAL: An Efficient and Privacy-Preserving Collaborative Perception Framework for Heterogeneous Autonomous Vehicles

Faster-HEAL is a lightweight, privacy-preserving collaborative perception framework that addresses the challenges of heterogeneous autonomous vehicles by using low-rank visual prompt fine-tuning and pyramid fusion to align diverse features into a unified space, achieving superior detection performance with significantly reduced computational overhead compared to state-of-the-art methods.

Armin Maleki, Hayder RadhaTue, 10 Ma💻 cs

A Lightweight Digital-Twin-Based Framework for Edge-Assisted Vehicle Tracking and Collision Prediction

This paper proposes a lightweight, digital-twin-based framework for edge-assisted vehicle tracking and collision prediction that achieves approximately 88% accuracy by relying on object detection and K-D tree-indexed path maps instead of computationally intensive trajectory prediction models.

Murat Arda Onsu, Poonam Lohan, Burak Kantarci, Aisha Syed, Matthew Andrews, Sean KennedyTue, 10 Ma💻 cs

AgrI Challenge: A Data-Centric AI Competition for Cross-Team Validation in Agricultural Vision

The AgrI Challenge introduces a data-centric competition framework featuring Cross-Team Validation to demonstrate that while single-source training suffers from significant generalization gaps in agricultural vision, collaborative multi-source training on independently collected, heterogeneous datasets dramatically improves model robustness and real-world performance.

Mohammed Brahimi, Karim Laabassi, Mohamed Seghir Hadj Ameur, Aicha Boutorh, Badia Siab-Farsi, Amin Khouani, Omar Farouk Zouak, Seif Eddine Bouziane, Kheira Lakhdari, Abdelkader Nabil BenghanemTue, 10 Ma🤖 cs.LG

N-Tree Diffusion for Long-Horizon Wildfire Risk Forecasting

The paper introduces N-Tree Diffusion (NT-Diffusion), a hierarchical diffusion model that improves long-horizon wildfire risk forecasting by sharing early denoising stages across prediction horizons to reduce computational redundancy while maintaining probabilistic accuracy.

Yucheng Xing, Xin WangTue, 10 Ma🤖 cs.LG

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

This paper introduces AQuA, a fine-grained dataset that categorizes ambiguous visual questions into four levels with corresponding optimal response strategies, demonstrating that fine-tuning Vision-Language Models on this dataset enables them to effectively recognize ambiguity and adaptively generate context-appropriate responses such as seeking clarification or listing alternatives, thereby outperforming existing baselines.

Jihyoung Jang, Hyounghun KimTue, 10 Ma💬 cs.CL

← Previous Next →