cs.CV papers | Gist.Science

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

This paper introduces LatentLens, a novel interpretability method that reveals the semantic meaning of visual tokens across all layers of Vision-Language Models by matching them to contextualized text representations, demonstrating that visual tokens are far more interpretable than previously believed.

Benno Krojer, Shravan Nayak, Oscar Mañas + 4 more2026-02-26🤖 cs.AI

Enhancing Multi-Image Understanding through Delimiter Token Scaling

This paper proposes a training-free method that scales the hidden states of delimiter tokens in Large Vision-Language Models to effectively mitigate cross-image information leakage, thereby significantly improving performance on multi-image and multi-document understanding benchmarks without incurring additional computational costs.

Minyoung Lee, Yeji Park, Dongjun Hwang + 3 more2026-02-26💻 cs

HetroD: A High-Fidelity Drone Dataset and Benchmark for Autonomous Driving in Heterogeneous Traffic

This paper introduces HetroD, a large-scale, high-fidelity drone-based dataset and benchmark designed to address the challenges of autonomous driving in heterogeneous traffic by providing centimeter-accurate annotations of complex interactions between vehicles and vulnerable road users, while demonstrating that current state-of-the-art models struggle to handle these unstructured and dense scenarios.

Yu-Hsiang Chen, Wei-Jer Chang, Christian Kotulla + 7 more2026-02-26💻 cs

TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection

This paper proposes TIPS, a lean zero-shot anomaly detection framework that leverages a spatially aware vision-language model and decoupled prompts to overcome CLIP's localization and sensitivity limitations, achieving superior performance across seven industrial datasets without relying on complex auxiliary modules.

Alireza Salehi, Ehsan Karami, Sepehr Noey + 4 more2026-02-26💻 cs

Progressive Checkerboards for Autoregressive Multiscale Image Generation

This paper introduces a flexible, fixed ordering based on progressive checkerboards for multiscale autoregressive image generation that enables efficient parallel sampling while maintaining balanced dependencies across scales, achieving competitive performance on ImageNet with fewer sampling steps.

David Eigen2026-02-26💻 cs

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

The paper proposes V-Retrver, an evidence-driven agentic framework that enhances universal multimodal retrieval by enabling MLLMs to actively verify fine-grained visual evidence through interleaved reasoning and targeted tool use, achieving significant accuracy improvements via a specialized curriculum-based training strategy.

Dongyang Chen, Chaoyang Wang, Dezhao Su + 6 more2026-02-26💻 cs

Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT

This study reveals that while foundation models achieve discrimination comparable to task-specific models for detecting traumatic bowel injury in abdominal CTs, their clinical utility is significantly limited by a susceptibility to specificity deficits caused by confounding negative-class heterogeneity, particularly when concurrent solid organ injuries are present.

Jineel H Raythatha, Shuchang Ye, Jeremy Hsu + 1 more2026-02-26⚡ eess

Extracting and Analyzing Rail Crossing Behavior Signatures from Videos using Tensor Methods

This paper proposes a multi-view tensor decomposition framework that analyzes TimeSformer embeddings of railway crossing videos to identify latent behavioral signatures, revealing that location is a stronger determinant of driver behavior patterns than time of day and enabling scalable, targeted safety interventions.

Dawon Ahn, Het Patel, Aemal Khattak + 2 more2026-02-26🤖 cs.LG

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

MALLVI is a multi-agent framework that leverages large language and vision models to enable robust, closed-loop robotic manipulation by coordinating specialized agents for planning, perception, and targeted error recovery, thereby improving zero-shot generalization in dynamic environments.

Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani + 3 more2026-02-26🤖 cs.AI

Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

This paper proposes a novel image copy detection framework that combines a pixel coordinate tracking module (PixTrace) with a geometrically-guided contrastive loss (CopyNCE) to enhance fine-grained correspondence learning and achieve state-of-the-art performance on the DISC21 dataset.

Yichen Lu, Siwei Nie, Minlong Lu + 3 more2026-02-26🤖 cs.AI

Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

This paper introduces Dual-Channel Attention Guidance (DCAG), a training-free framework for Diffusion Transformers that enhances image editing control and fidelity by simultaneously manipulating both the Key and Value attention channels, leveraging their distinct nonlinear and linear mechanisms to achieve superior performance over existing Key-only methods.

Guandong Li2026-02-26🤖 cs.AI

Hyperbolic Busemann Neural Networks

This paper introduces Hyperbolic Busemann Neural Networks, a framework that lifts Multinomial Logistic Regression and Fully Connected layers into hyperbolic space using Busemann functions to achieve compact, efficient, and unified components that outperform prior methods across diverse hierarchical and tree-structured data tasks.

Ziheng Chen, Bernhard Schölkopf, Nicu Sebe2026-02-26🤖 cs.AI

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

The paper proposes GS-CLIP, a zero-shot 3D anomaly detection framework that overcomes the limitations of existing 2D projection methods by dynamically generating geometry-aware text prompts and employing a synergistic view representation architecture to fuse rendered and depth features for superior anomaly detection.

Zehao Deng, An Liu, Yan Wang2026-02-26💻 cs

TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation

TherA is a novel framework that addresses the scarcity of thermal infrared data by combining a thermal-aware visual-language model with a latent diffusion translator to generate diverse, physically plausible, and controllable RGB-to-TIR images based on user prompts.

Dong-Guw Lee, Tai Hyoung Rhee, Hyunsoo Jang + 3 more2026-02-26💻 cs

Exploiting Label-Independent Regularization from Spatial Dependencies for Whole Slide Image Analysis

This paper proposes a spatially regularized Multiple Instance Learning framework that leverages inherent spatial dependencies among patch features as label-independent regularization to overcome the challenges of scarce annotations and unstable optimization in Whole Slide Image analysis, achieving significant performance improvements on multiple public datasets.

Weiyi Wu, Xinwen Xu, Chongyang Gao + 3 more2026-02-26💻 cs

RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space

RAYNOVA is a geometry-agnostic, dual-causal autoregressive world model that leverages relative Plücker-ray positional encoding and global attention to generate physically plausible, controllable multi-view driving videos with robust generalization across diverse camera setups and long horizons without relying on explicit 3D scene representations.

Yichen Xie, Chensheng Peng, Mazen Abdelfattah + 6 more2026-02-26💻 cs

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

This paper introduces MMHNet, a multimodal hierarchical network incorporating non-causal Mamba that enables video-to-audio models trained on short clips to effectively generalize and generate high-quality audio sequences exceeding five minutes in duration.

Christian Simon, Masato Ishii, Wei-Yao Wang + 8 more2026-02-26🤖 cs.AI

Uncertainty-Aware Diffusion Model for Multimodal Highway Trajectory Prediction via DDIM Sampling

This paper introduces cVMDx, an enhanced diffusion-based framework for multimodal highway trajectory prediction that leverages DDIM sampling to achieve a 100x reduction in inference time while improving accuracy and robustness over existing methods like cVMD on the highD dataset.

Marion Neumeier, Niklas Roßberg, Michael Botsch + 1 more2026-02-26🤖 cs.LG

Scaling View Synthesis Transformers

This paper introduces the Scalable View Synthesis Model (SVSM), an encoder-decoder transformer architecture that establishes new scaling laws for geometry-free view synthesis, demonstrating superior compute efficiency and state-of-the-art performance compared to previous decoder-only approaches.

Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel + 1 more2026-02-26🤖 cs.AI

RelA-Diffusion: Relativistic Adversarial Diffusion for Multi-Tracer PET Synthesis from Multi-Sequence MRI

The paper proposes RelA-Diffusion, a novel framework that leverages multi-sequence MRI inputs and a gradient-penalized relativistic adversarial loss within a diffusion model to achieve high-fidelity, artifact-free synthesis of multi-tracer PET images for comprehensive neurological assessment.

Minhui Yu, Yongheng Sun, David S. Lalush + 3 more2026-02-26⚡ eess

← Previous Next →