Text to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis

This study evaluates the effectiveness of vision-language and large language models in converting scanned student-drawn automata diagrams into TikZ code, finding that while direct image-to-text generation often yields errors, human-corrected descriptions significantly improve the accuracy of the resulting digital diagrams for educational applications like automated grading.

Ethan Young, Zichun Wang, Aiden Taylor, Chance Jewell, Julian Myers, Satya Sri Rajiteswari Nimmagadda, Anthony White, Aniruddha Maiti, Ananya Jana2026-03-10💻 cs

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

VisualAD is a language-free, zero-shot anomaly detection framework that leverages a frozen Vision Transformer backbone with learnable normality and abnormality tokens, along with spatial-aware cross-attention and self-alignment modules, to achieve state-of-the-art performance across industrial and medical domains without relying on text encoders or cross-modal alignment.

Yanning Hou, Peiyuan Li, Zirui Liu, Yitong Wang, Yanran Ruan, Jianfeng Qiu, Ke Xu2026-03-10💻 cs

SGG-R3^{\rm 3}: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

The paper introduces SGG-R3^{\rm 3}, a structured reasoning framework that combines chain-of-thought-guided supervised fine-tuning with relation augmentation and a novel dual-granularity reward scheme in reinforcement learning to achieve end-to-end unbiased Scene Graph Generation with improved recall and reduced bias on long-tailed distributions.

Jiaye Feng, Qixiang Yin, Yuankun Liu, Tong Mo, Weiping Li2026-03-10💻 cs

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

This paper introduces EcoG-Bench, a rigorous bilingual benchmark for egocentric co-speech grounding that reveals a significant performance gap between humans and state-of-the-art MLLMs, highlighting how multimodal interface limitations rather than reasoning deficits hinder the alignment of speech with pointing gestures in situated collaboration.

Weijie Zhou, Xuantang Xiong, Zhenlin Hu, Xiaomeng Zhu, Chaoyang Zhao, Honghui Dong, Zhengyou Zhang, Ming Tang, Jinqiao Wang2026-03-10💻 cs

Extend Your Horizon: A Device-Agnostic Surgical Tool Tracking Framework with Multi-View Optimization for Augmented Reality

This paper presents a device-agnostic surgical tool tracking framework that fuses multiple sensing modalities within a dynamic scene graph to overcome line-of-sight occlusions and enhance the robustness of augmented reality visualization in operating rooms.

Jiaming Zhang, Mingxu Liu, Hongchao Shu, Ruixing Liang, Yihao Liu, Ojas Taskar, Amir Kheradmand, Mehran Armand, Alejandro Martin-Gomez2026-03-10💻 cs

On the Feasibility and Opportunity of Autoregressive 3D Object Detection

The paper introduces AutoReg3D, an autoregressive 3D object detector that reformulates LiDAR-based detection as a sequence generation task using a near-to-far ordering to eliminate reliance on hand-crafted components like anchors and NMS, thereby achieving competitive performance while enabling the integration of advanced language model techniques such as reinforcement learning.

Zanming Huang, Jinsu Yoo, Sooyoung Jeon, Zhenzhen Liu, Mark Campbell, Kilian Q Weinberger, Bharath Hariharan, Wei-Lun Chao, Katie Z Luo2026-03-10💻 cs

ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation

This paper proposes the ViSA-enhanced framework, a triple-phase collaborative architecture that leverages structured visual prompting to enable Vision-Language Models to perform direct spatial reasoning on image planes, achieving a 70.3% improvement in success rate over state-of-the-art aerial Vision-Language Navigation methods on the CityNav benchmark.

Haoyu Tong, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, Yaoming Zhou, Chenghao Lin2026-03-10💻 cs

It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models

This paper addresses the significant challenge of analog clock reading in state-of-the-art Vision-Language Models by introducing the real-world, diverse TickTockVQA dataset and the Swap-DPO fine-tuning framework, which together substantially improve spatial-temporal reasoning and accuracy under complex visual conditions.

Jaeha Choi, Jin Won Lee, Siwoo You, Jangho Lee2026-03-10💻 cs

Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

This paper proposes "Missing No More," a novel dictionary-guided framework that addresses the challenge of missing infrared modality in image fusion by learning a shared convolutional dictionary to enable interpretable coefficient-domain inference and fusion, thereby avoiding uncontrolled pixel-space generation while improving perceptual quality and downstream detection performance.

Yafei Zhang, Meng Ma, Huafeng Li, Yu Liu2026-03-10💻 cs

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

This paper proposes a two-stage cascaded framework that generates controllable complex human motion videos by first using an autoregressive model to synthesize 2D skeleton sequences from text descriptions and then employing a pose-conditioned diffusion model with adaptive layer fusion to render high-fidelity videos, supported by a new synthetic dataset designed to overcome limitations in existing benchmarks.

Ashkan Taghipour, Morteza Ghahremani, Zinuo Li, Hamid Laga, Farid Boussaid, Mohammed Bennamoun2026-03-10💻 cs

QualiTeacher: Quality-Conditioned Pseudo-Labeling for Real-World Image Restoration

QualiTeacher introduces a novel framework for real-world image restoration that transforms imperfect pseudo-labels into conditional supervisory signals by explicitly conditioning the student model on estimated label quality, thereby enabling the learning of a quality-graded restoration manifold that avoids artifact mimicry and achieves superior generalization.

Fengyang Xiao, Jingjia Feng, Peng Hu, Dingming Zhang, Lei Xu, Guanyi Qin, Lu Li, Chunming He, Sina Farsiu2026-03-10💻 cs

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

This paper presents a robust multimodal framework for the 10th ABAW Expression Recognition Challenge that utilizes a dual-branch Transformer with safe cross-attention and modality dropout to dynamically fuse audio and visual data, effectively addressing partial occlusions, missing modalities, and class imbalance to achieve 60.79% accuracy on the Aff-Wild2 validation set.

Jun Yu, Naixiang Zheng, Guoyuan Wang, Yunxiang Zhang, Lingsi Zhu, Jiaen Liang, Wei Huang, Shengping Liu2026-03-10💻 cs