cs.AR papers | Gist.Science

Nemo: A Low-Write-Amplification Cache for Tiny Objects on Log-Structured Flash Devices

Nemo is a novel flash cache design that reduces application-level write amplification for tiny-object workloads by intentionally increasing hash collisions to improve set fill rates, while simultaneously maintaining high memory efficiency and low miss ratios through a bloom filter-based indexing mechanism and hybrid hotness tracking.

Xufeng Yang, Tingting Tan, Jingxin Hu, Congming Gao, Mingyang Liu, Tianyang Jiang, Jian Chen, Linbo Long, Yina Lv, Jiwu ShuWed, 11 Ma💻 cs

bsort: A theoretically efficient non-comparison-based sorting algorithm for integer and floating-point numbers

The paper introduces bsort, a non-comparison-based sorting algorithm that unifies the sorting of signed/unsigned integers and floating-point numbers with an asymptotic time complexity of $O(wn)$ and auxiliary space of $O(w)$ , demonstrating competitive performance against optimized hybrid libraries for small word sizes.

Benjamín GuzmánWed, 11 Ma💻 cs

ChatNeuroSim: An LLM Agent Framework for Automated Compute-in-Memory Accelerator Deployment and Optimization

This paper introduces ChatNeuroSim, an LLM-based agent framework that automates the entire Compute-in-Memory (CIM) accelerator deployment and optimization workflow—including simulation, parameter management, and design space pruning—to significantly reduce the time and effort required for design space exploration.

Ming-Yen Lee, Shimeng YuWed, 11 Ma💻 cs

Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service

This paper introduces Kareto, an adaptive multi-objective optimizer that efficiently navigates the complex configuration space of tiered KV cache storage to dynamically balance cost, throughput, and latency, significantly outperforming static strategies in LLM inference services.

Xianzhe Zheng, Zhengheng Wang, Ruiyan Ma, Rui Wang, Xiyu Wang, Rui Chen, Peng Zhang, Sicheng Pan, Zhangheng Huang, Chenxin Wu, Yi Zhang, Bo Cai, Kan Liu, Teng Ma, Yin Du, Dong Deng, Sai Wu, Guoyun Zhu, Wei Zhang, Feifei LiWed, 11 Ma💻 cs

FormalRTL: Verified RTL Synthesis at Scale

FormalRTL is a novel end-to-end multi-agent framework that leverages software reference models as formal specifications to enable scalable, verified, and reliable register-transfer level (RTL) code generation for complex industrial hardware designs.

Kezhi Li, Min Li, Xiangyu Wen, Shibo Zhao, Jieying Wu, Junhua Huang, Qiang XuWed, 11 Ma💻 cs

Fair and Square: Replacing One Real Multiplication with a Single Square and One Complex Multiplication with Three Squares When Performing Matrix Multiplication and Convolutions

This paper proposes a method to significantly reduce hardware resource requirements for matrix multiplications and convolutions by asymptotically replacing real and complex multiplications with fewer squaring operations, which are more efficient to implement in circuits.

Vincenzo LiguoriWed, 11 Ma💻 cs

AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding

AnalogToBi is a novel framework for automatic device-level analog circuit topology generation that combines bipartite graph representations, grammar-guided decoding, and data augmentation to achieve high validity and novelty while ensuring electrical correctness without human intervention.

Seungmin Kim, Mingun Kim, Yuna Lee, Yulhwa KimWed, 11 Ma💻 cs

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

This paper presents an empirical study mapping the interactions between model characteristics and prompt engineering strategies for Verilog code generation, revealing which trends generalize across diverse language models and benchmarks through controlled experiments.

Luca Collini, Andrew Hennesee, Patrick Yubeaton, Siddharth Garg, Ramesh KarriWed, 11 Ma💻 cs

A Hybrid Residue Floating Numerical Architecture with Formal Error Bounds for High Throughput FPGA Computation

This paper introduces the Hybrid Residue Floating Numerical Architecture (HRFNA), a formally verified numerical system combining carry-free residue arithmetic with lightweight exponent scaling that achieves significantly higher throughput, reduced resource usage, and improved energy efficiency on FPGAs compared to IEEE 754 standards while maintaining rigorous, bounded numerical error.

Mostafa DarvishiWed, 11 Ma💻 cs

TrainDeeploy: Hardware-Accelerated Parameter-Efficient Fine-Tuning of Small Transformer Models at the Extreme Edge

TrainDeeploy is a novel framework that enables efficient, parameter-efficient on-device fine-tuning of both CNN and Transformer models on ultra-low-power, memory-constrained RISC-V SoCs, achieving significant reductions in memory usage and computational overhead while supporting end-to-end training at the extreme edge.

Run Wang, Victor J. B. Jung, Philip Wiese, Francesco Conti, Alessio Burrello, Luca BeniniWed, 11 Ma🤖 cs.LG

The AetherFloat Family: Block-Scale-Free Quad-Radix Floating-Point Architectures for AI Accelerators

The AetherFloat Family introduces a novel block-scale-free, quad-radix floating-point architecture that eliminates the hardware overhead of dynamic scaling and IEEE 754 inefficiencies, achieving significant area, power, and latency reductions in AI accelerators through explicit mantissas, base-4 scaling, and stochastic rounding.

Keita MorisakiWed, 11 Ma🤖 cs.LG

Data-Rate-Aware High-Speed CNN Inference on FPGAs

This paper presents a data-rate-aware CNN accelerator architecture for FPGAs that utilizes multi-pixel processing and design-space exploration to optimize hardware utilization and resource efficiency across varying data rates, thereby enabling the efficient implementation of complex CNNs on a single device.

Tobias Habermann, Martin KummWed, 11 Ma🤖 cs.LG

Performance Analysis of Edge and In-Sensor AI Processors: A Comparative Review

This paper reviews the landscape of ultra-low-power edge and in-sensor AI processors and empirically benchmarks a segmentation model on GAP9, STM32N6, and Sony IMX500 platforms to demonstrate that while in-sensor processing offers superior energy-delay performance, different architectures provide distinct trade-offs between latency, energy efficiency, and power budgets.

Luigi Capogrosso, Pietro Bonazzi, Michele MagnoWed, 11 Ma🤖 cs.LG

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

KernelCraft introduces the first benchmark evaluating agentic LLM systems that use feedback-driven workflows to automatically generate and optimize low-level kernels for emerging hardware with novel ISAs, demonstrating their ability to produce valid, high-performance code that rivals or exceeds traditional compiler baselines.

Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, Yiren ZhaoWed, 11 Ma🤖 cs.LG

Two Teachers Better Than One: Hardware-Physics Co-Guided Distributed Scientific Machine Learning

The paper introduces EPIC, a hardware- and physics-co-guided distributed scientific machine learning framework that significantly reduces communication latency and energy consumption while preserving physical fidelity by performing lightweight local encoding and physics-aware decoding with cross-attention for tasks like full-waveform inversion.

Yuchen Yuan, Junhuan Yang, Hao Wan, Yipei Liu, Hanhan Wu, Youzuo Lin, Lei YangWed, 11 Ma🤖 cs.LG

The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference

This paper introduces the $qs$ inequality to demonstrate that Mixture-of-Experts (MoE) models suffer from a structural "double penalty" of routing fragmentation and memory constraints during inference, often rendering them significantly less efficient than quality-matched dense models for long-context serving despite their training-time FLOP advantages.

Vignesh Adhinarayanan, Nuwan JayasenaWed, 11 Ma🤖 cs.LG

DendroNN: Dendrocentric Neural Networks for Energy-Efficient Classification of Event-Based Data

This paper introduces DendroNN, a novel dendrocentric neural network that leverages non-differentiable sequence detection and a rewiring phase to efficiently classify event-based spatiotemporal data, achieving competitive accuracy with up to 4x higher energy efficiency than state-of-the-art neuromorphic hardware through a dedicated asynchronous digital architecture.

Jann Krausse, Zhe Su, Kyrus Mama, Maryada, Klaus Knobloch, Giacomo Indiveri, Jürgen BeckerWed, 11 Ma🤖 cs.AI

Wrong Code, Right Structure: Learning Netlist Representations from Imperfect LLM-Generated RTL

This paper proposes a cost-effective framework that leverages structurally informative but functionally imperfect LLM-generated RTL to train netlist representation models, effectively overcoming data scarcity and outperforming methods reliant on scarce high-quality labeled datasets.

Siyang Cai, Cangyuan Li, Yinhe Han, Ying WangWed, 11 Ma🤖 cs.AI

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

This paper presents a systematic layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4 quantization across three Qwen2.5 model scales, revealing that MLP up- and down-projection layers are the most sensitive components while sensitivity patterns vary by format and model depth rather than being confined to final blocks.

Musa Cim, Burak Topcu, Mahmut Taylan KandemirWed, 11 Ma🤖 cs.AI

Architectural Design and Performance Analysis of FPGA based AI Accelerators: A Comprehensive Review

This paper reviews FPGA-based AI accelerators for deep learning, highlighting their advantages over ASICs and GPUs, detailing key hardware optimization techniques such as loop pipelining and quantization, and analyzing state-of-the-art designs to identify challenges for future innovations.

Soumita Chatterjee, Sudip Ghosh, Tamal Ghosh, Hafizur RahamanWed, 11 Ma🤖 cs.AI

cs.AR