Efficient Coupled-Cluster Python Frameworks for Next-Generation GPUs: A Comparative Study of CuPy and PyTorch on the Hopper and Grace Hopper Architecture
This paper presents new batching algorithms and a generic tensor contraction protocol for coupled-cluster singles and doubles (CCSD) calculations on NVIDIA Hopper and Grace Hopper GPUs, demonstrating that optimized implementations using CuPy and PyTorch achieve up to a 16-fold speedup over previous hybrid CPU-GPU approaches, with PyTorch showing a 20% performance advantage on H100 while both libraries perform similarly on GH200.