Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning
Prism is a memory-centric LLM co-serving framework that utilizes a novel memory ballooning technique called kvcached to dynamically reclaim and reallocate GPU memory across multiple models, thereby unifying spatial and temporal sharing to improve cost-efficiency and SLO adherence in production environments.