Original authors: Hetvi Shastri, Pragya Sharma, Walid A. Hanafy, David Irwin, Mani Srivastava, Prashant Shenoy
Original authors: Hetvi Shastri, Pragya Sharma, Walid A. Hanafy, David Irwin, Mani Srivastava, Prashant Shenoy
Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer
Technical Summary: FMplex – Model Virtualization for Serving Extensible Foundation Models
Problem Statement
Foundation Models (FMs) have become the backbone for diverse downstream applications across language, vision, time-series, and multimodal domains. However, existing model-serving systems (e.g., NVIDIA Triton) are designed around an "instance-per-task" paradigm, where each customized task loads a separate, independent copy of the model. This approach is inefficient for FMs because:
- Resource Waste: FMs consist of a massive, shared backbone (often gigabytes in size) and lightweight task-specific extensions (heads, adapters). Loading a full backbone for every task replicates the heaviest component, wasting accelerator memory.
- Lost Efficiency: Independent instances prevent the amortization of batching and loading costs across tasks.
- Interference and Isolation: Simply co-locating tasks on a shared GPU without logical separation leads to cross-task interference, where load spikes from one task degrade the performance of others.
- Lifecycle Rigidity: Current systems couple the task lifecycle to the physical model instance, making it difficult to add, remove, or modify tasks without redeploying the entire backbone.
The paper argues that the FM backbone should be treated as a shared system substrate (analogous to a CPU or memory in OS virtualization) rather than a per-task deployment artifact.
Methodology: FMplex
The authors present FMplex, a serving system that introduces Foundation Model Virtualization. The core concept is the Virtual Foundation Model (vFM), a logically private FM instance presented to each task, which is backed by a shared physical FM instance.
Key Architectural Components
Virtual Foundation Model (vFM) Abstraction:
- Decoupling: The vFM decouples the task's logical view (customization, state, lifecycle) from the physical backbone.
- Structure: Each vFM includes a Virtual Queue (for request routing), Task Extensions (encoders, decoders, and PEFT adapters like LoRA), and State/Accounting (SLOs, priorities, weights).
- Mechanism: When a task invokes its vFM, FMplex intercepts the call, routes it through the virtual queue, and executes it on the shared physical backbone, applying task-specific adapters as needed.
Batch-Aware Fair Queueing (BFQ) Scheduler:
- Challenge: Standard fair-share schedulers (e.g., Start-Time Fair Queueing) operate on a per-request basis and do not account for the efficiency gains of request batching, which is critical for FM throughput.
- Solution: BFQ is a work-conserving scheduler that approximates weighted fair sharing while optimizing for batching.
- Operation: It assigns start/finish tags to requests based on task weights. It iteratively forms batches up to a maximum size (Bmax) or until an SLO deadline would be violated.
- Adapter Handling: BFQ handles adapter incompatibility by first batching requests over the common backbone and then sequentially processing incompatible adapter differences, ensuring fairness without sacrificing batching efficiency.
- Token-Based Support: For token-based FMs (e.g., LLMs), BFQ charges token-level work in service-time units to maintain consistency with request-level runtimes.
Task-API and Serving Stack:
- Task-API: A programming interface allowing users to construct task pipelines by attaching encoders, decoders, and adapters to a vFM. It supports both inference and fine-tuning using the same pipeline object.
- FMplex-Controller: A cluster-level controller that manages the deployment plan. It uses a "Max-Share" heuristic to bind tasks to existing physical backbones whenever possible, minimizing new backbone instantiation.
- Elastic Adaptation: When load changes, the system can rebind a task's vFM to a different existing physical backbone, moving only lightweight task-state (queues, adapters) rather than reloading the heavy backbone.
Key Contributions
- FM Virtualization for Deployment Sharing: The introduction of the vFM abstraction, which allows multiple independently customized tasks to share a single physical FM instance while maintaining logical isolation and independent lifecycles.
- Sharing-Based Serving Stack: An end-to-end system integrating Task-API for extensible task construction and FMplex-Controller for sharing-aware cluster deployment.
- Prototype Implementation: A functional prototype supporting multiple modalities (time-series, vision, LLMs, VLMs) and runtimes (PyTorch, vLLM), demonstrating flexibility across heterogeneous FMs.
- Comprehensive Evaluation: A rigorous evaluation across 7 backbone FMs (16 variants) and 92 downstream tasks.
Experimental Results
The evaluation was conducted on a 16-node AWS cluster (NVIDIA T4 GPUs) using synthetic and real-world traces (Azure Functions).
Latency Reduction:
- Compared to Spatial Partitioning (isolating tasks on GPU partitions), FMplex reduced latency by up to 80%.
- Compared to Best-Effort Co-location (multiple full instances on one GPU without isolation), FMplex reduced latency by up to 33.3%.
- At cluster scale, FMplex reduced mean latency by 15% and P99 latency by 26% compared to best-effort co-location.
Resource Efficiency and Scalability:
- Memory: FMplex significantly reduces GPU memory usage. For example, co-locating 10 time-series tasks on a shared backbone required only 1.17× the memory of a single task, compared to 10× for independent deployment.
- Throughput: FMplex sustained up to 6× more tasks at low load (where memory is the bottleneck) and 8–12% more tasks at moderate/high load (where compute is the bottleneck) compared to best-effort co-location.
- Fairness: Under asymmetric service weights (e.g., 3:1), FMplex maintained fairness scores of 0.97–0.98 while sustaining 84 RPS. In contrast, non-batched fair-sharing achieved similar fairness at only 37 RPS, and unmanaged sharing dropped fairness to 0.66.
Adaptation Overhead:
- FMplex demonstrated rapid adaptation to workload surges. Rebinding a task to an existing backbone took 0.5 seconds, whereas loading a new backbone instance (as required by non-sharing systems) took ~58 seconds, causing a two-order-of-magnitude latency spike.
Overhead: The scheduling overhead introduced by FMplex (queue handling and tag computation) was minimal, averaging 0.35 ms per request, which is negligible compared to backbone execution times.
Significance and Claims
The paper claims that FMplex addresses the fundamental mismatch between the architecture of Foundation Models (heavy shared backbones, lightweight extensions) and current serving systems (per-instance deployment). By treating the FM backbone as a virtualization substrate, FMplex enables:
- Deployment Sharing: Amortizing the heavy memory and compute costs of the backbone across multiple tasks.
- Task Isolation: Providing per-task performance guarantees and isolation without the resource penalty of full model replication.
- Operational Flexibility: Allowing tasks to be added, removed, or modified dynamically without redeploying the underlying infrastructure.
The authors position FMplex not just as an optimization for specific models, but as a generalizable system layer that extends classical virtualization principles to the domain of Foundation Model serving, enabling more efficient and scalable AI infrastructure.
Drowning in papers in your field?
Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.
Get the best machine learning papers every week.
Trusted by researchers at Stanford, Cambridge, and the French Academy of Sciences.
Check your inbox to confirm your subscription.
Something went wrong. Try again?
No spam, unsubscribe anytime.