An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

Imagine you are trying to bake a massive, multi-layered cake (a Large Language Model) in a very small kitchen (a single GPU).

The problem is that the cake is so huge that it requires more counter space, mixing bowls, and ingredients than your tiny kitchen can possibly hold. In the world of AI, this "counter space" is VRAM (Video RAM). Most high-end graphics cards only have enough space for a small cake, but researchers want to bake giant 100-billion-parameter cakes.

Usually, to bake these giant cakes, you need a massive commercial kitchen (a GPU Cluster or cloud supercomputer) that costs a fortune. But what if you only have one home oven?

Enter SlideFormer. It's a clever new "kitchen hack" that lets you bake the world's largest AI cakes on a single, consumer-grade graphics card (like an NVIDIA RTX 4090) by using your computer's regular memory (RAM) and hard drive as extra workspace.

Here is how SlideFormer works, explained through three simple analogies:

1. The "Conveyor Belt" Kitchen (The Sliding Window)

The Problem: In traditional methods, the chef (the GPU) cooks a layer of the cake, then stops completely to wait for the assistant (the CPU) to move the ingredients to the fridge and mix them. The chef stands idle, wasting time.

The SlideFormer Solution: Imagine a conveyor belt.

The GPU is the chef working on Layer 1.
While the chef is busy with Layer 1, the assistant is simultaneously moving the ingredients for Layer 2 onto the belt and mixing the ingredients for Layer 1 in the fridge.
As soon as the chef finishes Layer 1, they immediately slide to Layer 2 without stopping.
The Magic: The GPU never stops cooking. It overlaps its work with the CPU's work. It's like a perfectly choreographed dance where no one ever stands still waiting for the other.

2. The "Smart Fridge" (Heterogeneous Memory)

The Problem: Your kitchen counter (GPU VRAM) is tiny (24GB), but your fridge (CPU RAM) is huge (256GB). Old methods tried to keep everything on the counter, which was impossible. Others tried to move things to the fridge but did it clumsily, causing the chef to wait for the door to open.

The SlideFormer Solution: SlideFormer treats the fridge like a smart, pre-organized pantry.

Instead of grabbing random items from the fridge when needed (which causes clutter and delays), SlideFormer pre-sets specific "slots" in the fridge.
It uses a sliding window: Only the ingredients needed for the current layer are on the counter. The rest are neatly packed in the fridge.
When the chef needs the next layer, it's already pre-packed and waiting. This eliminates the "searching" time and ensures the counter never gets cluttered, allowing the chef to work with a much larger recipe than the counter size would normally allow.

3. The "Express Elevator" (GPUDirect Storage)

The Problem: Sometimes the cake is so big it doesn't even fit in the fridge. You have to store some parts in the basement (your NVMe SSD/Hard Drive). Usually, moving things from the basement to the counter requires a slow, bumpy ride through the living room (the CPU), which slows everything down.

The SlideFormer Solution: SlideFormer installs a direct express elevator (called GPUDirect Storage) from the basement straight to the kitchen counter.

It bypasses the living room (the CPU) entirely.
Data moves directly from the hard drive to the graphics card. This is incredibly fast and frees up the CPU to do other helpful tasks, like mixing ingredients, rather than just acting as a delivery truck.

The Result: What Can You Do Now?

Before SlideFormer, if you wanted to fine-tune (customize) a giant AI model, you needed a supercomputer or had to settle for a tiny, less effective model.

With SlideFormer on a single RTX 4090 (a powerful card you can buy for a few thousand dollars):

You can bake the giants: You can now fine-tune models with 123 Billion+ parameters (like Mistral Large) on a single card.
You can cook faster: It runs 1.4x to 6x faster than previous methods.
You can cook more at once: You can use much larger "batches" (cooking multiple cakes at once), making the process even more efficient.

Why Does This Matter?

This is the "democratization" of AI. It means that individual researchers, students, and small startups don't need to rent expensive cloud servers to work with the latest, most powerful AI models. They can do it right on their own desktop computers, turning a "VRAM wall" (a barrier that used to stop them) into a manageable hurdle.

In short: SlideFormer is the ultimate kitchen hack that turns a small home kitchen into a factory capable of baking the world's biggest cakes, simply by organizing the workflow and using every inch of space efficiently.

1. Problem Statement

Fine-tuning Large Language Models (LLMs) is essential for domain adaptation but faces a critical memory bottleneck on single-GPU systems.

The VRAM Wall: Full-parameter fine-tuning requires storing parameters, gradients, optimizer states (often 3x the parameter size for Adam), and activations. For an 8B parameter model, this exceeds 128 GB of VRAM, far surpassing the capacity of high-end consumer GPUs (e.g., RTX 4090 with 24 GB).
Limitations of Existing Solutions:
- Distributed Parallelism: Techniques like Pipeline or Tensor Parallelism require multiple GPUs, which are inaccessible to individuals/small labs.
- Parameter-Efficient Fine-Tuning (PEFT): Methods like LoRA often underperform compared to full-parameter fine-tuning on specialized tasks.
- Existing Offloading Systems (ZeRO-Offload, ZeRO-Infinity, ColossalAI): These were designed primarily for multi-GPU clusters. On a single GPU, they suffer from:
  - Synchronous updates that leave the GPU idle while waiting for CPU updates.
  - Inefficient memory management (fragmentation, on-demand allocation).
  - Lack of fine-grained optimizations for modern LLM architectures (e.g., handling CrossEntropyLoss bottlenecks).
The Opportunity: There is a widening gap between CPU/RAM capacity (up to 256 GB DDR5) and GPU VRAM growth. SlideFormer aims to leverage this "heterogeneous" hardware landscape to democratize LLM fine-tuning.

2. Methodology: SlideFormer Architecture

SlideFormer employs a holistic heterogeneous co-design consisting of three core pillars to maximize hardware utilization and minimize memory footprints.

A. Layer-Sliding Asynchronous Engine

Instead of treating the model as a monolithic block or using coarse-grained parameter groups, SlideFormer adopts a layer-granular approach.

Sliding Window: The GPU maintains a small, active window of layers (parameters and gradients) while the rest reside in CPU RAM or NVMe.
Asynchronous Pipelining: The system overlaps three distinct operations:
1. GPU Computation: Backward pass for the current layer ( $L_i$ ).
2. CPU Update: Optimizer step for the previous layer ( $L_{i-1}$ ) using host-resident states.
3. I/O Transfers: Asynchronous transfer of gradients from GPU to CPU ( $d2h$ ) and prefetching of next layer parameters from CPU to GPU ( $h2d$ ).
Thread-Based Design: Unlike prior works using multi-process engines (which incur IPC overhead), SlideFormer uses a lightweight thread-based engine with dedicated CUDA streams and CPU threads to ensure non-blocking execution.
Overlap Condition: The system achieves "lossless" offloading when the backward computation time ( $T_{bwd}$ ) is greater than or equal to the sum of gradient transfer and update times ( $T_{d2h} + T_{update}$ ).

B. Efficient Heterogeneous Memory Management

The system minimizes memory fragmentation and peak usage through pre-allocation and shared buffers.

Pre-allocated GPU Cache Queue: Instead of dynamic allocation, SlideFormer uses a fixed-size queue of GPU cache units. As layers slide through the pipeline, they reuse these pre-allocated slots, eliminating fragmentation and reallocation overhead.
Shared CPU Buffers:
- Gradients: Uses a layer-shared pinned tensor for gradients, reducing CPU memory footprint by a factor of $1/num\_layers$ .
- Type Conversion: A shared buffer handles FP32-to-BF16/FP16 conversion on the CPU before transfer, avoiding GPU-side conversion costs.
Sliding Activation Checkpointing: Activations are asynchronously offloaded to CPU/NVMe after the forward pass and reloaded for the backward pass, keeping VRAM usage for activations minimal.

C. Integrated I/O and Compute Co-Design

GPUDirect Storage (GDS): SlideFormer pioneers the use of GDS for LLM fine-tuning. This allows data to move directly between NVMe and GPU, bypassing the CPU bounce buffer. This reduces CPU utilization and PCIe contention, freeing CPU resources for the asynchronous engine.
Optimized Triton Kernels:
- Fused LinearCrossEntropy (LCE): A critical innovation. For models with large vocabularies (e.g., Llama-3.1), the intermediate logits tensor ( $B \times S \times V$ ) consumes massive VRAM. SlideFormer fuses the projection and loss calculation, computing gradients in chunks to avoid materializing the full logits tensor. This reduces output layer memory usage by >80%.
- Other fused kernels for RoPE, RMSNorm, and SwiGLU further improve throughput.

3. Key Contributions

Layer-Sliding Architecture: A novel pipeline that treats the GPU as a sliding window, enabling full overlap of GPU compute, CPU updates, and multi-tier I/O.
Fixed-Footprint Memory Management: A pre-allocated cache queue and shared buffer strategy that eliminates fragmentation and significantly reduces peak CPU/GPU memory.
Advanced I/O Integration: First integration of GPUDirect Storage for LLM offloading and the development of fused Triton kernels (specifically LCE) to solve critical memory bottlenecks in modern architectures.
Scalability: Enables fine-tuning of models >123B parameters on a single RTX 4090, supporting batch sizes 8x larger and model sizes 6x larger than existing baselines.

4. Experimental Results

Evaluated on NVIDIA RTX 4090 (24GB), A100 (80GB), and AMD RX 7900XT (20GB) with up to 256GB CPU RAM.

Throughput: SlideFormer achieves 1.40× to 6.27× higher throughput compared to baselines (ZeRO-Offload, ZeRO-Infinity, ColossalAI, LoHan).
Memory Efficiency:
- GPU Memory: Reduced by >50% compared to ZeRO-Offload.
- CPU Memory: Reduced by ~40% due to shared buffers and optimized layouts.
Model Size Limits:
- Can fine-tune 123B+ models on a single RTX 4090 (with NVMe offloading).
- Can fine-tune 24B models at >95% peak GPU performance without NVMe offloading (using only CPU RAM).
Hardware Compatibility: Maintains >95% peak performance on both NVIDIA and AMD GPUs, demonstrating broad applicability.
NVMe Offloading: Performance scales near-linearly with NVMe drive count. Offloading strategies can reduce CPU memory usage by 60-80% with a manageable throughput penalty (30-50%).

5. Significance

SlideFormer fundamentally shifts the paradigm of LLM fine-tuning from "cluster-dependent" to "single-machine accessible."

Democratization: It allows individual researchers and small labs with consumer-grade hardware (e.g., a single RTX 4090 and a high-RAM PC) to perform full-parameter fine-tuning on state-of-the-art models (Llama, Qwen, Mistral) that were previously restricted to data centers.
Efficiency: By solving the synchronization and memory fragmentation issues inherent in previous offloading methods, it proves that heterogeneous co-design can achieve near-native GPU performance even when the model does not fit in VRAM.
Practicality: The system is implemented on PyTorch/Transformers, ensuring compatibility with the latest model architectures, making it a practical tool for real-world domain adaptation.