veScale-FSDP: Flexible and High-Performance FSDP at Scale

veScale-FSDP is a redesigned, high-performance FSDP system that introduces a flexible RaggedShard format and structure-aware planning to overcome the limitations of existing systems in supporting block-wise quantization and non-element-wise optimizers, thereby achieving significantly higher throughput, lower memory usage, and efficient scaling to tens of thousands of GPUs.

Zezhou Wang, Youjie Li, Zhiqi Lin, Jiacheng Yang, Cong Xie, Guanyu Feng, Zheng Zhong, Ziyue Huang, Hongyu Zhu, Zhi Zhang, Yanghua Peng, Xin Liu

Published 2026-03-02
📖 5 min read🧠 Deep dive

Imagine you are trying to build a massive skyscraper (a giant AI model) using a team of thousands of construction workers (GPUs).

In the past, the standard way to organize this team was FSDP (Fully Sharded Data Parallel). Think of FSDP as a strict foreman who says: "We are going to cut the blueprints into tiny, equal-sized square tiles. Every worker gets exactly one tile. If a tile needs to be glued to another, we have to stop, gather all the tiles back together, glue them, and then cut them up again."

This works okay for simple buildings, but modern AI models are like complex, futuristic architecture with weird shapes, curved walls, and special materials. The old "square tile" method causes two big problems:

  1. The "Square Peg in a Round Hole" Problem: New, fancy construction techniques (like "block-wise quantization" or "matrix optimizers") need to work on specific chunks of the blueprint (like a whole 128x128 block). The old foreman keeps cutting these blocks in half, forcing workers to stop and reassemble them constantly.
  2. The "Wasted Space" Problem: Because the tiles must be perfectly equal, workers often end up with empty space on their desks (padding) just to make the math work. This wastes memory and slows everyone down.

Enter veScale-FSDP.

The authors of this paper invented a new system that acts like a smart, flexible project manager instead of a rigid foreman. Here is how it works, using simple analogies:

1. The "Ragged Shard" (The Flexible Puzzle)

Instead of forcing every worker to hold a perfect square tile, RaggedShard lets workers hold irregular puzzle pieces.

  • The Old Way: If a blueprint section is 100 units wide and you have 3 workers, the old system cuts it into 33, 33, and 34. But if the "fancy construction" needs a 32-unit block, the cut right through the middle of the block, ruining it.
  • The veScale Way: RaggedShard says, "Worker A, you take the first 32-unit block. Worker B, you take the next 64-unit block. Worker C, you take the rest."
  • The Result: The workers can now use the special construction techniques (like block-wise quantization) without ever having to stop and reassemble the pieces. The blocks stay intact, just like a puzzle piece that fits perfectly in your hand.

2. The "Planning Algorithm" (The Traffic Controller)

Now that everyone has different-sized puzzle pieces, how do we pass them around without causing a traffic jam?

  • The Problem: If you just line up the pieces randomly, you might end up with a huge gap (padding) between pieces, or a piece might get split across two workers' desks, requiring a messy hand-off.
  • The Solution: The paper introduces a Planning Algorithm. Imagine a traffic controller who looks at all the puzzle pieces and says: "Okay, put the big pieces here, the small pieces there, and arrange them so that when we pass them down the line, they fit together perfectly with zero gaps."
  • The Magic: This algorithm solves a complex math problem (NP-hard) to find the most efficient arrangement in milliseconds, ensuring no space is wasted and no time is lost shuffling pieces around.

3. The "Distributed Buffer" (The Super-Highway)

Once the pieces are arranged, they need to move between workers.

  • The Old Way: Workers often had to copy their pieces onto a temporary clipboard (memory) before passing them, which is slow and messy.
  • The veScale Way: They built a Distributed Buffer (DBuffer). Think of this as a super-highway where the pieces are already parked in the right spots. When a worker needs a piece, they just point to it. No copying, no moving, no waiting. It's "zero-copy" access.

Why Does This Matter? (The Results)

Because of these three innovations, the team can build their skyscraper much faster and with fewer resources:

  • Speed: They are 5% to 66% faster than previous systems. It's like going from a slow, stop-and-go commute to a high-speed train.
  • Memory: They use 16% to 30% less memory. This is like fitting the same amount of furniture into a smaller apartment by arranging it perfectly, rather than needing a warehouse.
  • Scale: They can now train models on tens of thousands of GPUs without the system crashing or running out of memory.

The Bottom Line

veScale-FSDP is a system upgrade that stops treating AI models like rigid, uniform blocks of data. Instead, it treats them like flexible, organic structures. By allowing workers to hold irregular pieces (RaggedShard), arranging them perfectly (Planning), and moving them instantly (DBuffer), it unlocks the ability to train the next generation of super-intelligent AI models that were previously too difficult or expensive to build.

It's the difference between trying to build a castle with only square Lego bricks versus having a set of custom-shaped bricks that fit together perfectly.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →