Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

Imagine you want to build a giant, super-smart robot brain (a Large Language Model, or LLM) that knows everything in the world. Usually, only the richest companies can do this because they need to buy thousands of expensive super-computers and hook them all together in a single, massive data center. It's like trying to build a skyscraper, but you can only use one crane, and that crane costs a billion dollars.

The Covenant-72B paper is about a radical new way to build that skyscraper.

Instead of one giant crane, they used thousands of small, ordinary cranes scattered all over the world, connected by the regular internet. And the best part? Anyone could bring their crane to the job site, no permission needed.

Here is how they pulled off this massive, decentralized construction project, explained simply:

1. The Problem: The "Internet Traffic Jam"

If you try to connect thousands of computers over the regular internet to train a giant AI, you hit a wall. The internet is slow and unreliable compared to the super-fast cables inside a data center.

The Analogy: Imagine trying to coordinate a dance routine with 20,000 people spread across different countries. If everyone has to shout their moves to everyone else every single second, the noise and lag would make the dance impossible. The computers would spend more time waiting for messages than actually learning.

2. The Solution: The "Group Chat" Strategy

The team used a clever trick called SparseLoCo.

The Analogy: Instead of sending a full, high-definition video of every move (which is huge and slow), the computers only send a text message saying, "I moved my left foot up."
How it works: The computers do a lot of work locally (learning on their own) and only occasionally send a tiny, heavily compressed summary of what they learned to the group. They use a technique called "error feedback" to make sure that even though they are sending tiny summaries, they don't lose any important information over time. It's like sending a postcard instead of a movie, but the postcard is so smartly written that you can reconstruct the whole movie from it.

3. The "Trustless" Element: The Blockchain Bouncer

Usually, when you let random people join a project, you worry they might cheat, break things, or send garbage data.

The Analogy: Imagine a massive potluck dinner where anyone can bring a dish. How do you know someone didn't bring a plate of rocks?
The Fix: They used a system called Gauntlet (running on a blockchain). Think of Gauntlet as a super-strict, automated food critic.
- Every time someone brings a "dish" (a piece of AI learning data), the critic tastes it immediately.
- If the dish tastes good (the math checks out), the critic gives them points and adds their dish to the main pot.
- If the dish is bad or they are trying to cheat, they get zero points and are ignored.
- This creates a system where people are rewarded for being honest and helpful, and there is no need for a central boss to say, "You are allowed to join."

4. The Result: A Giant Brain Built by the Crowd

They managed to train a 72-billion-parameter model (a very large AI) using this method.

The Scale: They used about 1.1 trillion words of data to teach the AI.
The Hardware: They didn't use a supercomputer. They used a mix of computers owned by volunteers, some with 8 powerful graphics cards, all connected via regular home internet.
The Performance: The resulting AI, Covenant-72B, is just as smart as models built by big tech companies in their expensive data centers, even though it was built by a crowd of strangers over the regular internet.

5. Why This Matters

This is a huge deal for democratization.

Before: Only the "rich kids" (big tech companies) could build the smartest AI because they had the money for the expensive data centers.
Now: This paper proves that if you have a smart enough way to coordinate, anyone with a computer and an internet connection can help build the next generation of AI. It turns AI training from a "closed club" into an "open global party."

Summary

The paper describes a successful experiment where they built a world-class AI by:

Letting anyone join (no whitelist).
Using a "text message" system to avoid internet traffic jams.
Using a blockchain bouncer to stop cheaters.
Proving that a crowd of strangers can build a giant brain just as well as a single giant company.

It's the difference between building a house with one expensive crane versus building it with a thousand volunteers using hand tools, but doing it so efficiently that the house ends up just as strong.

Here is a detailed technical summary of the paper "Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet."

1. Problem Statement

Large Language Models (LLMs) have traditionally required massive, centralized datacenter clusters with high-bandwidth, low-latency interconnects (e.g., InfiniBand) to achieve state-of-the-art performance. This centralization creates high barriers to entry, concentrating model development in a few wealthy organizations.

While decentralized training over the public internet offers a path to democratization, it faces three critical challenges:

Communication Efficiency: Standard synchronous training (e.g., All-Reduce) fails over commodity internet due to high latency and limited bandwidth.
Trustlessness: Existing decentralized efforts (e.g., DiLoCo-based projects) rely on whitelisted, trusted participants. They lack mechanisms to prevent malicious actors from submitting fake gradients or copying others' work in an open, permissionless environment.
Dynamic Participation: Participants in a public network may join or leave at any time, requiring robustness against churn without halting training.

2. Methodology

The authors propose COVENANT-72B, a 72-billion parameter LLM trained on approximately 1.1 trillion tokens using a fully permissionless, globally distributed network. The system relies on two core innovations:

A. SparseLoCo Optimizer (Communication Efficiency)

To overcome bandwidth constraints, the team utilizes SparseLoCo, a local-update distributed optimizer that minimizes communication frequency and volume.

Local Training: Each peer performs $H=30$ inner steps (using AdamW) on local data shards before communicating.
Compression: Instead of sending full gradients, peers compute pseudo-gradients ( $\Delta = \theta_{global} - \theta_{local}$ $Δ = θ_{g l o ba l} - θ_{l oc a l}$ ) and compress them using:
- Top-k Sparsification: Only the top $k$ values (magnitude) are transmitted.
- Chunk-wise Processing: Tensors are split into 64x64 blocks (2D) or 4096-size chunks (1D) to align with standard parallelism (FSDP/TP) and reduce index overhead.
- 2-bit Quantization: Values are quantized to 2 bits.
- Error Feedback: A local buffer accumulates untransmitted errors to ensure convergence despite aggressive sparsification.
Result: This achieves a compression ratio of >146x compared to dense gradient communication.

B. Gauntlet Protocol (Trustless Incentivization)

To enable permissionless participation, the system runs on the Bittensor blockchain (Subnet 3) and uses the Gauntlet mechanism:

Validation: A validator node scores submitted pseudo-gradients.
LossScore: The validator forwards small data batches to compute the loss difference before and after applying a peer's contribution.
Anti-Copying: Peers are assigned specific data shards. If a peer improves loss on random data more than on their assigned data, they are penalized (preventing gradient copying).
Dynamic Selection: The validator selects the top-scoring peers for the global aggregation step. The system maintains a pool of active participants slightly larger than the required contributors to handle dropouts instantly.
Normalization: Pseudo-gradients are scaled by their median norm to prevent any single peer from dominating the update with abnormally large values.

C. Systems Design

Hardware: Peers run on 8x NVIDIA B200 GPUs (some larger).
Parallelism: Dynamic Fully Sharded Data Parallel (FSDP) is used within each peer.
Memory Management: The system dynamically offloads optimizer states. During computation, the inner optimizer state is on GPU while the error-feedback buffer is offloaded; during communication, these are swapped to overlap transfer with computation.
Communication Backbone: Uses Cloudflare R2 object storage rather than direct P2P connections. Peers upload compressed gradients to R2; the validator selects the best ones, and all peers download them asynchronously. This decouples upload/download speeds and handles network variability.

3. Key Contributions

Scale: The largest collaborative decentralized pre-training run to date (72B parameters, 1.1T tokens).
Permissionless Participation: The first successful demonstration of training a foundation model on a trustless network where anyone can join without whitelisting, secured by a blockchain-based incentive mechanism.
Efficiency: Demonstrated that aggressive compression (146x) combined with local updates can achieve ~94.5% compute utilization even over commodity internet links, significantly outperforming previous decentralized attempts (e.g., INTELLECT-1) in terms of idle time reduction.
Open Source: All intermediate checkpoints, final pre-training weights, and post-training (SFT) models are released under an Apache License.

4. Results

The paper evaluates COVENANT-72B against centralized baselines (LLaMA-2-70B, LLM360 K2) and other decentralized models (INTELLECT-1, Psyche Consilience).

Pre-Training Benchmarks (Zero-Shot):
- MMLU: 67.1% (Competitive with LLaMA-2-70B at 65.6%, despite LLaMA-2 being trained on nearly 2x the tokens).
- ARC-Challenge: 56.8% (Outperforms K2 at 53.8% and LLaMA-2-70B at 57.4% is close).
- Overall: The model is broadly competitive with centralized models of similar scale, proving that decentralized training does not inherently sacrifice quality.
Supervised Fine-Tuning (SFT):
- The model underwent a 14.8B-token SFT stage with context extension to 8k.
- Covenant-72B-Chat achieved state-of-the-art results among compared models on IFEval (64.7%) and MATH (26.3%), demonstrating strong instruction-following and reasoning capabilities.
System Performance:
- Idle Time: Only 70 seconds of idle time per round (vs. 8.3 minutes in INTELLECT-1), despite training a 7.2x larger model.
- Participation: The system remained robust with an average of 16.9 contributing peers out of a dynamic pool, with 70 unique peers participating over the run.

5. Significance

Covenant-72B represents a paradigm shift in foundation model development:

Democratization: It proves that high-quality LLMs can be built without centralized datacenter monopolies, lowering the barrier to entry for global contributors.
Infrastructure Independence: It demonstrates that "commodity internet" is sufficient for training 70B+ models if the optimization algorithms (SparseLoCo) and coordination mechanisms (Gauntlet) are correctly designed.
Future Pathway: The success of this trustless, permissionless approach suggests a viable future where open participation replaces centralized access as the default mechanism for scaling AI, potentially leading to more diverse, resilient, and accessible AI ecosystems.

In conclusion, the paper validates that communication-efficient optimization combined with cryptographic/economic incentives can solve the fundamental bottlenecks of decentralized training, enabling the creation of world-class models through global collaboration.