Original authors: Hoa La, Ahan Gupta, Alex Morehead, Jianlin Cheng, Minjia Zhang

Published 2026-06-16

📖 5 min read🧠 Deep dive

Original authors: Hoa La, Ahan Gupta, Alex Morehead, Jianlin Cheng, Minjia Zhang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to build a massive, intricate 3D puzzle of a protein. For a long time, scientists used a specific set of rules (like the old "AlphaFold2") that worked well. But the newest version, AlphaFold3, is like upgrading from a flat 2D puzzle to a complex, multi-layered 3D structure where every single piece interacts with every other piece in a 3D space.

While this new model is scientifically amazing, it has a huge problem: it eats up computer memory like a black hole and runs incredibly slowly.

This paper introduces MegaFold, a new "system" (a set of specialized tools and rules) designed to make training these new 3D protein models possible on modern computers without them crashing or taking forever.

Here is how MegaFold solves the four main problems, explained with simple analogies:

The Problem: Why the Old Way Fails

The new AlphaFold3 model looks at proteins in a way that creates a "cubic" explosion of data.

The Analogy: Imagine you are organizing a party. In the old model, you just needed to know who was friends with whom (a simple list). In the new model, you have to track a 3D map of how every guest is interacting with every other guest simultaneously. If you have 100 guests, the old way needs 10,000 notes; the new way needs 1,000,000 notes. If you try to write all those notes on a single notepad (the computer's memory), the notepad explodes.

The Solution: MegaFold's Four Tools

MegaFold fixes this by introducing four specific upgrades:

1. The "Smart Notepad" (EvoFlash-3D)

The Problem: The computer tries to write down the entire 3D interaction map at once, filling up its memory instantly.
The MegaFold Fix: Instead of writing the whole map on the notepad, MegaFold uses a "scratchpad" (fast, temporary memory) to calculate the interactions in small chunks, one by one, and then throws the notes away immediately.
The Result: It never needs to hold the whole massive map in memory at once. This allows the computer to handle much longer protein chains without running out of space.

2. The "Team Relay" (EvoSP-3D)

The Problem: When you use multiple computers (GPUs) to work together, they usually split the work by giving each computer a different guest list. But because AlphaFold3's data is a 2D grid of interactions, splitting it simply doesn't work; the computers get confused about who is talking to whom.
The MegaFold Fix: MegaFold invents a new way for the computers to pass data. Instead of just splitting the list, they split the grid of interactions. They pass pieces of the puzzle back and forth in a specific, efficient pattern (like a relay race) so that every computer knows exactly which part of the 3D map it is responsible for, without wasting time waiting for others.

3. The "Assembly Line" (EvoFusion)

The Problem: The computer was doing the job in tiny, inefficient steps. It would calculate a number, stop, save it, load a new tool, calculate again, stop, and save again. This is like a chef chopping an onion, then walking to the fridge to get a knife, then walking back to chop again.
The MegaFold Fix: MegaFold combines these tiny steps into one giant, smooth motion. It fuses the "chopping," "mixing," and "cooking" into a single, continuous action.
The Result: The computer stops wasting time starting and stopping tasks. It runs much faster because it keeps the "assembly line" moving without interruption.

4. The "Pre-Prepared Ingredients" (EvoPipe)

The Problem: Before the computer can even start building the protein, a human (or a slow CPU) has to do a lot of messy research to gather the right data (like finding evolutionary history). This takes a long time and leaves the powerful computer sitting idle, waiting for the data.
The MegaFold Fix: MegaFold realizes that some of this research is always the same for the same protein. It does the hard work once ahead of time and stores it in a "pantry" (a cache). When the computer needs the data, it just grabs the pre-prepared ingredients instantly.
The Result: The powerful computer never has to wait. It stays busy 100% of the time.

The Results

The paper tested MegaFold on two types of powerful computer chips (NVIDIA and AMD). Here is what they found:

Longer Proteins: MegaFold allowed the team to train on protein sequences 3.36 times longer than before.
Faster Speed: It made the training process 1.73 times faster on NVIDIA chips and 1.62 times faster on AMD chips.
Less Memory: It used significantly less memory, preventing the "out of memory" crashes that usually happen with these models.

Summary

Think of AlphaFold3 as a super-powerful engine that was too heavy for its car chassis (the computer system). MegaFold didn't just build a bigger car; it redesigned the engine, the transmission, and the fuel system so that the super-powerful engine could actually run efficiently. This allows scientists to train these next-generation models on a wider variety of computers, making the science of protein folding faster and more accessible.

Technical Summary: MegaFold

Problem Statement

Recent advances in biomolecular modeling, specifically AlphaFold3 (AF3), have introduced a paradigm shift from standard transformer architectures to models utilizing 3D attention over 2D pairwise representations. While AF3 enables accurate prediction of protein-protein, protein-DNA/RNA, and protein-ligand interactions, this architectural change introduces severe system-level bottlenecks that existing Large Language Model (LLM) optimizations cannot address.

The paper identifies four fundamental challenges rooted in the cubic scaling of 3D attention:

Cubic Activation Memory: Unlike transformers where attention scales quadratically ( $O(N^2)$ ), AF3's 3D attention over 2D pairwise representations produces tensors scaling cubically ( $O(N^3)$ ). This leads to activation memory dominating GPU usage (up to 96% of total memory), causing Out-of-Memory (OOM) errors at moderate sequence lengths (e.g., >590 residues) even with small batch sizes.
Incompatibility with Sequence Parallelism (SP): Standard SP strategies for transformers rely on 1D sequence sharding. AF3's 2D pairwise representations and alternating reduction axes across layers create complex multi-axis communication patterns that conventional SP cannot support without excessive resharding or synchronization.
Kernel Fragmentation: The interleaving of 3D attention, transition layers, and projection operations over mixed-dimensional tensors results in thousands of fine-grained kernels per iteration. This fragmentation prevents effective kernel fusion, leading to poor warp occupancy, high launch overhead, and inefficient memory bandwidth utilization.
Host-Side Preprocessing Stalls: Constructing the evolutionary context (e.g., Multiple Sequence Alignments, MSA) required for 3D attention involves heavy, dependency-heavy CPU preprocessing. This creates significant GPU idle time that cannot be hidden by conventional prefetching pipelines.

Methodology: MegaFold

MegaFold is a cross-platform system designed to address these challenges through a co-design of memory-efficient kernels, communication strategies, operator fusion, and host-device pipelining. It is implemented on top of an open-source AF3 training framework and supports both NVIDIA and AMD GPUs.

1. EvoFlash-3D: Memory-Efficient 3D Attention

To address cubic memory growth, MegaFold introduces EvoFlash-3D, a Triton-based kernel for 3D attention.

Mechanism: Instead of materializing the full $O(N^3)$ attention map in global memory (HBM), the kernel uses an online, numerically stable softmax strategy. It incrementally computes attention logits within fast scratchpad memory (shared memory) during both forward and backward passes.
Optimization: This reduces peak activation memory complexity from $O(N^3)$ to $O(N^2)$ while preserving numerical correctness. The kernel handles the unique additive broadcasted pair-bias term of AF3, which introduces multi-dimensional data dependencies not present in standard FlashAttention.

2. EvoSP-3D: Communication-Efficient Sharding

To enable training on longer sequences, MegaFold proposes EvoSP-3D, a sharding strategy tailored for 2D pairwise representations.

Mechanism: It utilizes a hybrid Data Parallel (DP) and Sequence Parallel (SP) approach. While DP groups handle different batch inputs, SP groups partition the 2D pairwise tensor ( $[N_{token}, N_{token}, h_{pair}]$ ).
Communication: As AF3 layers alternate the attention axes (e.g., row-wise to col-wise), EvoSP-3D uses all_to_all collectives to rotate the sharding layout of the pairwise tensor between stages. This allows successive attention layers to operate on the correct 2D slices without materializing the full tensor.
Diffusion Stage: In the diffusion decoder, augmented atom coordinates are distributed across SP groups without communication, while the pairwise representation is gathered (all_gather) to allow diffusion to operate on the full context.

3. EvoFusion: Fused Operator Stack

To mitigate kernel fragmentation, MegaFold introduces EvoFusion, which restructures the internal execution of critical operators rather than simply fusing boundaries.

Scope: Targets Layer Normalization, Linear Projections, and SwiGLU activations, which collectively account for ~20% of activation memory and kernel invocations.
Mechanism: By restructuring the inner loops of these operators, EvoFusion eliminates redundant intermediate materialization (e.g., avoiding writing a full $N \times N$ tensor to HBM between normalization and linear projection). This reduces shared-memory pressure and kernel launch overhead while maintaining efficient reduction and GeMM schedules.

4. EvoPipe: Determinism-Aware Host-Device Pipeline

To eliminate GPU idle time caused by CPU preprocessing, MegaFold implements EvoPipe.

Mechanism: It decouples deterministic preprocessing steps (e.g., RDKit-based atom construction, MSA feature extraction) from the stochastic training loop. These deterministic features are precomputed and cached offline.
Execution: During training, the system performs lightweight lookups to retrieve cached features, applying only non-deterministic augmentations (e.g., random cropping) on the fly. This ensures consistent GPU utilization by removing variable host-side stalls.

Key Results

The system was evaluated on NVIDIA H200 (141GB) and AMD MI250 (64GB) GPUs using the AlphaFold3-PyTorch framework.

Sequence Length Scalability: MegaFold enables training with up to 3.36× longer sequence lengths on 32 GPUs compared to baseline PyTorch implementations. On 32 AMD GPUs, it supports sequences up to ~1450 residues, whereas baselines fail at ~590 residues.
Execution Time: MegaFold reduces end-to-end per-iteration execution time by up to 1.73× on NVIDIA and 1.62× on AMD GPUs.
Memory Efficiency: Peak GPU memory usage is reduced by up to 1.23× (or 23% reduction), allowing the reallocation of memory capacity to longer sequences.
Model FLOPs Utilization (MFU): MegaFold achieves significantly higher MFU (up to 44.9% at 1.81B parameters) compared to baselines, which often hit OOM limits before reaching larger scales.
Cross-Platform Performance: The system demonstrates consistent gains across NVIDIA and AMD hardware, validating the portability of the Triton-based kernel designs.

Significance and Claims

The paper claims that MegaFold is the first end-to-end system to provide integrated support for training next-generation 3D-attention protein models. Its significance lies in demonstrating that:

LLM Optimizations are Insufficient: Standard transformer optimizations (FlashAttention, standard SP, compiler fusion) fail to address the unique cubic scaling and mixed-dimensional dependencies of AF3-style models.
System-Level Co-Design is Essential: Efficient training of these scientific models requires dedicated system support that jointly addresses kernel execution, parallelization, operator fusion, and host-device pipelining.
Scalability Barrier Shift: By addressing the specific bottlenecks of 3D attention, MegaFold shifts the OOM barrier, enabling the training of larger biomolecular complexes that were previously infeasible, thereby facilitating the scaling of emerging scientific workloads on modern GPU platforms.

MegaFold: Efficient Training of Next-Generation 3D Attention Protein Models on Cross-Platform GPUs