Composer: A Search Framework for Hybrid Neural Architecture Design

Imagine you are trying to build the perfect recipe for a giant, delicious cake (a Large Language Model, or LLM). For years, bakers have used the same standard recipe: one cup of flour (Attention layers) followed immediately by one cup of sugar (MLP layers), repeated over and over. It works well, but it's a bit boring, and maybe there's a tastier combination out there.

Some recent bakers tried mixing things up—maybe two cups of sugar for every cup of flour, or putting all the flour at the start and all the sugar at the end. These "hybrid" recipes sometimes taste better, but finding the perfect mix is incredibly hard.

Why? Because the number of possible recipes is astronomical. If you have a 32-layer cake, there are over 4 billion ways to arrange the flour and sugar. Testing them all by baking a full-sized cake for every single attempt would take forever and cost a fortune in ingredients (computing power).

Enter Composer, a new "smart kitchen assistant" designed by researchers at Meta and UT Austin. Composer doesn't just guess; it uses a clever, scientific framework to find the best recipe without baking a million cakes.

Here is how Composer works, broken down into simple steps:

1. The "Taste Test" Kitchen (Small-Scale Search)

Instead of baking a massive 8-foot-tall cake to test a recipe, Composer bakes tiny, 2-inch mini-cakes.

The Problem: Usually, a tiny cake doesn't taste the same as a big one. If a recipe works for a mini-cake, it might fail for a giant one.
The Solution: Composer uses a special "proxy" ingredient called MAD. Think of MAD as a "flavor simulator." It's a synthetic dataset that acts like a super-fast taste test. It tells Composer, "Hey, this mix of flour and sugar has potential," without needing to bake the whole thing.

2. The "Smart Chef" (The Search Engine)

Composer has a chef who uses Bayesian Optimization. Imagine a chef who keeps a notebook of every cake they've ever tried.

Instead of randomly mixing ingredients, the chef looks at the notebook, predicts which new mix is most likely to be delicious, and tries that one.
They try three different strategies:
- One-Shot: Bake a small cake with a random pattern and see how it goes.
- Layer-by-Layer: Build the cake one layer at a time, fixing the bottom layers and only changing the top ones.
- Middle-Out: Fix the top and bottom, and only experiment with the middle layers.
The Discovery: The chef found that the standard "1 cup flour, 1 cup sugar" recipe isn't the best. The winner was a 1:2 ratio (one cup of flour, two cups of sugar) arranged in a specific, non-linear pattern.

3. The "Aggregator" (The Committee)

After baking hundreds of mini-cakes, the chef has a list of the top 10 best-tasting ones. But which one is the true winner?

Instead of just picking the single best mini-cake (which might have been a lucky fluke), Composer acts like a committee. It looks at the top 10 recipes and asks: "What ingredient appeared most often in the best cakes?"
It creates a "super-recipe" by taking the most popular flour layer, the most popular sugar layer, and so on. This smooths out the luck and finds the robust, reliable pattern.

4. The "Giant Baker" (The Extrapolator)

Now that they have the perfect mini-recipe, they need to bake the giant 8-foot cake.

Stretching: Imagine taking the mini-recipe and stretching it out like taffy. If the mini-cake had a pattern of "Flour-Sugar-Sugar," the big cake gets "Flour-Sugar-Sugar" repeated, but with more layers in between to fill the size.
Stacking: Imagine taking the mini-cake and stacking 10 of them on top of each other to make a tower.
Composer found that stretching worked best for finding creative new patterns, while stacking was great for consistency.

The Results: Why Should You Care?

When the researchers baked the final "Composer" cakes (the new hybrid models) and compared them to the industry standard (Llama 3.2), the results were impressive:

Taste: They were smarter. They made fewer mistakes on logic and reasoning tasks (about 2% better on average).
Efficiency: They were faster and cheaper to run. Because they used fewer "flour" (Attention) layers, they needed less memory and processed information 1.25 times faster.
Cost: They found these better recipes using a tiny fraction of the computing power usually required.

The Big Picture

Before Composer, designing a new AI architecture was like trying to find a needle in a haystack by looking at the whole haystack at once. It was slow, manual, and relied on gut feelings.

Composer is like a high-tech metal detector that scans a tiny patch of the haystack, figures out where the needles are likely to be, and then tells you exactly where to dig for the giant needle. It proves that we don't need to stick to the old, rigid recipes. By mixing and matching ingredients in new ways, we can build AI that is not only smarter but also faster and more efficient.

1. Problem Statement

While Transformer architectures (specifically the fixed 1:1 interleaving of Self-Attention and MLP layers) have dominated Large Language Model (LLM) development, recent research suggests that hybrid architectures—which vary the ratio and interleaving patterns of computational primitives like Attention, MLP, and State Space Models (SSMs)—can yield superior performance and efficiency.

However, discovering optimal hybrid architectures is currently a manual, intuition-driven process. The design space is exponentially large (e.g., a 32-layer model with two primitives has $2^{32}$ configurations), making exhaustive search impossible. Existing Neural Architecture Search (NAS) frameworks are either:

Post-training: Pruning or replacing blocks in pre-trained models (e.g., Nemotron), which does not optimize the pre-training phase.
Ineffective at Scale: Attempting to search on web-scale datasets is computationally prohibitive, while searching on small proxy datasets often fails to predict large-scale performance accurately due to scaling law mismatches.

There is a critical need for a principled framework to automatically discover hybrid LLM architectures during pre-training that scale efficiently from small search sizes to massive production models (e.g., 3B–8B parameters).

2. Methodology: The Composer Framework

The authors propose Composer, a Hybrid Neural Architecture Search (HNAS) framework designed to discover high-performing hybrid LLMs at a small scale and extrapolate them to large scales. Composer consists of four core components:

A. HNAS Search Engine

Instead of searching fixed hyperparameters (width, depth) with fixed layer interleaving, Composer searches the interleaving pattern and ratio of primitives (Attention $A$ and MLP $M$ ).

Search Strategies:
1. One-Shot Search: Searches a bounded number of layers ( $n \ll N$ ) using Bayesian Optimization (Gaussian Processes) to find the optimal sequence.
2. Incremental Search: Builds the model layer-by-layer (End-Layer or Middle-Layer) to prune the search space.
Width Scaling: Crucially, Composer scales down the width of primitives during the search phase (not just depth). This reduces search costs by $>6\times$ and prevents the discovery of "wide and shallow" architectures that do not generalize to large scales.

B. HNAS Evaluator

To avoid the cost of training on massive web-scale datasets during search, Composer evaluates candidates on small-scale proxy datasets.

Dataset Selection: The authors evaluated DCLM (sampled), BabiStories, and MAD (Mechanistic Design and Scaling).
Finding: The MAD dataset (synthetic token-manipulation tasks) proved most effective. It allows for rapid evaluation ( $>8\times$ cheaper than large-scale DCLM) while maintaining a strong correlation with large-scale performance because its tasks are learnable by small models yet representative of LLM capabilities.

C. HNAS Aggregator

After search, multiple candidate architectures are generated. The Aggregator synthesizes a single final architecture using $N_c$ clustering.

Mechanism: It uses K-means clustering on the top-performing candidates. For each layer $i$ , it selects the most frequent primitive ( $mode$ ) among candidates that share the same prefix of $c$ previous layers.
Optimal Strategy: $N_0$ clustering ( $c=0$ ) was found to be superior. It selects the dominant block at each layer independently, effectively smoothing out noise and overfitting from the small-scale search without enforcing rigid conditional dependencies that might limit generalization.

D. HNAS Extrapolator

This component scales the discovered small architecture to the target size (e.g., 1B–8B parameters).

Stacking: Repeats the small architecture block $s$ times. Effective for small search depths (e.g., 6 layers).
Stretching: Scales the depth of each group of primitives proportionally while maintaining the interleaving pattern. Effective for larger search depths (e.g., 16 layers), allowing the model to preserve complex transitions between primitives.
Strategy: The framework uses Stacking for 6-layer searches and Stretching for 16-layer searches.

3. Key Contributions

Composer Framework: The first systematic framework for discovering hybrid LLM architectures for pre-training from scratch, rather than post-training modification.
Scaling Strategies: Demonstrated that width scaling during search and $N_0$ clustering for aggregation are essential for translating small-scale search results to large-scale performance.
Proxy Dataset Efficacy: Validated that synthetic datasets like MAD are superior to sampled web-scale data for guiding hybrid architecture search.
Novel Architectures: Discovered two new "Composite" architectures that outperform standard Transformers:
- Stacked Composite: Derived from a 6-layer search ($2A + 4M$).
- Stretched Composite: Derived from a 16-layer search ($2A + 5M + 2A + 3M + 1A + 3M$).
- Both utilize a 1:2 Attention-to-MLP ratio (vs. the standard 1:1) and non-standard interleaving patterns.

4. Results

The discovered Composite architectures were evaluated against Llama 3.2 and other state-of-the-art baselines (Sandwich Transformer, Striped Attention, STAR) across model sizes from 350M to 8B parameters.

Performance:
- Validation Loss: Consistently reduced validation loss by 0.03 to 0.05 compared to Llama 3.2 across all model sizes and training budgets.
- Downstream Tasks: Improved accuracy on downstream benchmarks (Arc, HellaSwag, PIQA, etc.) by 2.0% – 2.1% on average, with peaks up to 3.7%.
- Robustness: The relative ranking of architectures found at small scale (6/16 layers) correlated highly ( $>0.97$ Spearman) with their performance at 1B scale, validating the framework's predictive power.
Efficiency:
- Training: Increased training throughput by 1.25 $\times$ and reduced per-step training time by 1.32 $\times$ .
- Inference: Reduced inference latency by 1.33 $\times$ and reduced KV cache size by 1.69 $\times$ (due to fewer Attention layers).
- Comparison: Outperformed previous hybrid baselines (e.g., Striped Attention, STAR) when trained with fixed token budgets.

5. Significance

Paradigm Shift: Moves LLM architecture design from manual intuition to automated, data-driven discovery.
Efficiency: Proves that hybrid models with fewer Attention layers (1:2 ratio) can achieve better performance and significantly lower inference costs than standard Transformers, addressing the high computational cost of Attention mechanisms.
Scalability: Establishes a blueprint for how to conduct NAS for LLMs by effectively bridging the gap between small-scale search and large-scale deployment through specific extrapolation and aggregation techniques.
Future Potential: The framework is extensible and can incorporate other primitives (e.g., Mamba, Gated Delta Net), suggesting a path toward even more diverse and efficient model families.

In conclusion, Composer successfully automates the discovery of hybrid LLMs, delivering models that are not only more accurate than Llama 3.2 but also significantly more efficient in terms of training and inference, marking a significant step forward in the systematic design of next-generation language models.