The Rise and Fall of $G$ in AGI

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: Is AI Getting "Smarter" or Just "Different"?

Imagine you are a teacher trying to measure the "general intelligence" of a class of students. You give them a test with math problems, history questions, and coding puzzles.

For a long time, you noticed a pattern: If a student is good at math, they are usually good at history and coding too. In psychology, this is called the "Positive Manifold." It suggests there is one big, underlying engine of intelligence (called $g$ ) that powers all these skills. If you boost that engine, the student gets better at everything at once.

This paper asks: Does this same pattern exist in Artificial Intelligence (AI)? And if so, is it still true today?

The author, David Krakauer, says: "Yes, it used to. But now, it's changing."

Part 1: The "Rise" – The Era of the Giant Engine (2019–2024)

The Analogy: The "Super-Student" who just studies harder.

In the early days of Large Language Models (LLMs), the recipe for success was simple: Make the model bigger and feed it more data.

What happened: Every time a new AI was released, it got better at everything. If it got 10% better at math, it also got 10% better at writing poetry and solving logic puzzles.
The Result: The AI behaved like a perfect "Super-Student." There was a single "General Intelligence" score (let's call it $G$ ) that predicted how well the AI would do on any test.
The Metaphor: Imagine a car engine. In this era, everyone just kept adding more cylinders to the engine. The car got faster, and it got faster at every task (driving, drifting, towing) simultaneously. The "General Intelligence" was very strong and dominant.

The Paper's Finding: During this time, the "General Intelligence" factor explained about 90% of the differences between models. It was a one-dimensional world: Bigger = Better at Everything.

Part 2: The "Fall" – The Era of Specialization (Late 2024–Present)

The Analogy: The "Tool-Using" Specialist.

Recently, AI labs stopped just making "bigger" models. They started building models that use tools (like calculators, web search, and code interpreters) and models that are specifically trained to reason step-by-step.

What happened: The AI landscape got messy. Some models became amazing at deep reasoning (solving hard physics problems) but stopped being as good at simple tasks. Others became great at writing code but less good at general knowledge.
The Result: The single "Super-Student" engine cracked. The "General Intelligence" score ( $G$ ) dropped from explaining 90% of the variance down to about 77%.
The Metaphor: The car engine didn't just get bigger; it got re-engineered. Now, you have:
- Race Cars: Incredible at speed (reasoning), but bad at hauling cargo.
- Trucks: Incredible at hauling (execution), but slow on the track.
- Hybrids: They use external tools (like a tow truck) to do the heavy lifting.

The paper calls this the "Rise and Fall of G." The "Fall" isn't that AI is getting dumber; it's that AI is becoming more diverse. The single "General Intelligence" factor is no longer the whole story.

Part 3: The "Hedgehog" vs. The "Fox"

The paper uses a famous metaphor from the philosopher Isaiah Berlin to describe what's happening inside the AI:

The Hedgehog (The Old AI): "Knows one big thing." In the early days, the AI was a Hedgehog. It had one giant brain (General Intelligence) that did everything.
The Fox (The New AI): "Knows many things." The new AI is a Fox. It is actually a collection of many different specialized skills working together.

The Twist: The paper argues that the "General Intelligence" we see today is actually a mask. It hides the fact that the AI is becoming a "Society of Minds."

When you remove the "General" part of the score, you see that Reasoning and Execution (doing things) are actually fighting each other.
If an AI gets really good at deep reasoning, it often gets slightly worse at simple execution, and vice versa. They are trading off against each other.

Part 4: The "Ptolemaic Succession" – Why We Keep Adding More Tests

The Analogy: The Geocentric Universe.

In ancient astronomy, people thought the Earth was the center of the universe. When planets moved in weird ways, they didn't change the theory; they just added a new circle (an "epicycle") to explain the movement. The model got more and more complicated, but it never found a simple law.

The paper argues AI benchmarking is doing the same thing:

AI learns to use a tool (like a calculator).
We create a new test to measure that.
AI learns to use a web browser.
We create another test.

We are just adding more "epicycles" (more tests) to keep up with the AI. The paper suggests we need a Newtonian approach: Find the simple, underlying laws of how these tools and models work together, rather than just testing them on isolated tasks.

The Final Takeaway: Intelligence is a Team Sport

The most important conclusion of the paper is this:

Intelligence is no longer just about the "brain" (the model); it's about the "brain + the tools."

Old View: Intelligence is a property of the AI itself.
New View: Intelligence is a property of the AI + its tools (calculators, search engines, code).

Just as a human isn't "smart" because of their brain alone, but because they have access to libraries, the internet, and writing, AI is becoming smart because it can use external tools.

In short: The era of the "One-Size-Fits-All" Super-Intelligence is fading. We are entering an era of specialized, tool-using intelligences that are more complex, more diverse, and perhaps more "human" in their ability to outsource work to tools. The "General Intelligence" score is dropping not because AI is failing, but because it is finally learning to be a Fox instead of a Hedgehog.

1. Problem Statement

The paper addresses the ambiguity surrounding the concept of Artificial General Intelligence (AGI). While the AI community often claims that Large Language Models (LLMs) exhibit "general intelligence" based on their performance across diverse benchmarks, this claim lacks the rigorous psychometric foundation found in human intelligence research.

Specifically, the paper investigates whether the observed correlations between LLM performance on different tasks constitute a true positive manifold (a statistical signature of a single latent "general intelligence" factor, or G) or if this correlation is merely an artifact of temporal trends (models getting better over time) or shared training data. The central question is: Is the "G-factor" in AI a stable, mechanistic property of intelligence, or is it a statistical illusion that is currently fracturing as models specialize?

2. Methodology

The author applies Principal Component Analysis (PCA) and psychometric techniques to a temporal dataset of LLM performance, treating models as "subjects" and benchmarks as "tests."

Data Structure:
- Subjects: $N=39$ language models released between February 2019 and December 2025 (covering OpenAI, Anthropic, Google, Meta, DeepSeek, Mistral, and DeepMind).
- Benchmarks: $K=14$ benchmarks spanning cognitive domains (e.g., MMLU, GSM8K, MATH, HumanEval, GPQA Diamond).
- Matrix: A score matrix $X$ where entries are normalized to 0–100%. The matrix is sparse due to the "moving battery" problem (new benchmarks appear while old ones are dropped).
Analytical Framework:
- Positive Manifold Test: Verifying if all pairwise benchmark correlations are positive.
- PCA: Computing the first principal component (PC1) as the G-factor. Metrics include the variance explained ( $\rho_1$ ), the dominance ratio ( $\delta = \lambda_1/\lambda_2$ ), and effective dimensionality ( $d_{eff}$ ).
- Temporal Decomposition: The timeline is divided into four Algorithmic Epochs based on architectural shifts (e.g., Scaling Laws, Frontier Dense Models, MoE, Inference-Time Reasoning).
- Control Analyses:
  - Detrending: Removing linear time trends to isolate structural correlations from the "all models get better" effect.
  - Partial Correlation: Removing the G-factor to reveal residual group structures (specializations).
  - Eigenvector Rotation: Measuring the angular displacement of the G-factor vector over time to detect shifts in what "general intelligence" means.
  - Horn's Parallel Analysis: Used to validate factor retention against random noise, addressing small sample size concerns.

3. Key Contributions

Formalization of AI Psychometrics: The paper establishes a rigorous framework for applying Spearman's g-factor theory to AI, defining "G" as the first principal component of benchmark performance.
The "Rise and Fall" Hypothesis: It provides empirical evidence that the strength of the G-factor is not static; it peaked during the "Scaling Era" and is now declining as models diverge into specialized architectures.
Discovery of Latent Specialization: By removing the G-factor, the paper reveals a hidden structure of anti-correlated specializations (Reasoning vs. Execution), suggesting that "generality" in current models is a mask for distinct, competing capabilities.
The "Ptolemaic Succession" Analogy: The paper argues that the current benchmarking regime is becoming increasingly complex (adding "epicycles" for every new capability) rather than discovering a parsimonious underlying law of intelligence.

4. Key Results

A. Confirmation of the Positive Manifold

Across 8 benchmarks with sufficient data, all 28 pairwise correlations are positive ( $\bar{r} = 0.82$ ).
PC1 Dominance: In a 5-benchmark core battery, PC1 explains 90% of the variance (rising to 92% in Epoch II). This confirms a strong positive manifold, similar to human psychometrics but initially stronger.

B. The Rise and Fall of G

The Rise (Epoch II, 2023–2024): During the era of pure scaling (larger dense transformers), the G-factor was at its peak ( $\rho_1 \approx 92\%$ ). All models improved in lockstep across all tasks.
The Fall (Epoch III/IV, 2024+): With the advent of inference-time reasoning (e.g., Chain-of-Thought, o1, DeepSeek R1) and specialized architectures (MoE, tool use), the G-factor declined.
- In Epoch III, $\rho_1$ dropped to 77%.
- The dominance ratio ( $\delta$ ) fell from 15:1 to 1.8:1.
- Effective dimensionality ( $d_{eff}$ ) increased from ~1.1 to ~1.9, indicating the emergence of a second dimension.

C. Eigenvector Rotation (The "Great Rotation")

The definition of "general intelligence" has shifted. In the scaling era, G loaded uniformly on all tasks.
In the reasoning era, the G-vector rotated (angular displacement up to 6.4°). It now loads heavily on knowledge-intensive reasoning (MATH, GPQA) and less on procedural execution (HumanEval, GSM8K).
This suggests models are outsourcing procedural tasks to tools, freeing capacity for high-level reasoning, thereby changing the nature of the "general" factor.

D. Residual Structure (The "Foxes and Hedgehogs")

When G is removed via partial correlation, the remaining structure is anti-correlated.
Group I (Reasoning): MATH and GPQA are positively correlated.
Group II (Execution): GSM8K and HumanEval are positively correlated.
Cross-Group: These groups are strongly negatively correlated (e.g., $r_{resid} = -0.80$ between GSM8K and GPQA).
Implication: Models excel at reasoning at the expense of procedural fluency, and vice versa. The "general" appearance is a suppression effect where G masks these trade-offs.

E. Detrending Analysis

Even after removing the linear time trend (which accounts for 47–72% of raw variance), a significant G-factor remains ( $\rho_1 = 77\%$ ). This confirms that the positive manifold is not merely a temporal artifact but reflects a shared computational architecture, though one that is now fragmenting.

5. Significance and Conclusion

AGI is Not a Single Point: The paper argues that the trajectory of AI is not converging toward a single, monolithic AGI (a "Hedgehog" that knows one big thing). Instead, it is evolving into a "Society of Minds" (a "Fox" that knows many things) where different models specialize in different subspaces of capability.
The End of Pure Benchmarks: The paper posits that the era of evaluating "raw" intelligence without tools is ending. As models integrate tools (code interpreters, search), the concept of a pure G-factor becomes ill-defined. Intelligence is becoming a property of the model-tool system, not just the substrate.
Ptolemaic Succession: The current approach to AGI is described as "Ptolemaic"—adding new benchmarks (epicycles) for every new capability rather than finding a unifying theory (Keplerian/Newtonian). The paper suggests that true progress requires understanding the latent dimensions of intelligence rather than just accumulating scores.
Future Outlook: The "Rise and Fall of G" signals a transition from a scaling-dominated regime to a specialization-dominated regime. The field must move beyond the simplistic metric of "general intelligence" to embrace a multi-dimensional view of AI capabilities.

In summary, Krakauer demonstrates that while LLMs currently exhibit a strong positive manifold, this "general intelligence" is a transient statistical phenomenon that is fracturing as models specialize. The future of AI lies not in a single dominant factor, but in a diverse, high-dimensional landscape of specialized intelligences.

The Rise and Fall of GGG in AGI