Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Imagine the world of Artificial Intelligence as a massive, bustling city. Right now, almost everyone who needs a smart assistant (like a chatbot or a reasoning engine) has to travel to the city center, the Cloud. This is a giant, super-powered data center where the "super-brains" (the biggest, most expensive AI models) live.

The problem? The city center is getting overcrowded. Millions of people are trying to get in at once, the traffic is jammed, and the electricity bill for keeping the lights on in that giant city is becoming astronomical.

This paper asks a simple question: Can we stop everyone from going to the city center? Can we let people solve their problems right in their own neighborhoods (on their laptops and phones)?

To answer this, the researchers invented a new way to measure success called "Intelligence Per Watt."

The New Scorecard: Intelligence Per Watt (IPW)

Think of AI like a car.

Old Scorecard: "How fast can this car go?" (This is just about raw power and accuracy).
New Scorecard (IPW): "How many miles can this car drive on one gallon of gas?"

In the AI world, "miles" is how smart the answer is, and "gallons" is how much electricity it takes to get that answer.

The researchers wanted to know: Can a small, local AI (running on your laptop) give you a good answer without eating up all your battery?

The Big Experiment

The team didn't just guess; they ran a massive test drive.

The Cars: They tested over 20 different "local" AI models (smaller, lighter brains) and 8 different types of computer chips (from Apple's M4 Max to powerful NVIDIA servers).
The Road Trip: They fed these models 1 million real-world questions people actually ask, ranging from "Write me a poem" to "Solve this complex physics problem."

What They Discovered (The Plot Twist)

Here are the three main takeaways, explained simply:

1. The Neighborhoods Are Ready (88.7% Success Rate)

For a long time, we thought you needed a supercomputer to get a good answer. The study found that 88.7% of the time, a small AI running on your own laptop can answer a question just as well as the giant cloud supercomputer.

The Analogy: It's like realizing that for 9 out of 10 trips (going to the grocery store, picking up kids, checking the weather), you don't need a massive semi-truck. A small, efficient sedan (your laptop) gets the job done perfectly.
The Catch: The "sedan" still struggles with the really heavy, complex jobs (like advanced engineering or deep scientific research). For those, you still need the "semi-truck" in the cloud.

2. The Engines Are Getting Amazingly Efficient (5.3x Improvement)

Between 2023 and 2025, things got a lot better, very fast.

The Analogy: Imagine if, in just two years, your car suddenly got 5 times more efficient. It could drive 5 miles on the same amount of gas it used to drive 1 mile.
Why? Two things happened:
1. Smarter Brains: The AI models got better at learning (algorithmic advances).
2. Better Engines: The computer chips (like Apple's M4) got much better at doing math without burning energy.
The result? Local AI is becoming a viable alternative to the cloud for a huge chunk of daily tasks.

3. The "Smart Dispatcher" Saves the Day

The researchers proposed a system where a "Smart Dispatcher" looks at your question and decides: "Is this a simple question? Send it to the laptop. Is this a hard question? Send it to the cloud."

The Analogy: Imagine a traffic controller who directs 80% of the cars to local roads and only sends the heavy trucks to the highway.
The Result: If we do this, we could save 60% to 80% of the energy, computing power, and money currently wasted by sending everything to the cloud. Even if the dispatcher makes a few mistakes (sending a simple question to the cloud), the savings are still massive.

Why This Matters

This paper is a wake-up call. It tells us that the future of AI isn't just about building bigger, hungrier data centers. It's about distributing the work.

For You: Your laptop might soon be smart enough to handle your daily AI needs without needing an internet connection or draining your battery.
For the Planet: By moving work from giant, energy-hungry data centers to our personal devices, we can drastically cut down on electricity usage and carbon emissions.
For the Industry: It proves that we don't need to choose between "Smart" and "Efficient." We can have both, if we measure "Intelligence Per Watt" instead of just raw power.

In short: The "Super-Brain" in the cloud is still the king for the hardest problems, but the "Smart Assistant" on your desk is now strong and efficient enough to handle almost everything else. And that's a huge win for everyone.

1. Problem Statement

The rapid exponential growth in Large Language Model (LLM) inference demand is straining centralized cloud infrastructure. Current projections suggest data center power requirements could reach 50–100 GW by 2030, creating a bottleneck in scaling compute, energy, and capital costs. While frontier models (≥100B parameters) dominate current performance, they are energy-intensive and require massive cloud resources.

Conversely, two converging trends offer a potential solution:

Small Local LMs: Models with ≤20B active parameters are achieving competitive performance on many tasks.
Local Accelerators: Consumer-grade hardware (e.g., Apple M4 Max, AMD Ryzen AI) now possesses sufficient memory and throughput to run these models interactively.

The Core Question: Can local inference viably redistribute demand from centralized cloud infrastructure? Answering this requires a unified metric to evaluate both capability (can the local model answer the query correctly?) and efficiency (can it do so within the power constraints of a laptop or edge device?).

2. Methodology

A. Proposed Metric: Intelligence per Watt (IPW)

The authors introduce Intelligence per Watt (IPW) as the primary metric for evaluating local inference viability.

Definition: $IPW = \frac{\text{Task Accuracy}}{\text{Power Consumption (Watts)}}$
Rationale: This metric captures the fundamental trade-off of local inference: achieving sufficient task performance within constrained power budgets. It allows for systematic comparison across different model-accelerator configurations.
Complementary Metrics: The study also tracks Accuracy per Joule (total energy efficiency including latency) and Perplexity per Watt/Joule.

B. Experimental Setup

The authors conducted a large-scale empirical study involving:

Models: 20+ State-of-the-Art (SOTA) local LMs (≤20B active parameters) including Qwen3, GPT-OSS, Gemma3, and IBM Granite families, compared against frontier models (e.g., GPT-5, Claude Sonnet 4.5).
Hardware: 8 diverse accelerators:
- Local: Apple M4 Max, SambaNova SN40L.
- Cloud/Enterprise: NVIDIA A100, H200, GH200, B200; AMD MI300X.
Dataset: 1 Million real-world queries spanning four benchmarks:
1. WILDCHAT: 500K naturalistic user conversations.
2. NATURALREASONING: 500K reasoning-focused queries (math, physics, chemistry).
3. MMLU PRO: 12K multi-domain knowledge evaluation.
4. SUPERGPQA: 26.5K graduate-level expert reasoning across 285 disciplines.
Profiling Harness: A custom, hardware-agnostic tool developed to measure latency, throughput, energy consumption (via NVML, powermetrics, ROCm SMI), and memory usage at 50ms intervals.

C. Evaluation Strategy

Accuracy: Measured via "Win Rate" against a frontier model (using LLM-as-a-Judge) or Ground Truth for benchmarks.
Routing Simulation: The study simulates a hybrid local-cloud system where a router assigns queries to the smallest capable local model or falls back to a frontier cloud model.
Longitudinal Analysis: Tracking progress from 2023 to 2025 to isolate contributions from model architecture vs. hardware improvements.

3. Key Contributions

Introduction of Intelligence per Watt (IPW): A unified metric that quantifies the efficiency of local inference, balancing model capability against energy cost.
First Large-Scale Empirical Study: A comprehensive evaluation of 1M+ queries across 20+ models and 8 hardware accelerators, covering the 2023–2025 evolution of local AI.
Open-Source Profiling Harness: Release of a reproducible benchmarking toolkit (https://github.com/HazyResearch/intelligence-per-watt) to track efficiency as the ecosystem evolves.

4. Key Results

A. Local Model Coverage (Capability)

High Success Rate: As of October 2025, local LMs (≤20B) can successfully answer 88.7% of single-turn chat and reasoning queries when routing to the best local model for each specific query.
Domain Variance:
- Creative/General Tasks: >90% coverage (e.g., Arts, Media, General Chat).
- Technical Tasks: Drops to ~68% for specialized fields like Architecture & Engineering.
Longitudinal Growth: The fraction of queries solvable by local models increased from 23.2% (2023) to 48.7% (2024) to 71.3% (2025).

B. Efficiency Improvements (IPW)

5.3× Improvement: Intelligence per Watt improved 5.3× from 2023 to 2025.
Decomposition of Gains:
- Model Advances: Contributed a 3.1× gain (driven by better pre-training, post-training, and MoE architectures).
- Hardware Advances: Contributed a 1.7× gain (driven by memory bandwidth and compute density improvements).
Local vs. Cloud Efficiency: While local accelerators (e.g., M4 Max) are less efficient per watt than enterprise cloud chips (e.g., NVIDIA B200) by a factor of 1.4× to 1.7× (and up to 7.4× in energy per query due to latency), they offer a massive system-level advantage by avoiding cloud infrastructure costs entirely for the majority of queries.

C. Resource Savings via Routing

The study simulated a hybrid system routing queries to the smallest capable local model:

Oracle Routing (Perfect Assignment): Could reduce energy by 80.4%, compute by 77.3%, and cost by 73.8% compared to a cloud-only approach.
Realistic Routing (80% Accuracy): Even with imperfect routing, the system captures 80% of theoretical gains, achieving:
- 64.3% energy reduction.
- 61.8% compute reduction.
- 59.0% cost reduction.
Scalability: At current daily inference volumes, this translates to terawatt-hours of annual energy savings.

5. Significance and Implications

Paradigm Shift: The paper argues that the future of LLM inference lies not in a single "cloud-only" or "edge-only" model, but in a hybrid local-cloud ecosystem. Local inference is no longer just a niche for privacy; it is a critical mechanism for scaling AI sustainably.
Economic Impact: By offloading ~88% of queries to local devices, the industry can significantly mitigate the projected $7 trillion capital expenditure required for data center expansion.
Hardware-Software Co-Design: The results highlight that efficiency gains are multiplicative. Future progress requires simultaneous advances in model architectures (to reduce parameter counts while maintaining accuracy) and local hardware (to close the efficiency gap with cloud accelerators).
Benchmarking Standard: The introduction of IPW provides a necessary standard for evaluating the "greenness" and viability of AI systems, moving beyond simple accuracy metrics to include energy constraints.

In conclusion, the paper demonstrates that local AI is rapidly becoming a practical, efficient, and economically viable complement to centralized infrastructure, capable of handling the vast majority of daily user interactions with significantly lower resource consumption.