Menu Pricing of Large Language Models

Imagine you are running a massive, high-tech bakery. But instead of selling loaves of bread, you sell "Brain Power" (Large Language Models, or LLMs).

Your customers are businesses and individuals who want to use your brain power for all sorts of things: writing emails, coding software, analyzing legal documents, or generating art. The problem is, every customer is different. Some need a tiny spark of intelligence for a quick task; others need a roaring fire of computation for a massive project. And you don't know exactly what they are going to do with your power until they start using it.

This paper by Bergemann, Bonatti, and Smolin is a guide on how to price your "Brain Power" so you make the most money without scaring customers away.

Here is the breakdown using simple analogies:

1. The Problem: A Messy Kitchen

Usually, pricing is hard because customers are complicated.

The Mystery: You don't know if a customer is a "light user" (just checking the weather) or a "heavy user" (training a robot army).
The Hidden Action: Even if you sell them a "token bucket" (a bucket of brain power), you can't see how they pour it. Do they use it on easy tasks or hard ones?
The Math Nightmare: In theory, this is a math problem with infinite dimensions. It's like trying to price a car when you don't know if the buyer wants to drive it to the grocery store or race it in the Indy 500, and you can't see which gears they shift into.

2. The Big Discovery: The "Magic Summary"

The authors found a magic trick. Because the way LLMs work is consistent (mathematically "homogeneous"), you don't need to know the details of every single task a customer does.

The Analogy: Imagine every customer has a "Brain Power Score."

One customer might do 1,000 easy tasks.
Another might do 10 very hard tasks.
If their total value is the same, they act exactly the same to you.

You can ignore the messy details and just look at this single Score. This turns a terrifying, infinite-dimensional math problem into a simple, one-dimensional problem: "How much does this customer's Score tell us they are willing to pay?"

3. The Solution: The "Spending Pass" (Token Budgets)

So, how do you sell this? You don't sell "100 words of text." You sell a Spending Pass.

The Mechanism: You give the customer a budget of "credits" (tokens).
The Price: You charge them a flat fee upfront for the budget.
The Twist: Inside that budget, they can spend the credits however they want.
- Using a "smart" model (like a Ferrari) costs more credits per mile.
- Using a "basic" model (like a sedan) costs fewer credits.
- They can mix and match.

Why this works: It's like giving a customer a prepaid card for a theme park. You charge them $100 for the card. They can ride the gentle carousel 50 times or the scary rollercoaster 5 times. You don't care what they ride; you just know that the person with the "high score" (the thrill-seeker) will buy the big card, and the "low score" person will buy the small one.

4. The Three Real-World Strategies

The paper shows that the big AI companies (like OpenAI and Anthropic) are already doing this, but in slightly different ways:

A. The "Anthropic" Style (Quantity Only)

The Setup: Everyone gets access to the same models (Ferraris and Sedans).
The Difference: The "Pro" plan gives you a bigger bucket of tokens. The "Max" plan gives you a giant bucket.
The Logic: You are screening based on how much they want to use, not which model they use.

B. The "OpenAI" Style (Quality + Quantity)

The Setup: The "Free" plan gets the basic sedan. The "Plus" plan gets the sedan + a few rides on the Ferrari. The "Pro" plan gets unlimited access to the Ferrari.
The Logic: You are screening based on both how much they use and how smart the model needs to be. High-value customers get the best tools and more of them.

C. The "GitHub/Quora" Style (The Aggregator)

The Setup: These companies don't make their own models; they resell others'. They sell a "Credit Pack."
The Logic:
- Poe (Quora): You buy 12 million points. If you run out, you stop. (Maximum Spend).
- GitHub Copilot: You buy a plan with a limit, but if you go over, you can keep going at a higher price. (Minimum Spend).
- Both are just different ways of managing that "Spending Pass."

5. The "API" Exception: The Grocery Store

There is one place where this fancy pricing doesn't happen: Developer APIs (where programmers build apps).

The Price: It's strictly linear. $0.0001 per token. No bundles, no discounts.
Why? The market is too competitive. If OpenAI tried to sell a complex "Spending Pass" to a developer, that developer would just go to Google or Anthropic who offers a simple "pay-as-you-go" price.
The Paper's Take: This is actually the "efficient" price. It's like a grocery store selling flour by the pound. No tricks, just the cost of the flour plus a tiny bit of profit.

6. The Competition Twist

What happens if a "Big Guy" (Proprietary Leader) competes with a "Small Guy" (Open Source)?

The Small Guy: Sells tokens at the bare minimum cost (marginal cost).
The Big Guy: Has to be clever.
- Low-end customers: They just buy from the Small Guy. The Big Guy ignores them.
- Middle customers: The Big Guy sells them just enough tokens so they don't feel the need to buy more from the Small Guy. It's a "deterrence" zone.
- High-end customers: The Big Guy sells them a premium package, ignoring the Small Guy because these customers want the Big Guy's special features anyway.

The Bottom Line

The paper argues that the confusing pricing we see today (subscriptions, token limits, credit systems) isn't random. It's actually the mathematically perfect way to sell AI.

By selling budgets instead of specific tasks, companies can:

Hide the complexity of the technology from the user.
Automatically sort customers by how much they are willing to pay.
Make the most profit while keeping the system efficient.

It turns out, the best way to sell "Brain Power" is to sell a wallet of credits and let the customer decide how to spend it.

Here is a detailed technical summary of the paper "Menu Pricing of Large Language Models" by Bergemann, Bonatti, and Smolin.

1. Problem Statement

The paper addresses the fundamental economic challenge of pricing access to Large Language Models (LLMs). The core difficulty lies in the multidimensional screening problem combined with moral hazard:

High-Dimensional Types: Users face a continuum of tasks, each with a different valuation. A user's "type" is an infinite-dimensional function mapping tasks to values.
Hidden Allocation (Moral Hazard): Providers can meter total token usage (input/output/fine-tuning) but cannot observe or contract on how a user allocates these tokens across their specific tasks.
Complexity: The combination of infinite-dimensional private information and high-dimensional allocation space (tokens across tasks) suggests the problem is intractable.

The authors aim to derive the optimal mechanism for a profit-maximizing provider to sell token budgets to heterogeneous users and to explain observed industry pricing practices (e.g., Anthropic, OpenAI, GitHub).

2. Methodology and Model

The authors construct a mechanism design framework with the following primitives:

Task Environment: A buyer faces a unit measure of tasks $i \in [0, 1]$ .
Production Technology: The gain (performance) on task $i$ $i$ is given by $g(x_i, z) = \Psi(x_i)\Phi(z)$ $g (x_{i}, z) = Ψ (x_{i}) Φ (z)$ , where $x_i$ $x_{i}$ is the vector of inference tokens and $z$ $z$ is the vector of fine-tuning tokens.
- Homogeneity: A critical assumption is that $\Psi$ is homogeneous of degree $\sigma \in (0, 1)$ . This implies diminishing returns to scale and, crucially, that the optimal composition of token classes is scale-invariant (the mix of tokens is the same for all tasks; only the scale varies).
Mechanism: The provider sells a menu of token budgets (total inference and fine-tuning tokens). The buyer allocates these tokens across tasks to maximize utility.

Key Analytical Tool: The Aggregation Result
The authors prove that under the homogeneity assumption, the user's infinite-dimensional type profile $w = (w_i)_{i \in [0,1]}$ collapses into a scalar sufficient statistic, the aggregate type $\theta(w)$ :
$\theta(w) \triangleq \left( \int_0^1 w_i^{\frac{1}{1-\sigma}} di \right)^{1-\sigma}$
This reduction transforms the complex multidimensional screening problem into a standard one-dimensional screening problem (à la Mussa and Rosen, 1978).

3. Key Contributions and Results

A. Efficient Allocation and Linear Pricing (Section 3)

Efficiency: In the socially efficient allocation, all tasks use inference tokens in the same proportions. The scale of usage depends only on the task's marginal value and the aggregate type.
Capacity Constraints: When capacity constraints exist (e.g., GPU limits), the efficient allocation can be implemented via linear prices equal to marginal costs inflated by shadow costs.
Implication: This rationalizes the observed linear, pay-per-token API pricing used by providers for developers, suggesting it is an efficient response to capacity constraints rather than a suboptimal design.

B. Optimal Monopoly Mechanism: Committed-Spend Contracts (Section 4)

Optimal Menu: The optimal mechanism for a monopolist is a menu of committed-spend contracts. Buyers pay an upfront fee for a budget of tokens, which they allocate freely across token classes priced at marginal cost.
Distortions: As in standard screening, low types are excluded, and quality (token quantity) is distorted downward for intermediate types.
Indirect Implementations: The authors show the optimal direct mechanism can be implemented via three intuitive formats observed in practice:
1. Maximum-Spend Mechanism: A hard budget cap (e.g., "100k tokens/month"). Once exhausted, access stops.
2. Minimum-Spend Mechanism: A commitment to spend a minimum amount to unlock lower per-token prices (volume discounts).
3. Two-Part Tariff: An upfront fee plus a per-token price.
Condition: Minimum-spend and two-part tariffs are optimal if the markup factor $m(\theta) = \theta / \bar{\phi}(\theta)$ is decreasing (i.e., higher types face lower per-token prices).

C. Multi-Model Versioning (Section 5.1 & 5.2)

Extension: The model is extended to multiple differentiated models (e.g., different capability tiers).
Single-Model Usage: Even with multiple models available, the optimal allocation for any given buyer type involves using only one model for all tasks.
Versioning Structure: Higher aggregate types are assigned to models with higher returns to fine-tuning (higher capability).
Discontinuities: The optimal quality schedule exhibits discrete jumps at model-switching thresholds. This explains why premium tiers often grant access to entirely new, more powerful models rather than just larger quotas.

D. Competition: Leader vs. Open-Source Fringe (Section 5.3)

Setup: A proprietary leader competes with an open-source fringe selling tokens at marginal cost.
Three Regimes: The optimal menu creates three distinct regions based on user type $\theta$ $θ$ :
1. Low Types: Purchase exclusively from the fringe (leader offers nothing).
2. Intermediate Types (Deterrence Region): The leader offers a quantity exactly sufficient to deter the user from "topping up" with the fringe. This quantity is strictly lower than the monopoly quantity.
3. High Types: The leader acts as an unconstrained monopolist, offering the standard distorted monopoly quantity.
Significance: Competition reshapes both the intensive margin (how much is sold) and the extensive margin (who buys the proprietary model).

4. Empirical Mapping and Significance

The paper provides a theoretical foundation for current industry practices, arguing they are not ad-hoc but reflect optimal economic mechanisms:

Observed Practice	Theoretical Counterpart	Provider Example
Consumer Subscriptions (Tiered plans with usage limits)	Nonlinear menus (Prop 3)	Anthropic: Screens on quantity only (same models, different budgets).
Premium Tiers (Exclusive access to top models)	Multi-model versioning (Prop 6)	OpenAI: Screens on both quantity and model access (reserves o1-pro for top tier).
Model Aggregators (Credits/Points systems)	Committed-spend mechanisms (Prop 4)	Quora's Poe: Maximum-spend (hard cap). GitHub Copilot: Minimum-spend (overage allowed).
Developer APIs (Linear pay-per-token)	Constrained-efficient linear pricing (Cor 1)	All Providers: Prioritizing adoption/surplus over rent extraction.

5. Conclusion

The paper demonstrates that despite the apparent intractability of pricing LLMs due to high-dimensional user types and hidden actions, the homogeneity of the production technology allows for a powerful reduction to a one-dimensional problem.

Theoretical Insight: The "aggregate type" is the sole determinant of demand, collapsing the complexity of task-specific valuations.
Practical Impact: The framework validates the "committed-spend" model (budgets + marginal cost pricing) as the optimal mechanism for LLM providers. It explains why providers differentiate via usage caps (Anthropic) versus model access (OpenAI) and why API pricing remains linear.
Future Directions: The authors suggest extensions to oligopoly competition, dynamic model improvement via user data, and scenarios where the homogeneity assumption breaks down (e.g., specialized models for specific tasks).

In summary, the paper provides a rigorous economic justification for the current state of the LLM market, showing that observed pricing structures are likely optimal responses to the specific technological constraints of token-based computation.