Bayesian Hierarchical Models and the Maximum Entropy Principle

This paper demonstrates that when the prior given hyperparameters in a Bayesian hierarchical model is a canonical maximum entropy distribution, the resulting dependent marginal prior also possesses a maximum entropy property, but with respect to a constraint on the marginal distribution of a function of the unknown parameters, thereby clarifying the implicit information assumptions of such models.

Brendon J. Brewer

Published Thu, 12 Ma
📖 6 min read🧠 Deep dive

Here is an explanation of the paper using simple language and creative analogies.

The Big Idea: The "Hidden Chef" of Statistics

Imagine you are a chef trying to create a recipe for a giant pot of soup (the data). You have nn ingredients (x1,x2,,xnx_1, x_2, \dots, x_n), but you don't know exactly what they are yet. You need to decide on a "prior belief"—a starting guess about what the soup might taste like before you actually taste it.

Usually, statisticians use a Maximum Entropy rule. Think of this as the "Principle of Least Assumption." It says: "Don't guess anything you don't have to. Just spread your possibilities as evenly as possible, unless you have a specific rule to follow."

For example, if you know the average temperature of the soup must be 50°C, the "Maximum Entropy" rule gives you a very specific, standard recipe (called a canonical distribution) that fits that rule perfectly.

The Problem:
Sometimes, we don't actually know the exact average temperature. We just have a feeling about it. Maybe we think the average is likely around 50, but it could be 40 or 60.

In the past, statisticians would say: "Okay, let's pretend we do know the average is 50. We'll use that standard recipe, but we'll add a 'secret sauce' (a hyperparameter) that we don't know the exact amount of. We'll guess a range for that secret sauce and mix it all together."

This creates a Hierarchical Model. It's like saying, "I don't know the exact temperature, so I'll assume the temperature is a random variable drawn from a distribution of temperatures."

The Confusion:
The author, Brendon Brewer, noticed something weird. When you mix all those different "standard recipes" together (because the temperature is unknown), the final result is a messy "mixture." It doesn't look like a clean, standard Maximum Entropy recipe anymore.

People thought: "Oh no! We lost the Maximum Entropy principle. This messy mixture isn't the most 'ignorant' or 'fair' guess anymore."

The Discovery:
Brewer says: "Wait a minute! You haven't lost the principle. You just changed the rule."

He proves that this messy mixture IS actually a Maximum Entropy distribution. It just isn't following the rule about the average temperature. Instead, it is following a rule about the shape of the distribution of the average temperature itself.


The Analogy: The "Indirect" Rule

Let's use a metaphor to make this crystal clear.

Scenario A: The Direct Rule (Standard MaxEnt)

Imagine you are a teacher grading a class of 100 students.

  • The Rule: "The class average must be exactly 75."
  • The Result: You assign grades to every student in a very specific, standard way (a bell curve) to ensure that average is hit. This is the "Canonical Distribution."

Scenario B: The Indirect Rule (The Hierarchical Model)

Now, imagine you don't know what the class average should be. You only know that the "ideal average" is a mystery.

  • The Setup: You decide that the "ideal average" could be anywhere between 60 and 90, and you have a hunch about how likely each number is (maybe 75 is most likely, but 60 is possible).
  • The Process: You imagine a thousand different classrooms, each with a different fixed average (60, 61, 62... 90). For each classroom, you assign grades using the "Standard Rule" from Scenario A. Then, you mix all those classrooms together into one giant pile of grades.
  • The Result: The final pile of grades looks messy. It's not a perfect bell curve.

The Old View: "This messy pile is bad. It's not a clean Maximum Entropy solution."

Brewer's New View: "Actually, this messy pile IS the perfect Maximum Entropy solution, but for a different question."

Instead of asking, "What is the distribution of grades if the average is 75?"
You are now asking, "What is the distribution of grades if I have a specific uncertainty about what the average is?"

Brewer shows that the messy pile is the most "honest" (maximum entropy) way to distribute grades given that you have a specific belief about how the class average behaves.

The "Indicator" Trick (How it works mathematically)

The paper uses a clever mathematical trick to prove this.

Imagine you have a function TT (like the class average).

  1. Standard MaxEnt says: "I know the average is 75." So, I adjust the grades to hit that target.
  2. Hierarchical MaxEnt says: "I don't know the average, but I know the probability of the average being 60, 70, or 80."

Brewer proves that if you take the "Standard" recipe and mix it according to your beliefs about the average, the result is mathematically identical to taking the original "ignorant" recipe and applying a new rule: "Make sure the final grades look like they came from a specific distribution of averages."

It's like saying:

  • Old Way: "I know the soup is 50°C."
  • New Way: "I don't know the temperature, but I know the distribution of temperatures I'm willing to accept. Therefore, the soup must look like this specific mixture."

Why Does This Matter?

This is a huge deal for two reasons:

  1. It Validates the "Messy" Models: Scientists often use these hierarchical models because they are practical. They are easier to build when you don't know the exact numbers. Before this paper, people worried they were breaking the "Maximum Entropy" rules. Now we know they are not breaking the rules; they are just applying the rules to a different, more complex constraint.
  2. It Explains What We Are Assuming: When you build a hierarchical model, you aren't just guessing numbers. You are explicitly stating: "I believe the underlying parameters (like the average) follow this specific pattern." Brewer's work tells us exactly what information we are putting into the model.

Summary in One Sentence

Brendon Brewer proves that when you use a "hierarchical" model (where you guess the rules for your rules), you aren't abandoning the "Maximum Entropy" principle; you are simply applying it to the uncertainty of the rules themselves rather than the rules directly.

It's like realizing that while you can't predict the exact weather, you can predict the most "fair" distribution of weather patterns if you know how much you trust your weather forecast.