Thermodynamics of Reinforcement Learning Curricula

This paper proposes a geometric framework for reinforcement learning curricula by interpreting reward parameters as coordinates on a task manifold, demonstrating that optimal curricula correspond to geodesics that minimize excess thermodynamic work, and applying this insight to derive the "MEW" algorithm for principled temperature annealing in maximum-entropy RL.

Jacob Adamczyk, Juan Sebastian Rojas, Rahul V. Kulkarni

Published 2026-03-16
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot to walk. If you just throw it into a complex, rocky forest immediately, it will likely fall over, get frustrated, and learn very slowly. But if you start it on a flat, smooth sidewalk, then move it to a grassy lawn, and finally take it to the rocky forest, it learns much faster. This is called Curriculum Learning: giving the student (the robot) a sequence of tasks that get progressively harder.

However, most people design these curriculums by just guessing or using a simple "linear" approach (e.g., "make the rocks 10% harder every day"). This paper argues that this simple approach is wrong because the "space" of learning tasks isn't flat like a sidewalk; it's a bumpy, curved landscape.

Here is the paper's core idea, broken down into simple concepts and analogies:

1. The Landscape of Learning (The "Task Manifold")

Think of every possible version of a task (e.g., a video game with different gravity, different wind speeds, or different reward rules) as a point on a giant, invisible map.

  • The Old Way: We assume this map is flat. If you want to go from "Easy Mode" to "Hard Mode," you just draw a straight line.
  • The New Way: The authors say this map is actually a curved mountain range. Some paths are smooth and easy to walk; others are steep cliffs or muddy swamps. If you try to walk in a straight line across a mountain, you might get stuck in a swamp or slide down a cliff.

2. The "Friction" of Learning

Why is the map curved? Because of friction.
In physics, friction is the resistance you feel when sliding a heavy box across the floor. In this paper, "friction" is the difficulty of adapting.

  • If you change the rules of the game slightly, and the robot can adapt instantly, there is low friction.
  • If you change the rules and the robot has to completely relearn how to move, struggling for a long time before it gets it right, there is high friction.

The authors discovered that this "friction" isn't the same everywhere. It depends on the current state of the robot's brain (its policy). Some directions in the task space are "slippery" (easy to learn), while others are "sticky" (hard to learn).

3. The Thermodynamic Connection (The "Heat" Analogy)

The paper uses Thermodynamics (the physics of heat and energy) to solve this.

  • Imagine the robot's learning process is like a gas in a container.
  • Changing the task parameters is like compressing or expanding that gas.
  • If you change the task too fast, the system gets "hot" and chaotic. This creates wasted energy (called "excess work"). In learning terms, this wasted energy is the time the robot spends confused, making mistakes, and unlearning bad habits.
  • The goal is to move from Task A to Task B in a way that generates the least amount of wasted energy.

4. The Solution: The "Geodesic" Path

In geometry, the shortest distance between two points on a curved surface (like the Earth) isn't a straight line; it's a curve called a geodesic (like the flight path of an airplane).

  • The Paper's Discovery: The optimal curriculum is a geodesic on this learning map.
  • How it works: The robot should move slowly through the "sticky" parts of the map (high friction) where learning is hard. It should move quickly through the "slippery" parts (low friction) where learning is easy.
  • The Result: By following this curved path, the robot learns faster and more efficiently than if it tried to rush through the hard parts or move in a straight line.

5. Real-World Application: "MEW" (Minimum Excess Work)

The authors turned this theory into a practical algorithm called MEW. They tested it on a high-dimensional robot (a digital Humanoid) learning to walk.

  • The Problem: Standard methods often lower the "temperature" (a measure of how random/exploratory the robot is) too quickly. This makes the robot stop exploring and get stuck in a bad habit, causing it to fail.
  • The MEW Fix: The algorithm acts like a smart thermostat.
    • If the robot is struggling (high variance in rewards), the algorithm says, "Slow down! Don't change the rules yet. Let the robot settle."
    • If the robot is doing great and stable, the algorithm says, "Great! Let's speed up and make the task harder."
  • The Outcome: The robot learned to walk more stably and efficiently than with standard methods.

Summary Analogy

Imagine you are driving a car from City A to City B.

  • Standard Curriculum: You drive at a constant speed, ignoring that there is a massive traffic jam (high friction) ahead. You get stuck, waste gas, and arrive late.
  • This Paper's Curriculum: You have a GPS that knows the road conditions. It tells you to drive slowly through the traffic jam to avoid wasting fuel, and then speed up on the empty highway. You arrive faster and with less wasted energy.

In a nutshell: This paper proves that learning isn't just about what you teach, but how you pace the teaching. By treating learning as a physical journey across a bumpy landscape, we can find the smoothest, most efficient path for AI to master new skills.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →