Learning to Recommend in Unknown Games

This paper establishes that a moderator can learn agents' utility functions in unknown multi-agent games through recommendation feedback, demonstrating that quantal-response models enable precise identification with logarithmic sample complexity while best-response models yield a broader identifiable set, and further providing a low-regret online algorithm for both scenarios.

Arwa Alanqary, Zakaria Baba, Manxi Wu, Alexandre M. Bayen

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are a traffic controller in a busy city, but there's a catch: you don't know what the drivers want. You can't ask them, "Do you prefer the scenic route or the highway?" because they might lie, or they might not even know themselves yet.

All you can do is suggest a route (a recommendation) and watch what they actually do.

  • If they take your route, they are "compliant."
  • If they ignore you and take a different road, they are "deviating."

This paper is about how a smart AI (the moderator) can figure out the drivers' hidden preferences just by watching these choices over and over again, even though the drivers are playing a complex game where what one person does affects everyone else.

Here is the breakdown of the paper's ideas using simple analogies:

1. The Two Types of Drivers (Feedback Models)

The paper tests two different ways drivers might react to your suggestions:

  • The "Perfectly Rational" Driver (Best Response):
    Imagine a driver who is a math genius. If you suggest a route, they instantly calculate the perfect route for them based on what they think everyone else will do. If your suggestion isn't the absolute best, they ignore you immediately.

    • The Problem: This driver is too rigid. If you suggest a route and they ignore it, you only know "This wasn't the best." You don't know how much worse it was. It's like guessing a number between 1 and 100, and they just say "No." You learn very slowly. The paper shows that with these drivers, you can never fully figure out their exact preferences; you can only guess a fuzzy range.
  • The "Human" Driver (Quantal Response):
    Imagine a driver who is mostly rational but sometimes makes mistakes or takes a risk. If your suggestion is slightly worse than the alternative, they might still take it by accident. If it's much worse, they definitely won't.

    • The Good News: Because they sometimes make "noisy" choices, their behavior reveals more information. If they take a slightly bad route, it tells you the difference between the routes is small. If they ignore a bad route, the difference is huge. The paper proves that with these "human-like" drivers, the AI can learn their exact preferences very quickly (in logarithmic time, which is super fast).

2. The "Game" of Guessing

The drivers aren't just deciding in a vacuum; they are playing a game.

  • Analogy: Imagine a group of friends deciding where to eat. If Alice suggests "Pizza," Bob might say "No, I want Sushi" because he knows Charlie loves Sushi.
  • The AI has to learn not just what Alice likes, but how Alice's choice depends on what she thinks Bob and Charlie will do.
  • The paper shows that by watching who deviates from the suggestion, the AI can map out the entire "preference landscape" of the group, provided the drivers aren't perfectly robotic.

3. The "Regret" (The Scorecard)

How do we know if the AI is doing a good job? We measure Regret.

  • The Metaphor: Imagine the AI is a coach giving play calls to a football team.
    • If the team follows the play and scores, the coach is happy (Zero Regret).
    • If the team ignores the play and runs a different one, the coach feels "Regret" because the team could have scored more if they had listened.
  • The paper designs a smart algorithm that minimizes this regret. It's like a coach who learns from every mistake. Even if the coach doesn't know the players' strengths at first, after a few games, the coach starts calling plays that the team wants to follow, because they've learned what makes the players tick.

4. The Secret Weapon: "Cutting the Cake"

How does the AI learn so fast? It uses a mathematical trick called Cutting-Plane Methods.

  • The Analogy: Imagine you are trying to find a hidden treasure inside a giant, foggy box. You don't know where it is.
    • You guess a spot.
    • The treasure tells you, "No, it's not in this half of the box."
    • You cut the box in half and throw away the empty half.
    • You repeat this. With every guess, you cut the search space in half.
  • The paper's algorithm does this, but instead of a box, it's cutting through a complex geometric shape representing all possible driver preferences. Every time a driver deviates, the AI "cuts away" all the theories about what they like that are now proven wrong.

5. Why This Matters

This isn't just about traffic or football. This is the future of AI on the internet.

  • Online Marketplaces: An AI suggesting prices to sellers.
  • Social Media: An algorithm recommending posts to users.
  • Energy Grids: A system telling factories when to turn off machines.

In all these cases, the AI cannot force people to do things; it can only suggest. This paper gives us the mathematical proof that if people act like humans (making small mistakes), the AI can learn exactly what they want and give them perfect suggestions very quickly. But if people act like perfect robots, the AI will always be stuck guessing in the dark.

In a nutshell:
The paper teaches us that to teach an AI how to lead a group of strategic people, you need people who are slightly imperfect. Their "mistakes" are actually the clues the AI needs to learn the rules of the game and become a perfect leader.