Evolution of cooperation with Q-learning: the impact of… — Plain-Language Explanation

Original authors: Guozhong Zheng, Zhenwei Ding, Jiqiang Zhang, Shengfeng Deng, Weiran Cai, Li Chen

Published 2026-02-04

📖 5 min read🧠 Deep dive

Original authors: Guozhong Zheng, Zhenwei Ding, Jiqiang Zhang, Shengfeng Deng, Weiran Cai, Li Chen

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you and a friend are playing a game where you both have to decide whether to be nice (Cooperate) or look out for yourself at the other's expense (Defect). This is the classic "Prisoner's Dilemma." If you both are nice, you both win a little. If you both look out for yourselves, you both lose a bit. But if one is nice and the other isn't, the "nice" one gets crushed, and the "selfish" one gets a huge reward.

Usually, scientists studying this game assume both players see the world exactly the same way. They both know what the other person did last time, or they both only know what they themselves did.

This paper asks a different question: What happens if the two players see the game differently? What if one player is watching their friend's moves, while the other player is only watching their own?

The researchers used a computer algorithm called "Q-learning" (think of it as a digital student that learns by trial and error, keeping a mental scorecard of what works and what doesn't) to simulate this. They tested three different "vision" setups:

The "You and You" Team: Both players only watch what the other person does.
The "Me and Me" Team: Both players only watch what they themselves do.
The "You and Me" Team (Asymmetric): One player watches the other, while the other player only watches themselves.

Here is what they found, explained simply:

1. The "You and You" Team (Watching the Other)

When both players are only focused on what the other person is doing, the game is a mess. It's like two people trying to dance while staring only at each other's feet; they can't find a rhythm. They keep switching between being nice and being mean, but they never settle into a stable pattern of cooperation. Eventually, they usually give up and just look out for themselves.

2. The "Me and Me" Team (Watching Themselves)

When both players only focus on their own past actions, things are more stable, but they get stuck easily.

The Good: If the temptation to be mean is low, they can get stuck in a "happy loop" where they are both nice forever.
The Bad: If the temptation to be mean is high, they get stuck in a "sad loop" where they are both mean forever.
The Catch: Once they pick a loop (happy or sad), it's very hard to switch. It's like a train that has left the station; it's either going to the destination of "Friendship" or "Betrayal," and it rarely changes tracks once it starts.

3. The "You and Me" Team (The Mixed Vision)

This is where the magic happens. When one player watches the other, and the other watches themselves, the game becomes dynamic and surprisingly effective.

The researchers discovered a complex, three-part story that plays out over time:

Phase 1: The Honeymoon. The two players figure out that being nice works. They start cooperating.
Phase 2: The Breakup. One player (the one watching the other) starts to get greedy. They realize they can get a bigger reward by being mean while the other person is still being nice. They exploit their partner. The nice partner, confused but trying to be good, keeps being nice for a while (tolerance), but eventually gets hurt.
Phase 3: The Rebuild. The nice partner finally snaps. They decide to be mean too, just to teach the greedy partner a lesson. This "punishment" hurts the greedy player, who then realizes, "Hey, being mean isn't working anymore." The greedy player switches back to being nice. The cycle resets, and they build a stronger, more resilient cooperation than before.

The Big Takeaway

The most surprising finding is that this mixed vision (Asymmetric) setup actually leads to faster and stronger cooperation than the setups where everyone sees the world the same way.

Think of it like a relationship:

If you and your partner both only look at your own feelings, you might get stuck in a rut.
If you both only stare at each other, you might get anxious and unstable.
But if one of you is focused on the relationship (watching the other) and the other is focused on their own growth (watching themselves), you create a dynamic where you can forgive mistakes, learn from them, and build a stronger bond.

The paper concludes that how we perceive information matters more than we thought. The structure of what we know—and who knows what—determines whether we end up in a cycle of betrayal or a stable cycle of cooperation. The "mixed vision" creates a natural rhythm of trust, betrayal, punishment, and forgiveness that mirrors real human behavior, allowing cooperation to survive even when it's difficult.

Technical Summary: Evolution of Cooperation with Q-learning: The Impact of Information Perception

Problem Statement
The emergence and stability of cooperation in social dilemmas, particularly the Prisoner's Dilemma (PD), remain central challenges in evolutionary game theory. While reinforcement learning (RL) has emerged as a powerful paradigm for studying social behavior, existing literature largely assumes that individuals possess symmetric information perception—meaning all agents access identical types of information (e.g., only their own actions, only neighbors' actions, or both) when making decisions. This assumption contrasts with real-world observations where information perception is often asymmetric, shaped by factors such as age, experience, culture, and social status. This study addresses the gap in understanding how asymmetric information perception influences the evolution of cooperation within a two-player RL framework.

Methodology
The authors employ the Q-learning algorithm to model the evolution of cooperation in an iterated two-player Prisoner's Dilemma game. The study defines three distinct information perception schemes to test the impact of information structure:

Scheme I (Symmetric "You + You"): Both players base their state perception on the opponent's action.
Scheme II (Symmetric "Me + Me"): Both players base their state perception on their own action.
Scheme III (Asymmetric "You + Me"): One player perceives the opponent's action, while the other perceives their own action.

The agents utilize a Q-table to score actions ( $C$ or $D$ ) within specific states. The system evolves through synchronous updates involving exploration (with probability $\epsilon$ ) and exploitation based on the Q-values. The payoff matrix follows the strong PD version ( $T > R > P > S$ and $T+S < 2R$ ), with the dilemma strength controlled by parameter $b$ . The study analyzes time-averaged cooperation preferences, probability density functions (PDFs) of cooperation levels, and the temporal evolution of Q-values to uncover underlying mechanisms.

Key Results
The study reveals that information structure fundamentally alters the evolutionary dynamics of cooperation:

Scheme I (Opponent-focused): Cooperation is highly unstable. Even at low dilemma strengths, the system tends to evolve toward mutual defection. The PDF of cooperation preference exhibits a trimodal distribution, indicating a lack of stable cooperative states.
Scheme II (Self-focused): The system exhibits bistability and a first-order-like phase transition. Depending on initial conditions, the system converges to either mutual cooperation or mutual defection. Once a stable state is reached, it is generally maintained, though the region of cooperation shrinks as dilemma strength increases.
Scheme III (Asymmetric): This scenario yields the most complex and robust dynamics. While it also displays bistability, it is characterized by a unique "bounce" between full cooperation and full defection. Notably, Scheme III achieves the highest cooperation preference in the shortest convergence time compared to the other schemes, particularly at moderate dilemma strengths ( $b \approx 0.3$ ).

Mechanistic Analysis
Through a detailed analysis of the Q-value evolution in the asymmetric scenario (Scheme III), the authors identify a cyclical process comprising three stages:

Emergence: Cooperation emerges through a cycle of exploitation and tolerance. One player (the "Me" agent) initially tolerates the other's defection, allowing mutual cooperation to form via positive feedback.
Breakdown: The tolerance is eventually eroded by repeated exploitation. The "Me" agent switches to defection as a punishment strategy, leading to a collapse into mutual defection.
Reconstruction: Following the collapse, simultaneous cooperative exploration allows the system to escape mutual defection. The roles of exploiter and tolerator reverse, and through a similar cycle of punishment and tolerance, mutual cooperation is reestablished.

This dynamic mirrors psychological shifts in human behavior, where cooperation is not a static state but a process of emergence, breakdown, and reconstruction.

Significance and Claims
The paper claims that information structure is a critical determinant in fostering cooperation. Specifically, it demonstrates that asymmetric information perception can catalyze the emergence of cooperation more rapidly and robustly than symmetric structures. The findings underscore that:

Information Structure Matters: The specific way agents perceive information (action vs. self-action) dictates the stability and speed of cooperative evolution.
Complexity in Asymmetry: Asymmetric scenarios introduce rich dynamical behaviors, including true bistability and oscillatory transitions between cooperation and defection, which are absent in symmetric models.
Realism: The observed dynamics of emergence, breakdown, and reconstruction in the asymmetric model align more closely with the complexities of human decision-making and real-world social interactions than previous symmetric models.

The authors conclude that while this work focuses on simplified two-player scenarios, it provides a foundational step toward understanding how diverse information perceptions shape cooperative relationships, suggesting that future research should explore more complex social networks and integrate moral preferences into RL frameworks.

Evolution of cooperation with Q-learning: the impact of information perception

1. The "You and You" Team (Watching the Other)

2. The "Me and Me" Team (Watching Themselves)

3. The "You and Me" Team (The Mixed Vision)

The Big Takeaway

More like this