Influence-Based Reward Modulation for Implicit Communication in Human-Robot Interaction

Here is an explanation of the paper, translated into simple language with some creative analogies.

The Big Idea: The "Silent Dance" of Robots and Humans

Imagine you are walking down a crowded hallway. You see someone coming toward you. Without saying a word, you both subtly shift your hips or tilt your head, and suddenly, you both know exactly who is going to step left and who is going to step right. You pass each other smoothly. No shouting, no hand signals—just a silent, intuitive understanding.

This is implicit communication. It's the "vibe" or the "flow" between people.

This paper asks a big question: Can we teach robots to do this? Not by programming them with complex rules about human psychology, but by teaching them to "feel" the flow of information between them and a human.

The Problem: Robots Are Too Clueless

Currently, for a robot to understand a human, it usually needs a "cheat sheet." It needs to know: "If the human looks left, they probably want to go left." Or it needs to build a complex mathematical model of the human's brain.

The problem is, humans are messy. We don't always follow rules, and we don't always have a "cheat sheet" for every situation. The authors wanted a way for robots to learn to communicate without needing to know exactly what the human is thinking or having a pre-written rulebook.

The Solution: The "Influence Score" (Transfer Entropy)

The authors came up with a clever trick using a math concept called Transfer Entropy.

Think of Transfer Entropy as a "Silent Influence Score."

High Score: It means "My actions are strongly affecting what you do next." (e.g., I step left, and you immediately step right because of me).
Low Score: It means "We are ignoring each other." (e.g., I step left, and you keep walking straight because you didn't notice me).

The paper proposes adding this "Score" to the robot's reward system. Instead of just getting points for "reaching the goal," the robot gets extra points for making its actions influence the human (or resisting that influence, depending on the situation).

The Two Modes: The "Good Cop" and the "Bad Cop"

The researchers tested two different ways to use this score:

1. The "Good Cop" (Boosting Influence)

Goal: Collaboration and Transparency.
How it works: The robot is rewarded for making its moves so clear that the human can't help but react to them.

The Analogy: Imagine a dance partner who moves so clearly that you instinctively know where to step. They aren't forcing you; they are just so legible that you fall into sync.
The Result: In experiments (like a virtual hallway game), when the robot tried to boost this influence, humans became better at collaborating. They figured out the robot's intentions faster and worked together more smoothly. The robot became "legible."

2. The "Bad Cop" (Resisting Influence)

Goal: Independence or Competition.
How it works: The robot is rewarded for ignoring the human's influence. It tries to be a "wall" that the human cannot push.

The Analogy: Imagine a stubborn mule that refuses to move even if you pull the rope. It resists your influence.
The Result: When the robot did this, humans found it harder to predict what the robot would do. Collaboration dropped. However, in a competitive scenario (where the robot and human were racing against each other), the robot became more "selfish" and independent, which sometimes helped it win, but made the interaction feel less cooperative.

The Experiments: From Video Games to Real Robots

The team didn't just talk about this; they tested it in three ways:

The Grid World (Video Game): Two dots on a screen had to decide whether to meet or pass each other in a narrow corridor.
- Finding: The "Good Cop" robots helped humans win more often in team games. The "Bad Cop" robots made humans struggle to coordinate.
The Virtual Human: Real people played the game against a computer robot.
- Finding: People felt the "Good Cop" robot was more human-like and easier to understand, even though they couldn't consciously explain why. They just felt the flow was better.
The Real Robot: They put a physical robot (a Fetch robot) in a real hallway with real humans.
- Finding: The results held up! When the robot tried to be "influential," humans walked more cooperatively. When it tried to be "independent," humans had a harder time coordinating.

The Wild Card: The Highway Test

They also tested this on a self-driving car simulation (the "Highway" environment).

The Twist: Here, "boosting influence" made the car too aggressive. It started driving faster and cutting closer to other cars to "influence" them to move.
The Lesson: Sometimes, you don't want to be too influential. On a highway, you want to be predictable and safe, not a social butterfly trying to force a dance. This shows that the robot needs to know when to be a "Good Cop" and when to be a "Bad Cop."

The Takeaway

This paper is like teaching a robot to listen to the room rather than just following a script.

Without this method: A robot is like a person shouting instructions in a foreign language. It's loud, confusing, and annoying.
With this method: The robot is like a skilled dancer. It adjusts its moves based on the flow of the room. It doesn't need to know your name or your history; it just needs to feel how its moves change your next step.

By simply tweaking the robot's "reward" to care about how much it influences you, the robot learns to communicate without saying a word. It's a step toward robots that feel less like machines and more like natural partners in our daily lives.

Here is a detailed technical summary of the paper "Influence-Based Reward Modulation for Implicit Communication in Human-Robot Interaction."

1. Problem Statement

In Human-Robot Interaction (HRI), successful collaboration and competition often rely on implicit communication—subtle, non-verbal exchanges of information (e.g., trajectory changes, timing) rather than explicit verbal or signal-based communication.

Current Limitations: Most existing HRI approaches for implicit communication rely on explicitly modeling human intentions (e.g., using inverse reinforcement learning or belief models) or require pre-existing social knowledge. These methods are often brittle, computationally expensive, and difficult to generalize to real-world scenarios where human intent is unknown or dynamic.
The Gap: There is a need for a framework that enables robots to foster implicit communication and adapt to human behavior without requiring explicit models of human intent or prior knowledge of the task.

2. Methodology

The authors propose a novel framework that treats communication as the degree of influence agents have on one another, quantified and modulated using Transfer Entropy (TE) within a Partially Observable Markov Decision Process (POMDP).

Core Concept: Transfer Entropy (TE)

TE measures the directed flow of information from one stochastic process to another. It quantifies how much the uncertainty of an agent's future action is reduced by knowing the past actions of another agent.

Formula: $TE(X \to Y) = H(Y_t | Y_{t-1}, \dots) - H(Y_t | Y_{t-1}, \dots, X_{t-1}, \dots)$
Interpretation: High TE from Agent B to Agent A implies that Agent A's actions are highly predictable based on Agent B's history, indicating strong influence.

Reward Modulation Strategy

The authors augment the standard reward function ( $r$ ) with a TE-based term ( $\phi \cdot TE$ ) to control the agent's behavior:
$Reward = \phi \cdot TE + r$

Positive-TE ( $\phi > 0$ ): The agent is rewarded for increasing the influence it has on others (or being influenced by others). This promotes legibility, transparency, and collaboration. The agent acts in a way that makes its intentions clear to others, facilitating mutual understanding.
Negative-TE ( $\phi < 0$ ): The agent is rewarded for resisting influence. This promotes social independence and conservative behavior, useful in competitive scenarios where the agent wants to avoid being manipulated or to maintain autonomy.
Non-TE ( $\phi = 0$ ): Baseline agent with no influence modulation.

Implementation Details

Discrete Setting (Corridor Dilemma): Implemented using Q-learning. The Q-table is used to derive probability distributions over actions. To calculate TE, the authors marginalize the Q-table over the other agent's history to simulate a "no influence" scenario, comparing the resulting entropy against the full observation scenario.
Continuous Setting (Highway Driving): Extended to Deep Reinforcement Learning (DRL). Since continuous action spaces do not allow direct Q-table marginalization, the authors use Monte Carlo estimation. They sample the source signal (other agents' states), feed them into the policy network, and average the resulting action distributions to approximate the marginal policy required for TE calculation.

3. Key Contributions

Model-Free Framework: A novel approach to implicit communication that does not require explicit human behavior modeling or prior knowledge of human intent.
Influence Modulation via TE: Theoretical and practical demonstration that manipulating Transfer Entropy in the reward function can effectively control social dynamics (collaboration vs. competition).
Extensive Validation: The framework was validated across three distinct levels:
- Simulations: Self-play and interactions with rule-based social force models.
- Virtual Human-Agent Experiments: Users interacting with trained RL agents in a grid-world.
- Real-World Human-Robot Experiments: Physical interactions with a Fetch robot in a corridor navigation task.
- Complex DRL Extension: Application to the Highway-env (autonomous driving) with continuous state spaces and multi-agent interactions.

4. Results

A. Corridor Dilemma (Collaboration vs. Competition)

Simulation:
- Positive-TE agents achieved the highest collaboration success rates (91.72% in Pos-TE vs. Pos-TE pairs) and fairer competitive outcomes.
- They outperformed rule-based Social Force models, which failed to adapt proactively.
- Negative-TE agents reduced collaboration performance but maintained competitive stability.
Virtual Human-Agent Study:
- Humans interacting with Positive-TE agents showed significantly higher success rates in both collaboration and competition compared to Negative-TE agents.
- Humans perceived Positive-TE agents as more "legible" and "human-like," though statistical differences in perception were subtle.
Real-World Human-Robot Study:
- Collaboration: Humans achieved near-significantly higher success rates when interacting with a Positive-TE robot compared to a Negative-TE robot.
- Competition: Results were mixed; humans performed slightly better against Negative-TE robots, likely due to physical advantages (speed) and proxemics overriding the algorithmic objectives in the continuous physical environment.

B. Highway Environment (Autonomous Driving)

Positive-TE ( $d > 0$ ): Agents became assertive and interactive. They drove faster, maintained shorter gaps to leading vehicles, and attempted to trigger lane changes. This led to higher collision rates in some cases, indicating that "more interaction" is not always safer in high-speed contexts.
Negative-TE ( $d < 0$ ): Agents behaved conservatively, maintaining larger gaps and lower speeds. However, excessive suppression ( $d < -4$ ) led to irrational decisions and increased collisions due to ignoring other agents entirely.
Conclusion: The framework successfully modulates behavior, but the sign of the TE reward must be context-dependent (positive for social navigation, potentially negative for high-risk driving).

5. Significance and Implications

Generalizability: The framework is applicable to diverse HRI tasks (navigation, driving, handovers) without needing task-specific human models.
Control of Information Asymmetry: By tuning the TE reward, designers can intentionally create information asymmetry to bias the robot toward altruism (ceding to humans) or autonomy (resisting influence), effectively implementing a form of "Asimov-like" reframing.
Implicit Communication: The work proves that robots can learn to communicate implicitly by optimizing for information flow, making interactions more natural and intuitive for humans who are unaware of the underlying algorithm.
Context-Awareness: The study highlights that "more influence" is not universally better; the optimal strategy depends on whether the scenario requires cooperation (corridor) or safety/autonomy (highway).

6. Limitations and Future Work

Interpretability: While the method is model-free, the specific behavioral outcomes of different TE magnitudes can be hard to interpret without explicit intent modeling.
Confounding Factors: Real-world experiments were subject to fatigue and physical variables (e.g., human speed advantages).
Future Directions:
- Developing methods to automatically infer the appropriate TE sign based on the interaction context.
- Integrating explicit Theory of Mind (ToM) mechanisms.
- Combining implicit (TE-based) and explicit communication channels.