A Review of Reward Functions for Reinforcement Learning in the context of Autonomous Driving

This paper reviews and categorizes existing reward functions for reinforcement learning in autonomous driving into safety, comfort, progress, and traffic rule compliance, while highlighting their current limitations in standardization and context-awareness to propose future research directions for more robust and conflict-resolving reward designs.

Ahmed Abouelazm, Jonas Michel, J. Marius Zoellner

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot to drive a car. You can't just tell it, "Drive safely." You have to give it a scorecard—a set of rules that tells it when it's doing a good job and when it's doing a bad job. In the world of Artificial Intelligence, this scorecard is called a Reward Function.

This paper is essentially a report card on how researchers are currently writing these scorecards for self-driving cars. The authors, Ahmed, Jonas, and J. Marius, argue that while we are getting better at teaching robots to drive, the way we grade them is messy, inconsistent, and sometimes dangerous.

Here is a breakdown of their findings using simple analogies:

1. The Four Pillars of the Scorecard

The authors looked at hundreds of research papers and realized that every reward function tries to balance four main goals. Think of these as the four judges on a reality TV show:

  • Safety (The Bodyguard): This is the most important judge. The robot must not crash.
    • The Problem: Currently, the scorecard is too blunt. If the robot bumps a shopping cart, it gets the same "bad score" as if it hit a pedestrian at high speed. Some researchers try to fix this by making the penalty worse if the car is going fast, but it's still a guess.
  • Progress (The Marathon Runner): The robot needs to get to its destination.
    • The Problem: If you just tell the robot "Go fast," it might drive straight into a wall because hitting the wall is technically "moving forward." The current scorecards sometimes encourage the robot to be so eager to finish that it ignores obstacles.
  • Comfort (The Taxi Passenger): The ride shouldn't feel like a rollercoaster. No sudden jerks or hard braking.
    • The Problem: Many researchers ignore this completely! They focus so much on not crashing that they teach the robot to drive like a nervous teenager slamming on the brakes.
  • Traffic Rules (The Police Officer): Stay in the lane, stop at red lights, don't speed.
    • The Problem: The scorecards are often too rigid. They don't know when it's okay to break a rule (like driving slightly over the speed limit to merge safely into fast traffic).

2. The "Smoothie" Problem (Aggregation)

How do you combine these four judges into one final score?

  • Current Method (The Smoothie): Most researchers just mix everything together in a blender. They add the Safety score, the Progress score, and the Comfort score.
    • The Flaw: If the "Progress" judge is too loud, it drowns out the "Safety" judge. The robot might decide, "I'll crash into that wall because the points I get for moving forward are higher than the points I lose for crashing."
  • Weighted Method: Some try to give Safety a "louder voice" by multiplying its score by a big number.
    • The Flaw: It's like trying to tune a radio by guessing. You have to manually tweak the numbers over and over, and what works in a rainy city might fail on a sunny highway.

3. The "One-Size-Fits-All" Trap (Lack of Context)

Imagine a teacher who grades a student's math test the same way they grade their art project. That's what's happening with self-driving cars.

  • The current scorecards are often built for one specific situation (like a highway).
  • If you take that same scorecard and put the car in a busy city with pedestrians, it gets confused. It doesn't know that the rules change based on the weather, the time of day, or the type of road. It lacks "common sense."

4. The Missing "Rulebook"

The authors suggest we stop guessing with numbers and start using a Rulebook.

  • The Analogy: Instead of mixing ingredients in a blender, imagine a strict hierarchy of laws.
    • Rule 1: Never hit a person. (If you break this, the game is over).
    • Rule 2: Never hit a car.
    • Rule 3: Don't speed.
    • Rule 4: Be comfortable.
  • This way, the robot knows that Safety always beats Progress. You don't need to guess the numbers; the rules are clear.

5. The "Context Machine"

To fix the lack of common sense, they suggest using Reward Machines.

  • The Analogy: Think of this as a GPS that changes the rules as you drive.
    • When you are on the highway, the machine says, "Okay, speed is important."
    • When you enter a school zone, the machine instantly switches to a new set of rules: "Stop! Look for kids! Slow down!"
    • This allows the robot to adapt its "scorecard" to the situation, just like a human driver does.

6. The Missing Safety Check

Finally, the paper points out a scary gap: We have no way to automatically test if the scorecard is broken.

  • Currently, if a researcher designs a new scorecard, they have to manually check if it makes the robot do stupid things.
  • The authors say we need a "Safety Inspector" tool that automatically runs thousands of crazy scenarios (like a deer jumping out, or a tire blowing out) to see if the robot's scorecard makes it behave safely.

The Bottom Line

The paper concludes that while Reinforcement Learning is a powerful tool for self-driving cars, we are currently teaching them with a broken grading system. We need to move away from "guessing the right numbers" and toward structured Rulebooks and context-aware systems that understand that driving in a snowstorm is different from driving on a sunny track. Until we fix the scorecard, the robot driver will never be truly safe.