Learning Risk Preferences in Markov Decision Processes: an Application to the Fourth Down Decision in the National Football League

Imagine you are watching a football game. It's 4th down, and the team has to make a huge choice: Go for it (try to get the first down), Kick a Field Goal (try for 3 points), or Punt (kick the ball away to the other team).

For decades, statisticians have looked at these decisions and said, "Wait a minute! The coaches are playing it too safe. If they just used a computer model to calculate the odds, they would go for it way more often."

But the coaches keep making the "safe" choice. Why?

This paper asks a fascinating question: What if the coaches aren't making mistakes? What if they are actually playing a different game than the statisticians think they are?

Here is the simple breakdown of how the authors solved this mystery.

1. The Detective Work: "Inverse Optimization"

Usually, if you want to know what a coach is thinking, you ask them. But coaches rarely say, "I'm scared of losing the ball."

Instead, the authors used a method called Inverse Optimization. Think of it like this:

Normal Math: You know the rules and the goal, so you calculate the best move.
Inverse Math: You see the move the person actually made, and you work backward to figure out what their "hidden rulebook" must have been.

The authors assumed the coaches were making the best possible decision for their specific goals. They just had to figure out what those goals were.

2. The "Risk" Meter: The Quantile

To explain the coaches' behavior, the authors realized the coaches weren't trying to maximize their average points (like a gambler hoping for the best). They were trying to avoid the worst-case scenarios.

They used a concept called a Quantile. Imagine a line of people ranked from "Lucky" to "Unlucky."

If you are Risk-Neutral (like a standard computer model), you care about the average person in the middle of the line.
If you are Risk-Averse (like a nervous coach), you care about the worst 10% of people on the line. You want to make sure that even if things go badly, you don't end up in the bottom 10%.

The authors found that NFL coaches are essentially playing a game where they are trying to optimize the bottom 30% to 40% of possible outcomes. They are terrified of the "worst-case scenario" (turning the ball over on 4th down), even if the "average" outcome suggests they should take the risk.

3. The "Field Half" Analogy

The study discovered a funny quirk in how coaches think, depending on where they are on the field.

In Your Own Half (The "Home" Zone): The coaches are super conservative. They are like a parent driving a car with a child in the backseat. They would rather drive 10 miles per hour under the speed limit than risk a single scratch on the bumper. They almost never go for it here.
In the Opponent's Half (The "Away" Zone): The coaches become much bolder. They are like a surfer riding a wave. They are willing to take more risks because the reward (scoring points) is right there, and the "worst-case" scenario (giving the ball back) feels less catastrophic than it does in their own territory.

4. The "Time Travel" Discovery

The authors also looked at how this has changed over time (from 2014 to 2022).

Then: Coaches were extremely scared of the worst-case scenario.
Now: Coaches are slowly becoming a little bit braver. They are starting to trust the math a little more, or perhaps they are just tired of losing games by being too safe.

5. The "Video Game" Connection

To make this work, the authors built a massive Video Game of the NFL.

They fed the game 9 years of real-life play-by-play data.
They programmed the "physics" of the game (how likely a team is to get a first down, how likely a field goal is to go in).
Then, they ran the "Inverse Optimization" engine. It asked: "If the coaches are playing this game perfectly, what 'Risk Meter' setting must they have turned on?"

The Big Takeaway

The paper concludes that coaches aren't "stupid" or "bad at math." They are just risk-averse.

They are playing a game where the penalty for a mistake is so high (losing the ball in a good spot) that they are willing to accept a lower average score just to avoid that one bad outcome.

In simple terms:
If a statistician says, "On average, you should go for it," the coach thinks, "But what if I fail? Then I look like an idiot and we lose." The coach is optimizing for not looking like an idiot, not for winning the most points on average.

This study gives us a new way to understand human decision-making: sometimes, the "safe" choice isn't a mistake; it's a calculated move to avoid the worst possible nightmare.

Here is a detailed technical summary of the paper "Learning Risk Preferences in Markov Decision Processes: An Application to the Fourth Down Decision in the National Football League."

1. Problem Statement

For decades, empirical observations of NFL coaches' fourth-down decisions have been inconsistent with prescriptions derived from statistical models that maximize win probability or expected points. While prior research established that coaches are generally "overly conservative" (preferring punts or field goals over going for a first down), the specific risk preferences driving these decisions remained unquantified.

The core problem addressed is an inverse optimization problem: Given observed decisions (actions) by coaches in specific game states, can we infer the underlying objective function (specifically, the risk measure) that makes these observed actions optimal? The authors aim to move beyond simply labeling coaches as "suboptimal" and instead characterize their implicit risk sensitivity using a quantile-based framework.

2. Methodology

The authors propose a framework combining Markov Decision Processes (MDPs) and Inverse Optimization (IO) to estimate risk preferences.

A. Forward Model: The MDP

The fourth-down decision and subsequent game flow are modeled as an MDP:

States ( $S$ ): Defined by possession, down, yardline (binned), and yards to go. The state space is partitioned into scoring states and play states.
Actions ( $A$ ): $\{ \text{Go for it (GO)}, \text{Field Goal Attempt (FGA)}, \text{Punt (PUNT)} \}$ .
Transitions: Estimated empirically from NFL play-by-play data (2014–2022 seasons). The model assumes a fixed, league-average policy ( $\bar{\pi}$ ) governs all future decisions after the current fourth-down play.
Rewards: Defined from the perspective of the team with possession (e.g., +6.95 for a touchdown, +3 for a field goal, -2 for a safety, 0 otherwise).
Value Function: Instead of maximizing the expected value (risk-neutral), the model assumes coaches maximize a specific quantile of the next-state value distribution.

B. The Inverse Optimization Framework

The goal is to find the quantile parameter $\tau \in [0, 1]$ that minimizes the discrepancy between observed actions and the actions prescribed by the quantile-optimal policy.

Objective: Minimize the average Hamming distance (loss) between observed actions $a_j$ and the optimal actions $a^*_j$ derived from a candidate quantile objective function $q^\tau$ .
Quantile Parameterization: The authors use the $\tau$ $τ$ -quantile of the next-state value distribution $V(\sigma, a)$ $V (σ, a)$ .
- Low $\tau$ (e.g., 0.1) implies high risk aversion (optimizing for the worst-case outcomes).
- High $\tau$ (e.g., 0.9) implies high risk tolerance (optimizing for best-case outcomes).
- $\tau = 0.5$ corresponds to the median; $\tau = 1$ corresponds to the maximum (risk-seeking).
Partitioning: To allow for context-dependent risk preferences, the state space is partitioned into two sets: Opponent Half (yards to endzone < 50) and **Own Half** (yards to endzone > 50). This allows for distinct quantile estimates ( $\tau_1, \tau_2$ ) for each region.

C. Estimation and Inference

Data: 9 seasons of NFL play-by-play data via nflfastR.
Regularization: Raw empirical estimates of transition probabilities are unstable in regions where coaches rarely "go for it." The authors apply bivariate monotonic smoothness constraints (using shape-constrained additive models) to the quantile estimates to ensure intuitive decision boundaries.
Uncertainty Quantification: A bootstrap procedure (resampling at the game level) is used to generate confidence intervals for the estimated $\tau$ values, accounting for variability in transition probabilities and decision paths.

3. Key Contributions

First Inverse Quantile MDP: The paper is the first to apply inverse optimization specifically to Quantile MDPs, moving beyond standard expected-value or utility-based inverse reinforcement learning.
Quantifiable Risk Profiles: Instead of a binary "risky vs. safe" label, the authors provide a continuous metric ( $\tau$ ) representing the specific quantile of the outcome distribution coaches optimize for.
Contextual Risk Sensitivity: The framework successfully demonstrates that risk preferences are not static but vary significantly based on:
- Field Position: Coaches are more risk-tolerant in the opponent's half.
- Win Probability: Risk tolerance increases as win probability decreases (desperation).
- Time: Risk tolerance has increased over the 2014–2022 period.
- Coach/Team: Significant heterogeneity exists between individual coaches.

4. Key Results

Conservative Baseline: Generally, coaches optimize for low quantiles (approx. $\tau \approx 0.3$ to $0.4 $), confirming they are risk-averse compared to the risk-neutral expectation ($ \tau=0.5$) or the "4th Down Bot" (a win-probability maximizing model).
Field Region Disparity:
- Opponent Half: Coaches exhibit higher risk tolerance (higher $\tau$ ).
- Own Half: Coaches are extremely risk-averse (lower $\tau$ ), rarely deviating from conservative play.
Win Probability Interaction:
- In low win probability scenarios (teams losing), coaches in the Opponent Half become significantly more risk-tolerant, aligning their behavior closer to the win-probability model.
- In high win probability scenarios, coaches remain conservative regardless of field position.
Temporal Trend: There is a statistically significant increase in risk tolerance across the league from 2014 to 2022, particularly in the opponent's half.
Performance Correlation: A regression analysis reveals a positive association between a coach's estimated risk tolerance ( $\hat{\tau}$ ) and their average points gained on fourth down. This suggests that excessive risk aversion (very low $\tau$ ) is correlated with lower team performance, validating the economic cost of conservative play.

5. Significance and Implications

Theoretical Advancement: The paper bridges the gap between decision theory and sports analytics by providing a rigorous method to "learn" the risk measure of a decision-maker directly from data, without assuming a specific utility function form.
Practical Application: The findings offer a nuanced explanation for the "fourth down gap." It is not merely that coaches are wrong; it is that they are optimizing for a different objective (low quantiles) driven by risk aversion, which is more pronounced in their own territory.
Behavioral Change: By quantifying the specific "cost" of risk aversion in terms of points and wins, the study provides a data-driven argument for coaches to adjust their risk profiles, particularly in the opponent's half or when trailing.
Generalizability: The methodology is applicable to any domain involving sequential decision-making under uncertainty where the decision-maker's risk profile is unknown (e.g., finance, logistics, healthcare).

In summary, the paper successfully reframes the fourth-down decision problem from a "coaches vs. math" conflict to a "different risk preferences" analysis, using inverse optimization to map observed behavior to specific quantile-based risk measures.

Learning Risk Preferences in Markov Decision Processes: an Application to the Fourth Down Decision in the National Football League

1. The Detective Work: "Inverse Optimization"

2. The "Risk" Meter: The Quantile

3. The "Field Half" Analogy

4. The "Time Travel" Discovery

5. The "Video Game" Connection

The Big Takeaway

1. Problem Statement

2. Methodology

A. Forward Model: The MDP

B. The Inverse Optimization Framework

C. Estimation and Inference

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

The fourth known primitive solution to a5+b5+c5+d5=e5a^5 + b^5 + c^5 + d^5 = e^5a5+b5+c5+d5=e5

Waring-Goldbach problems for one square and higher powers

Reductification of parahoric group schemes

Sobolev regularity of the symmetric gradient of solutions to a class of ϕ\phiϕ-Laplacian systems

On the approximation of Weierstrass function via superoscillations

The fourth known primitive solution to $a^5 + b^5 + c^5 + d^5 = e^5$

Sobolev regularity of the symmetric gradient of solutions to a class of $\phi$ -Laplacian systems