Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach

Imagine you've hired a team of incredibly smart, super-fast digital assistants (Large Language Models, or LLMs) to run your business, negotiate deals, or even drive your car. You expect them to be helpful, but you soon realize they have a weird personality quirk: they are too nice.

In the world of economics, this is a problem. If your AI agent is negotiating a price, it might give away the store just to be "polite," ignoring the fact that it should be making a profit. If it's driving a car, it might sacrifice the passenger to save a pedestrian, even when the passenger is your own family member.

This paper is about teaching these digital assistants to have a "personality" that matches your specific goals, rather than just being a generic, overly helpful robot.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "Overly Polite" Intern

The authors started by testing standard AI models (like GPT-4o) in classic economic games, like the Prisoner's Dilemma (a game where two people have to decide whether to trust each other or betray each other).

What happened: The AI acted like a golden retriever. It cooperated way too much, even when it was bad for its own score. It didn't care much about the "rules of the game" (the incentives).
The Analogy: Imagine hiring an intern to run your lemonade stand. Instead of charging the highest price the market will bear, the intern gives the lemonade away for free because they think "sharing is caring." They aren't bad; they just haven't been trained to understand business.

2. The Solution: The "Training Camp" (Fine-Tuning)

The authors didn't just tell the AI, "Be smarter!" (which is like giving a vague instruction to a confused intern). Instead, they created a training camp.

They took the AI and taught it two specific "personalities" using a method called Supervised Fine-Tuning:

Personality A: "Homo Economicus" (The Rational Businessperson)
- The Vibe: "I am here to maximize my own profit. I will play the game to win, but I won't be mean; I'll just be smart."
- The Training: They fed the AI thousands of examples where the "best move" was to act in self-interest.
Personality B: "Homo Moralis" (The Moral Kantian)
- The Vibe: "I care about myself, but I also care about what would happen if everyone acted like me. I want to do the 'right' thing, even if it's hard."
- The Training: They fed the AI examples where the best move involved balancing self-interest with a rule like, "If everyone did this, would the world be better?"

3. The Results: Different Personalities, Different Outcomes

After this "training camp," the AI agents changed their behavior permanently. They didn't just act differently because of a prompt; they actually thought differently.

Test 1: The "Moral Machine" (Self-Driving Cars)

Imagine a self-driving car facing a crash. It must choose: Stay the course and kill the passenger (you) to save 10 pedestrians, or Swerve and kill the pedestrians to save you.

The Standard AI: Always swerves to save the most lives, even if you are the passenger. It's a "martyr."
The "Rational" AI: It says, "If I'm the passenger, I want to live! I'll buy a car that protects me. If I'm a stranger, I'll agree to save the 10 people." It changes its mind based on who is in the car.
The "Moral" AI: It says, "If everyone follows the rule of saving the most lives, that's the right thing to do." So, it swerves to save the 10 people, even if it's your family member in the car. It is consistent.

Test 2: The "Price War" (Two Competing Shops)

Imagine two AI agents running competing shops. They can either compete (low prices) or collude (high prices, like a secret agreement).

The Standard AI: When told to be "profitable," it immediately raises prices to monopoly levels (too high!). It's too eager to collude.
The "Rational" AI: It plays smart. If the game is competitive, it lowers prices to win customers. If the game allows for cooperation, it finds a middle ground.
The "Moral" AI: It is the most stable. It refuses to raise prices too high even when encouraged to be greedy. It acts like a "rule-follower" that keeps the market competitive and fair.

4. Why This Matters

The paper argues that how we design AI is a strategic choice.

The Old Way: We just hope the AI is "safe" and "helpful." But in a business or market, "helpful" might mean "giving away your profits."
The New Way: We can explicitly design the AI's "brain" to have a specific set of values.
- Want a ruthless negotiator? Train it to be Rational.
- Want a fair, stable market player? Train it to be Moral.
- Want an ethical driver? Train it to be Moral.

The Big Takeaway

Think of AI not as a blank slate, but as a student. If you just let it read the internet, it learns a messy mix of human behaviors (some nice, some greedy, some confused).

This paper shows that if you give the student a specific textbook (a small dataset based on economic theory) and teach them a specific philosophy (Rational vs. Moral), they will become a consistent, predictable, and useful agent for that specific job.

It turns AI alignment from a vague "be good" instruction into a precise engineering task: "Build an agent that thinks like a rational economist" or "Build an agent that thinks like a moral philosopher."

1. Problem Statement

As Large Language Models (LLMs) are increasingly deployed as autonomous agents in strategic environments (e.g., markets, negotiations, pricing), their behavior often deviates from economically rational or strategically coherent outcomes.

The Gap: Off-the-shelf LLMs (like GPT-4o) exhibit systematic biases, such as excessive cooperation and insensitivity to payoff incentives, failing to act as reliable strategic actors.
Limitations of Current Alignment: Existing alignment methods (e.g., RLHF) focus on safety, helpfulness, and honesty in single-agent assistant contexts. They do not effectively encode explicit economic utility functions or equilibrium considerations required for multi-agent strategic interactions.
The Core Question: How can LLM agents be aligned with explicit, theory-grounded economic preferences (specifically homo economicus and homo moralis) to ensure predictable, incentive-sensitive, and strategically coherent behavior?

2. Methodology

The authors propose a Supervised Fine-Tuning (SFT) pipeline that embeds economic theory directly into the model's parameters, rather than relying on prompt engineering or post-hoc evaluation.

A. Theoretical Framework

The study utilizes two stylized utility functions derived from behavioral economics:

Homo Economicus (Rational): Maximizes self-interest based solely on expected personal payoffs.
- Utility: $u_{econ} = \sum \eta(x, \hat{y}, \zeta) \cdot \pi_{own}(\zeta)$
Homo Moralis (Moral): Balances self-interest with Kantian universalizability. The agent considers the payoff if both players adopted the same strategy.
- Utility: $u_{moral} = (1-\kappa) \cdot \text{Self-Interest} + \kappa \cdot \text{Universalized Payoff}$
- Where $\kappa$ (set to 0.5) represents the weight of moral concern.

B. Data Generation & Fine-Tuning

Synthetic Dataset: Instead of using human-labeled data, the authors generate a small, theory-driven synthetic dataset (400 examples per agent type) based on the Sequential Prisoner's Dilemma (SPD).
Process:
1. For various payoff structures $(T, R, P, S)$ , the optimal strategies for both homo economicus and homo moralis are computed analytically.
2. These optimal strategies are paired with Chain-of-Thought (CoT) reasoning steps that explicitly calculate expected utilities.
3. The GPT-4o model is fine-tuned on these $(Input, Reasoning, Optimal Action)$ triplets.
Goal: To induce the model to internalize the decision logic of the specific preference structure, making it robust across different game contexts.

C. Evaluation Strategy

The authors evaluate the fine-tuned agents across three domains:

Canonical Economic Games: Sequential Prisoner's Dilemma (SPD), Trust Game (TG), and Ultimatum Game (UG).
Moral Dilemmas: The "Moral Machine" experiment (autonomous vehicle trolley problems) to test generalization to ethical trade-offs.
Strategic Market Interaction: A repeated duopoly pricing game to test for algorithmic collusion and competitive behavior.
Safety Benchmarks: Evaluation on SimpleQA, BBQ, StrongREJECT, and XSTest to ensure alignment does not degrade safety or factual accuracy.

3. Key Contributions

Shift from Descriptive to Prescriptive Alignment: Moves beyond benchmarking LLM behavior to actively shaping agent preferences via SFT using formal utility functions.
Theory-Driven Synthetic Data: Demonstrates that a small dataset (400 examples) generated from economic theory is sufficient to induce distinct, interpretable behavioral shifts in LLMs.
Decoupling Prompting from Alignment: Shows that simple "persona prompting" (telling the model to be rational/moral) fails to produce stable, incentive-sensitive behavior, whereas fine-tuning successfully internalizes these preferences.
Safety Preservation: Proves that aligning agents to specific economic preferences does not degrade performance on standard safety and bias benchmarks; in some cases (e.g., jailbreak resistance), it improves them.

4. Key Results

A. Baseline vs. Fine-Tuned Behavior

Baseline GPT-4o: Exhibits excessive cooperation (near 100% in SPD) regardless of payoff changes and shows a disconnect between its stated beliefs and actions. It is largely insensitive to incentive structures.
Fine-Tuned Rational Agent: Behaves consistently with homo economicus. It defects when incentives dictate, responds to payoff changes, and aligns actions with beliefs.
Fine-Tuned Moral Agent: Exhibits Kantian reasoning. It cooperates when universal cooperation is optimal but defects when universal defection is the only stable outcome. It shows high internal consistency.

B. Generalization to Moral Dilemmas (Moral Machine)

Rational Agent: Shows context-sensitive preferences. It endorses utilitarian outcomes (saving more lives) generally but reduces willingness to purchase utilitarian AVs when family members are at risk (self-interest kicks in).
Moral Agent: Shows stable, context-independent preferences. It maintains a consistent willingness to purchase utilitarian AVs regardless of whether the passenger is a family member or coworker, adhering to the universalizable rule.
Baseline: Consistently favors utilitarian outcomes even in high-stakes personal contexts, lacking the nuance of self-interest found in the rational agent.

C. Algorithmic Collusion (Duopoly Pricing)

Baseline GPT-4o: Highly prone to tacit collusion, setting prices near monopoly levels when prompted to focus on long-term profit.
Rational Agent: Adapts strategically. It colludes under "collusive" prompts but aggressively undercuts to Nash equilibrium levels under "competitive" prompts.
Moral Agent: Exhibits price stability. It is less sensitive to prompt framing. Under competitive prompts, it prices below the Nash equilibrium (reflecting a desire for universally beneficial outcomes), and under collusive prompts, it sets lower prices than the baseline.
Implication: Moral preferences may act as a stabilizing force that reduces the risk of algorithmic collusion compared to purely rational or baseline agents.

D. Safety Benchmarks

The fine-tuned models performed comparably or better than the baseline on safety metrics (jailbreak resistance, bias reduction, over-refusal calibration) while maintaining factual accuracy. This suggests that embedding structured economic reasoning does not compromise safety.

5. Significance and Implications

AI as a Strategic Design Problem: The paper reframes AI alignment in multi-agent settings as an objective-design problem. The choice of alignment objective (e.g., rational vs. moral) is a strategic decision with direct consequences for market outcomes (e.g., preventing collusion) and societal welfare.
Interpretability: By grounding alignment in explicit utility functions, the resulting agent behaviors are interpretable and predictable, unlike the "black box" nature of RLHF.
Practical Deployment: Organizations can use lightweight, theory-driven fine-tuning to deploy AI agents that behave consistently with specific organizational goals (e.g., aggressive competition vs. stable cooperation) without needing massive reinforcement learning setups.
Policy Relevance: The findings suggest that regulators and firms can mitigate risks like algorithmic collusion by designing agents with specific preference structures (e.g., moral alignment) rather than relying solely on external monitoring.

In conclusion, the paper demonstrates that Supervised Fine-Tuning with synthetic, theory-derived data is a viable, efficient, and interpretable method to align LLM agents with specific economic and moral preferences, enabling them to act as coherent strategic actors in complex environments.