[Re] FairDICE: A Gap Between Theory And Practice

Imagine you are trying to teach a robot to drive a delivery truck. But there's a catch: you can't let the robot drive around the city and crash into things to learn (that's too dangerous and expensive). Instead, you only have a video recording of a human driver who already did the job. This is Offline Reinforcement Learning: learning from a past dataset without touching the real world.

Now, imagine the human driver had to balance three conflicting goals:

Speed: Get the package there fast.
Safety: Don't hit any pedestrians.
Fuel Economy: Don't waste gas.

If you just tell the robot "do what the human did," it might just copy the human's bad habits (like speeding when the road is empty but ignoring safety). If you try to tell the robot "be perfect," it might get confused and do nothing. The challenge is finding a fair compromise that balances all three goals automatically.

This is where a new algorithm called FairDICE was supposed to come in. The original authors claimed they built a "magic sauce" that could automatically figure out the perfect balance between speed, safety, and fuel, without needing a human to tweak the settings.

The Replication Study: "Wait, the Sauce is Just Water?"

A team of researchers decided to test this "magic sauce" to see if it actually works. They tried to rebuild the algorithm from scratch using the code the original authors published. Here is what they found, explained simply:

1. The Big Mistake: The "Copy-Paste" Glitch

The researchers discovered a massive bug in the code, like a chef who accidentally forgot to add the spice to the stew.

The Theory: The algorithm was supposed to look at the past data, calculate how important each goal was, and then adjust the robot's behavior to be fairer.
The Reality: Because of a coding error (a "broadcasting mistake"), the algorithm completely ignored the "importance" calculations. It just blindly copied the human driver's actions, exactly as if it were doing a simple "Behavior Cloning" (copying homework).
The Result: The original paper showed amazing results, but those results were actually just the robot copying the human. The "magic" of balancing goals wasn't actually happening in the continuous environments (the complex driving scenarios).

2. Fixing the Sauce

Once the researchers fixed the code, they tried again.

Good News: The theory does work! In simple, toy-like environments (like a robot moving through a grid of rooms), the algorithm successfully learned to balance goals better than just copying the human. It proved the math was sound.
Bad News: In the complex, real-world-like environments, the algorithm became extremely sensitive. It's like a car that only drives well if you turn the steering wheel to exactly 42.3 degrees. If you turn it to 42.4 degrees, it crashes.
- The algorithm needs a specific setting (called Beta) to work.
- The original paper claimed you could use any setting and it would work fine. The replication showed that if you pick the wrong setting, the algorithm performs worse than just copying the human.
- The Catch: To find the right setting, you usually have to test it in the real world (Online), which defeats the purpose of "Offline" learning.

3. The Stress Tests

The researchers also put the fixed algorithm through some tough tests:

Negative Rewards: What if the goals are "don't lose money" instead of "make money"? The algorithm handled this okay.
Biased Data: What if the human driver in the video was terrible at safety but great at speed? The algorithm could partially fix this, but if the data was really biased, the robot couldn't learn to be fair. It's hard to teach a robot to be fair if the only teacher you have was unfair.
High Complexity: What if there are 100 different goals (like balancing 100 different people's needs)? The algorithm scaled up well and handled it.
Image Inputs: What if the robot has to look at a video camera instead of numbers? It worked, though the improvement over just copying was small.

The Final Verdict

Think of FairDICE as a brilliant new recipe for a cake that promises to taste perfect no matter what ingredients you have.

The Theory: The recipe is mathematically sound. It should work.
The Practice: The original paper served you a cake that was actually just a store-bought cake (the bug) that happened to taste good.
The Real Cake: When you bake the cake correctly, it can taste amazing, but only if you are a master baker who knows exactly how much sugar to add. If you guess the sugar amount, it will taste terrible.

Conclusion:
FairDICE is a fascinating idea with a solid theoretical foundation. However, the original paper was too optimistic. It claimed the algorithm was "plug-and-play" (easy to use), but the reality is that it requires careful tuning and high-quality data to work. It's not a magic wand that solves all fairness problems automatically yet, but it's a very promising tool that just needs a little more polishing before it can be trusted in the real world.

1. Problem Statement

Offline Reinforcement Learning (RL) allows agents to learn policies from static datasets without interacting with the environment, which is crucial for high-risk domains like medicine and robotics. However, real-world tasks often involve multi-objective optimization where goals may conflict (e.g., maximizing speed vs. minimizing energy).

The Challenge: Standard offline RL algorithms typically require a single scalar reward. Combining multiple objectives via a weighted sum requires manually tuning weights to ensure "fairness" (balancing objectives rather than optimizing one at the expense of others).
The Proposed Solution (Original Paper): Kim et al. (2025a) proposed FairDICE, an adaptation of the OptiDICE algorithm. FairDICE aims to automatically learn optimal weights ( $\mu$ ) for multiple objectives during training by adding a regularization term that encourages rewards to be distributed according to a non-linear utility function (e.g., Nash Social Welfare).
The Gap: While the theoretical framework appeared sound, the practical efficacy and robustness of FairDICE, particularly in continuous environments, were unverified. This replication study investigates whether the claims of FairDICE hold up under rigorous scrutiny.

2. Methodology

The authors conducted a comprehensive replication study involving both discrete (toy) and continuous (MuJoCo-based) environments.

A. Experimental Setup

Discrete Environments: MO-Four-Rooms and Random MOMDP. Used to verify theoretical properties regarding hyperparameters $\alpha$ (fairness type) and $\beta$ (regularization strength).
Continuous Environments: The D4MORL benchmark (Hopper, Swimmer, Walker2d, Ant, HalfCheetah) and new extensions:
- MO-GroupFair: A 100-objective environment simulating societal unfairness.
- MO-Minecart-RGB: An image-based environment with discrete actions.
Baselines: Comparison against Preference-Conditioned Behavior Cloning (BC(P)), MODT(P), and MORvS(P).

B. Critical Code Analysis & Discrepancies

During the review of the public implementation (GitHub), the authors identified two critical discrepancies between the paper's description and the code:

Broadcasting Error in Policy Loss (Continuous): In FairDICE.py, the calculation of the policy loss involved a tensor shape mismatch. The importance weights $w^*(s,a)$ $w^{*} (s, a)$ (shape: batch, 1) were multiplied by log-probabilities (shape: batch,). Due to broadcasting rules, this resulted in an outer product rather than an element-wise product.
- Consequence: The learned weights were effectively ignored, and the policy loss became equivalent to standard Behavior Cloning (BC). The critic had no influence on the policy.
Unspecified Gradient Penalty: An additional gradient penalty term ( $\lambda$ ) was present in the critic loss to enforce smoothness, which was not described in the paper. The authors confirmed this was an oversight.

C. Replication Strategy

The authors ran experiments in two modes:

Original Code: To reproduce the results claimed in Kim et al. (2025a).
Fixed Code: Correcting the broadcasting error to implement true weighted behavior cloning and testing various hyperparameters ( $\beta, \lambda$ ).

3. Key Findings & Results

A. Discrete Environments (Theoretical Claims)

Verification: The theoretical claims held true. FairDICE successfully learned balanced policies that outperformed the data-collection policy in terms of Nash Social Welfare (NSW).
Hyperparameter Sensitivity:
- $\alpha$ : Correctly interpolated between utilitarianism ( $\alpha=0$ ) and min-max fairness ( $\alpha \to \infty$ ).
- $\beta$ : Controlled the trade-off between welfare maximization and sticking to the behavior policy. Lower $\beta$ allowed deviation for better fairness; higher $\beta$ constrained the policy to the data.

B. Continuous Environments (Practical Claims)

The "Robustness" Illusion: The original paper claimed FairDICE was robust across a wide range of $\beta$ values (Claim 2.1). The replication revealed this was false. The original results were actually just standard BC performance, which is naturally robust because it ignores the critic and weights.
True Performance: Once the code was fixed, FairDICE became highly sensitive to $\beta$ $β$ .
- Most $\beta$ values resulted in performance worse than standard BC.
- There was no clear pattern for selecting $\beta$ across different environments, undermining the claim that FairDICE can be applied without online hyperparameter tuning.
Pareto Front: The "Rerun" (original code) FairDICE appeared to lie on the Pareto front, but this was an artifact of the dataset composition. The "Fixed" FairDICE only outperformed baselines significantly in specific cases (e.g., HalfCheetah-Expert) and was generally comparable or worse in others.

C. Extensions

Negative Rewards: FairDICE could handle negative rewards using a piecewise log function, though standard log-based FairDICE also worked if expected returns were positive.
Biased Datasets: FairDICE could partially correct for bias in datasets (e.g., 80/10/10 goal distribution) but failed to fully recover fairness when the dataset was highly skewed.
Scalability: The algorithm successfully scaled to 100 objectives (GroupFair) and image-based observations (Minecart-RGB), demonstrating that the method is not limited by dimensionality, provided the hyperparameters are tuned.

4. Key Contributions

Identification of Implementation Bugs: The study exposed a critical broadcasting error that reduced FairDICE to standard Behavior Cloning in continuous environments, explaining the discrepancy between theory and the original experimental results.
Refutation of Robustness Claims: The study demonstrated that FairDICE is not robust to hyperparameter choices in continuous settings. It requires careful, likely online, tuning of the regularization strength $\beta$ , which contradicts the "truly offline" promise of the algorithm.
Empirical Validation of Theory: Confirmed that the theoretical mechanism (learning weights via regularization) works in discrete, tabular settings but faces significant practical hurdles in continuous, high-dimensional spaces.
Extended Evaluation: Provided new benchmarks for high-dimensional rewards (100 objectives) and image-based inputs, showing the method's scalability potential despite tuning difficulties.

5. Significance and Conclusion

The paper concludes that FairDICE is a theoretically interesting contribution with a valid mechanism for learning fair weights. However, the experimental justification in the original paper was flawed due to implementation errors.

Implication for the Field: The study highlights the critical importance of code transparency and rigorous replication in RL. It warns that "robustness" claims in offline RL must be scrutinized, as they may mask underlying implementation failures.
Future Directions: For FairDICE to be practically useful in truly offline settings, future work must either:
1. Develop a more robust offline framework that is less sensitive to hyperparameter tuning.
2. Characterize the relationship between dataset properties and optimal $\beta$ to enable automatic selection without online evaluation.
3. Explore combining the learnable linearization mechanism with other offline RL algorithms (e.g., CQL, IQL) to improve stability.

In summary, while the idea of FairDICE is sound, the practice requires significant revision to move from a theoretical curiosity to a reliable tool for fair offline multi-objective reinforcement learning.