Treatment, evidence, imitation, and chat

✨

This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Idea: Two Different Games

Imagine there are two very different games being played with Artificial Intelligence (AI) in medicine.

The "Chat Game" (The Chat Problem): This is what current chatbots (like the one you are talking to now) are really good at. The goal is to sound human, be polite, and give an answer that makes the user feel satisfied.
The "Treatment Game" (The Treatment Problem): This is the real, high-stakes medical decision. The goal is to figure out the single best action to keep a specific patient alive and healthy, balancing risks (side effects) against rewards (curing the disease).

The author, Samuel Weisenthal, argues that chatbots are currently playing the Chat Game, but people are mistakenly thinking they are playing the Treatment Game. Just because a bot sounds smart doesn't mean it can save your life.

1. The Treatment Problem: The Ultimate Puzzle

Imagine you are a doctor trying to decide if a patient should take a cholesterol pill (a statin).

The Goal: You want to maximize the patient's "happiness" (utility). This means avoiding heart attacks and avoiding bad side effects like muscle pain.
The Difficulty: Every patient is different. One person might hate muscle pain more than they fear a heart attack; another might feel the opposite.
The Solution: To solve this perfectly, you need to know exactly what would happen if you gave the pill vs. if you didn't. In the real world, we usually do this through Clinical Trials (flipping a coin to see who gets the pill) or by looking at Observational Data (looking at past records).

The Analogy: Solving the Treatment Problem is like trying to predict the weather for a specific person's backyard. You need precise data about wind, rain, and temperature to know if they should bring an umbrella.

2. The Chat Problem: The Mirror Game

Now, imagine a chatbot. Its job isn't to predict the weather; its job is to mimic a human conversation.

How it works: The chatbot looks at millions of past conversations (training data). If a human asked, "Should I take a statin?" and the top answer in the database was "Yes, take it," the bot says "Yes."
The Trap: The bot isn't calculating the risk of a heart attack. It is just copying what humans usually say. It is playing a game of "Telephone" where it tries to sound like a doctor, not actually be a doctor.

The Analogy: A chatbot is like a parrot. If you teach a parrot to say "The sky is blue," it will say it perfectly. But if you ask the parrot, "Is the sky blue right now?" it doesn't look outside; it just repeats what it was taught. It mimics the words, not the truth.

3. Why Imitation Isn't Enough

The paper warns us about Imitation Learning. This is when an AI learns by copying human doctors' notes.

The Problem: If human doctors are wrong, the AI will copy their mistakes.
The Statin Example: Imagine a community of doctors who all prescribe statins to everyone over 50, even if it's not necessary. If an AI learns by copying them, it will also prescribe statins to everyone over 50. It won't realize that for some people, the side effects aren't worth it.
The Missing Piece: Imitation doesn't care about the outcome. It only cares about looking like the teacher. It doesn't know if the patient actually got better or got sick.

4. The "Magic" of Chatbots vs. The Reality of Medicine

Why are chatbots so good at chatting but bad at medicine?

Chatbots can experiment: If a chatbot gives a weird answer, the worst thing that happens is a user says, "That was weird." The engineers can try again. They can run thousands of "trials" in a day to see what users like.
Doctors cannot experiment: A doctor cannot randomly give a patient a dangerous drug just to see what happens. The stakes are too high. You can't "A/B test" a human heart.

The Analogy:

Chatbots are like a video game developer. They can crash the game a million times to fix the bugs.
Doctors are like pilots. They can't crash the plane to see if the landing gear works. They have to rely on strict rules and proven data.

5. Can We Train a Bot to Be a Real Doctor?

The author asks: Can we build a bot that actually solves the Treatment Problem?

Theoretically, yes: If we could feed the bot data on what happens to patients after they take a drug (did they have a heart attack? did they get muscle pain?), the bot could learn to maximize patient health.
The Hurdle: We can't easily get that data. We can't force people to take drugs just to train the AI. We have to rely on messy, real-world records (Observational Data), which are full of hidden biases (like, maybe only rich people took the drug, so the data looks better than it really is).

The "Moonshot" Idea:
The author suggests that maybe we can use mathematical models to analyze millions of medical notes and find the best treatment strategies. But this is a "Moonshot"—a huge, risky, long-term goal. It's not something we can do tomorrow.

The Bottom Line

Chatbots are great tools: They can help doctors find information, summarize notes, or chat with patients to keep them calm.
Chatbots are NOT doctors: They are currently trained to imitate conversation, not to optimize health.
Don't be fooled: Just because a bot sounds confident and uses big medical words doesn't mean it has done the math to save your life.

Final Metaphor:
Think of a chatbot as a very knowledgeable tour guide. They can tell you the history of a city, the best restaurants, and the rules of the road. But if you ask them to drive the car through a storm, you should be very careful. They know the words for "steering wheel" and "brake," but they haven't actually learned how to keep the car from crashing in the rain. That requires a different kind of training—one that involves real risk and real consequences.

Based on the paper "Treatment, evidence, imitation, and chat" by Samuel J. Weisenthal, here is a detailed technical summary covering the problem definition, methodology, key contributions, results, and significance.

1. Problem Definition

The paper addresses the potential and limitations of Large Language Models (LLMs) in medical decision-making, specifically distinguishing between two distinct optimization problems:

The Treatment Problem: The core clinical task of determining the optimal treatment ( $T$ ) for a patient with specific characteristics ( $X$ ) to maximize patient utility ( $U$ ). This is framed as a decision-analytic or reinforcement learning (RL) task where the goal is to find a policy $\pi^*$ that maximizes expected utility:
$\pi^* = \arg \max_{\pi} E_{\pi} U(T, X)$
This requires estimating counterfactual outcomes (e.g., $P(u|do(t), x)$), which involves causal inference.
The Chat Problem: The task of generating a human-like response ( $A$ ) to a user prompt ( $Q$ ) that maximizes user utility ( $S$ ) or satisfaction. This is typically solved via Imitation Learning (mimicking human text data) combined with Reinforcement Learning from Human Feedback (RLHF) to optimize for preferences.
$\pi^*_c = \arg \max_{\pi_c} E_{\pi_c} S(A, Q)$

The Core Conflict: The author argues that current LLMs (like ChatGPT) are optimized to solve the Chat Problem (imitation + user preference), but there is a dangerous misconception that they can directly solve the Treatment Problem (causal optimization of patient health outcomes).

2. Methodology and Theoretical Framework

The paper utilizes a formal mathematical framework comparing Decision Analysis, Causal Inference, and Reinforcement Learning.

A. Formalizing the Treatment Problem

The treatment problem is modeled using the do-calculus (Pearl's notation) to represent interventional probabilities.

Objective: Maximize expected utility $E[U]$ over the joint distribution of outcomes and treatments.
Data Sources:
- Randomized Controlled Trials (RCTs): Provide unbiased estimates of $P(u|do(t), x)$ but are limited by sample size, generalizability, and ethical constraints on experimentation.
- Observational Data: Can be used via Off-Policy Reinforcement Learning (e.g., Importance Sampling, Inverse Probability Weighting) to estimate treatment effects. However, this relies on the "No Unmeasured Confounders" assumption, which is untestable and often violated in real-world medical data (e.g., Electronic Health Records).
Heuristics: Current clinical practice often uses heuristics (e.g., ASCVD risk scores) rather than solving the full utility optimization, which may fail to account for individual patient preferences or side effects.

B. Formalizing the Chat Problem

The chat problem is modeled as a trade-off between Imitation and Utility Optimization.

Imitation (Behavioral Cloning): Minimizing the Kullback-Leibler (KL) divergence between the model's policy and the human-generated data distribution ( $\pi_0$ ).
$\hat{\pi}_c = \arg \min_{\pi_c} KL(\pi_c, \pi_{0c})$
Utility Optimization (RLHF): Maximizing a reward signal $S$ (user satisfaction) while constraining divergence from the base policy (Trust Region Policy Optimization).
$\tilde{\pi}^*_c = \arg \max_{\pi_c} E_{\pi_c} S(A, Q) - \lambda d(\pi_c, \hat{\pi}_c)$

C. The "Imitation" Trap in Medicine

The paper analyzes Imitation Learning applied to medical notes. If an LLM is trained to mimic clinical notes (imitating $\pi_{0tc}$ ), it learns the standard of care as it was practiced, not necessarily the optimal care.

Flaw: If the historical data contains suboptimal practices or biases, the model imitates these errors.
Missing Signal: Pure imitation lacks the utility signal (patient outcomes/side effects) required to solve the treatment problem. It optimizes for "looking like a doctor," not "acting as an optimal doctor."

3. Key Contributions

Conceptual Distinction: The paper rigorously distinguishes between the Chat Problem (optimizing for human-like interaction and user preference) and the Treatment Problem (optimizing for patient health outcomes via causal inference). It argues that solving the former does not imply solving the latter.
Mathematical Mapping: It maps medical decision-making concepts (RCTs, observational studies, confounding) directly to RL concepts (on-policy vs. off-policy learning, exploration vs. exploitation, propensity scores).
Critique of Imitation in Medicine: It demonstrates that training LLMs on medical notes (imitation) creates a "facade" of medical competence. Without a utility signal derived from patient outcomes, the model cannot optimize treatment decisions, only replicate historical patterns.
Identification of Barriers:
- Ethical Barrier: Unlike chatbots, which can be trained via massive experimentation (A/B testing responses), medical treatment systems cannot ethically experiment on patients at scale to learn optimal policies.
- Causal Barrier: Solving the treatment problem requires valid causal inference from observational data, which is plagued by unmeasured confounders and positivity violations.
Role of LLMs: It proposes that LLMs should be viewed as tools for extracting signals (e.g., mining utility from unstructured text notes) or supporting clinicians (literature search, patient education) rather than as autonomous decision-makers.

4. Results and Findings

Chatbots do not solve the treatment problem: A system optimized for the chat problem (equation 17) prioritizes imitation and user satisfaction. Even if it provides a medically accurate answer, it does so based on pattern matching, not a calculation of expected patient utility.
Imitation is insufficient: Imitating medical notes (equation 19) fails to account for the counterfactual outcomes necessary for optimal treatment. If the training data reflects suboptimal guidelines or clinician bias, the model perpetuates them.
The "Moonshot" of Observational Learning: While theoretically possible to solve the treatment problem using observational data and off-policy RL (equation 7 and 20), the assumptions required (no unmeasured confounders) are extremely difficult to satisfy in high-dimensional medical data.
Ethical Constraints on Experimentation: The success of current LLMs relies heavily on experimentation (exploration). In medicine, this experimentation is ethically restricted to clinical trials, which are slow and expensive. Therefore, the "superhuman" performance seen in chat cannot be directly transferred to clinical decision-making without solving the causal inference problem first.
Utility Heterogeneity: Patient utility varies significantly (e.g., tolerance for side effects vs. risk of stroke). Current heuristics and chat models often fail to capture this individualized utility maximization.

5. Significance and Implications

De-hyping AI in Healthcare: The paper serves as a critical reality check against the anthropomorphism of medical AI. It warns that "sounding like a doctor" (solving the chat problem) is not the same as "being a doctor" (solving the treatment problem).
Guidance for Research: It suggests that the medical research community should not wait for LLMs to solve the treatment problem. Instead, efforts should focus on improving Evidence-Based Medicine (EBM): better trial design, handling observational data with robust causal methods, and defining patient utilities more precisely.
Future Directions:
- Hybrid Systems: LLMs could be used to extract structured utility signals from unstructured medical text to feed into causal inference models.
- Decision Support: LLMs should act as "co-pilots" for clinicians (searching literature, summarizing guidelines) rather than autonomous agents making treatment decisions.
- Regulatory Caution: Regulators and clinicians must remain skeptical of "AI" claims in medicine, recognizing that the underlying math of a chatbot (imitation) is fundamentally different from the math of clinical decision-making (causal optimization).

In conclusion, the paper argues that while LLMs are powerful tools for communication and information retrieval, they cannot currently solve the treatment problem due to the lack of valid causal data, the inability to experiment ethically at scale, and the fundamental mismatch between optimizing for user conversation and optimizing for patient health outcomes.