A Grammar of Machine Learning Workflows

Imagine you are a chef trying to create the perfect recipe for a new dish. You want to know if your dish is truly delicious, not just because you've tasted it a hundred times while cooking it, but because a stranger, who has never seen your kitchen, takes a bite and says, "Wow."

In the world of Machine Learning (AI), this "stranger" is the Test Set. It's the data the AI has never seen before.

The problem, as this paper points out, is that many AI chefs are cheating. They are tasting the final dish while they are still cooking it, or they are peeking at the stranger's notes before the stranger even arrives. This is called Data Leakage. It makes the AI look smarter than it actually is, leading to false confidence and failed predictions in the real world.

The author, Simon Roth, proposes a solution not as a "checklist" (a list of rules to remember), but as a Grammar—a set of strict, unbreakable rules built into the very language of how we build AI.

Here is how the paper works, explained through simple analogies:

1. The Problem: The "Leaky Kitchen"

Currently, building an AI is like cooking in a kitchen with no walls. You can accidentally use the ingredients meant for the final tasting (the Test Set) to season the soup while it's cooking.

The old way: You write code, run it, and hope you didn't cheat. If you did, you might only find out years later when the paper is published and the results don't hold up.
The new way: Build a kitchen with locked doors. If you try to walk into the "Final Tasting Room" before you've finished cooking, the door physically won't open.

2. The Solution: A "Grammar" for AI

Just as English grammar has rules that make a sentence "correct" or "nonsense" (e.g., "The cat sat" is good; "Sat the cat" is bad), this paper defines a Grammar of Machine Learning.

It breaks the AI building process down into 7 Basic Moves (Primitives). You can only do these moves in a specific order, and the "language" itself will stop you if you try to cheat.

The 7 Moves (The Primitives):

Split: You take your pile of ingredients (Data) and lock them into three separate, labeled jars: Training (for practice), Validation (for practice checks), and Test (the final exam).
- The Rule: You cannot touch the "Test" jar until the very end.
Prepare: You wash and chop your ingredients.
- The Rule: You must do this inside the Training jar. If you wash ingredients from the Test jar while prepping the Training jar, the system yells, "Stop! You're cheating!"
Fit: You cook the recipe (train the model).
Evaluate: You taste the soup during cooking to see if it needs more salt.
- The Rule: You can do this as many times as you want, but only with the Validation jar.
Assess: The final moment. You serve the dish to the stranger (the Test Set).
- The Rule: You can only do this once. Once you serve it, the "Assess" button breaks. You cannot taste it again to tweak the recipe. If you try, the system locks you out.
Predict: Using the finished recipe to cook for new customers.
Explain: Telling people why the soup tastes the way it does.

3. The "Hard Constraints" (The Security Guards)

The paper introduces four "Hard Constraints" that act like security guards at the door of your kitchen:

Guard 1 (The One-Time Pass): You can only serve the final dish to the stranger once. If you try to serve it again to tweak the score, the guard stops you. This prevents "Selection Leakage" (cheating by picking the best score out of 100 tries).
Guard 2 (The Clean Apron): You cannot wash your hands (preprocess data) using the Test ingredients. You must wash them only with the Training ingredients. This prevents "Estimation Leakage."
Guard 3 (The ID Check): You can't cook (Fit) unless you have a valid ID from the "Split" step. If you try to cook raw data that hasn't been split, the system rejects it.
Guard 4 (No Peeking): You cannot look at the Test labels (the answers) before you split the data.

4. Why This Matters: The "Magic" of the Grammar

The author tested this idea with three different computer languages (Python, R, and Julia). In all three, the "Grammar" worked perfectly.

Before: If you wanted to cheat, you could. You just had to be careful not to get caught.
Now: If you try to cheat, the code crashes or refuses to run. It's not a warning; it's a hard stop.

The paper proves that cheating (Data Leakage) inflates the AI's performance scores significantly. By forcing the grammar, the AI can no longer "fake" its intelligence. It has to prove it on the final exam, and it can only take that exam once.

5. The "Chomsky" Connection

The author compares this to famous grammarians.

Wilkinson (Graphics): Made a grammar for charts (like ggplot2).
Codd (Databases): Made a grammar for data storage (SQL).
Roth (AI): Makes a grammar for how we learn from data.

Just as you can't write a grammatically correct sentence that says "Colorless green ideas sleep furiously" (it's grammatically correct but makes no sense), this grammar ensures your AI workflow is structurally sound. It doesn't guarantee your AI is good (you could still choose a bad recipe), but it guarantees you didn't cheat to make it look good.

Summary

Think of this paper as the invention of a locked, tamper-proof kitchen for AI.

Old Way: "Please don't peek at the test answers." (People peek anyway).
New Way: The test answers are in a vault that only opens after the cooking is done, and the vault has a timer that only lets you open it one time.

If you try to break the rules, the kitchen locks up. This ensures that when an AI says, "I am 95% accurate," it actually means it, not that it just memorized the test.

Here is a detailed technical summary of the paper "A Grammar of Machine Learning Workflows" by Simon Roth (2026).

1. The Problem: Data Leakage in ML Research

The paper addresses the pervasive issue of data leakage in machine learning, citing a 2023 audit by Kapoor and Narayanan that found leakage errors in 294 published papers across 17 scientific fields.

Current Limitations: The dominant response to leakage has been documentation (checklists, best-practice guides, and linters). The author argues that documentation fails because it relies on human adherence and post-hoc detection rather than structural prevention.
The Gap: While tools like tidymodels (recipes) prevent Class I leakage (preprocessing before splitting), they do not structurally prevent Class II (selection/peeking at test labels) and Class III (memorization/reusing test data) leakage.
Goal: To create a structural remedy—a formal grammar—that makes invalid workflows impossible to execute at call time, rather than detecting them after the fact.

2. Methodology: The Grammar Framework

The paper proposes a typed Directed Acyclic Graph (DAG) workflow system that decomposes the supervised learning lifecycle into 7 kernel primitives connected by strict type signatures and 4 hard constraints.

A. The 7 Kernel Primitives

The workflow is defined by seven atomic operations:

split: Takes a DataFrame and produces a Partition (train, valid, test). It establishes the assessment boundary.
prepare: Normalizes, encodes, and imputes features. In declarative mode, this is handled internally by fit per fold.
fit: Trains a model. Requires data registered by split (train/valid tags only).
predict: Applies a fitted model to new data (no partition constraints).
evaluate: Measures performance on validation data. Repeatable and iterative.
explain: Generates feature importances or partial dependence (diagnostic only).
assess: Measures performance on the test set. Terminal, once-per-model, and irreversible.

B. The Type DAG and State Machine

The workflow is governed by a state machine with four states: {CREATED, FITTED, EVALUATED, ASSESSED}.

Iteration: The fit $\to$ evaluate loop is open and repeatable on training/validation data.
Termination: The assess operation moves the model to the ASSESSED state, a terminal sink that rejects all further transitions.
Evidence Type: assess returns an Evidence type, which is distinct from Metrics. No primitive accepts Evidence as input, ensuring test-set results cannot feed back into the model training loop.

C. The Four Hard Constraints

The grammar enforces four rules via type checks and runtime guards:

Assess Once: A model can only be assessed on the test set once. A second call raises a guard error.
Prepare After Split: Preprocessing must occur after splitting and per fold. Global preprocessing is rejected.
Type-Safe Transitions: Prevents fitting on test data or evaluating without a fitted model.
No Label Access Before Split: Prevents feature selection using test labels. Data must be registered by split before entering fit.

3. Key Contributions

Structural Prevention over Detection: Unlike linters that scan code, this grammar rejects invalid workflows (e.g., fit(test_data)) at the API boundary before execution.
The Terminal Assess Constraint: The core innovation is the runtime-enforced boundary between evaluate (repeatable, validation) and assess (terminal, test). This specifically targets the most damaging leakage classes.
Cross-Language Portability: The grammar is implemented in Python, R, and Julia with identical type signatures and behavior, proving the specification is language-agnostic and robust.
Empirical Grounding: The design is informed by a companion study quantifying the effect sizes of different leakage types.

4. Results and Empirical Evidence

The paper presents a companion study (Roth 2026) involving 2,047 experimental instances and 3,759 additional scaling instances across OpenML datasets.

A. Leakage Effect Sizes

The study quantified the impact of leakage using Cohen's $d_z$ :

Class I (Estimation/Preprocessing): Negligible effect ( $|d| < 0.1$ ).
Class II (Selection/Peeking): Large effect ( $d_z = 0.93$ ). This corresponds to a raw AUC inflation of +0.046 points. Crucially, this effect persists across all sample sizes ( $N=50$ to $2,000 $) with a positive asymptotic floor ($ d_\infty = 0.047$), proving that sample size alone does not solve selection bias.
Class III (Memorization): Large effect ( $d_z = 0.53–1.11$ ), scaling with model capacity.

B. Falsifiable Predictions

The grammar generated three predictions, two of which were confirmed and one falsified (demonstrating the model's falsifiability):

Screen Inflation: Selecting the best of $K$ algorithms inflates performance. Confirmed ( $d = +0.27$ ).
Stack Leakage: Stacking models leaks through fold labels. Falsified ( $d = -0.22$ ); the grammar's out-of-fold architecture was empirically safe.
Seed Inflation: Reporting the best of $S$ random seeds inflates performance. Confirmed ( $d = +0.88$ , 92% prevalence).

C. Implementation Stress Testing

Three independent implementations (Python, R, Julia) passed 2,805 test cases, satisfying all 7 "Codd test" conditions (e.g., distinct Evidence vs. Metrics types, rejection of unsplit data).

5. Significance and Implications

Paradigm Shift: Moves ML workflow correctness from a "human memory" problem (following checklists) to a "system enforcement" problem (types and guards).
Methodological Rigor: By making the evaluate/assess distinction structural, the grammar prevents the two most damaging forms of leakage (selection and memorization) that currently invalidate a significant portion of published research.
Scalability: The positive asymptotic floor of selection leakage ( $d_\infty \approx 0.047$ ) implies that even with massive datasets, structural prevention is necessary; larger $N$ does not eliminate the bias.
Limitations: The grammar prevents structural errors but not semantic errors (e.g., choosing the wrong algorithm for a specific domain). It also does not currently address "optimization leakage" (overfitting to the validation set via excessive tuning), though it provides the structural foundation for nested cross-validation.

Conclusion: The paper argues that a formal grammar is the necessary next step in ML infrastructure. By enforcing a "terminal assess" constraint, it renders the most common and damaging data leakage patterns impossible to execute, thereby raising the baseline of scientific rigor in machine learning.