Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning

Here is an explanation of the paper "Why Code, Why Now" using simple language, analogies, and metaphors.

The Big Mystery: Why is AI Good at Coding but Bad at Everything Else?

Imagine you have a super-smart robot student.

In Math Class (Coding): It gets an A+. It can write complex programs, fix bugs, and solve logic puzzles. It seems to understand the rules perfectly.
In Gym Class (Reinforcement Learning): It keeps tripping over its own feet. No matter how many times it tries to run, jump, or play a game, it never seems to get better. It just gets confused or gives up.

The paper asks: Why is this happening? Is the robot just "dumb" at games? Or is the game itself broken?

The author, Zhimin Zhao, argues that the problem isn't the robot's brain size. It's the nature of the homework.

The Core Idea: The "Feedback Loop"

To learn anything, you need feedback.

Coding is like a strict math teacher. If you miss a semicolon, the teacher immediately points to the exact line and says, "Wrong here." If you get the logic right, the program runs. The feedback is dense, local, and instant.
Reinforcement Learning (RL) is like a game of "Hot and Cold" played in the dark. You take a step, and the teacher just says "Good" or "Bad" at the very end of the game. They don't tell you which step was wrong. You might have taken 100 steps, and only the last one mattered.

The Analogy:

Coding is like learning to drive with a GPS and a co-pilot who screams, "Turn left now!" or "You missed the turn!"
RL is like learning to drive by blindfolded, driving for 10 miles, and then being told, "You crashed." You have no idea if you crashed because you turned too early, hit a pothole, or drove too fast.

The paper says: You can't learn a skill if you don't know where you made the mistake.

The 5 Levels of "Learnability"

The author creates a "Ladder of Learning" to explain why some tasks are easy for AI and others are impossible, no matter how big the AI gets.

Level 0: The Black Hole (Unobservable)
- The Analogy: Trying to guess the combination of a safe that has no dial, no sound, and no keyhole.
- What happens: You can try a billion combinations, but you get zero information.
- Result: Impossible. No amount of computing power helps.
Level 1: The Moving Target (Adversarial)
- The Analogy: Playing chess against an opponent who changes the rules of the game while you are thinking about your move.
- What happens: As soon as you learn a strategy, the game changes to trick you.
- Result: Unstable. You can never get good enough because the goalpost keeps moving. (This is why many RL agents fail).
Level 2: The Noisy Room (Stochastic)
- The Analogy: Trying to hear a friend in a crowded, noisy bar. You can't hear every word perfectly, but if you listen long enough, you can figure out the conversation.
- What happens: The signal is fuzzy, but it exists.
- Result: Learnable. This is how most current AI (like image recognition) works. It just needs a lot of data to filter out the noise.
Level 3: The One-Way Mirror (Indirect)
- The Analogy: A writer who only gets feedback when they make a mistake. If they write a sentence that makes sense, no one says anything. If they write nonsense, someone crosses it out.
- What happens: You know what not to do, but you never get a "Gold Star" for doing it right. You just keep trying until you stop making mistakes.
- Result: Learnable, but slow. This is how code generation works during training. The model sees millions of valid programs and learns the patterns by avoiding the invalid ones.
Level 4: The Perfect Judge (Direct)
- The Analogy: A math test with an answer key. You write an answer, and a machine instantly says "Right" or "Wrong" with 100% certainty.
- What happens: Immediate, perfect verification.
- Result: Highly Learnable. This is why code is so easy for AI. Compilers and type checkers act as perfect judges.

The Secret Sauce of Code:
Code generation is special because it sits on Level 3 (learning from valid examples) but is propped up by Level 4 (compilers that instantly verify errors). It's like learning to swim with a life vest that instantly pulls you up if you sink.

The "Expressibility Trap" (Why Bigger isn't Always Better)

There is a common myth: "If we just make the AI bigger and give it more data, it will solve everything."

The paper says: No.

The Analogy:
Imagine you are trying to find a specific needle in a haystack.

Small AI: Can't find it.
Big AI: Can find it faster.
Super-Big AI: If the haystack is actually a black hole (Level 0) or a moving target (Level 1), the AI will just get lost faster.

The paper argues that Expressibility (how much the AI can represent) is different from Learnability (how much the AI can actually learn).

If a task has no clear rules or feedback, making the AI smarter just makes it better at guessing wrong things.
The Ceiling: The limit of AI isn't how big the model is; it's whether the task has a structure that allows learning.

What Should We Do Next?

The author suggests we stop trying to force AI to learn "hard" things (like general reasoning or complex strategy games) and start re-engineering the problems to make them easier to learn.

Four Strategies:

Break it down: Don't ask the AI to "Write a whole movie." Ask it to "Write the next sentence." (Small, local steps are easier to learn).
Engineer better feedback: Instead of just saying "Good job," give the AI a specific hint like "The character's motivation was unclear in paragraph 2."
Lower the bar: Don't aim for "Perfect Perfection." Aim for "Good enough for this step."
Change the game: Turn a hard problem into a different, easier one. (e.g., Instead of asking "Is this medical diagnosis correct?", ask "Does this image look like the training images of cancer?").

The Bottom Line

Code is easy for AI not because AI is smart, but because code is "friendly" to learning. It has clear rules, instant feedback, and no moving goalposts.

Most other problems (like understanding human emotion, playing complex strategy games, or proving deep math theorems) are "unfriendly." They lack the clear feedback loops that AI needs to learn.

The future of AI isn't about building bigger brains; it's about finding tasks that have a clear path to learning, or redesigning the tasks so they do.

Here is a detailed technical summary of the paper "Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning" by Zhimin Zhao.

1. Problem Statement

The paper addresses a fundamental paradox in Artificial Intelligence: Code generation has achieved consistent, predictable scaling and success with large models, whereas Reinforcement Learning (RL) and other interactive domains often fail to accumulate general competence despite massive compute budgets and data.

Common explanations (e.g., "RL needs more compute" or "sparse rewards") are argued to be insufficient. The author posits that the disparity is not architectural or algorithmic but structural. The core problem is distinguishing between tasks that are merely computable (solvable by an algorithm) or expressible (representable by a function class) and those that are actually learnable by large models under realistic data regimes. The paper argues that the "scaling hypothesis" (that bigger models solve all problems) fails when the underlying information structure of a task does not support learning.

2. Methodology and Theoretical Framework

The paper employs a formal theoretical approach rooted in computability theory, learning theory (PAC learning), and information theory to construct a diagnostic framework.

Formal Distinctions: The author rigorously defines and separates three properties of computational problems:
1. Expressibility: Can a correct function exist within a hypothesis class? ( $\exists f \forall x$ )
2. Computability: Does a terminating algorithm exist to solve the problem? ( $\exists M \forall x$ )
3. Learnability: Can an algorithm converge to a solution given finite, noisy, or sequential data? (Involves deeper quantifier alternations over distributions and data orderings).
Unified Template: All properties are framed as instances of the schema $\exists \Phi \in \mathcal{M}$ such that $R(\Phi, L) = 0$ , where $\mathcal{M}$ is the mechanism class and $R$ is a risk functional. The difficulty of the problem is determined by the depth of quantifier alternation in the definition of success.
Information Structure Analysis: The paper analyzes how "information" is exposed to a learner. It draws on the concept of epiplexity (structural information extractable by a bounded observer) vs. time-bounded entropy (residual unpredictability).

3. Key Contributions

A. The Five-Level Hierarchy of Learnability

The paper proposes a hierarchy based on the quality of feedback available to a learner, ranging from Level 0 (unobservable) to Level 4 (deterministically verifiable):

B. Formal Relationships Between Properties

The paper establishes pairwise relationships and separations:

Expressibility $\not\Rightarrow$ Computability: The Halting Problem is expressible (as a mathematical function) but not computable.
Computability $\not\Rightarrow$ Learnability: Cryptographic functions (e.g., AES) are computable but not efficiently PAC-learnable (under cryptographic hardness assumptions).
Learnability $\Rightarrow$ Computability (of evaluation): If a class is PAC-learnable, the resulting hypothesis must be a computable procedure.
Generation vs. Identification: "Generation in the limit" (producing valid strings) is strictly weaker than "Identification in the limit" (converging on the exact grammar). Code generation succeeds because it only requires generation, not full identification.

C. Analysis of Code vs. RL

The paper explains why code generation scales while RL often fails:

Code (Level 3/4 Hybrid): Code provides dense, local, verifiable feedback.
- Syntactic Constraints: A missing semicolon is a hard, local error.
- Locality: Type errors point to specific lines.
- Compositionality: Patterns are reusable across contexts.
- Verification: Compilers and type checkers provide Level 4 feedback (deterministic verification) that scaffolds the Level 3 learning process (learning from valid strings).
RL (Level 1/0 Risk): RL often suffers from sparse, delayed, and non-stationary feedback.
- Credit Assignment: A binary pass/fail reward at the end of a sequence does not indicate which step was wrong.
- Reflexivity: The environment changes as the policy improves (Goodhart's Law), shifting the target distribution.
- Information Density: RL collapses rich diagnostic signals into a single scalar, losing the structural information required for smooth scaling.

4. Results and Findings

Scaling Limits: The ceiling of ML progress is determined by the learnability of the task, not the size of the model. If a task is Level 0 or Level 1, increasing model size or compute yields diminishing or negative returns (e.g., faster overfitting or policy entropy exhaustion).
The "Expressibility Trap": Increasing the expressiveness of a hypothesis class (e.g., using universal approximators) does not guarantee learnability. In fact, infinite VC dimension (common in expressive classes) can destroy distribution-free learnability guarantees.
Manifold Hypothesis: Real-world data (like code) occupies a low-dimensional, structured submanifold. While the theoretical space of all programs is unlearnable (due to the Halting Problem), the distribution of human-written code has high "epiplexity" (learnable structure), making it tractable for neural networks.
Proxy Re-encoding: Successful ML applications (like protein folding or code generation) work because the original problem is re-encoded into a proxy (e.g., next-token prediction) that has a stable, learnable information structure. The gap between the proxy and the true objective is where failures (like hallucination or safety issues) occur.

5. Significance and Implications

Reframing AI Progress: The paper argues that the field should shift focus from "building bigger models" to "diagnosing task learnability." The question "Is this task learnable?" is more critical than "Is this model powerful enough?"
Design Principles for Future AI: To solve currently intractable problems, researchers should:
1. Decompose tasks into subtasks with local, verifiable feedback.
2. Engineer feedback structures that provide dense, attributable error signals (avoiding single scalar rewards).
3. Adopt weaker objectives (locally correct, verifiable steps) rather than demanding global optimality.
4. Re-encode problems into proxies that expose structural information to bounded observers.
Limits of Reasoning: The paper concludes that current models do not "reason" in a logical sense; they perform function approximation on structured manifolds. They can generate valid code because the structure of code is learnable, not because they understand logic. True logical entailment and proof generation remain distinct from statistical prediction.

In summary, the paper provides a rigorous theoretical foundation for why code generation is a "sweet spot" for current AI, while highlighting that many other domains (like complex RL or open-ended reasoning) face fundamental structural barriers that scaling alone cannot overcome.