Theory of Code Space: Do Code Agents Understand Software Architecture?

Imagine you are trying to understand a massive, mysterious city. You don't have a map, and you can't see the whole thing at once. You have to walk around, peek into buildings, read street signs, and try to figure out how the subway lines connect to the power grid.

This is exactly what AI "code agents" (smart computer programs that write and fix software) are supposed to do. But there's a problem: while these AIs are great at solving small, isolated puzzles, they often get completely lost when asked to navigate a whole software project. They might fix one file but accidentally break three others because they don't understand how the pieces fit together.

This paper introduces a new test called TOCS (Theory of Code Space) to see if AI agents can actually build a mental map of a software city, or if they are just wandering around blindly.

Here is the breakdown of their findings using simple analogies:

1. The "Active vs. Passive" Paradox

The Analogy: Imagine two students trying to learn a new city.

Student A (Active): Walks around, opens doors, and asks locals questions.
Student B (Passive): Is handed a giant, 500-page atlas of the city all at once.

Usually, we think the student with the atlas (Passive) should win. But the paper found something weird: It depends on the student.

One AI model (GPT) actually did better when it walked around and explored step-by-step. When it was handed the whole atlas at once, it got overwhelmed and confused (like trying to drink from a firehose).
Another AI model (Gemini) did the opposite. It got lost when it had to walk around, but it understood the city perfectly when it was given the whole atlas at once.

The Lesson: "Active exploration" isn't just a default skill; it's a specific talent that some AIs have and others lack.

2. The "Memory Blackout" (Belief Instability)

The Analogy: Imagine you are drawing a map of the city on a piece of paper. Every few steps, you have to stop and show your map to a teacher.

The Stable Student: Draws a little bit, shows the map, adds a little more, shows the map again. The map keeps getting better and never loses what was already drawn.
The Unstable Student: Draws a great map, shows it to the teacher, then erases half of it before drawing the next part. By the end, they've forgotten the streets they discovered in the first half of the tour.

The Lesson: The paper found that some AI models suffer from "catastrophic forgetting." A smaller model (Gemini 2.5 Flash) kept a perfect, stable map. A much larger, "smarter" sibling (Gemini 2.5 Pro) built a good map, then suddenly forgot everything it had learned in a single step. Size doesn't always equal stability.

3. The "Self-Scaffolding" Effect

The Analogy: Imagine you are building a tower of blocks.

Method A: You build a block, put it down, and forget it. Then you build the next one.
Method B: You build a block, write down exactly where it is on a sticky note, and stick that note to the wall. Then you use that note to help you place the next block.

The Lesson: When the AI was allowed to keep its "notes" (the JSON map it wrote) in its memory while it kept working, one model (GPT) got significantly better. It used its own previous notes to help it understand the future. However, for another model, keeping the notes didn't help at all. This means the ability to "use your own notes to help you think" is a skill that varies wildly between different AIs.

4. The "Hidden Rules" Test

The Analogy: In a real city, there are rules like "No trucks allowed on this bridge" or "All water pipes must go through the main valve." These aren't written on the street signs; you have to infer them by looking at the pipes and valves.
The paper planted these "hidden rules" in the code.

The Result: Most AIs missed them. But when the researchers gave the AI a better "exam prompt" (a clearer set of instructions on what to look for), the AIs suddenly got much better at finding these hidden rules. This proves that sometimes the AI can understand the architecture, but it just needed a clearer question to answer.

The Big Takeaway

The paper concludes that AI code agents are not yet ready to be independent architects.

They are like very talented interns who are great at writing code for a single room but often fail to understand the blueprint of the whole building. Some interns are good at exploring new rooms, while others are better if you hand them the blueprints first. Some forget what they learned five minutes ago, while others remember everything perfectly.

Why does this matter?
If we want AI to fix complex software bugs or build new systems, we can't just ask them to "go fix it." We need to design systems that:

Help them remember what they've seen (like a digital sticky note).
Teach them how to explore efficiently without getting overwhelmed.
Give them clearer instructions on what "rules" to look for.

The authors released their test (TOCS) as open-source so that other researchers can use it to build better, more reliable AI architects.

Here is a detailed technical summary of the paper "Theory of Code Space: Do Code Agents Understand Software Architecture?"

1. Problem Statement

While Large Language Models (LLMs) excel at isolated code generation tasks (e.g., HumanEval), they struggle with multi-file software engineering tasks that require a deep understanding of software architecture. Practitioners observe that models often produce incoherent results when modifying real-world codebases with interdependent modules.

The core hypothesis is that this failure stems from an inability to maintain a coherent internal "cognitive map" (a structured belief state of the codebase) during active exploration. Unlike static code analysis, agents must operate under partial observability (limited budget to open files) and dynamically update their understanding as they discover new information. Existing benchmarks (e.g., SWE-bench) measure output correctness but do not evaluate the agent's evolving architectural belief state.

2. Methodology: Theory of Code Space (TOCS)

The authors introduce TOCS, a benchmark framework designed to evaluate an agent's ability to construct, maintain, and update architectural beliefs.

A. Environment & Action Space

Procedural Generation: The environment consists of procedurally generated Python codebases (Pipeline architecture) with controlled complexity.
Partial Observability: Agents operate under a strict action budget ( $B=20$ ). They cannot see the whole codebase at once.
Actions:
- LIST(d): List filenames in a directory.
- OPEN(f): Read full file content (costs 1 action).
- SEARCH(q): Find file paths/line numbers (no content).
- INSPECT(f, s): View type signatures/docstrings (no body).
- DONE(): Terminate.
Edge Types: The codebases contain four types of dependencies, varying in discoverability:
1. IMPORTS (Static, ~67%): Visible via AST.
2. CALLS_API (Runtime, ~17%): Requires reading function bodies.
3. REGISTRY_WIRES (Config-driven, ~9%): Dynamic loading via JSON config; invisible to static import analysis.
4. DATA_FLOWS_TO (Orchestration, ~7%): Requires understanding data flow logic.

B. Cognitive Map Probing

Every $K=3$ actions, the agent is interrupted to externalize its belief state as structured JSON. This "probe" is free (does not consume the action budget) and captures:

Component beliefs (status, purpose, symbols).
Dependency edges (source, target, type, confidence).
Invariant beliefs (architectural constraints like forbidden dependencies).
Uncertainty tracking.

C. Evaluation Metrics

Dependency F1: Precision/Recall of discovered edges against ground truth.
Invariant F1: Ability to discover planted architectural constraints (e.g., "Module A must not import Module C").
Active-Passive Gap (APG): Decomposes performance into:
- Selection Cost: Cost of choosing which files to open.
- Decision Cost: Cost of processing observations vs. just receiving them.
Belief Stability: Measures if the agent forgets previously discovered components over time.

3. Key Contributions

TOCS Benchmark: The first framework to operationalize "architectural belief construction" in code, moving beyond static code generation to dynamic state maintenance.
Procedural Generator: A tool generating codebases with specific anti-triviality measures (e.g., hidden invariants, registry wiring) and four distinct edge types.
Architectural Constraint Discovery: A novel dimension measuring the ability to find logical rules (invariants) rather than just structural links.
Empirical Findings: Pilot experiments with 4 baselines and 6 frontier LLMs revealing three critical phenomena.

4. Experimental Results

The study evaluated models including GPT-5.3-Codex, Claude Sonnet 4.6, and various Gemini models against rule-based baselines (Oracle, BFS, Random).

Key Findings

The Active-Passive Gap is Model-Dependent:
- GPT-5.3-Codex: Performed better with active exploration (APG = -0.22) than receiving all files at once. Active exploration likely prevents information overload.
- Gemini 2.5 Flash: Performed worse with active exploration (APG = +0.23) and better when given all files at once. It struggles with the cognitive load of decision-making during exploration.
- Implication: "Active exploration" is not a universal capability; some models degrade when forced to gather information themselves.
Self-Scaffolding is Model-Dependent:
- Retaining the JSON belief map in the context (acting as a "scratchpad") significantly boosted GPT-5.3-Codex (+13.8 F1 points).
- The same mechanism had negligible or negative effects on Gemini models.
- Implication: The ability to use one's own output as working memory is a specific capability, not a generic trait of large models.
Belief State Instability:
- Gemini 2.5 Pro (a larger model) exhibited catastrophic belief collapse: it built a reasonable map, then destroyed it in a single probe, forgetting previously discovered components.
- Gemini 2.5 Flash (a smaller model) maintained perfectly stable beliefs with zero edge loss across probes.
- Implication: Belief stability is not correlated with model scale but likely depends on training objectives (e.g., whether the model treats probes as "summarize from scratch" vs. "incremental update").
Edge Type Discovery:
- Rule-based baselines could only detect IMPORTS and REGISTRY_WIRES.
- Top LLMs (GPT, Claude) successfully discovered all four edge types, including the difficult DATA_FLOWS_TO and CALLS_API, demonstrating superior semantic reasoning.

5. Significance & Implications

Redefining Code Agent Evaluation: TOCS shifts the focus from "did the code run?" to "does the agent understand the system?" It highlights that current agents often fail not at coding, but at maintaining a consistent mental model of the codebase.
Designing Better Agents:
- Hybrid Approaches: Combining static AST analysis (for structural completeness) with LLM semantic analysis (for runtime logic) is recommended.
- State Management: Explicitly managing the belief state (e.g., via scratchpads) is crucial, but the implementation must be tailored to the specific model's capabilities.
- Prompt Engineering: The study found that prompt ambiguity (e.g., how to report edges that are both imports and calls) caused massive false positives, indicating that "capability gaps" are often "prompt specification gaps."
Open Source: The authors release TOCS as an open-source toolkit to enable community evaluation and further research into architectural reasoning.

Conclusion

The paper demonstrates that while frontier LLMs can outperform rule-based systems in discovering complex software architectures, they suffer from significant instability in maintaining those beliefs. The "Theory of Code Space" provides a diagnostic framework to identify whether an agent's failure is due to poor exploration strategy, inability to externalize beliefs, or catastrophic forgetting, offering a roadmap for building more robust, architecturally aware AI agents.