Terminal Is All You Need: Design Properties for Human-AI Agent Collaboration

Here is an explanation of the paper using simple language and everyday analogies.

The Big Idea: Why "Old School" Terminals Are Winning

Imagine you are hiring a super-smart robot assistant to help you fix your house. You have two ways to talk to it:

The "Magic Remote" (GUI): You point at a picture of the house on a screen, and the robot tries to figure out which button to press or which wall to paint based on what it sees. This is hard. The robot often gets confused by shadows, colors, or weird layouts. It's like trying to teach a dog to drive by showing it a photo of a steering wheel.
The "Walkie-Talkie" (Terminal): You speak to the robot, and it speaks back in plain text commands. "Go to the kitchen, pick up the hammer, and hit the nail." The robot types it out, you read it, say "Yes," and it does it.

The paper argues that the "Walkie-Talkie" (the computer terminal) is currently the best way for humans and AI to work together.

Even though the tech world is obsessed with fancy graphical interfaces (like clicking icons on a screen), the most effective AI tools right now are actually text-based. The authors say this isn't an accident; it's because the terminal naturally solves three big problems that fancy screens struggle with.

The Three Secret Ingredients for Success

The paper identifies three "design properties" that make the terminal work so well. Think of these as the three legs of a sturdy stool.

1. Speaking the Same Language (Representational Compatibility)

The Problem: AI models (the brains of the robot) are basically giant text processors. They think in words and code. If you show them a picture of a button, they have to do a huge amount of mental gymnastics to translate "red circle at the top right" into "click here."
The Terminal Solution: The terminal speaks the robot's native language: Text.
The Analogy: Imagine you are a chef (the AI) who only speaks French.
- GUI: You show the chef a picture of a tomato. They have to guess, "Is that a tomato? Is it red? Where is the knife?" It's slow and error-prone.
- Terminal: You hand the chef a recipe card that says "Chop 2 tomatoes." The chef reads it and acts immediately. No guessing, no translation needed.
Why it matters: When the human and the AI are both reading and writing text, there is zero friction. The AI doesn't have to "see" the screen; it just reads the instructions.

2. The "Glass Box" (Transparency)

The Problem: When you use a fancy app, the AI might click a button, and you just see the result. You don't know why it clicked there or what it was thinking. If it makes a mistake, you have no idea how to stop it. It's like a black box.
The Terminal Solution: The terminal is a Glass Box. Every step the AI takes is written down in a log.
The Analogy:
- GUI: You tell a self-driving car to "Go to the store." It suddenly swerves. You have no idea if it saw a pedestrian, a pothole, or if it just got confused. You can't intervene until it's too late.
- Terminal: The car says: "I am turning left because the traffic light is green. I am slowing down because there is a dog. Do you approve?"
- The "Approval Gate": The terminal pauses and asks, "Do you want me to do this?" You can say "Yes," "No," or "Wait, change that." You are always in the loop.

3. No "Expertise Tax" (Low Barriers to Entry)

The Problem: Traditionally, using a computer terminal was hard. You had to memorize complex codes (like rm -rf /), which scared most people. It was like a secret club for hackers.
The Terminal Solution: AI changes the rules. Now, you don't need to know the secret codes. You just speak in natural language.
The Analogy:
- Old Way: You want to find all your old photos. You have to learn a complex command like find . -name "*.jpg" -size +10M. If you get the syntax wrong, nothing happens.
- New Way (AI + Terminal): You just say, "Find all my big photos." The AI translates your English into the complex code, runs it, and shows you the results.
Why it matters: It lowers the barrier. You don't need to be a computer expert to use the powerful tools; you just need to know how to talk.

The "Mixed-Initiative" Dance

The paper also talks about how humans and AI should take turns leading the dance.

In a good terminal setup:

You say what you want ("Fix the login bug").
The AI proposes a plan ("I will check line 42 and add a safety check").
You review the plan. You can say "Yes," "No," or "Actually, skip that part and do this instead."
The AI adjusts and executes.

This is called Mixed-Initiative. The human stays in charge, but the AI does the heavy lifting. The text stream makes this easy because it's easy to stop, read, and edit a text plan. In a graphical interface, it's much harder to "pause" a robot mid-click and tell it to change its mind.

The Takeaway

The authors aren't saying we should throw away all our fancy screens and go back to the 1980s. They are saying: "The terminal is the gold standard for how AI should talk to us."

If we want to build AI that works well with graphical interfaces (like clicking buttons on a website), we need to engineer those screens to act like terminals.

Give the AI text descriptions of what it sees (not just pictures).
Show the AI's "thought process" on the screen so we can read it.
Let us interrupt and correct the AI easily using plain English.

In short: The terminal isn't just a leftover tool from the past; it's a design blueprint for the future of human-AI teamwork. It works because it's clear, honest, and easy to talk to.

Based on the paper "Terminal Is All You Need: Design Properties for Human-AI Agent Collaboration" (accepted at CHI 2026), here is a detailed technical summary:

1. Problem Statement

Current research into AI agents focuses heavily on enabling them to operate Graphical User Interfaces (GUIs) by interpreting screenshots or accessibility trees and generating mouse/keyboard actions. However, this approach faces significant performance bottlenecks:

Low Success Rates: State-of-the-art GUI agents achieve only ~12% task success on benchmarks like OSWorld, compared to ~72% for humans.
Perception Overhead: GUI agents must translate visual pixels into motor actions, creating a "perception problem" that acts as a primary bottleneck.
Adoption Gap: Despite the theoretical promise of GUI agents, the most effective and widely adopted agentic tools in practice (e.g., Cursor, Claude Code, OpenAI Codex) have converged on terminal-based, text-sequential interactions.
The Dual-Audience Challenge: Interfaces must simultaneously serve two audiences: the human (who needs legibility and control) and the agent (which needs interpretable states and executable actions). Current GUI designs often optimize for human vision at the expense of agent interpretability, while raw APIs optimize for agents but remain opaque to humans.

2. Methodology

The paper employs a theoretical analysis grounded in established Human-Computer Interaction (HCI) literature rather than a new empirical user study. The authors:

Analyze Industry Trends: They observe the convergence of major agentic tools (from competing organizations) toward text-based sequential patterns (CLI/TUI) despite the existence of graphical IDEs.
Triangulate HCI Theory: They map the effectiveness of terminal-based tools against three core HCI concepts:
- Representational Compatibility (alignment between agent reasoning and interface format).
- Transparency (visibility of agent actions and reasoning).
- Low Barriers to Entry (accessibility for users of varying expertise).
Compare Modalities: They contrast the "text stream" paradigm of terminal agents with the "pixel-to-action" paradigm of GUI agents, using examples from software development (e.g., code diffs, test execution) to illustrate design properties.

3. Key Contributions: The Three Design Properties

The paper argues that the terminal is not a legacy artifact but a design exemplar. It identifies three properties that terminal-based tools satisfy by default, which any Human-AI-UI modality must engineer to achieve effective collaboration:

A. Representational Compatibility (P1)

Definition: The alignment between the format native to the agent (text/LLM) and the interface format.
Mechanism: Large Language Models (LLMs) process text. Terminal interfaces consume and produce text. This creates a direct structural mapping where tool invocations = shell commands, and parameters = arguments.
Contrast: GUI agents require a translation layer (pixels $\to$ coordinates $\to$ actions), introducing error and latency.
Implication: For GUIs, this suggests providing semantic representations (DOM trees, accessibility metadata) rather than raw pixels to reduce translation overhead.

B. Transparency of the Interaction Medium (P2)

Definition: The ability of the human to inspect agent actions, reasoning, and history within the same medium used for communication.
Mechanism: In a terminal, the text stream serves as the communication channel, explanation surface, chronological log, and approval gate simultaneously. Every decision point is recorded as readable text.
Contrast: GUI actions are visually observable but often lack an inspectable, reproducible log of why an action was taken or the specific reasoning path.
Implication: GUI agents need embedded transparency, such as persistent action logs, inline annotations of reasoning over UI elements, and editable plan representations.

C. Low Barriers to Human Participation (P3)

Definition: Accessibility of the interface to humans regardless of their prior expertise with the underlying system.
Mechanism: While traditional CLIs had steep learning curves, Natural Language (NL) input bridges the "Gulf of Execution." The agent translates NL intent into executable commands, allowing novices to access expert-level capabilities without memorizing syntax.
Contrast: Direct manipulation (GUIs) often requires users to understand the application's internal logic to effectively oversee an agent.
Implication: Interfaces must provide NL channels for direction, progressive disclosure of technical details, and low-friction intervention (pause/undo) without requiring deep domain expertise.

4. Results & Observations

Convergence Evidence: The paper notes that by late 2024, ~30% of new Python functions on GitHub were AI-generated, with adoption rates of coding agents (15–23%) showing rapid growth. These tools universally utilize text-based sequential patterns.
Performance Data:
- Text-based Agent-Computer Interfaces (ACI) outperformed default Linux shell agents by 10.7% on SWE-bench.
- Using executable Python code as an action space yields 20% higher success rates than JSON-based function calling due to better expressiveness.
- GUI agents show up to 59% lower success rates on realistic tasks compared to simpler benchmarks.
Mixed-Initiative Interaction: The terminal structure naturally supports "mixed-initiative" (human and agent taking turns). The CLI prompt acts as an unambiguous turn boundary, allowing humans to intervene, edit plans, or approve actions easily (e.g., Proceed? [y/n/edit]).

5. Significance & Future Directions

Reframing the Research Agenda: The paper challenges the assumption that the primary bottleneck for GUI agents is perceptual capability (e.g., better vision models). Instead, it posits the bottleneck is the lack of design properties (compatibility, transparency, low barriers).
Design Guidelines for GUI/Spatial Agents: Researchers building agents for graphical or spatial interfaces must deliberately engineer:
1. Semantic Layers: Accessible data structures for agents, not just pixels.
2. Embedded Transparency: Action logs and reasoning overlays integrated into the flow, not separate dashboards.
3. Oversight Mechanisms: Protocols for turn-taking and intervention that do not require deep application expertise.
Call to Action: The field should treat the Agent-Computer Interface (ACI) with the same rigor as human-facing interface design. The authors propose future empirical studies (e.g., within-subjects comparisons of text streams vs. opaque GUIs) to validate these properties across domains beyond software development (e.g., spatial reasoning, visual design).

Conclusion: The terminal is not the end goal, but a proof-of-concept for how Human-AI collaboration should be structured. Any modality (GUI, VR, Spatial) must replicate the terminal's Representational Compatibility, Transparency, and Low Barriers to be effective.