Link Wars: The Semantic Crisis. Is the debate over or is it just beginning?

This paper argues that the current fragmentation in high-performance interconnects stems from a fundamental "semantic crisis" caused by implicit, vendor-specific assumptions about time and ordering, and proposes that adopting explicit, testable link semantics through the Open Atomic Ethernet (OAE) standard is essential to restore correctness and enable convergence.

Paul Borrill

Published Tue, 10 Ma
📖 7 min read🧠 Deep dive

Here is an explanation of the paper "Link Wars: The Semantic Crisis" using simple language, everyday analogies, and metaphors.

The Big Picture: A World of Broken Promises

Imagine you are running a massive global logistics company. You have trucks (data) moving between warehouses (servers) every second. For the last 50 years, the industry has been obsessed with one thing: Speed. "How fast can we build a bigger truck?" "How many lanes can we add to the highway?"

But this paper argues that we have hit a wall. We have built incredibly fast highways, but the rules of the road are a mess.

The author, Paul Borrill, calls this the "Semantic Crisis." In plain English, "semantics" means meaning. The crisis is that different trucking companies (tech vendors like NVIDIA, Intel, Google, etc.) are using different languages to describe what "delivery" actually means.

  • Company A says: "I dropped the package at the door. I'm done." (Even if the door was locked).
  • Company B says: "I dropped the package, but I won't know if it was accepted until I get a text message back... which might never come."
  • Company C says: "I promise it's there, but you can't ask me how I know."

Because they can't agree on what "delivered" means, software engineers have to build massive, complex safety nets to catch the mistakes. This slows everything down and makes systems fragile.


The Core Problem: The "Forward-Only" Mistake

The paper identifies a single, deep-rooted error in how we design computer networks. It calls this the FITO (Forward-In-Time-Only) Category Mistake.

The Analogy: The One-Way Mailbox
Imagine you send a letter to a friend.

  1. You drop it in the mailbox.
  2. You wait.
  3. You hope your friend gets it.
  4. If they get it, they might send a postcard back saying, "Got it!"

This is how almost all computer networks work today. The sender sends data and then has to wait and guess if it arrived. The sender has no way of knowing the truth instantly.

The paper argues this is a design choice, not a law of physics. We could build a system where the sender and receiver shake hands before the transaction is considered "done." But because we stuck with the "One-Way Mailbox" model, we have to invent crazy workarounds to make sure data doesn't get lost or corrupted.

The Symptoms: How the Crisis Shows Up

Because of this "One-Way Mailbox" problem, the industry has developed four bad habits (pathologies):

1. The "Universal Fencing" (The Over-Protective Parent)

The Metaphor: Imagine a parent who wants to make sure their child crosses the street safely. Instead of just watching the child cross, the parent stops every single car in the city and forces everyone to wait until the child is safely across.
The Reality: In computer networks (specifically RDMA), because the sender doesn't know if a message arrived, the system forces every single operation to wait for a confirmation before doing the next one. This kills speed. It turns a super-fast highway into a single-lane road where everyone stops at every intersection.

2. The "Fire-and-Forget" (The Ghost Writer)

The Metaphor: You write a letter, throw it into a black hole, and assume it arrived. If it didn't, you don't know until your friend stops talking to you.
The Reality: In AI training (like NVIDIA's GPUs), the computer sends data and immediately moves on to the next task. If the data got lost or arrived out of order, the AI might learn the wrong thing, but the system won't know until it's too late. It's like building a house on a foundation you haven't checked.

3. The "Secret Handshake" (The Walled Garden)

The Metaphor: Only people wearing a specific brand of hat are allowed into the club. If you don't have the hat, you can't talk to anyone inside, even if you speak the same language.
The Reality: Companies like NVIDIA make their own private networks (NVLink). They work great inside their own club, but they are a mystery to everyone else. You can't mix their chips with Intel's chips easily because they don't agree on the rules of "completion."

4. The "Tower of Babel" (Multi-Cloud Chaos)

The Metaphor: You are trying to coordinate a project between a team in New York, one in London, and one in Tokyo. The New York team says "Done" means "I signed the paper." The London team says "Done" means "I mailed the paper." The Tokyo team says "Done" means "I thought about the paper."
The Reality: When you try to run a computer program across different cloud providers (AWS, Google, Azure), they all define "success" differently. The software has to spend all its time translating these different meanings, slowing everything down.


The Solution: Open Atomic Ethernet (OAE)

The paper proposes a new way to build networks called Open Atomic Ethernet (OAE).

The Analogy: The Two-Way Handshake
Instead of the "One-Way Mailbox," imagine a Two-Way Handshake.

  • Sender: "I am handing you this package."
  • Receiver: "I have caught it. It is mine."
  • Both: "Transaction complete. We both know it's done."

This happens instantly and is guaranteed. There is no guessing. There is no waiting for a postcard.

Key Features of OAE:

  1. Bilateral Transactions: Both sides agree on the outcome at the same time.
  2. Explicit Contracts: You can tell the network, "I need this to be strictly ordered" or "I don't care about order, just get it there fast." The network guarantees exactly what you asked for.
  3. No More Guessing: If a connection breaks, both sides know immediately. No more "Is it lost? Or is it just slow?"

Why This Matters for the Future

The paper argues that we are currently stuck in a loop:

  • We need faster AI and bigger data centers.
  • We build faster chips.
  • But because the "rules of the road" (semantics) are broken, we have to add more software layers to fix the mistakes.
  • This slows us down again.

The "Superscalar" Analogy:
Think of a computer processor. In the old days, it did things one by one (Step 1, then Step 2). Then, engineers realized they could do Step 1 and Step 2 at the same time, as long as they made sure the final result looked like they happened in order. They separated execution from completion.

The paper says we need to do the same thing for networks. We need to stop forcing the network to be "slow and safe" (like the universal fence) and start making it "fast and explicit" (like the handshake).

The Conclusion: Is the War Over?

The paper asks: Will we ever agree on one standard?

  • The Pessimist says: No. Companies want to sell their own proprietary "secret sauce" (like NVLink). They will keep fighting for market share.
  • The Optimist says: Yes, but only if we stop arguing about speed and start arguing about trust.

If we ask, "Which network is the fastest?" we will keep fragmenting.
If we ask, "Which network gives us a guaranteed, testable promise that the data arrived?" then we can finally agree on one standard.

The Bottom Line:
We have been trying to solve a "meaning" problem with "speed" solutions. We need to stop building faster highways and start fixing the traffic laws. Until we agree on what "delivered" actually means, our computers will keep tripping over their own feet, no matter how fast they run.