Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation

Imagine you are a chef (the AI) trying to send a complex recipe to a sous-chef (the computer system) so they can cook it perfectly.

This paper is a taste test comparing three different ways to write down that recipe:

JSON (The Classic Cookbook): The standard, well-known format everyone uses. It's reliable, but sometimes a bit wordy.
JSON-SO (The Strict Sous-Chef): A version where the chef is forced to follow a rigid checklist. They can't make a mistake, but they might get confused if the checklist is too strict.
TOON (The New Shorthand): A brand-new, ultra-compact language designed specifically to save space. It's like writing a recipe in a secret code that uses fewer words.

The author, Ivan Matveev, wanted to see if this new "secret code" (TOON) is actually better than the old ways, especially when you have to pay for every word you write (tokens).

The Setup: The "Prompt Tax"

Here's the catch: JSON is like a language the chef already knows. You just say, "Write a recipe," and they do it.

TOON, however, is new. The chef has never seen it before. So, before they can write the recipe, you have to spend a lot of time explaining the rules: "Okay, use 2-space indentation, put arrays in brackets, don't forget the commas..."

This explanation is called the "Prompt Tax." It's like paying an entrance fee before you can even start cooking. If the recipe is short, that entrance fee might cost more than the savings you get from using the shorthand.

The Results: What Happened in the Kitchen?

1. The "Simple Salad" Test (Easy Data)

When the task was simple (like a list of users or a simple order), JSON-SO (The Strict Sous-Chef) was the winner.

Why? Because the strict rules prevented the chef from making typos, and since the recipe was short, the "Prompt Tax" of TOON made it too expensive.
Analogy: If you are just ordering a coffee, it's faster to say "Coffee" than to explain a new, complex ordering system.

2. The "Complex Banquet" Test (Hard Data)

When the task got complicated (deep nesting, like a company with departments and employees), TOON started to struggle.

The Problem: The new shorthand is great for flat lists, but it gets messy when you have to go deep into layers (like a Russian nesting doll). The chef got confused by the rules and made mistakes.
The Fix: The system had to try again and again (repair loops), which ate up all the time and money saved by the shorthand.

3. The "Sweet Spot"

TOON shined brightest in the middle ground: Standard business documents like invoices or orders.

These are structured enough to be predictable but not so deep that the rules break.
The Verdict: If you are sending huge amounts of data (like a massive database dump), the savings from TOON eventually pay off the "Prompt Tax." But for small tasks, it's not worth the effort.

The Big Takeaways (In Plain English)

Don't reinvent the wheel for small jobs: If you are just sending a simple list, stick to standard JSON or use the "Strict Sous-Chef" (JSON-SO). It's faster and cheaper.
TOON is a specialist, not a generalist: It works amazing for "tabular" data (rows and columns, like Excel sheets) but fails miserably at complex, deep hierarchies (like a family tree or a complex file system).
The "Prompt Tax" is real: Because TOON is new, you have to teach the AI how to use it every time. If your data is small, that teaching cost is too high. You only save money if you are generating massive amounts of data.
The "Repair Loop" Trap: If the AI messes up the new shorthand, fixing it is expensive. Because the instructions are so long, sending the error back to the AI doubles the cost.

The Final Recommendation

Think of TOON like a high-speed train.

If you are traveling a short distance (small data), driving a car (JSON) is faster because you don't have to walk to the station and buy a ticket (the Prompt Tax).
If you are traveling thousands of miles (massive datasets), the train is unbeatable. It's fast and efficient, but only if you stay on the tracks. If you try to take the train off-road (complex, deep data structures), it derails.

Bottom line: TOON is a promising tool for specific, high-volume jobs, but it's not ready to replace the standard JSON format for everything just yet.

Metric	Plain JSON (J)	JSON-SO (JSO)	TOON (T)
One-Shot Accuracy	Highest (often >90% for strong models)	Moderate (High for weak models, low for strong)	Low (High for aligned, 0% for non-aligned)
Final Accuracy	High	High	Moderate (Dependent on repair loops)
Token Efficiency	Baseline	Best for simple tasks (e.g., Users case: 556 tokens)	Best for large, aligned tasks (if prompt tax is amortized)
Robustness	High	High (Guardrails for weak models)	Low (Fails on complex structures)

Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation

The Setup: The "Prompt Tax"

The Results: What Happened in the Kitchen?

1. The "Simple Salad" Test (Easy Data)

2. The "Complex Banquet" Test (Hard Data)

3. The "Sweet Spot"

The Big Takeaways (In Plain English)

The Final Recommendation

1. Problem Statement

2. Methodology

Experimental Setup

3. Key Contributions & Findings

A. The "Prompt Tax" vs. Syntax Savings

B. Performance of Constrained Decoding (JSON-SO)

C. TOON's "Domain Alignment" Hypothesis

D. Scaling Hypothesis

4. Results Summary Table (Averages)

5. Significance and Recommendations

Significance

Recommendations

Conclusion

Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation

The Setup: The "Prompt Tax"

The Results: What Happened in the Kitchen?

1. The "Simple Salad" Test (Easy Data)

2. The "Complex Banquet" Test (Hard Data)

3. The "Sweet Spot"

The Big Takeaways (In Plain English)

The Final Recommendation

1. Problem Statement

2. Methodology

Experimental Setup

3. Key Contributions & Findings

A. The "Prompt Tax" vs. Syntax Savings

B. Performance of Constrained Decoding (JSON-SO)

C. TOON's "Domain Alignment" Hypothesis

D. Scaling Hypothesis

4. Results Summary Table (Averages)

5. Significance and Recommendations

Significance

Recommendations

Conclusion

More like this

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

PACED: Distillation at the Frontier of Student Competence

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Reversible Lifelong Model Editing via Semantic Routing-Based LoRA