CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

This paper introduces CAKE, a novel benchmark comprising 188 expert-validated questions across four cognitive levels and five cloud-native topics, to evaluate and reveal how different response formats and model configurations influence the assessment of large language models' understanding of cloud architecture.

Original authors: Tim Lukas Adam, Phongsakon Mark Konrad, Riccardo Terrenzi, Florian Girardo Lukas, Rahime Yilmaz, Krzysztof Sierszecki, Serkan Ayvaz

Published 2026-04-08
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are hiring a new architect to design a massive, futuristic city made entirely of digital clouds. You have two types of candidates: tiny, eager apprentices (small AI models) and seasoned, master builders (large AI models).

The paper you shared, CAKE, is essentially a rigorous job interview designed to test how well these AI "architects" actually understand the complex rules of building cloud cities.

Here is the story of their findings, broken down into simple concepts:

1. The Problem: The "Fake Expert" Trap

Until now, we've tested AI on coding tasks (like asking them to write a specific line of code) or general trivia. But asking an AI to design a cloud system is different. It's like asking someone to write a sentence versus asking them to design a bridge.

The researchers realized there was no test to see if an AI truly understood the concepts behind cloud architecture (like how to make a system resilient or how to break a big app into smaller pieces). They needed a way to see if the AI was a genuine expert or just a "parrot" repeating facts it memorized.

2. The Solution: The CAKE Exam

They created CAKE (Cloud Architecture Knowledge Evaluation). Think of this as a multi-level driving test for AI architects.

  • The Questions: They built 188 questions, vetted by human experts.
  • The Levels (Bloom's Taxonomy): The test isn't just one thing. It has four levels of difficulty, like climbing a ladder:
    1. Recall: "What is a container?" (Simple memory).
    2. Analyze: "Why did this system crash?" (Understanding relationships).
    3. Design: "Draw me a system that won't crash." (Creating new solutions).
    4. Implement: "Write the actual plan to build it." (Doing the hard work).
  • The Topics: They covered five key areas: Patterns, Quality, Breaking things down, Cloud deployment, and "Technical Debt" (the mess you leave behind for later).

3. The Test Subjects

They didn't just test one AI. They tested 22 different versions of AI models, ranging from tiny ones (0.5 billion "brain cells" or parameters) to massive ones (70 billion). They also tested them in three "modes":

  • Base Mode: Just answer the question.
  • Think Mode (+think): "Take a deep breath and think step-by-step before answering."
  • Tool Mode (+tool): "Go search the internet and use tools to help you."

4. The Big Surprises (The Results)

🍰 The "Multiple Choice" Ceiling

When the AI took the Multiple Choice (MCQ) part of the test, the results were surprisingly boring for the big models.

  • The Analogy: Imagine a trivia game. Once a student has studied enough (about 3 billion parameters), they get almost 100% of the answers right.
  • The Finding: Whether the AI was a 3-billion-parameter model or a 70-billion-parameter giant, they both scored near-perfectly on multiple-choice questions. The test couldn't tell the difference between a good student and a genius because the questions were too easy for the big ones.

📝 The "Free Response" Reality Check

Then, they asked the AI to write out their own answers (Free Response).

  • The Analogy: This is like asking the student to explain their reasoning or draw the blueprint, rather than just circling "A" or "B."
  • The Finding: Here, the differences exploded. The tiny models struggled to explain why they chose a design, while the big models wrote clear, logical plans.
  • Key Insight: Multiple-choice questions lie. They make small models look smarter than they are. Only free-response questions reveal who can actually do the work.

🧠 The "Thinking" and "Tools" Twist

  • Thinking (+think): Telling the AI to "think step-by-step" helped the small models write better essays (free response), but sometimes confused them on multiple-choice questions. It's like a student over-analyzing a simple math problem and getting it wrong.
  • Tools (+tool): Giving the AI internet access was a disaster for the tiny models. They got lost in the search results and gave worse answers. It's like giving a compass to a toddler; they just spin in circles. The AI needed to be at least a certain size (around 8 billion parameters) before it could use tools effectively.

5. The "Conviction" Meter

One clever trick the researchers used was asking the AI the same question three times.

  • If the AI gave the same answer all three times, they called it "High Conviction." These answers were right 90% of the time.
  • If the AI changed its mind between runs, it was "Low Conviction." These answers were only right 55% of the time.
  • Takeaway: If an AI seems unsure (changes its answer), humans should double-check its work.

6. The Final Verdict

The paper concludes with a warning for anyone using AI in software architecture:

Don't trust the multiple-choice scores.

If you ask an AI to pick the right answer from a list, even a small, cheap AI will look like a genius. But if you ask it to design a system or explain a complex decision, you need a much larger, smarter model.

In short:

  • Small Models: Good at memorizing facts and picking the right letter on a test.
  • Big Models: Good at actually designing and explaining complex systems.
  • The Test: You must ask them to write, not just pick, to know what they can really do.

The researchers made their "exam questions" (the CAKE dataset) public so everyone can keep testing and improving these digital architects.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →