2DIO: A Cache-Accurate Storage Microbenchmark

This paper introduces 2DIO, a portable storage microbenchmark that uses a compact parameter triplet to generate cache-accurate I/O traces with tunable complex behaviors, such as performance cliffs and plateaus, enabling researchers to systematically explore cache dynamics and faithfully replicate real-world workloads.

Original authors: Yirong Wang, Isaac Khor, Peter Desnoyers

Published 2026-03-23
📖 5 min read🧠 Deep dive

Original authors: Yirong Wang, Isaac Khor, Peter Desnoyers

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a chef trying to test a new, high-tech oven. To see if it works well, you need to bake different kinds of cakes. But here's the problem: you don't have the actual cakes (real-world data) because they are too big to carry, or they are secret recipes you can't share.

So, you try to make "fake" cakes (synthetic data) to test the oven.

The Problem with Old Tools
For years, the tools used to make these fake cakes were like a baker who only knew how to make one type of cake: a simple sponge cake that gets flatter and flatter the more you add ingredients. In the world of computer storage, this means old tools could only create workloads where adding more memory (cache) always helped a little bit, but never suddenly made a huge difference.

But real life isn't like that. Sometimes, adding a tiny bit more memory causes a massive jump in speed (a "cliff"). Other times, adding a ton of memory does absolutely nothing (a "plateau"). Old tools couldn't create these weird, real-world scenarios, so they couldn't accurately test if a new storage system was truly good.

Enter 2DIO: The "Master Chef" of Fake Data
The paper introduces 2DIO, a new tool that acts like a master chef who can bake any kind of cake, no matter how strange the recipe.

Here is how it works, using simple analogies:

1. The Two Ingredients: "Recency" and "Frequency"

To understand how a computer cache works, you need to understand two things about the data it stores:

  • Recency (The "Just Ate" Factor): Did we just look at this item? If yes, it's likely to stay in the cache.
  • Frequency (The "Favorite" Factor): Do we look at this item a lot over a long time? If yes, it's likely to stay in the cache.

Old tools only looked at Frequency. They assumed that if you liked a song, you'd play it randomly throughout the day.
2DIO looks at both. It understands that sometimes you listen to a song on repeat for 10 minutes (high recency), and then you never touch it for a week (low frequency).

2. The Secret Sauce: The "Sleep Schedule"

The magic of 2DIO is a clever trick it uses called IRD (Inter-Reference Distance).

Imagine you have a list of 1,000 different songs.

  • Old Tools: They pick a song, play it, and then pick the next song completely at random based on how popular the song is. This creates a smooth, predictable pattern.
  • 2DIO: It gives every song a "sleep schedule."
    • Song A might say: "I will be played, then I will sleep for exactly 5 minutes, then I will be played again."
    • Song B might say: "I will be played, then I will sleep for 100 years."
    • Song C might say: "I will be played, then sleep for 1 minute, then 1 minute, then 1 minute..."

By carefully designing these sleep schedules, 2DIO can force the computer's memory to behave in specific ways.

  • If you want a Performance Cliff (where a tiny bit more memory causes a huge speed boost), 2DIO creates a "sleep schedule" where many songs wake up at the exact same time, flooding the system.
  • If you want a Plateau (where more memory does nothing), it creates a schedule where songs wake up so far apart that the memory is always empty anyway.

3. The "Compact Recipe" (The Parameter Triplet)

The best part is that 2DIO doesn't need a massive file to describe these complex behaviors. It uses a tiny triplet of numbers (a "recipe"):

  1. How often do we pick a random song? (Frequency)
  2. What is the "sleep schedule" for the popular songs? (Recency)
  3. How big is the playlist? (Scale)

Because the recipe is so small, researchers can easily tweak the numbers to see "What if?" scenarios.

  • "What if we add a spike in the sleep schedule here?" -> Boom, you get a performance cliff.
  • "What if we make the sleep times more random?" -> Boom, you get a smooth curve.

4. Why This Matters

Think of 2DIO as a flight simulator for storage systems.

  • Before, flight simulators could only fly in perfect, sunny weather. If you wanted to test a plane in a storm, you had to wait for a real storm (which is dangerous and rare).
  • Now, with 2DIO, you can dial in "Storm Level 5" or "Turbulence Level 9" instantly. You can test if your new storage system can handle the worst-case scenarios without ever needing to wait for a real-world disaster to happen.

Summary

2DIO is a tool that lets computer scientists create perfectly fake storage workloads that behave exactly like real, messy, unpredictable real-world data. It does this by controlling when data is accessed (recency) and how often (frequency), allowing them to test storage systems against complex challenges like sudden speed boosts or useless memory, all with a tiny, easy-to-share set of numbers.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →