Cross-Project Flakiness: A Case Study of the OpenStack… — Plain-Language Explanation

Original authors: Tao Xiao, Dong Wang, Shane McIntosh, Hideaki Hata, Yasutaka Kamei

Published 2026-05-29

📖 6 min read🧠 Deep dive

Original authors: Tao Xiao, Dong Wang, Shane McIntosh, Hideaki Hata, Yasutaka Kamei

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are part of a massive, global construction crew building a giant, complex cloud city called OpenStack. This city isn't built by one person; it's built by thousands of workers (developers) working on hundreds of different neighborhoods (projects) like Cinder, Glance, and Nova. To make sure the city doesn't collapse, every time someone adds a new brick or changes a pipe, they run a series of automated "safety checks" (tests).

Ideally, these safety checks should be like a perfect traffic light: Green means "Go, the change is safe," and Red means "Stop, there's a problem."

But sometimes, the traffic light flickers. It turns Red for no good reason, then Green when you check again, then Red again. In the software world, this is called "Flakiness." It's like a test that is just "moody"—it doesn't know if it's passing or failing, even though nothing changed in the code.

This paper is a detective story about how this "moody" behavior spreads across the entire OpenStack city, not just in one neighborhood.

The Two Big Problems They Found

The researchers discovered two specific ways this "moody" behavior causes trouble:

1. The "Contagious" Glitch (Cross-Project Flakiness)
Imagine a specific safety check (a test) that is supposed to verify if a door lock works. In this city, that same lock-check is used in the Cinder neighborhood, the Glance neighborhood, and the Nova neighborhood.

The Problem: The lock-check is "moody." It fails randomly in all three neighborhoods.
The Impact: Because the neighborhoods share this one test, a single glitchy test stops progress in multiple places at once. The researchers found that 55% of all neighborhoods in OpenStack are affected by these contagious glitches. It's like a single bad apple rotting the whole barrel, but the apple is actually a test that everyone is using.

2. The "Pick-and-Choose" Glitch (Inconsistent Flakiness)
Now, imagine that same lock-check is used in the Cinder neighborhood and the Nova neighborhood.

The Problem: In Cinder, the test is perfectly reliable (always Green). But in Nova, the exact same test is "moody" (flickering between Red and Green).
The Impact: This is confusing! It means the test itself isn't broken; something about the environment in Nova is causing the trouble. It's like a car that starts perfectly in your driveway but sputters every time you try to start it at a friend's house. The researchers found over 1,100 of these "pick-and-choose" glitches.

The Big Surprise: Even "Unit" Tests Are Getting Sick

Usually, developers think of Unit Tests as the "microscopes" of the software world. They look at tiny, isolated pieces of code (like a single function) in a vacuum. They are supposed to be the most stable, predictable tests because they don't talk to the outside world.

The Paper's Shocking Finding:
The researchers found that 70% of these "microscope" tests are actually involved in the "Contagious" glitches.

Analogy: It's like finding out that the tiny, isolated screws holding your toaster together are the same screws causing the whole kitchen's electrical system to short out. We assumed these small tests were safe and isolated, but in a giant ecosystem, they are deeply connected and can spread instability everywhere.

Why Does This Happen? (The Causes)

The team dug into the logs to find out why the tests were acting up in some places but not others. They found three main culprits:

The "Race Condition" (The 89% Killer): This is the most common cause. Imagine two workers trying to grab the same tool at the exact same millisecond. Sometimes Worker A gets it; sometimes Worker B gets it. If the test tries to grab a resource (like a server or a file) that is already being used by something else, it fails. If it gets it, it passes. This randomness is called a "race condition."
Mismatched Configurations: It's like trying to bake a cake using a recipe from one country but ingredients from another. The test expects a specific setup (like a specific version of a library or a specific server speed), but the environment doesn't match.
Dependency Issues: One neighborhood might have updated their "power grid" (a software library), while the neighboring town hasn't. The test works in the updated town but fails in the old one.

The Cost of the "Wait and See" Approach

When a test fails, the standard reaction in OpenStack is to say, "Oh, it must be a glitch. Let's just run it again (recheck) and wait."

The Cost: The researchers calculated that this "recheck and wait" habit has wasted 1,156 days of computing time and money.
The Analogy: It's like a traffic cop seeing a red light, assuming the sensor is broken, and waving cars through, then checking again, then waving them through again. This wastes fuel (computing resources) and delays everyone's commute (code reviews).

What Do the Workers Say? (Developer Feedback)

The researchers asked the actual builders (developers) about this.

The Frustration: Many developers feel helpless. They say, "I'm new, I don't know who to ask, so I just keep hitting 'recheck' until it passes."
The Reality: They admit that fixing these issues is hard because it requires talking to multiple teams. If a test fails in Nova because of a problem in Cinder, the Nova developer has to wait for the Cinder team to fix it.
The Tool Gap: They mentioned that while tools exist to help, they often break or get abandoned because no one has the time to maintain them. They need a dedicated "mechanic" for the CI system, not just volunteers doing it on the side.

The Takeaway

The paper concludes that in a giant, connected software ecosystem, you can't treat tests as isolated islands.

For Developers: Stop just "rechecking" and waiting. Investigate why a test failed, even if it seems unrelated to your code.
For Team Leads: You need to standardize how tests are run across all neighborhoods. If one town uses a specific tool, everyone should. You also need to centralize the tracking of these glitches so everyone knows which "screws" are loose.
For the Future: We need better tools to automatically tell us why a test is flaky (e.g., "It failed because the server was down," not just "It failed").

In short, the paper argues that to keep the OpenStack city running smoothly, we need to stop treating test failures as random bad luck and start treating them as a systemic coordination problem that affects the whole city.

Cross-Project Flakiness: A Case Study of the OpenStack Ecosystem