Experimental Analysis of FreeRTOS Dependability through Targeted Fault Injection Campaigns

This paper presents KRONOS, a non-intrusive software-based fault injection framework used to evaluate FreeRTOS dependability under ionizing radiation, revealing that corruption of pointer and scheduler variables frequently causes system crashes while many Task Control Block fields have limited impact on availability.

Original authors: Luca Mannella, Stefano Di Carlo, Alessandro Savino

Published 2026-03-27
📖 5 min read🧠 Deep dive

Original authors: Luca Mannella, Stefano Di Carlo, Alessandro Savino

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine a busy, high-stakes airport control tower. This tower is run by a sophisticated computer system (the RTOS, or Real-Time Operating System) that manages hundreds of flights (tasks), ensuring they take off and land on time, in the right order, and without crashing into each other.

In the real world, this control tower isn't just sitting in a quiet office; it's often located in space or high-altitude environments where cosmic rays (like invisible, tiny bullets) can hit the computer chips. These hits can flip a single "bit" of data—a 0 becomes a 1, or vice versa. This is called a Single Event Upset (SEU).

The paper you shared is about a team of researchers who wanted to answer a scary question: "If a cosmic ray hits our control tower's brain, how badly does the airport crash?"

Here is a breakdown of their work using simple analogies:

1. The Problem: Invisible Rain

Usually, when we test how strong a building is, we might throw a ball at it. But you can't easily throw a "cosmic ray" at a computer chip in a lab without expensive, specialized equipment.

The researchers realized that by the time a cosmic ray hits a chip and causes a problem, the damage has already traveled up to the computer's "brain" (the software). So, instead of trying to shoot the chip with radiation, they decided to pretend the damage had already happened. They built a tool to manually flip the bits in the software's memory to see what happens next.

2. The Tool: KRONOS (The "Digital Saboteur")

The researchers built a software tool called KRONOS. Think of KRONOS as a ghostly saboteur that can sneak into the control tower's computer while it's running.

  • Non-Intrusive: It doesn't break the tower down to fix it; it just whispers lies into the computer's ear.
  • Post-Propagation: It assumes the "cosmic ray" has already done its damage and is now sitting in the computer's memory.
  • The Mission: KRONOS picks a specific piece of data (like a list of flight times or a pointer to a pilot's name), flips a bit, and then watches to see if the airport keeps running smoothly, slows down, or explodes into chaos.

3. The Experiment: Testing the Weak Links

They ran thousands of tests (a "campaign") on FreeRTOS, a very popular operating system used in everything from pacemakers to satellites. They targeted four main areas of the control tower:

  • Global Variables (The Signage): Things like "How many flights are active?" or "What is the current time?"
  • Pointers (The Finger Posts): These are like signs pointing to specific locations (e.g., "The list of delayed flights is here"). If you break the sign, the controller looks in the wrong place.
  • Lists (The Queues): The actual lines of flights waiting to take off or land.
  • TCBs (The Pilot's ID Badge): A file containing all the details of the currently flying plane (its priority, its stack, its name).

4. The Results: What Happened?

When KRONOS started flipping bits, the results were a mix of "oops" and "disaster."

  • The "Crash" Zone (70% of the time):
    When KRONOS messed with Pointers (the finger posts) or Critical Scheduler Variables (the main decision-maker), the system almost always crashed.

    • Analogy: Imagine the controller looks at a sign that says "Gate A," but the sign was flipped to say "Gate A" is actually "The Ocean." The plane tries to take off into the ocean. Boom. System crash.
    • Key Finding: The most dangerous thing to break was the Current TCB (the active pilot's ID) or the Scheduler (the person deciding who flies next). If these get corrupted, the whole system stops.
  • The "Delay" Zone (3% of the time):
    Sometimes the system didn't crash, but it got confused and took too long to finish its job.

    • Analogy: The controller is still working, but they are stuck in traffic because they are looking for a flight manifest that got lost. The plane lands, but 10 minutes late. This is bad for a satellite, but maybe okay for a toaster.
  • The "Silent" Zone (Rare):
    Very rarely, the system finished its job on time but gave the wrong answer (e.g., the plane landed at the wrong airport, but the controller didn't notice). This is called Silent Data Corruption. It's the most dangerous because no one knows anything is wrong until it's too late. The researchers found this happened very rarely in FreeRTOS.

  • The "Resilient" Zone:
    Surprisingly, messing with some less important lists (like a list of planes that were already cancelled) didn't cause a crash. The system was robust enough to ignore those specific errors.

5. The Big Takeaway

The researchers discovered that FreeRTOS is like a house of cards.

  • If you poke the foundation (the scheduler and pointers), the whole house collapses immediately.
  • If you poke the decorations (some minor lists), the house might wobble but stay standing.

They also found that it didn't matter if the damage was a "one-time glitch" (transient) or a "permanent broken sign" (permanent). If the critical data was corrupted, the result was usually the same: Game Over.

Why Does This Matter?

This study is crucial for engineers building safety-critical systems (like self-driving cars or Mars rovers). It tells them:

  1. Don't trust everything: You can't just assume the software will handle a cosmic ray hit gracefully.
  2. Protect the VIPs: You need to put extra armor (hardware protection or software checks) specifically around the Scheduler and Pointers, because those are the weak points.
  3. Better Design: Now that we know exactly where the system breaks, we can design better safety nets to catch those errors before they cause a crash.

In short, the paper is a "stress test" for the brain of our digital world, showing us exactly which neurons need to be protected so our satellites and planes don't fall out of the sky.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →