Identification and mitigation of memory block timing issue in ITk ABCStar during ASIC production

This paper details the identification of a timing flaw in the ABCStar ASIC that threatened production yields, and the successful mitigation of this issue through a combination of increasing the core operating voltage and adjusting the clock duty cycle, thereby avoiding costly process changes or redesigns and enabling the continued production of ATLAS ITk detector modules.

Original authors: B. Ashmanskas, J. Botte, J. R. Dandoy, J. Dopke, N. Dressnandt, B. J. Gallop, J. J. John, P. T. Keener, T. Koffas, J. Kroll, R. P. McGovern, M. F. Newcomer, B. J. Norman, P. W. Phillips, C. Sawyer, R.
Published 2026-05-22
📖 6 min read🧠 Deep dive

Original authors: B. Ashmanskas, J. Botte, J. R. Dandoy, J. Dopke, N. Dressnandt, B. J. Gallop, J. J. John, P. T. Keener, T. Koffas, J. Kroll, R. P. McGovern, M. F. Newcomer, B. J. Norman, P. W. Phillips, C. Sawyer, R. Scouten, P. Vicente Leitao, M. Warren

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Story of the "Star" Chip That Stuttered

Imagine the ATLAS experiment at CERN as a massive, high-speed camera trying to take pictures of particles colliding at nearly the speed of light. To do this, it needs millions of tiny, super-smart sensors called ABCStar chips. These chips are the "eyes" of the camera, reading data from silicon strips and sending it to a central computer.

Before the camera could be built, engineers had to manufacture these chips. They expected about 90% of the chips to work perfectly. However, during testing, they found a terrifying problem: on some batches of chips, only 2% worked. The rest were failing.

The Mystery: A "Silicon-Proven" Ghost

The engineers were confused. The failing chips weren't broken in a weird way; they were passing almost every test. They could read analog signals, handle power, and do complex math. The only thing they failed was a specific digital test that checked if they could remember and recall data correctly.

The data was being stored in SRAM blocks (think of these as the chip's short-term memory notebooks). These specific memory blocks had been used in many other successful chips before. In the industry, this is called being "silicon proven." It's like using a tire design that has been on millions of cars without ever having a blowout. Everyone assumed these tires were perfect.

The engineers suspected the memory itself was broken, but they were wrong. The memory was fine. The problem was the traffic controller (the "glue logic") that told the memory when to write and when to read.

The Root Cause: A Timing Mismatch

Here is the analogy: Imagine a relay race where a runner (the data) has to hand a baton to a teammate (the memory) exactly when a whistle blows.

  • The Plan: The whistle blows, the runner sprints, and the teammate catches the baton.
  • The Reality: In some of these chips, the runner was slightly slower than the engineers thought. Because the "silicon proven" memory models were based on older tools, they didn't account for the fact that the runner might be a little sluggish in this specific factory batch.
  • The Result: The teammate tried to catch the baton too early. The runner wasn't there yet. The baton was dropped. In chip terms, this is a bit flip or a timing error. The data got corrupted.

This happened mostly on the edges of the silicon wafers (like the edges of a pizza), where the manufacturing process is slightly less uniform, making the "runners" even slower.

The Investigation: Finding the Fix

The team had to find a way to fix this without throwing away millions of dollars worth of chips or redesigning the whole thing from scratch (which would take years). They tested two main ideas:

1. The "Speed Boost" (Voltage Increase)

If the runner is slow, give them a caffeine shot.

  • The Fix: They increased the electrical voltage supplied to the chip's digital brain from 1.20 Volts to 1.25 Volts.
  • The Effect: Higher voltage makes the transistors (the runners) move faster. Suddenly, the runner was fast enough to catch the baton on time.
  • The Result: Chips that were previously failing (2% yield) suddenly worked 80% of the time.

2. The "Longer Pause" (Clock Duty Cycle)

If the runner is still a bit slow, tell the teammate to wait a little longer before trying to catch the baton.

  • The Fix: The chip runs on a clock signal that ticks back and forth. The engineers realized the "high" part of the tick (when the logic is active) was too short. They physically swapped two wires on the circuit board so the "high" part lasted longer.
  • The Effect: This gave the logic more time to settle and get ready before the memory tried to grab the data.
  • The Result: This added an extra layer of safety, ensuring the chips wouldn't fail even if they got a little older or colder.

The "What If" Scenario: Changing the Factory

The team also talked to the factory (the foundry) about changing the manufacturing process to make the transistors naturally faster.

  • The Problem: They had already made 300 wafers with the "slow" process. You can't un-bake a cake. If they changed the process now, they would have to scrap all the existing wafers and start over, costing a fortune and delaying the project.
  • The Decision: They tested "fast" transistors on new experimental wafers. While they worked, they caused other side effects (like changing the analog sensors' sensitivity).
  • The Verdict: Since the "Speed Boost" (voltage) and "Longer Pause" (wiring swap) worked perfectly on the existing chips, they decided not to change the factory process. It was cheaper, faster, and safer to just tweak how the chips were used.

The Final Outcome

The team proved that by simply turning up the voltage slightly and swapping two wires, they could save the project.

  • Yield: They went from a disaster (2% working) to a success (over 80% working).
  • Power: The extra voltage used a tiny bit more power (about 3% more), which the cooling system of the detector could easily handle.
  • Radiation: They tested the chips under heavy radiation (like they would face in the particle collider) and found the fix still worked.

The Big Lesson

The paper ends with a crucial lesson for all engineers: Don't assume "proven" is perfect.

Just because a component (like the memory block) worked in the past doesn't mean it will work perfectly in every new design, especially when combined with new manufacturing variations. The team learned that even "silicon proven" blocks need to be re-checked with the specific tools and conditions of the new project. If they had done this earlier, they might have caught the issue sooner.

Thanks to this detective work, the ATLAS ITk detector is now being assembled with these chips, and they are expected to run reliably for the lifetime of the experiment.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →