Metriq: A Collaborative Platform for Benchmarking Quantum Computers

Imagine the world of quantum computing as a bustling, chaotic city where every builder (like IBM, Quantinuum, Rigetti, etc.) is constructing their own unique skyscraper. Some are made of glass, some of steel, some of wood. They all claim to be the "tallest" or "fastest," but they measure height with different rulers, use different units of time, and refuse to let anyone else inspect their blueprints.

This is the problem the paper "Metriq" addresses. It introduces a new, open-source platform designed to be the universal "Consumer Reports" for quantum computers.

Here is a breakdown of how Metriq works, using simple analogies:

1. The Problem: A Tower of Babel

Right now, if you want to know which quantum computer is best, it's like trying to compare cars where one manufacturer measures speed in "miles per hour," another in "how many clouds they can fly through," and a third just says, "Trust us, we're fast."

The Issue: There is no standard way to compare them. The data is scattered, inconsistent, and often hidden behind paywalls or specific company tools.
The Result: It's impossible to get a clear, honest picture of who is actually winning the race.

2. The Solution: Metriq (The Universal Translator)

Metriq is a collaborative platform built by an independent group (the Unitary Foundation) that acts as a neutral referee. It doesn't belong to any one company. It has three main parts that work together like a well-oiled machine:

The Runner (metriq-gym): Think of this as a universal remote control. Instead of needing a different remote for every TV brand, Metriq has one remote that can talk to any quantum computer, whether it's made by IBM, Google, or a university lab. It sends the same set of tests to every machine.
The Dataset (metriq-data): This is the public library of results. Every time a test is run, the results are saved in a standardized format. No more hiding data in private spreadsheets. Everyone can see the raw numbers.
The Website (metriq-web): This is the interactive dashboard. It takes all that raw data and turns it into easy-to-read charts, graphs, and leaderboards. You can filter by "fastest," "most accurate," or "best for machine learning."

3. The Tests: The "Driver's License" Exam

Metriq doesn't just run one test; it runs a whole battery of exams to see how the "cars" handle different terrains. They have two types of tests:

A. The "Engine Room" Tests (System-Level)
These check the basic health of the machine, like checking oil pressure or tire tread.

BSEQ (Bell State Effective Qubits): Imagine trying to hold hands with a partner across a crowded room. Can you stay connected without letting go? This test checks if the computer can keep qubits "entangled" (connected) across the whole chip.
EPLG (Error Per Layered Gate): This is like counting how many steps you trip over while walking a tightrope. It measures how many errors happen when the computer performs a sequence of basic moves.
CLOPS (Circuit Layer Operations Per Second): This is the speedometer. It measures how fast the computer can actually execute instructions, including the time it takes to load the program and wait in line.

B. The "Road Trip" Tests (Application-Level)
These tests see if the computer can actually do something useful, not just run in circles.

QML Kernel: Can the computer recognize patterns in data? (Like a very basic version of AI).
WIT (Wormhole Teleportation): A fancy physics test. It tries to simulate a "wormhole" (a shortcut through space) to see if the computer can preserve information while moving it around.
LR-QAOA: A puzzle-solving test. Can the computer find the best solution to a complex optimization problem (like finding the shortest route for a delivery truck)?

4. The Score: The "Metriq Score"

How do you combine all these different tests into one number?
Imagine you are rating a restaurant. You have scores for:

Food quality (0-10)
Service speed (0-10)
Ambiance (0-10)

Metriq takes all the different test results and combines them into a single Metriq Score.

The Twist: They weigh the tests based on difficulty. Passing a test with 100 qubits is worth much more than passing one with 5 qubits, just like driving a Formula 1 car is harder than driving a go-kart.
The Baseline: They pick one computer (currently an IBM machine called "Torino") as the "standard" and give it a score of 100. If another machine scores 120, it's 20% better than the standard. If it scores 50, it's half as good.

5. Why This Matters

Transparency: No more "black box" results. If a company claims their computer is the best, Metriq lets you see the raw data to verify it.
Community Power: Anyone can suggest a new test or run a test on their own lab equipment. It's a living, breathing project that grows with the technology.
Future-Proofing: As quantum computers get bigger and better, Metriq updates its tests. It's designed to track progress over years, not just take a snapshot today.

The Bottom Line

Metriq is the first time the quantum world has a fair, open, and standardized way to say, "Okay, let's see who is actually the best at what they claim to do."

It turns a chaotic race with different rules into a clear, transparent competition where the best technology can finally shine. And the best part? It's free for everyone to use, check, and improve.

Here is a detailed technical summary of the paper "Metriq: A Collaborative Platform for Benchmarking Quantum Computers."

1. Problem Statement

The current landscape of quantum computer benchmarking is fragmented and inconsistent. Key challenges include:

Vendor Lock-in: Existing tools (e.g., IBM Qiskit, IQM utilities) are tied to specific hardware stacks, preventing fair cross-platform comparisons.
Lack of Standardization: Benchmarks are often defined, executed, and compared differently across providers, leading to inconsistent evaluation methodologies.
Isolated Data: Most results appear as isolated case studies with little public, standardized, cross-platform data. There is a lack of longitudinal tracking of device performance over time.
Absence of Third-Party Verification: There is a need for unbiased, community-driven frameworks to provide objective assessments beyond vendor-run benchmarks.

2. Methodology: The Metriq Platform

The authors introduce Metriq, an open-source, collaborative platform designed to unify benchmark definition, execution, data collection, and presentation. The platform consists of three modular components:

A. Runner: `metriq-gym`

Function: A Python-based framework for defining and executing benchmarks across heterogeneous hardware providers (IBM, Quantinuum, Rigetti, IQM, OriginQ, etc.) and local simulators.
Key Features:
- Vendor-Agnostic: Uses the qBraid SDK to abstract cloud APIs and hardware architectures.
- Schema-Driven: All benchmark parameters are defined in explicit JSON schemas, ensuring reproducibility and preventing configuration drift.
- Asynchronous Execution: Supports non-blocking job dispatch and polling, essential for large-scale campaigns across multiple backends.
- Local Simulation: Treats local simulators (e.g., Qiskit Aer) as first-class devices for rapid prototyping and validation.

B. Dataset: `metriq-data`

Function: A version-controlled, schema-validated collection of benchmark results stored as JSON files in a GitHub repository.
Structure: Organized by {source}/{version}/{provider}/{device}/{timestamp}_{benchmark-type}.json.
Benefit: Enables offline analysis, automated aggregation, and transparent provenance tracking. It is the first public dataset to systematically aggregate benchmarking evidence across diverse quantum hardware.

C. Visualization: `metriq-web`

Function: An interactive web application (TypeScript/Vega) that visualizes the dataset.
Features: Supports filtering by provider/benchmark, time-series plotting, and machine-readable data export. It serves as a central hub for community discussion and result exploration.

3. The Metriq Score (MS)

To summarize heterogeneous benchmark outcomes into a single comparable metric, the authors propose the Metriq Score, a composite index calculated in three steps:

Within-Benchmark Aggregation: Linear aggregation of results across different circuit widths (qubit counts), weighted by the circuit width to emphasize larger scales.
Baseline Normalization: Results are normalized against a designated baseline device (e.g., ibm_torino), set to 100.00. Values >100 indicate better performance; <100 indicate worse.
Across-Benchmark Aggregation: Subscores are combined using weights derived from the effective circuit scale ( $\mu_b$ ) of each benchmark. This ensures that benchmarks probing larger, more challenging regimes (where classical simulation is difficult) contribute more significantly to the final score.

4. Benchmark Suite

The paper presents a curated suite of 8 benchmarks spanning system-level metrics and application-inspired tasks:

System-Level Benchmarks:

BSEQ (Bell State Effective Qubits): Measures the largest connected component of qubits that can violate the CHSH inequality, quantifying device-wide entanglement quality.
EPLG (Error Per Layered Gate): Quantifies the error rate of parallel two-qubit gate layers on connected chains, sensitive to crosstalk and calibration uniformity.
Mirror Circuits: A scalable, verifiable method where a circuit is followed by its inverse. The success probability (polarization) measures accumulated coherent errors and gate fidelity.
CLOPS (Circuit Layer Operations Per Second): A speed-oriented metric measuring the sustained rate of gate layer execution, capturing compilation efficiency and runtime overhead.

Application-Inspired Benchmarks:

QML Kernel: Evaluates the ability to compute kernel matrix elements for machine learning, testing coherent multi-qubit entanglement and parameterized rotations.
WIT (Wormhole-inspired Teleportation): Simulates traversable wormhole dynamics (AdS/CFT correspondence), testing multi-stage coherence and information preservation.
LR-QAOA (Linear-Ramp QAOA): A deterministic variant of QAOA for MaxCut problems, probing performance at large width and depth without classical optimization loops.
QFT (Quantum Fourier Transform): Tests the accumulation of phase information and coherence across structured multi-qubit circuits.

5. Key Results

The authors executed the suite on more than 10 quantum computers from multiple vendors (IBM, Quantinuum, Rigetti, IQM, OriginQ) between March 2025 and March 2026.

Cross-Platform Comparison: The resulting dataset (Table I) reveals significant performance disparities. For example, Quantinuum H2-2 and IBM Heron devices generally outperform others in fidelity-based benchmarks (BSEQ, Mirror Circuits), while IBM devices show higher CLOPS scores due to faster gate times.
Correlations:
- Strong positive correlations exist between most benchmarks (e.g., Mirror Circuits and QML Kernel have $\rho = 0.991$ ), suggesting they probe similar underlying physical bottlenecks (two-qubit gate errors).
- The Metriq Score is highly correlated with vendor-reported two-qubit gate fidelity ( $\rho = 0.982$ ).
- BSEQ correlates strongly with LR-QAOA, highlighting the importance of entanglement connectivity for optimization tasks.
Cost Analysis: The paper provides a cost breakdown (Table XIII), showing that while some benchmarks are cheap, others (like EPLG) incur significant costs, emphasizing the need for "frugal" benchmarking strategies.
Compilation Impact: The study highlights how compilation settings (e.g., "verbatim" mode on AWS Braket) drastically affect results. Without verbatim compilation, symmetric circuits (like QML Kernel) were optimized away to identity, masking hardware noise.

6. Significance and Future Directions

Reproducibility & Transparency: Metriq establishes a "living" benchmarking ecosystem where results are versioned, reproducible, and open to community scrutiny, moving away from one-off vendor snapshots.
Community Governance: By decoupling execution from presentation and using open schemas, it allows the community to refine benchmarks and weights as hardware evolves.
Future Roadmap:
- Logical Qubits: The framework is designed to extend to fault-tolerant benchmarks (e.g., logical Bell-pair factories) as error-corrected devices emerge.
- Error Mitigation: Integration with tools like Mitiq to report both raw and mitigated performance.
- Expanded Coverage: Supporting more hardware modalities (neutral atoms, photonics) and integrating with HPC resources.

In conclusion, Metriq provides the first unified, open, and continuously evolving infrastructure for cross-platform quantum benchmarking, offering a practical foundation for tracking the progress of quantum hardware as it scales toward utility.

Metriq: A Collaborative Platform for Benchmarking Quantum Computers

1. The Problem: A Tower of Babel

2. The Solution: Metriq (The Universal Translator)

3. The Tests: The "Driver's License" Exam

4. The Score: The "Metriq Score"

5. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: The Metriq Platform

A. Runner: metriq-gym

B. Dataset: metriq-data

C. Visualization: metriq-web

3. The Metriq Score (MS)

4. Benchmark Suite

5. Key Results

6. Significance and Future Directions

More like this

Formally Verifying Quantum Phase Estimation Circuits with 1,000+ Qubits

Distributed g(2) Retrieval with Atomic Clocks: Eliminating Conventional Sync Protocols

Efficient training of photonic quantum generative models

Quantum algorithm for anisotropic diffusion and convection equations with vector norm scaling

Large Language Model-Assisted Superconducting Qubit Experiments

A. Runner: `metriq-gym`

B. Dataset: `metriq-data`

C. Visualization: `metriq-web`