Every claim Atlas makes, with its source. Measured, not asserted — a self-characterising instrument for quantum-compute triage: it routes a circuit to its cheapest correct compute tier, and it measures the limits of that judgment.
| Route accuracy | 99.56% (2,506 / 2,517 certified) · 0 false-alarm | §3 |
| Calibration | Brier 0.0076 held-out · conformal correctness ≥98.67% | §5 |
| Confident-wrong (FalseVerify) | 0 / 1,983 — vs AlphaFold2 33.6% | §8 |
| Information transported | 83.4% of route uncertainty · multi-estimator +59% over best single | §7 |
| Cost structure | cheap pre-check carries 732× more bits/second — proofreading-optimal | §9 |
| Validity & limits | 97% in-domain · 87% of residual uncertainty is the theoretical floor (Leone), not missing data | §8 |
| Hardware-validated | TVD≈0.06 on a Heron r2-class device · auditable job IDs | §2 |
| Simulability certificate | agreement of independent folds → STRONG/FIRM/WEAK/SPLIT/NULL, hash-stamped · 0 false-STRONG | §10 |
arXiv:2602.22330), so a calibrated abstention is the optimal answer.
| Layer | vs. SoTA |
|---|---|
| Quantum simulation (statevector) | well below — quimb / Aer own this |
| Tensor networks · stabilizer · compiler · hardware | below — cotengra / Stim / Qiskit own these |
| Route adjudication — which method should you run? | potentially ahead — almost no one builds this layer |
| Explainability · failure-mode awareness | at or above |
| Triage workflow as a category | possibly unique |
The practical ceiling is a function of three things: time budget × circuit entanglement × your hardware — not a fixed qubit count. Below: per-estimator wall-clock on a dev Apple M4 (30 s/case), for bounded-entanglement (area-law) circuits — 1D nearest-neighbour chains. Reproducible: benchmarks/scale_ceiling.py.
| Estimator | Honest regime | Reached on M4 (≤30 s) | Scaling |
|---|---|---|---|
| Clifford (Stim) | any stabilizer circuit | n=800 in 18.8 s | polynomial → routinely far higher |
| MPS bond (quimb) | area-law / structured | n=640 in 2.4 s (truncated, bond ≤64) | collapses for volume-law / 2D-dense |
| treewidth (cotengra) | low-treewidth topology | n=320 in 1.7 s | collapses for dense graphs |
| statevector (exact) | any circuit (dense) | n≈28 | 2ⁿ memory (hard wall) |
Read honestly (audited by an adversarial reviewer): only Clifford (Stim) scales generically — it's polynomial. The MPS and treewidth numbers hold for bounded-entanglement / structured circuits; for volume-law or 2D-dense circuits the bond saturates (≈2^(n/2)) and these n drop sharply. Above n=20 the MPS bond is a truncated lower bound (cap 64), not exact. We state the regime, not just the number. Source: benchmarks/scale_ceiling.md · scale_ceiling.json.
Atlas's core claim — "if it routes CPU, the classical distribution reproduces the hardware result within error" — is grounded in real QPU job submissions on public-access superconducting hardware (Heron r2 class, 156-qubit, us-east, open-instance), not simulation. Each row below is an auditable job with a real usage time.
| Job (prefix) | QPU class | Mode | Completed | Usage | Status |
|---|---|---|---|---|---|
d8t2gatbh0os7… | Heron r2 (kingston) | Job | 2026-06-23 | 8 s | ✓ mirror-RB |
d8snb9tbh0os7… | Heron r2 (kingston) | Job | 2026-06-22 | 3 m 55 s | ✓ |
f3dc1a59-bd8d… | Heron r2 (kingston) | Batch | 2026-06-22 | 8 s | ✓ |
c966ac61-5777… | superconducting (fez) | Batch | 2026-06-23 | 20 s | ✓ |
d8sobh4bp3hs7… · d8sobglposuc7… | Heron r2 (kingston) | Job | 2026-06-22 | 3 s ea. | ✓ PT |
Result: TVD(ideal, QPU) ≈ 0.059 / 0.055 for the validated families (GHZ-4, Clifford+T-5); mirror-RB per-layer fidelity 6.7 %–11.2 % readout-corrected. Full data, formulas & all job-ids: benchmarks/QPU_RESULTS.md · qpu_mirror_results.json.
Scope (honest): validated on specific circuit families, one device class, specific dates. Real data, narrow scope — growing.
"2,517 certified" answers the only question that matters: certified against an exact classical oracle, on a corpus that is fully specified, hash-pinned and regenerable — and on which Atlas was never fitted (pure held-out evaluation, no train split).
| What | Definition |
|---|---|
| Corpus | Stratified, deterministic: 7 families (line, ladder, grid, star, heavy-hex, dense-core, all-to-all-sparse) × n = 8 … 40 × 3 depths × 4 T-densities × 8 seeds. Hash-pinned: sha256 66f9d6…06e6af22. |
| "Certified" = | An exact classical oracle applies — Stim (Clifford, any n) / non-truncated MPS (n≤20) / exact treewidth / statevector. The cheapest exact-certified route is the ground truth. No exact certificate ⇒ certified=False, row excluded from accuracy. |
| Selection | Evaluation-only — Atlas is not trained on this set; it is a held-out test corpus. The confidence threshold is chosen on a separate split half (§5). |
| Result | 2,517 / 2,517 rows oracle-certified. Route accuracy 99.56% (2,506/2,517). False-alarm: 0. |
Honest boundary (the hard question): the genuinely quantum-hard regime (n>24, no classical oracle) has no ground truth by construction — false-security there is unmeasurable, not zero. On the hard-verified subset (n≤20, exact ground truth, 25 circuits) false-security is 1/25; Wilson 95% CI [0.7%, 19.5%] — wide, and we say so. Atlas claims neither classical impossibility nor quantum advantage. Download: benchmark_manifest.json · scaled_results.csv (+ext, moat) · benchmarks/oracle.py.
The classifier was attacked across multiple vectors — irrational rotations, T-identity camouflage (T⁸=I), maximal cross-topology entanglement, mid-circuit measurements, permutation-camouflage, MPS-truncation, depth-ignoring statevector — to make it route an expensive circuit as cheap (a "false-security" verdict, the dangerous failure).
0 false-security across all attack rounds. The adjudicator holds because it invalidates truncated-MPS lower bounds, uses an always-valid statevector certificate, and takes the cheapest valid route. A separate finding — a triage denial-of-service via a C-level hang in the contraction optimizer — was found and closed (topology-aware reach-2q guard → killable subprocess; degrades to an honest "compute-bound → HPC" verdict instead of hanging).
Source & permanent regression test: benchmarks/adversarial_attack.py · adversarial_findings.json. CPython refs for the DoS fix: bugs.python.org/issue27422, cpython#96971.
The raw confidence score is a heuristic 0–100 index. We recalibrated it into a true P(correct route) — fit on a selection half, measured on an untouched validation half — so the number means what it says.
| Metric (held-out validation half) | Raw heuristic | Isotonic-calibrated |
|---|---|---|
| Brier score (lower = better) | 0.103 | 0.0076 |
| Expected Calibration Error | 0.255 | 0.0069 |
The conformal certificate is the headline guarantee: at the selected threshold, held-out coverage 100% with route-correctness ≥ 98.67% (one-sided Wilson 95%, error bound ≤ 1.33%) — a valid held-out bound, not an in-sample one. All 11 errors in the corpus sit at raw confidence ≤ 34; above that, 0 / 2,517. We never display "100%" — the honest ceiling is the Wilson bound. The borderline zone (MPS bond ~2⁶–2¹⁰) and out-of-distribution circuits decline to "verify" rather than feign certainty. A two-layer noise model separates a measured local channel (real device data) from a clearly-labeled "toy global" heuristic.
Reproduce: HANDOFF_5ideas/atlas_recalibrate.py (isotonic + Platt + reliability diagram) · atlas_conformal.py (split conformal) · route_adjudicator.py. Data: calibration_report.json · kingston_calibration.json.
Why "MEDIUM" is honesty, not a gap — and why we never claim certainty: deciding exactly whether a state is classically simulable (membership in the stabilizer polytope) is provably super-exponential, Ω(2^(n²)), incondicional bajo ETH (Leone, Eisert & Oliviero, arXiv:2602.22330, 2026). An exact simulability classifier cannot exist. A multi-estimator verdict with calibrated, sometimes-MEDIUM confidence is therefore not a workaround — it is the only epistemically honest architecture for an undecidable problem. When Atlas says MEDIUM, that is the theoretical ceiling speaking, not a bug.
Pre-flight simulability triage went from folklore to an active research topic in May–June 2026. Atlas is the measured, multi-engine, QPU-validated point in that space — complementary to the predict-from-features and pure-theory work.
| Work (2026) | Approach | How Atlas differs |
|---|---|---|
Xing et al., arXiv:2606.11620 — family-aware ML | Predicts MPS-bond threshold & runtime from static gate features (~50 ms, 79.5% exact) | Atlas measures the exact bond/treewidth — a predicted threshold can be wrong (a false-security risk); a measured one can't |
Del Rey et al., arXiv:2605.28986 — simulability from samples | Studies T-count & MPS bond as control variables for learning simulability | Atlas uses the same two as decision variables, cross-validated against Stim, in a verdict |
Shao et al., arXiv:2606.00474 — TN complexity under noise | Pure theory (when poly bond suffices) — no tool, no implementation | Atlas is the running implementation with calibrated confidence |
| MatchCake / PennyLane blog | Free-fermion (matchgate) classical check before a QML advantage claim | Atlas covers the magic / T-gate regime matchgate methods don't |
The field recognised the gap; the published approaches predict (ML) or theorise. Atlas is the multi-engine system that measures and validates on real hardware. We cite this work because honesty about the landscape is the point — Atlas is first as a running product, not first to ask the question.
Declared next — future work, not yet shipped (cited so the path is honest, not claimed as done): family-aware MPS auto-tuning (Leonteva et al. arXiv:2606.23262) to cut dense-case latency; a noise-induced simulability regime via operator-scrambling γ_c (Dowling et al. arXiv:2605.18943) as a third noise verdict; an SRE second-order magic estimator (Lipardi et al. arXiv:2509.16799) beyond raw T-count; a ZX rank-width contraction bound (Kuyanov & Kissinger arXiv:2603.06764) to tighten the treewidth↔MPS divergence. The field is converging from separate resource measures toward coupled ones (magic × entanglement × interference); Atlas sits on that frontier as the measured layer. Roadmap, not claims.
Structural estimators with sharp theoretical thresholds (declared roadmap): a QAOA degree threshold — 2-local + depth O(log n) is exactly classically samplable (Āboliņš & Ambainis arXiv:2605.22758); an entanglement-graph topology check — circle/planar graph states look hard yet are topology-simulable (Hahn et al. arXiv:2603.08847); and an input-data wavelet exponent α=½ area-law/volume-law cut for QML (Kam et al. arXiv:2605.11557). Plus an original contribution only this corpus enables — the meta-triage question: when does measuring the estimators cost more than just simulating? We hold triage-time and simulation-time for all 2,517 circuits. Roadmap, not claims.
A measuring instrument, not just a tool. We computed — in bits — how much information about the true simulation route Atlas's verdict actually carries, and how much each physical estimator carries alone versus combined. This is the Shannon view of triage, and to our knowledge the first such measurement — possible only because we hold the predictor, the ground-truth corpus, and the QPU validation together.
| Quantity (bits, on the certified corpus) | Value | Reading |
|---|---|---|
| Prior route uncertainty H(route) | 0.245 | before Atlas |
| Verdict transports I(route ; verdict) | 0.205 | resolves 83.4% of it |
| Best single estimator (MPS bond) alone | 0.192 | the ablation, in bits |
| Multi-estimator (joint) | 0.305 | +59% over best single — synergy +0.113 bits |
| Residual H(route | verdict) | 0.041 | exactly the MEDIUM / "oracle needed" zone |
The combined adjudicator demonstrably carries more information than any single estimator — the multi-estimator design earns its keep in bits, not rhetoric. The residual is irreducible: deciding exactly is super-exponential (Leone et al. arXiv:2602.22330), so a perfect channel cannot exist — what Atlas reports is a measured lower bound on how much signal passes. This is what reframes triage from "a tool that guesses" into an instrument that measures the classical-simulability frontier.
Reproduce: HANDOFF_5ideas/atlas_channel_capacity.py (mutual information in bits, Miller–Madow corrected). Data: channel_capacity.json. The measured MI is a lower bound on channel capacity (capacity = sup over input distributions); H(route) is small because the corpus is route-imbalanced, so we report the fraction (83.4%) and the synergy, not just absolute bits.
Third order. §7 measured how much signal the triage transports; this measures the validity boundary of the instrument itself — the Applicability Domain (a 20-year standard in QSAR/pharma cheminformatics, to our knowledge never applied to quantum-compute triage), the FalseVerify rate (the AlphaFold2 confidence pathology), and the split of remaining uncertainty into reducible vs. irreducible.
| Self-characterisation (certified corpus) | Value | Reading |
|---|---|---|
| FalseVerify — HIGH-confidence and wrong | 0 / 1,983 (Wilson ≤0.14%) | vs AlphaFold2 fold-switch 33.6% — confidence is not decoupled from correctness |
| Outside the Applicability Domain (leverage > h*) | 76 / 2,517 (3.0%) | where the predictor extrapolates — flagged, not trusted blindly |
| Remaining uncertainty: epistemic (reducible) | 67 / 534 | out of domain → fixable with more corpus |
| Remaining uncertainty: aleatoric (irreducible) | 467 / 534 | in domain but fundamentally ambiguous (Leone) — never fixable |
Two consequences. First, every Atlas error lives in the low-confidence, already-flagged region — the exact opposite of the AlphaFold2 pathology where a high score can be confidently wrong. Second, ~87% of the remaining uncertainty is aleatoric, not epistemic: Atlas is already near the data limit; what is left is the theoretical floor (deciding exactly is super-exponential, Leone et al. arXiv:2602.22330), not missing data. Separating these two sources of ignorance — reducible vs. irreducible — is standard in pharma/QSAR and, to our knowledge, has never been done for a quantum-simulability predictor.
Reproduce: HANDOFF_5ideas/atlas_applicability_domain.py (leverage AD, FalseVerify, epistemic/aleatoric). Data: applicability_domain.json. FalseVerify measured on the certified region; in the genuinely quantum-hard regime it is unmeasurable by construction. Epistemic/aleatoric is an operational proxy (out-of-domain vs in-domain), not a fundamental decomposition.
Biology buys precision with dissipation, in steps (Hopfield–Ninio kinetic proofreading; Landauer; the Thermodynamic Uncertainty Relation). Atlas is the same, unnamed: each estimator is a proofreading step that costs compute and buys information about the true route. We measured the curve.
| Proofreading step | Cost | Marginal info | Efficiency |
|---|---|---|---|
| #T + Clifford pre-check | ~0.001 s | +0.107 bits | 107 bits/s |
| MPS bond (quimb) | ~0.9 s | +0.132 bits | 0.15 bits/s |
| treewidth (cotengra greedy) | ~1.5 s | +0.066 bits | 0.04 bits/s |
Information efficiency falls 732× after the cheap pre-check and keeps falling — a textbook diminishing-returns (proofreading) signature. Two consequences, now empirical rather than assumed: the cheapest estimator carries the most bits per second, so Atlas's early-exit ordering is the thermodynamically efficient one; and when the full chain still leaves residual route-uncertainty, paying more dissipation is wasted — MEDIUM is the optimal stopping point, not a failure. To our knowledge, the first thermodynamic-proofreading characterisation of a quantum-simulability triage.
Reproduce: HANDOFF_5ideas/atlas_proofreading.py. Data: proofreading.json. Information is exact (bits, Miller–Madow); cost is representative per-estimator wall-clock (threshold_calibration.json / scale_ceiling.json). Landauer/TUR are the lens — we quantify bits and compute-cost, not Joules.
The engine routes over physically-independent folds (magic, complexity, locality, free-fermion, BGC) — each measuring the same circuit from a different mathematical space. When ≥2 independent folds agree "cheap", that is corroboration, not a single prediction. Atlas turns that agreement into a signed, hash-stamped certificate; and the disagreement into a frontier map. Two faces, one call.
| Level | Agreement | Meaning |
|---|---|---|
| STRONG | ≥2 folds cheap, 0 hard | classically simulable — convergent validity |
| FIRM | classical majority, named dissenter | simulable, dissent documented |
| WEAK | a single fold carries it | provisional, thin evidence |
| SPLIT | even split | ON THE FRONTIER — the Leone-intractable region made observable (the convergence map) |
| NULL | no fold cheap | QPU-required |
Each certificate carries a SHA-256 content hash of the circuit, the per-fold signers (axis · cost · vote), the level, and the engine's actual route — an archivable, citable audit artifact, not a black-box verdict. Soundness battery (Clifford → STRONG; as the budget tightens the level degrades STRONG → SPLIT → WEAK exactly as the independent folds begin to disagree): 0 false-STRONG — no STRONG is ever issued on a hard verdict. The disagreement names which fold dissents and in which direction — that is the convergence map, from the same call.
Reproduce: physics_magnitude_lab.certificate.certificate(n, circuit, budget_log2=30) → STRONG/FIRM/WEAK/SPLIT/NULL + signers + hash. Soundness: scripts/certificate_validate.py → certificate_validation.json. Independence is the moat — agreement among methods built on different mathematics is corroboration a single tool cannot fake.
The same structure — stage cheap signals first, certify when independent witnesses agree, map the frontier when they split, and abstain honestly under irreducible uncertainty — appears independently across fields that never talk to each other. Each is an independent witness that this architecture is the right shape; their agreement is itself a STRONG certificate (convergent validity, one level up).
| Domain | Everyday mirror | What it corroborates | Grounded in |
|---|---|---|---|
| Medicine | a triage nurse — routes (home/clinic/ER), doesn't cure; reads cheap signs first; "this one, see a specialist" | route-not-compute · MEDIUM as the honest call | §3 · §5 |
| Molecular biology | kinetic proofreading (the ribosome) — staged dissipation buys fidelity | cheap pre-check first, escalate only as needed | §9 |
| Structural biology | AlphaFold2 pLDDT — a confidence score can be confidently wrong (FalseVerify 33.6%) | Atlas's confidence is not decoupled: 0 confident-wrong | §8 · §10 |
| Cheminformatics | Applicability Domain (QSAR) — a model fails silently outside its domain | the validity-boundary map | §8 |
| Finance | a credit-rating agency — convergent independent signals → an archived, citable rating (AAA…D) | STRONG/FIRM/WEAK/SPLIT/NULL, hash-stamped & archivable | §10 |
| Law | independent witnesses — strangers who agree = evidence, not hearsay; who contradict = locate the dispute | agreement certifies · disagreement maps the frontier | §10 |
| Thermodynamics | Landauer / TUR — speed × precision × dissipation is an irreducible trade-off | the proofreading cost curve is a physical law, not an engineering limit | §9 |
| Information theory | Shannon channel capacity — how many bits survive the noise | the triage transports 83.4% of the route signal | §7 |
| Statistics | classifier with reject option (selective prediction) — abstention is an optimal action | MEDIUM is the (predict, abstain) pair, not a failure | §5 · §8 |
Honest where the bridge splits (we flag it, we don't sell it): the credit-rating mirror carries a warning — the 2008 "issuer-pays" failure means whoever submits a circuit must never pay for a favourable certificate; the rating stays independent of payment, or it is worth zero. And the ribosome mirror is STRONG on "staged fidelity" but not on "voting" — the ribosome proofreads in series, the certificate's folds vote in parallel. Naming where an analogy stops crossing is the same discipline as naming where an estimator dissents.