Evidence & Audit Trail

Every claim Atlas makes, with its source. Measured, not asserted — a self-characterising instrument for quantum-compute triage: it routes a circuit to its cheapest correct compute tier, and it measures the limits of that judgment.

Atlas (Krenn·IQ) · updated 2026-06-24 · operator: Fco. Osvaldo Morales Vilchis

Instrument datasheet — Atlas at one glance. Each row is measured, reproducible, and detailed in the section linked.

Route accuracy	99.56% (2,506 / 2,517 certified) · 0 false-alarm	§3
Calibration	Brier 0.0076 held-out · conformal correctness ≥98.67%	§5
Confident-wrong (FalseVerify)	0 / 1,983 — vs AlphaFold2 33.6%	§8
Information transported	83.4% of route uncertainty · multi-estimator +59% over best single	§7
Cost structure	cheap pre-check carries 732× more bits/second — proofreading-optimal	§9
Validity & limits	97% in-domain · 87% of residual uncertainty is the theoretical floor (Leone), not missing data	§8
Hardware-validated	TVD≈0.06 on a Heron r2-class device · auditable job IDs	§2
Simulability certificate	agreement of independent folds → STRONG/FIRM/WEAK/SPLIT/NULL, hash-stamped · 0 false-STRONG	§10

MEDIUM is not a failure — deciding exactly is super-exponential (Leone et al. arXiv:2602.22330), so a calibrated abstention is the optimal answer.

Where Atlas stands vs. the state of the art — honestly, not marketed. Atlas does not try to beat the specialized engines, and shouldn't. It answers a different question.

Layer	vs. SoTA
Quantum simulation (statevector)	well below — quimb / Aer own this
Tensor networks · stabilizer · compiler · hardware	below — cotengra / Stim / Qiskit own these
Route adjudication — which method should you run?	potentially ahead — almost no one builds this layer
Explainability · failure-mode awareness	at or above
Triage workflow as a category	possibly unique

Most tools answer "run my method." Atlas answers "which method should you run, and how sure can I be?" — the CPU-vs-GPU compiler decision, for quantum. Time advantage, stated honestly: ~18–24 months as a product category; 0–12 months as science, pending external validation that the adjudicator beats simple heuristics on public corpora. We do not claim to lead in simulation, hardware, or compilation — because we don't.

1 Scale — the ceiling is time, not qubit count

The practical ceiling is a function of three things: time budget × circuit entanglement × your hardware — not a fixed qubit count. Below: per-estimator wall-clock on a dev Apple M4 (30 s/case), for bounded-entanglement (area-law) circuits — 1D nearest-neighbour chains. Reproducible: benchmarks/scale_ceiling.py.

Estimator	Honest regime	Reached on M4 (≤30 s)	Scaling
Clifford (Stim)	any stabilizer circuit	n=800 in 18.8 s	polynomial → routinely far higher
MPS bond (quimb)	area-law / structured	n=640 in 2.4 s (truncated, bond ≤64)	collapses for volume-law / 2D-dense
treewidth (cotengra)	low-treewidth topology	n=320 in 1.7 s	collapses for dense graphs
statevector (exact)	any circuit (dense)	n≈28	2ⁿ memory (hard wall)

Read honestly (audited by an adversarial reviewer): only Clifford (Stim) scales generically — it's polynomial. The MPS and treewidth numbers hold for bounded-entanglement / structured circuits; for volume-law or 2D-dense circuits the bond saturates (≈2^(n/2)) and these n drop sharply. Above n=20 the MPS bond is a truncated lower bound (cap 64), not exact. We state the regime, not just the number. Source: benchmarks/scale_ceiling.md · scale_ceiling.json.

2 Validated on real quantum hardware

Atlas's core claim — "if it routes CPU, the classical distribution reproduces the hardware result within error" — is grounded in real QPU job submissions on public-access superconducting hardware (Heron r2 class, 156-qubit, us-east, open-instance), not simulation. Each row below is an auditable job with a real usage time.

Job (prefix)	QPU class	Mode	Completed	Usage	Status
`d8t2gatbh0os7…`	Heron r2 (kingston)	Job	2026-06-23	8 s	✓ mirror-RB
`d8snb9tbh0os7…`	Heron r2 (kingston)	Job	2026-06-22	3 m 55 s	✓
`f3dc1a59-bd8d…`	Heron r2 (kingston)	Batch	2026-06-22	8 s	✓
`c966ac61-5777…`	superconducting (fez)	Batch	2026-06-23	20 s	✓
`d8sobh4bp3hs7…` · `d8sobglposuc7…`	Heron r2 (kingston)	Job	2026-06-22	3 s ea.	✓ PT

Result: TVD(ideal, QPU) ≈ 0.059 / 0.055 for the validated families (GHZ-4, Clifford+T-5); mirror-RB per-layer fidelity 6.7 %–11.2 % readout-corrected. Full data, formulas & all job-ids: benchmarks/QPU_RESULTS.md · qpu_mirror_results.json.

Scope (honest): validated on specific circuit families, one device class, specific dates. Real data, narrow scope — growing.

3 The certified benchmark — 2,517 circuits, certified how

"2,517 certified" answers the only question that matters: certified against an exact classical oracle, on a corpus that is fully specified, hash-pinned and regenerable — and on which Atlas was never fitted (pure held-out evaluation, no train split).

What	Definition
Corpus	Stratified, deterministic: 7 families (line, ladder, grid, star, heavy-hex, dense-core, all-to-all-sparse) × n = 8 … 40 × 3 depths × 4 T-densities × 8 seeds. Hash-pinned: `sha256 66f9d6…06e6af22`.
"Certified" =	An exact classical oracle applies — Stim (Clifford, any n) / non-truncated MPS (n≤20) / exact treewidth / statevector. The cheapest exact-certified route is the ground truth. No exact certificate ⇒ `certified=False`, row excluded from accuracy.
Selection	Evaluation-only — Atlas is not trained on this set; it is a held-out test corpus. The confidence threshold is chosen on a separate split half (§5).
Result	2,517 / 2,517 rows oracle-certified. Route accuracy 99.56% (2,506/2,517). False-alarm: 0.

Honest boundary (the hard question): the genuinely quantum-hard regime (n>24, no classical oracle) has no ground truth by construction — false-security there is unmeasurable, not zero. On the hard-verified subset (n≤20, exact ground truth, 25 circuits) false-security is 1/25; Wilson 95% CI [0.7%, 19.5%] — wide, and we say so. Atlas claims neither classical impossibility nor quantum advantage. Download: benchmark_manifest.json · scaled_results.csv (+ext, moat) · benchmarks/oracle.py.

4 Adversarially audited — 0 false-security

The classifier was attacked across multiple vectors — irrational rotations, T-identity camouflage (T⁸=I), maximal cross-topology entanglement, mid-circuit measurements, permutation-camouflage, MPS-truncation, depth-ignoring statevector — to make it route an expensive circuit as cheap (a "false-security" verdict, the dangerous failure).

0 false-security across all attack rounds. The adjudicator holds because it invalidates truncated-MPS lower bounds, uses an always-valid statevector certificate, and takes the cheapest valid route. A separate finding — a triage denial-of-service via a C-level hang in the contraction optimizer — was found and closed (topology-aware reach-2q guard → killable subprocess; degrades to an honest "compute-bound → HPC" verdict instead of hanging).

Source & permanent regression test: benchmarks/adversarial_attack.py · adversarial_findings.json. CPython refs for the DoS fix: bugs.python.org/issue27422, cpython#96971.

5 Calibrated confidence — "88" against what, exactly

The raw confidence score is a heuristic 0–100 index. We recalibrated it into a true P(correct route) — fit on a selection half, measured on an untouched validation half — so the number means what it says.

Atlas reliability diagram — raw heuristic vs isotonic-calibrated, held-out validation half

Metric (held-out validation half)	Raw heuristic	Isotonic-calibrated
Brier score (lower = better)	0.103	0.0076
Expected Calibration Error	0.255	0.0069

The conformal certificate is the headline guarantee: at the selected threshold, held-out coverage 100% with route-correctness ≥ 98.67% (one-sided Wilson 95%, error bound ≤ 1.33%) — a valid held-out bound, not an in-sample one. All 11 errors in the corpus sit at raw confidence ≤ 34; above that, 0 / 2,517. We never display "100%" — the honest ceiling is the Wilson bound. The borderline zone (MPS bond ~2⁶–2¹⁰) and out-of-distribution circuits decline to "verify" rather than feign certainty. A two-layer noise model separates a measured local channel (real device data) from a clearly-labeled "toy global" heuristic.

Reproduce: HANDOFF_5ideas/atlas_recalibrate.py (isotonic + Platt + reliability diagram) · atlas_conformal.py (split conformal) · route_adjudicator.py. Data: calibration_report.json · kingston_calibration.json.

Why "MEDIUM" is honesty, not a gap — and why we never claim certainty: deciding exactly whether a state is classically simulable (membership in the stabilizer polytope) is provably super-exponential, Ω(2^(n²)), incondicional bajo ETH (Leone, Eisert & Oliviero, arXiv:2602.22330, 2026). An exact simulability classifier cannot exist. A multi-estimator verdict with calibrated, sometimes-MEDIUM confidence is therefore not a workaround — it is the only epistemically honest architecture for an undecidable problem. When Atlas says MEDIUM, that is the theoretical ceiling speaking, not a bug.

6 Field context — the problem the field just named (mid-2026)

Pre-flight simulability triage went from folklore to an active research topic in May–June 2026. Atlas is the measured, multi-engine, QPU-validated point in that space — complementary to the predict-from-features and pure-theory work.

Work (2026)	Approach	How Atlas differs
Xing et al., `arXiv:2606.11620` — family-aware ML	Predicts MPS-bond threshold & runtime from static gate features (~50 ms, 79.5% exact)	Atlas measures the exact bond/treewidth — a predicted threshold can be wrong (a false-security risk); a measured one can't
Del Rey et al., `arXiv:2605.28986` — simulability from samples	Studies T-count & MPS bond as control variables for learning simulability	Atlas uses the same two as decision variables, cross-validated against Stim, in a verdict
Shao et al., `arXiv:2606.00474` — TN complexity under noise	Pure theory (when poly bond suffices) — no tool, no implementation	Atlas is the running implementation with calibrated confidence
MatchCake / PennyLane blog	Free-fermion (matchgate) classical check before a QML advantage claim	Atlas covers the magic / T-gate regime matchgate methods don't

The field recognised the gap; the published approaches predict (ML) or theorise. Atlas is the multi-engine system that measures and validates on real hardware. We cite this work because honesty about the landscape is the point — Atlas is first as a running product, not first to ask the question.

Declared next — future work, not yet shipped (cited so the path is honest, not claimed as done): family-aware MPS auto-tuning (Leonteva et al. arXiv:2606.23262) to cut dense-case latency; a noise-induced simulability regime via operator-scrambling γ_c (Dowling et al. arXiv:2605.18943) as a third noise verdict; an SRE second-order magic estimator (Lipardi et al. arXiv:2509.16799) beyond raw T-count; a ZX rank-width contraction bound (Kuyanov & Kissinger arXiv:2603.06764) to tighten the treewidth↔MPS divergence. The field is converging from separate resource measures toward coupled ones (magic × entanglement × interference); Atlas sits on that frontier as the measured layer. Roadmap, not claims.

Structural estimators with sharp theoretical thresholds (declared roadmap): a QAOA degree threshold — 2-local + depth O(log n) is exactly classically samplable (Āboliņš & Ambainis arXiv:2605.22758); an entanglement-graph topology check — circle/planar graph states look hard yet are topology-simulable (Hahn et al. arXiv:2603.08847); and an input-data wavelet exponent α=½ area-law/volume-law cut for QML (Kam et al. arXiv:2605.11557). Plus an original contribution only this corpus enables — the meta-triage question: when does measuring the estimators cost more than just simulating? We hold triage-time and simulation-time for all 2,517 circuits. Roadmap, not claims.

7 Atlas as an instrument — the information it transports

A measuring instrument, not just a tool. We computed — in bits — how much information about the true simulation route Atlas's verdict actually carries, and how much each physical estimator carries alone versus combined. This is the Shannon view of triage, and to our knowledge the first such measurement — possible only because we hold the predictor, the ground-truth corpus, and the QPU validation together.

Information in bits transported by each estimator alone vs. the combined multi-estimator adjudicator

Quantity (bits, on the certified corpus)	Value	Reading
Prior route uncertainty H(route)	0.245	before Atlas
Verdict transports I(route ; verdict)	0.205	resolves 83.4% of it
Best single estimator (MPS bond) alone	0.192	the ablation, in bits
Multi-estimator (joint)	0.305	+59% over best single — synergy +0.113 bits
Residual H(route \| verdict)	0.041	exactly the MEDIUM / "oracle needed" zone

The combined adjudicator demonstrably carries more information than any single estimator — the multi-estimator design earns its keep in bits, not rhetoric. The residual is irreducible: deciding exactly is super-exponential (Leone et al. arXiv:2602.22330), so a perfect channel cannot exist — what Atlas reports is a measured lower bound on how much signal passes. This is what reframes triage from "a tool that guesses" into an instrument that measures the classical-simulability frontier.

Reproduce: HANDOFF_5ideas/atlas_channel_capacity.py (mutual information in bits, Miller–Madow corrected). Data: channel_capacity.json. The measured MI is a lower bound on channel capacity (capacity = sup over input distributions); H(route) is small because the corpus is route-imbalanced, so we report the fraction (83.4%) and the synergy, not just absolute bits.

8 When to trust the instrument — its Applicability Domain

Third order. §7 measured how much signal the triage transports; this measures the validity boundary of the instrument itself — the Applicability Domain (a 20-year standard in QSAR/pharma cheminformatics, to our knowledge never applied to quantum-compute triage), the FalseVerify rate (the AlphaFold2 confidence pathology), and the split of remaining uncertainty into reducible vs. irreducible.

Atlas error rate per confidence band vs AlphaFold2 FalseVerify reference — all Atlas errors at low confidence

Self-characterisation (certified corpus)	Value	Reading
FalseVerify — HIGH-confidence and wrong	0 / 1,983 (Wilson ≤0.14%)	vs AlphaFold2 fold-switch 33.6% — confidence is not decoupled from correctness
Outside the Applicability Domain (leverage > h*)	76 / 2,517 (3.0%)	where the predictor extrapolates — flagged, not trusted blindly
Remaining uncertainty: epistemic (reducible)	67 / 534	out of domain → fixable with more corpus
Remaining uncertainty: aleatoric (irreducible)	467 / 534	in domain but fundamentally ambiguous (Leone) — never fixable

Two consequences. First, every Atlas error lives in the low-confidence, already-flagged region — the exact opposite of the AlphaFold2 pathology where a high score can be confidently wrong. Second, ~87% of the remaining uncertainty is aleatoric, not epistemic: Atlas is already near the data limit; what is left is the theoretical floor (deciding exactly is super-exponential, Leone et al. arXiv:2602.22330), not missing data. Separating these two sources of ignorance — reducible vs. irreducible — is standard in pharma/QSAR and, to our knowledge, has never been done for a quantum-simulability predictor.

Reproduce: HANDOFF_5ideas/atlas_applicability_domain.py (leverage AD, FalseVerify, epistemic/aleatoric). Data: applicability_domain.json. FalseVerify measured on the certified region; in the genuinely quantum-hard regime it is unmeasurable by construction. Epistemic/aleatoric is an operational proxy (out-of-domain vs in-domain), not a fundamental decomposition.

9 The cost of knowing — a proofreading curve

Biology buys precision with dissipation, in steps (Hopfield–Ninio kinetic proofreading; Landauer; the Thermodynamic Uncertainty Relation). Atlas is the same, unnamed: each estimator is a proofreading step that costs compute and buys information about the true route. We measured the curve.

Cumulative information about the route (bits) vs cumulative cost — diminishing efficiency after the cheap pre-check

Proofreading step	Cost	Marginal info	Efficiency
#T + Clifford pre-check	~0.001 s	+0.107 bits	107 bits/s
MPS bond (quimb)	~0.9 s	+0.132 bits	0.15 bits/s
treewidth (cotengra greedy)	~1.5 s	+0.066 bits	0.04 bits/s

Information efficiency falls 732× after the cheap pre-check and keeps falling — a textbook diminishing-returns (proofreading) signature. Two consequences, now empirical rather than assumed: the cheapest estimator carries the most bits per second, so Atlas's early-exit ordering is the thermodynamically efficient one; and when the full chain still leaves residual route-uncertainty, paying more dissipation is wasted — MEDIUM is the optimal stopping point, not a failure. To our knowledge, the first thermodynamic-proofreading characterisation of a quantum-simulability triage.

Reproduce: HANDOFF_5ideas/atlas_proofreading.py. Data: proofreading.json. Information is exact (bits, Miller–Madow); cost is representative per-estimator wall-clock (threshold_calibration.json / scale_ceiling.json). Landauer/TUR are the lens — we quantify bits and compute-cost, not Joules.

10 Simulability Certificate — agreement as convergent validity

The engine routes over physically-independent folds (magic, complexity, locality, free-fermion, BGC) — each measuring the same circuit from a different mathematical space. When ≥2 independent folds agree "cheap", that is corroboration, not a single prediction. Atlas turns that agreement into a signed, hash-stamped certificate; and the disagreement into a frontier map. Two faces, one call.

Level	Agreement	Meaning
STRONG	≥2 folds cheap, 0 hard	classically simulable — convergent validity
FIRM	classical majority, named dissenter	simulable, dissent documented
WEAK	a single fold carries it	provisional, thin evidence
SPLIT	even split	ON THE FRONTIER — the Leone-intractable region made observable (the convergence map)
NULL	no fold cheap	QPU-required

Each certificate carries a SHA-256 content hash of the circuit, the per-fold signers (axis · cost · vote), the level, and the engine's actual route — an archivable, citable audit artifact, not a black-box verdict. Soundness battery (Clifford → STRONG; as the budget tightens the level degrades STRONG → SPLIT → WEAK exactly as the independent folds begin to disagree): 0 false-STRONG — no STRONG is ever issued on a hard verdict. The disagreement names which fold dissents and in which direction — that is the convergence map, from the same call.

Reproduce: physics_magnitude_lab.certificate.certificate(n, circuit, budget_log2=30) → STRONG/FIRM/WEAK/SPLIT/NULL + signers + hash. Soundness: scripts/certificate_validate.py → certificate_validation.json. Independence is the moat — agreement among methods built on different mathematics is corroboration a single tool cannot fake.

11 Cross-domain corroboration — the design is not ad-hoc

The same structure — stage cheap signals first, certify when independent witnesses agree, map the frontier when they split, and abstain honestly under irreducible uncertainty — appears independently across fields that never talk to each other. Each is an independent witness that this architecture is the right shape; their agreement is itself a STRONG certificate (convergent validity, one level up).

Domain	Everyday mirror	What it corroborates	Grounded in
Medicine	a triage nurse — routes (home/clinic/ER), doesn't cure; reads cheap signs first; "this one, see a specialist"	route-not-compute · MEDIUM as the honest call	§3 · §5
Molecular biology	kinetic proofreading (the ribosome) — staged dissipation buys fidelity	cheap pre-check first, escalate only as needed	§9
Structural biology	AlphaFold2 pLDDT — a confidence score can be confidently wrong (FalseVerify 33.6%)	Atlas's confidence is not decoupled: 0 confident-wrong	§8 · §10
Cheminformatics	Applicability Domain (QSAR) — a model fails silently outside its domain	the validity-boundary map	§8
Finance	a credit-rating agency — convergent independent signals → an archived, citable rating (AAA…D)	STRONG/FIRM/WEAK/SPLIT/NULL, hash-stamped & archivable	§10
Law	independent witnesses — strangers who agree = evidence, not hearsay; who contradict = locate the dispute	agreement certifies · disagreement maps the frontier	§10
Thermodynamics	Landauer / TUR — speed × precision × dissipation is an irreducible trade-off	the proofreading cost curve is a physical law, not an engineering limit	§9
Information theory	Shannon channel capacity — how many bits survive the noise	the triage transports 83.4% of the route signal	§7
Statistics	classifier with reject option (selective prediction) — abstention is an optimal action	MEDIUM is the (predict, abstain) pair, not a failure	§5 · §8

Honest where the bridge splits (we flag it, we don't sell it): the credit-rating mirror carries a warning — the 2008 "issuer-pays" failure means whoever submits a circuit must never pay for a favourable certificate; the rating stays independent of payment, or it is worth zero. And the ribosome mirror is STRONG on "staged fidelity" but not on "voting" — the ribosome proofreads in series, the certificate's folds vote in parallel. Naming where an analogy stops crossing is the same discipline as naming where an estimator dissents.

What's open vs. what's ours.
The engine is open (Apache 2.0): Stim, quimb, cotengra, the orchestration, the route adjudicator. What we don't ship is the calibrated intelligence — the calibration corpus, the measured-hardware datasets, and the confidence model trained on them. You can verify every result on this page without us handing over the corpus. The code is open; the measurements are the moat.