1 / 10
Compass Quality Scorecard

A real scorecard for Compass — and what it tells us about where to focus next.

All seven accuracy dimensions ran at near-full coverage on last night's sweep — the first time we've had complete data for all of them at once. This is where Compass stands today, how we got here, and where the team is heading next.

May 26, 2026 · NCTQ × Starling Strategy

Bottom Line

All seven dimensions ran last night. None hits its threshold yet. Every gap is named.

Scores below land in a tight 83–91% range — four to sixteen points off their committed thresholds. "Coverage" = the share of each dimension's defined test scenarios that ran last night. Near-100% coverage means the score reflects almost the full test set, not a small sample.

Dimension
Score
Threshold
Coverage
Tested
Gap
Filter Accuracy
91%
≥95%
27 of 27 (100%)
May 26
−4 pts
Consistency
89%
≥95%
28 of 28 (100%)
May 26
−6 pts
Selection Accuracy
87%
≥95%
74 of 77 (96%)
May 26
−8 pts
Sort Accuracy
86%
≥98%
26 of 26 (100%)
May 26
−12 pts
Citation Accuracy
84%
≥98%
36 of 37 (97%)
May 26
−14 pts
Coverage-State Labeling
83%
≥98%
25 of 26 (96%)
May 26
−15 pts
Data Fidelity
83%
≥99%
75 of 75 (100%)
May 26
−16 pts

Source: May 26 nightly sweep, 2,947 trials across the active scenario set. Thresholds come from the NCTQ Accuracy Framework Worksheet.

The Trend

From May 22 to May 26 — scores are up across the board.

The chart below shows accuracy scores on the previous broad sweep (May 22, left) versus the validating nightly we ran last night (May 26, right). Six of seven dimensions moved up. Selection moved down thirteen points — explained below the chart.

Dimension
May 22 score
May 26 score
Consistency
36%
89%
Coverage-State Labeling
43%
83%
Filter Accuracy
53%
91%
Data Fidelity
45%
83%
Citation Accuracy
63%
84%
Sort Accuracy
78%
86%
Selection Accuracy (tested harder set)
100%
87%

About the Selection drop: the May 22 sweep tested a smaller subset of cases (48) than May 26 (74). The larger May 26 set includes harder cases that the smaller May 22 sample didn't see — so the comparison isn't a regression, it's a fuller test. From now on every nightly tests the full case set, making future comparisons direct.

The Headline

Three sentences.

1.
The evaluation methodology is in a better place. The scorecard now tests ~70 scenarios instead of the 3–5 we had before. The denominators are right. The numbers reflect what's really happening.
2.
Applied retrospectively, it shows Compass has improved. On May 22, six of seven dimensions sat between 36% and 78%. On May 26, those same six are between 83% and 91%.
3.
It tells us exactly where to focus. No dimension hits its threshold yet, but every gap is named — and they're all within sixteen points. The team has a clear list of fixes for the next 24–48 hours.

For Reference

The seven dimensions, in plain language.

Each dimension answers one fair question about a Compass answer. They come from NCTQ's Accuracy Framework Worksheet — the spec that's been the reference all along.

Selection Accuracy
Did Compass pick the right districts, year, and metric for the question being asked?
Data Fidelity
Do the numbers in the answer match what NCTQ has on file? No rounding errors, no fabrications.
Coverage-State Labeling
When a district isn't in NCTQ's set, does Compass say so clearly instead of pretending it has the data?
Filter Accuracy
Did the answer honor every constraint in the question — "large urban districts only," "since 2023," etc.?
Sort Accuracy
"Highest first" actually returned highest first. Ties were broken the way the question asked.
Citation Accuracy
Every claim ties back to a real source — the right contract, the right page, the right year.
Consistency
Ask the same question twice and get the same answer twice. No coin-flips.

What's Next

A real baseline. Scoped fixes.

Today's scorecard is the first one we'll measure ourselves against. Seven well-defined gaps, all in the 4-to-16 point range. The team is working through them now; every PR from here on lands or breaks a row on this page.

7 dimensions, locked 96–100% case coverage Daily full-coverage sweep Replayable history

The next four slides are appendix material — show-the-work math, methodology notes, and the rule catalog — for anyone who wants to look under the hood.

Appendix · Optional

Show The Work

Every score, case by case.

You don't have to read this — it's here if you want to verify the math.
Dimension (May 26 sweep)
Cases
Total verdicts
Skipped¹
Evaluated
Passed
Score
Filter Accuracy
27 / 27
1,878
664
1,214
1,127
91%
Consistency
28 / 28
2,850
1,383
1,467
1,359
89%
Selection Accuracy
74 / 77
5,974
3,278
2,696
2,420
87%
Sort Accuracy
26 / 26
1,939
1,205
734
659
86%
Citation Accuracy
36 / 37
2,857
1,640
1,217
1,041
84%
Coverage-State Labeling
25 / 26
1,755
1,059
696
593
83%
Data Fidelity
75 / 75
5,257
3,385
1,872
1,588
83%

¹ Skipped = rule didn't apply to that turn (e.g. citation check on a turn with no claims). Excluded from both numerator and denominator.

Appendix · Optional

The Rules

28 active rules judging each Compass answer.

Rules are grouped by how they check.
Deterministic rules
Regex / shape checks. Fast, exact, no LLM in the loop.
Citation format, table shape, district reference resolution, data-fingerprint match against NCTQ source rows, sort-direction consistency, coverage-state label whitelist
Judge-prompt rules
LLM rubric checks. Read the answer against the case's written spec.
Did the answer match the rubric? Was the right metric chosen? Was the constraint honored? Was the framing of a ranked answer correct? Was the multi-turn context carried forward?
Span-assertion rules
Trace-level checks against the underlying execution log.
Did the system actually invoke the right tool? Did persistence happen? Did the planning step produce the expected shape?
Scenario-fit rule
One rubric judge per case — overall fit.
The catch-all: does this answer plausibly satisfy the case's full expected_output rubric, considering every criterion together? One verdict per case-trial.

Each rule writes a pass/fail/skip verdict per turn. Skipped verdicts are excluded from the score so a rule that doesn't apply on a given turn doesn't move the math.

Appendix · Optional

What We Changed Today

One framework. Locked in code. A score has to earn the right to be shown.

Today's release fixes four issues at the source. The dimensions are committed to the NCTQ-approved framework, the math is a pure function we can replay against any past sweep, and a dimension only gets a published score when it cleared the coverage bar.

Before today

  • 9 dimensions, two of which weren't from the framework
  • Rubric criteria added rapidly, often mid-week
  • Dashboard showed whichever sweep finished last — sometimes a 3-case smoke
  • 100% from 3 cases looked identical to 100% from 75
  • No reliable way to compare last week to this week

From today on

  • 7 dimensions, matching the NCTQ Accuracy Framework Worksheet
  • Rubric changes go through review, with a dated record
  • Each score sourced from the most comprehensive recent sweep
  • Coverage % shown next to every score — under 50%, no score is published
  • Daily full-coverage sweep added to nightly
Appendix · Optional

How The Comparison Works

Replay past sweeps through today's math.

Every judgment a sweep produces is stored permanently with the sweep's identifier. The scoring math — average the trials, average the cases, average the scenarios — is a pure function. To compare today and last week, we run that same function against last week's stored judgments. Same dimensions, same math, applied uniformly.

The honest caveat: the rules each answer was judged against on May 22 were a subset of today's rules. The rubric record shows when each rule was added.

compass.verdicts every judgment, every sweep tagged with sweep_run_id scoring function K=3 macro-averaging, 7-dim canonical set same input → same output May 22 scorecard (replayed today) Data Fid 45% Coverage 43% May 23 scorecard (partial run) Selection 99% Filter 97% May 26 scorecard (today) Data Fid 83% Coverage 83%
Appendix · Optional

The Honest Bit

Why past numbers haven't been comparable.

For anyone wondering why scores moved week to week before the methodology lock.
Issue 1
The dimensions kept shifting. The scorecard had drifted to nine rows including two that aren't accuracy dimensions in our framework.
Issue 2
The rules underneath kept growing. The catalog of checks each answer faces grew from 4 rules on May 17 to 28 rules today. Stricter rules pulled scores down even when Compass was improving.
Issue 3
The dashboard picked the latest sweep, not the most comprehensive one. A 3-case post-audit smoke could overwrite a 75-case validating nightly — and both would read "100%."
Issue 4
A bookkeeping bug hid five full sweeps. Last night's nightly successfully ran all seven dimensions at near-full case coverage — but five of seven sweep_runs never flipped their status flag to completed. The dashboard's "latest completed sweep" rule skipped them. Fixed today.