Compass Quality Scorecard

A real scorecard for Compass — and what it tells us about where to focus next.

All seven accuracy dimensions ran at near-full coverage on last night's sweep — the first time we've had complete data for all of them at once. This is where Compass stands today, how we got here, and where the team is heading next.

May 26, 2026 · NCTQ × Starling Strategy

Bottom Line

All seven dimensions ran last night. None hits its threshold yet. Every gap is named.

Scores below land in a tight 83–91% range — four to sixteen points off their committed thresholds. "Coverage" = the share of each dimension's defined test scenarios that ran last night. Near-100% coverage means the score reflects almost the full test set, not a small sample.

Filter Accuracy

91%

≥95%

27 of 27 (100%)

May 26

−4 pts

Consistency

89%

≥95%

28 of 28 (100%)

May 26

−6 pts

Selection Accuracy

87%

≥95%

74 of 77 (96%)

May 26

−8 pts

Sort Accuracy

86%

≥98%

26 of 26 (100%)

May 26

−12 pts

Citation Accuracy

84%

≥98%

36 of 37 (97%)

May 26

−14 pts

Coverage-State Labeling

83%

≥98%

25 of 26 (96%)

May 26

−15 pts

Data Fidelity

83%

≥99%

75 of 75 (100%)

May 26

−16 pts

Source: May 26 nightly sweep, 2,947 trials across the active scenario set. Thresholds come from the NCTQ Accuracy Framework Worksheet.

The Trend

From May 22 to May 26 — scores are up across the board.

The chart below shows accuracy scores on the previous broad sweep (May 22, left) versus the validating nightly we ran last night (May 26, right). Six of seven dimensions moved up. Selection moved down thirteen points — explained below the chart.

Dimension

May 22 score

→

May 26 score

Consistency

36%

89%

Coverage-State Labeling

43%

83%

Filter Accuracy

53%

91%

Data Fidelity

45%

83%

Citation Accuracy

63%

84%

Sort Accuracy

78%

86%

Selection Accuracy (tested harder set)

100%

87%

About the Selection drop: the May 22 sweep tested a smaller subset of cases (48) than May 26 (74). The larger May 26 set includes harder cases that the smaller May 22 sample didn't see — so the comparison isn't a regression, it's a fuller test. From now on every nightly tests the full case set, making future comparisons direct.

The Headline

Three sentences.

1.

The evaluation methodology is in a better place. The scorecard now tests ~70 scenarios instead of the 3–5 we had before. The denominators are right. The numbers reflect what's really happening.

2.

Applied retrospectively, it shows Compass has improved. On May 22, six of seven dimensions sat between 36% and 78%. On May 26, those same six are between 83% and 91%.

3.

It tells us exactly where to focus. No dimension hits its threshold yet, but every gap is named — and they're all within sixteen points. The team has a clear list of fixes for the next 24–48 hours.

For Reference

The seven dimensions, in plain language.

Each dimension answers one fair question about a Compass answer. They come from NCTQ's Accuracy Framework Worksheet — the spec that's been the reference all along.

Selection Accuracy

Did Compass pick the right districts, year, and metric for the question being asked?

Data Fidelity

Do the numbers in the answer match what NCTQ has on file? No rounding errors, no fabrications.

Coverage-State Labeling

When a district isn't in NCTQ's set, does Compass say so clearly instead of pretending it has the data?

Filter Accuracy

Did the answer honor every constraint in the question — "large urban districts only," "since 2023," etc.?

Sort Accuracy

"Highest first" actually returned highest first. Ties were broken the way the question asked.

Citation Accuracy

Every claim ties back to a real source — the right contract, the right page, the right year.

Consistency

Ask the same question twice and get the same answer twice. No coin-flips.

What's Next

A real baseline. Scoped fixes.

Today's scorecard is the first one we'll measure ourselves against. Seven well-defined gaps, all in the 4-to-16 point range. The team is working through them now; every PR from here on lands or breaks a row on this page.

7 dimensions, locked 96–100% case coverage Daily full-coverage sweep Replayable history

The next four slides are appendix material — show-the-work math, methodology notes, and the rule catalog — for anyone who wants to look under the hood.

Appendix · Optional

Show The Work

Every score, case by case.

You don't have to read this — it's here if you want to verify the math.

Filter Accuracy

27 / 27

1,878

664

1,214

1,127

91%

Consistency

28 / 28

2,850

1,383

1,467

1,359

89%

Selection Accuracy

74 / 77

5,974

3,278

2,696

2,420

87%

Sort Accuracy

26 / 26

1,939

1,205

734

659

86%

Citation Accuracy

36 / 37

2,857

1,640

1,217

1,041

84%

Coverage-State Labeling

25 / 26

1,755

1,059

696

593

83%

Data Fidelity

75 / 75

5,257

3,385

1,872

1,588

83%

¹ Skipped = rule didn't apply to that turn (e.g. citation check on a turn with no claims). Excluded from both numerator and denominator.

Appendix · Optional

The Rules

28 active rules judging each Compass answer.

Rules are grouped by how they check.

Deterministic rules

Regex / shape checks. Fast, exact, no LLM in the loop.

Citation format, table shape, district reference resolution, data-fingerprint match against NCTQ source rows, sort-direction consistency, coverage-state label whitelist

Judge-prompt rules

LLM rubric checks. Read the answer against the case's written spec.

Did the answer match the rubric? Was the right metric chosen? Was the constraint honored? Was the framing of a ranked answer correct? Was the multi-turn context carried forward?

Span-assertion rules

Trace-level checks against the underlying execution log.

Did the system actually invoke the right tool? Did persistence happen? Did the planning step produce the expected shape?

Scenario-fit rule

One rubric judge per case — overall fit.

The catch-all: does this answer plausibly satisfy the case's full expected_output rubric, considering every criterion together? One verdict per case-trial.

Each rule writes a pass/fail/skip verdict per turn. Skipped verdicts are excluded from the score so a rule that doesn't apply on a given turn doesn't move the math.

Appendix · Optional

What We Changed Today

One framework. Locked in code. A score has to earn the right to be shown.

Today's release fixes four issues at the source. The dimensions are committed to the NCTQ-approved framework, the math is a pure function we can replay against any past sweep, and a dimension only gets a published score when it cleared the coverage bar.

Before today

9 dimensions, two of which weren't from the framework
Rubric criteria added rapidly, often mid-week
Dashboard showed whichever sweep finished last — sometimes a 3-case smoke
100% from 3 cases looked identical to 100% from 75
No reliable way to compare last week to this week

→

From today on

7 dimensions, matching the NCTQ Accuracy Framework Worksheet
Rubric changes go through review, with a dated record
Each score sourced from the most comprehensive recent sweep
Coverage % shown next to every score — under 50%, no score is published
Daily full-coverage sweep added to nightly

Appendix · Optional

How The Comparison Works

Replay past sweeps through today's math.

Every judgment a sweep produces is stored permanently with the sweep's identifier. The scoring math — average the trials, average the cases, average the scenarios — is a pure function. To compare today and last week, we run that same function against last week's stored judgments. Same dimensions, same math, applied uniformly.

The honest caveat: the rules each answer was judged against on May 22 were a subset of today's rules. The rubric record shows when each rule was added.

Appendix · Optional

The Honest Bit

Why past numbers haven't been comparable.

For anyone wondering why scores moved week to week before the methodology lock.

Issue 1

The dimensions kept shifting. The scorecard had drifted to nine rows including two that aren't accuracy dimensions in our framework.

Issue 2

The rules underneath kept growing. The catalog of checks each answer faces grew from 4 rules on May 17 to 28 rules today. Stricter rules pulled scores down even when Compass was improving.

Issue 3

The dashboard picked the latest sweep, not the most comprehensive one. A 3-case post-audit smoke could overwrite a 75-case validating nightly — and both would read "100%."

Issue 4

A bookkeeping bug hid five full sweeps. Last night's nightly successfully ran all seven dimensions at near-full case coverage — but five of seven sweep_runs never flipped their status flag to completed. The dashboard's "latest completed sweep" rule skipped them. Fixed today.