All seven accuracy dimensions ran at near-full coverage on last night's sweep — the first time we've had complete data for all of them at once. This is where Compass stands today, how we got here, and where the team is heading next.
Scores below land in a tight 83–91% range — four to sixteen points off their committed thresholds. "Coverage" = the share of each dimension's defined test scenarios that ran last night. Near-100% coverage means the score reflects almost the full test set, not a small sample.
Source: May 26 nightly sweep, 2,947 trials across the active scenario set. Thresholds come from the NCTQ Accuracy Framework Worksheet.
The chart below shows accuracy scores on the previous broad sweep (May 22, left) versus the validating nightly we ran last night (May 26, right). Six of seven dimensions moved up. Selection moved down thirteen points — explained below the chart.
About the Selection drop: the May 22 sweep tested a smaller subset of cases (48) than May 26 (74). The larger May 26 set includes harder cases that the smaller May 22 sample didn't see — so the comparison isn't a regression, it's a fuller test. From now on every nightly tests the full case set, making future comparisons direct.
Each dimension answers one fair question about a Compass answer. They come from NCTQ's Accuracy Framework Worksheet — the spec that's been the reference all along.
Today's scorecard is the first one we'll measure ourselves against. Seven well-defined gaps, all in the 4-to-16 point range. The team is working through them now; every PR from here on lands or breaks a row on this page.
The next four slides are appendix material — show-the-work math, methodology notes, and the rule catalog — for anyone who wants to look under the hood.
¹ Skipped = rule didn't apply to that turn (e.g. citation check on a turn with no claims). Excluded from both numerator and denominator.
expected_output rubric, considering every
criterion together? One verdict per case-trial.
Each rule writes a pass/fail/skip verdict per turn. Skipped verdicts are excluded from the score so a rule that doesn't apply on a given turn doesn't move the math.
Today's release fixes four issues at the source. The dimensions are committed to the NCTQ-approved framework, the math is a pure function we can replay against any past sweep, and a dimension only gets a published score when it cleared the coverage bar.
Every judgment a sweep produces is stored permanently with the sweep's identifier. The scoring math — average the trials, average the cases, average the scenarios — is a pure function. To compare today and last week, we run that same function against last week's stored judgments. Same dimensions, same math, applied uniformly.
The honest caveat: the rules each answer was judged against on May 22 were a subset of today's rules. The rubric record shows when each rule was added.
completed. The dashboard's "latest
completed sweep" rule skipped them. Fixed today.