What does it mean for a sales score to be calibrated?

A score is calibrated when its stated confidence matches reality over many cases. If a system labels a set of leads 80% likely to convert, about 80 percent of them should convert. Calibration measures the agreement between the number a score claims and the outcome frequency you actually observe.

Is calibration the same as accuracy?

No. Accuracy asks whether a single prediction was right. Calibration asks whether the confidence attached to many predictions was honest. A model can rank leads well and still be badly calibrated, because ranking and honest probability are different properties. Source: Assay calibration tutorial corpus, Essay 1.

How do you measure whether a sales score is calibrated?

Bucket predictions by their stated confidence, then compare each bucket's claimed probability to its observed outcome rate. Expected calibration error summarizes that gap into one number. A score with no published calibration error is a claim, not a measurement. Source: Assay calibration tutorial corpus, Essay 3.

Is a confidence score a probability that a claim is true?

Not in Assay's substrate. A confidence score there is a trust score over the evidence chain behind a claim, capped by a per-source-type ceiling, not a probability the claim is semantically correct. A calibrated outcome score and a confidence score answer different questions. Source: Assay confidence-scoring canon.

Calibrated, not just confident: what a sales score actually has to prove

Your scoring tool labels a lead “82% likely to close.” A rep reads it, a forecast inherits it, and an AI agent turns it into a sentence in an email. Nobody in that chain asks the one question that decides whether the number is worth anything: of all the leads this system has ever called 82%, how many actually closed?

That question is what calibration answers, and most B2B scoring systems cannot. The Commercial Truth manifesto argues that marketing has never had the measurement infrastructure other functions take for granted. A score that reports confidence it has not earned is one of the cleanest examples of that gap.

This piece is for any operator who acts on a score — the VP of GTM reading a pipeline forecast, the positioning owner trusting a messaging-test number, the revenue leader defending it to a board. Calibration is not a statistical nicety. It is the difference between a number you can stake a decision on and a number that only looks decisive.

What calibration actually means

A score is calibrated when its stated confidence matches the world over many cases. If a system assigns 80% to a group of leads, roughly 80 percent of that group should convert. Source: Assay calibration tutorial corpus, Essay 1.

The definition is deliberately about frequency, not about any single lead. Calibration is a property you can only see across a population of predictions, by sorting them into confidence buckets and checking each bucket against its own outcome rate. The 90% bucket should convert near 90 percent and the 30% bucket near 30 percent, and when those two columns track each other, the score is honest about what it knows.

This is why a model that says “95% sure” needs to be right 95 percent of the time, or the 95 is decoration. Source: Assay calibration tutorial corpus, Essay 1.

Why calibration is not accuracy

The common mistake is to treat calibration as a synonym for accuracy or for ranking quality. They are different properties, and a system can pass one while failing the other.

Accuracy and ranking ask whether the model orders leads well — whether the ones it scored higher really do close more often than the ones it scored lower. A model can rank perfectly and still be badly calibrated, because ordering says nothing about whether the attached numbers are honest. Source: Assay calibration tutorial corpus, Essay 1.

A scorer that multiplies every probability by 1.2 keeps the exact same ranking and breaks calibration completely. The leads are in the right order; the confidences are inflated. That is the failure mode most teams never test for, because the dashboard still looks sorted and sensible.

Why this matters now

The cost of a miscalibrated score used to be a slightly optimistic forecast. Now it is a chain of optimistic claims moving at machine speed.

GTM teams run more AI agents than ever, and those agents read numbers off dashboards and emit copy from them. An agent that ingests an overconfident 82% does not soften it — it asserts it downstream, in an email, a forecast, a board slide, each step inheriting confidence the original number never earned.

This is the substrate shift. When scores feed humans alone, a miscalibration is one bad guess; when they feed a fleet of agents, a miscalibration is a systematic, repeated overstatement. Calibration is the discipline that keeps that propagation honest, and it is one axis of PIL-B2 — Calibrated: every claim carries outcome evidence, with credible intervals over point estimates and “not yet” when the data is thin. Source: Assay brand canon, PIL-B2.

The primitives calibration introduces

Taking calibration seriously forces three things into how a score is reported.

First, a reliability check becomes mandatory. You bucket predictions by stated confidence and compare each bucket’s claim to its observed rate. Expected calibration error summarizes that gap into a single number — and a confidence score shipped without one is a claim, not a measurement. Source: Assay calibration tutorial corpus, Essay 3.

Second, the credible interval replaces the bare point estimate. A calibrated system reports a band it can defend, not a single tidy figure, because honest inference shows its uncertainty rather than hiding it. This is the brand glyph for PIL-B2 — the Credible Interval Bar wherever a confidence claim appears. Source: Assay brand canon, PIL-B2.

Third, “not yet” becomes a first-class output. When the sample is too thin to support a number worth acting on, the calibrated answer is “not yet,” not a precise-looking guess. Accordingly, this essay reports no measured Assay calibration figures — those are not yet published.

A separate primitive worth naming: in Assay’s substrate, a confidence score is a trust score over the evidence chain, capped by a per-source-type ceiling, not a probability a claim is semantically correct. Source: Assay confidence-scoring canon. A calibrated outcome score and that confidence score answer different questions, and conflating them is its own quiet error.

A worked example

Consider two lead-scoring systems on the same 1,000 leads. The numbers that follow are illustrative, not measured Assay results.

System A sorts the leads beautifully — its high scores really do close more than its low scores — but every probability runs hot. Of the leads it calls 80%, only about 55 percent convert. The ranking is right; the confidence is a fiction, and any forecast built on the raw numbers will overshoot.

System B ranks the leads slightly worse but is calibrated: of the leads it calls 80%, about 78 percent convert, and the same holds across every bucket. Its expected calibration error is small, and it reports each score as a credible interval rather than a point. For a forecast, System B is the one you can defend, because its 80 means roughly what 80 should mean. Source: Assay calibration tutorial corpus, Essay 3.

The lesson is not that one model is smarter. It is that a confident number and a calibrated number are not the same artifact, and only one of them survives contact with a board that asks how you know.

Closes / opens

Closes the LSO §E concept question of what calibration means for a sales score: a score is calibrated when its stated confidence matches observed outcome frequency over many cases — a property distinct from accuracy or ranking, and one most scoring systems never test.

Opens the natural follow-on — once you accept calibration, how do you measure it on a vendor’s own outputs, and why does almost no scoring vendor publish an expected calibration error on the numbers they sell?

The next time a tool reports a sales score, ask not whether it feels right but whether its confidence has been checked against reality — and a number that can be scored, banded, and verified over time is the methodology Assay is developing for the Commercial Truth Index, measuring whether the numbers a platform emits are honestly grounded.

This essay is grounded in the Assay brand canon (PIL-B2), the calibration tutorial corpus (Essay 1 and Essay 3), and the confidence-scoring concept canon. Methodology for the Commercial Truth Index is in development.