Why do sales teams stop trusting lead scores?

Because uncalibrated scores don't match reality — a batch of '90%' leads that converts at 10% teaches reps the number is noise. Calibration realigns the score to observed frequency, which is what earns the score back its credibility.

Why don't most scoring vendors publish ECE?

Reporting it requires keeping the predictions made at the time and joining them to outcomes later. Many scoring stacks assign a number from static rules and never check it against what happened, so there is no loop to calibrate against.

How is a calibrated score different for small B2B segments?

A calibrated approach shrinks small-sample estimates toward the overall rate by an amount that depends on sample size, so a 20-lead segment isn't handed a 90% score off one lucky deal. Where data is too thin, the honest output is 'not yet' rather than a confident guess.

Expected calibration error for lead scoring

Q: What is expected calibration error in lead scoring?

ECE measures, on average, how far a scoring model's stated confidence is from what actually happens. Bucket the leads by predicted score, compare each bucket's predicted rate to its real conversion rate, and average the gaps — a low ECE means a 90% score really does convert near 90% of the time.

Expected calibration error (ECE) measures how far a scoring model’s stated confidence sits from observed reality: bucket leads by predicted score, compare each bucket’s predicted rate to its actual conversion rate, and average the gaps. A low ECE means a “90%” lead really converts near 90% of the time. Almost no scoring vendor reports it.

Your lead-scoring model says a thousand accounts are “90% likely to convert.” Sales has quietly stopped believing it, because the last batch of 90s closed nowhere near 90% of the time. Expected calibration error is the one number that tells you whether to trust the score — and almost no scoring vendor publishes it.

The Commercial Truth manifesto argues that B2B tools have traded measurement for confident-looking numbers. A fit score is the cleanest example: a percentage that looks like a probability but was never checked against outcomes. Calibration is the check — and confident is a marketing word where calibrated is a measurement word.

What does expected calibration error measure?

Calibration is a simple promise: when a model says 90%, the thing happens about 90% of the time. Expected calibration error (ECE) is how far a model breaks that promise on average.

You compute it by bucketing predictions and checking each bucket against reality. Take every lead the model scored near 80% and look at how many actually converted. If 78% did, that bucket is well-calibrated; if 12% did, it is off by 66 points — and ECE is the average of those gaps across all the buckets, weighted by how many leads fall in each.

A well-calibrated scorer might sit within a few points across every bucket; a badly calibrated one can be off by fifty or more in the bins that matter most — the high-confidence leads sales actually works. The headline number hides where the model lies to you; the per-bucket view shows it.

This is the difference between a confident number and a measured one. A 90% score with no ECE is a claim about confidence the vendor has not earned. The same score with a published ECE is a claim it can defend.

The credible interval, not the confidence interval

There is a second word worth fixing while we are here. When a vendor says it is “89% confident,” ask which kind of confidence it means. A frequentist confidence interval is a procedural guarantee about long-run coverage; a Bayesian credible interval is a direct statement about where the true rate probably lies.

Marketers always meant the second and were usually handed the first. At B2B sample sizes the difference is not academic — it is the gap between an interpretable band you can act on and a procedural defense you cannot.

Why does almost nobody report ECE?

Reporting ECE requires two things most scoring stacks do not keep: the predictions, frozen at the moment they were made, and the outcomes, joined back to them. Legacy scoring assigns a number from static rules and never revisits it against what happened. There is no loop, so there is nothing to calibrate.

This is where the substrate matters. When predictions and conversion outcomes both live as typed, source-attributed nodes, the join is built in — every score can be checked against the outcome it predicted, and the score updates as evidence arrives. Calibration stops being a one-time audit and becomes a standing property.

It also becomes reproducible. Because the inputs, the model version, and the seed are logged with each calculation, a score can be re-run from frozen inputs and land in the same place — which is what turns a confidence number from a claim into evidence you can audit.

The honest version of a score

B2B makes this harder in a useful way. Segments are small — a vertical might have 20 conversions, not 20,000 — so a naive model will hand a 90% score to a segment on the strength of one lucky deal. A calibrated approach shrinks small-sample estimates toward the overall rate by an amount that depends on the sample size, so a thin segment is scored with appropriate skepticism.

The mechanism has a name — empirical Bayes shrinkage — but the operator-facing version is simple. A winning segment with twelve conversions is reported as winning with the skepticism its sample size deserves, not crowned on one lucky deal; the action-ready number is the shrunken one, and the raw number is the optimistic one.

And where the data simply is not there yet, the calibrated answer is not yet — insufficient data — rather than a confident guess. That honesty is not a weakness in the score. It is the thing that makes sales trust the scores that do carry confidence.

Where this gets measured

The methodology Assay is developing for the Commercial Truth Index puts ECE at the center of its calibration axis — measured across vendors, reported with a credible interval rather than a single number, alongside a “not yet” honesty rate. The point is not to crown a winner. It is that a confidence score you cannot check is decoration, and lead scoring is too consequential to decorate.

It is also consequential to regulators: lead scoring sits in the EU AI Act’s high-risk category, where “the model was confident” is not a defense. A published calibration metric is part of what a defensible, evidence-based posture looks like — though that is a topic for its own essay, and for counsel, not this one.

Closes / opens

For a GTM operator, the takeaway is small and sharp. Before you let a score route a rep’s week or trigger an automated send, ask the vendor for its expected calibration error — and treat the absence of an answer as the answer.

Closes the calibration corpus’s ECE seat (LSO §F.7): the metric that decides whether a lead score is measurement or decoration. Opens the threshold question — at what calibrated confidence should an automated outbound action be allowed to fire on its own?

This essay is grounded in Assay’s calibration tutorial corpus and the Commercial Truth Index calibration axis, with pillar context from the brand canon. Worked figures are illustrative. Methodology for the Commercial Truth Index is in development.