Hierarchical Bayesian models: how to score B2B segments that don't share a sample size
Your verticals have wildly uneven sample sizes, so pooled estimates hide the small ones and per-segment estimates trust noise. Hierarchical Bayesian models are the operator's answer.
Your dashboard says the enterprise vertical responded better to the new positioning than the mid-market vertical. The enterprise segment had 60 conversions; mid-market had 800. Before you reallocate spend on that comparison, it is worth asking whether the model behind those two numbers was even allowed to treat them differently.
Most segment-level reporting answers that question badly. It either pools every segment into one estimate — and quietly buries the segment with thin data — or it estimates each segment in full isolation, treating a 60-conversion read with the same trust as an 800-conversion read. The Commercial Truth manifesto argues that marketing has never had the measurement infrastructure other functions take for granted, and uneven B2B segments are one of the sharpest places that gap shows.
This piece is for the data-literate GTM operator who slices results by vertical, segment, or motion and then has to decide what to believe. Hierarchical Bayesian inference is the method that fits the shape of your data instead of forcing your data into the method. It is not a research-paper curiosity, but the operational answer to “my segments do not have the same sample sizes.”
Why this matters now
GTM teams run more positioning experiments than ever, and they read those experiments at the segment level because that is where the budget decisions live. The trouble is that segment sample sizes in B2B are structurally uneven: a typical positioning experiment carries 8 to 30 conversions per arm in a niche segment and many hundreds in a core one. Source: Assay calibration tutorial corpus, Essay 4.
When those numbers are read by downstream systems, the error compounds. An AI agent that reads a segment lift off a dashboard and turns it into a confident sentence does not know the read came from 60 conversions. The misread propagates at machine speed, and the thin segment’s noise gets reported as fact across every surface the agent touches.
Two naive defaults produce that noise. Complete pooling assumes every segment behaves identically, so a real difference in the enterprise vertical disappears into the average, while no pooling assumes every segment is its own island, so the 60-conversion read is trusted as much as the 800-conversion one. Neither is honest about what the data can actually support.
The primitives a hierarchical model introduces
A hierarchical model sits between the two extremes by introducing one structural idea: partial pooling.
Partial pooling treats your segments as related rather than identical. Each vertical gets its own estimate, but every estimate is also informed by the others through a shared parent distribution. The model assumes your verticals are draws from a common population of verticals — different, but not unrelated — and it learns how different they are from the data itself.
The consequence is adaptive shrinkage. Each segment estimate is pulled toward the overall mean by an amount that depends on how much data that segment has.
The 800-conversion segment barely moves, because it carries its own weight. The 60-conversion segment moves a lot, because the model trusts the population more than it trusts a thin local read. Source: Assay calibration tutorial corpus, Essay 5.
The third primitive is honest uncertainty as an output. A hierarchical model reports a credible interval per segment, and a thin segment simply gets a wider band. When the band is too wide to act on, the calibrated answer is “not yet,” not a confident-looking point estimate.
This essay reports no measured Assay lift figures; those are not yet published. Source: Assay Calibration product canon.
This is the difference between a confident answer and a calibrated one. A point estimate per segment looks decisive and hides its own fragility. A credible interval per segment shows you exactly how much the data is willing to claim.
A worked example
Consider a lead-score calibration run across three verticals. The numbers that follow are illustrative, not measured Assay results.
Vertical A is the core motion with 800 scored outcomes. Vertical B is an established adjacent segment with 220 outcomes. Vertical C is a new healthcare push with 60 outcomes.
The raw, per-segment hit rate looks like 22% for A, 31% for B, and 44% for C — and on those raw numbers, C looks like the runaway winner worth reweighting the whole funnel toward.
Run it three ways. Complete pooling averages all three into a single rate near 26% and reports that for every vertical, which erases the genuine signal that B and C are running hotter than A. No pooling reports 22%, 31%, and 44% as if each were equally trustworthy, so the 60-conversion read for C drives a major budget decision on the thinnest evidence in the set.
The hierarchical model does neither. It reports A near 23% with a tight credible interval, because 800 outcomes barely need help, and it reports B near 30% with a moderate band.
It pulls C down from the raw 44% toward roughly 33% with a wide credible interval — the model trusts the shared distribution more than 60 noisy conversions, and it tells you so by widening the band rather than hiding it. Source: Assay calibration tutorial corpus, Essay 5.
The lesson is not that C is bad. It is that C is promising but unproven, and only the hierarchical model says both halves of that sentence.
The raw read said “winner.” The honest read said “winner-shaped, at a sample size that has not earned the claim yet.” That distinction is the whole point of reading segments through a model that respects sample size.
Where this lives in the substrate
Inside Assay, this is not a notebook a data scientist runs once. It is the inference layer of the Calibration Engine.
Variations of canonical claims deploy across instrumented surfaces, outcome events flow into a unified bus, and a nightly hierarchical Bayesian model fits credible intervals per segment. Source: Assay Calibration product canon.
The output routes back to the typed knowledge layer rather than sitting in a chart. A shrunken, calibrated segment estimate becomes a proposed canon update — never an autonomous commit — that a human approves before the winning framing propagates to every downstream surface.
The credible-interval bar, filled when the sample is informative and ghosted when it is “not yet,” is the visible signature of that honesty. Source: Assay Calibration product canon.
That closes the loop the manifesto opens. Positioning stops being a static document and becomes an evidence-updating object, and the evidence is read through a model that knows your enterprise vertical and your new healthcare segment were never going to have the same sample size.
Closes / opens
Closes the LSO §F.7 calibration-tutorial cluster’s hierarchical-modeling question: when B2B segments carry wildly uneven sample sizes, complete pooling hides the small segments and no pooling trusts their noise — partial pooling sits between the two and reports an honest credible interval for each.
Opens the next question this raises — once you accept that each segment should be pulled toward the mean, exactly how much should it be pulled, and how do you choose the prior that governs that shrinkage without baking in the answer you wanted? That is its own calibration problem, and its own essay.
The raw segment read is the optimistic version of your results. The shrunken, interval-bearing read is the version you can defend in a board review — and a number that can be scored, calibrated, and checked over time is the methodology Assay is developing for the Commercial Truth Index, measuring whether the estimates a platform emits are honestly grounded.
This essay is grounded in the Assay calibration tutorial corpus (Essay 5) and the Calibration product canon. Methodology for the Commercial Truth Index is in development.