Skip to content
Article

How to draw a defensible conclusion from eight conversions

June 2, 2026

B2B positioning tests rarely clear 30 conversions per arm. Here is the small-sample inference method that produces an honest answer instead of a false winner or a shrug.

Every product marketer eventually has the conversation they would rather skip: the messaging test ran for six weeks, and I still cannot tell you which version won. One arm pulled ahead, then the other did. The sample never grew large enough to settle the question, and the deadline arrived first.

This is not a discipline problem. It is a sample-size problem, and it is structural to B2B. The Commercial Truth manifesto argues that marketing has never had measurement infrastructure; the absence shows most clearly here, where the method on the shelf was built for sample sizes B2B will never reach.

The conventional answer — run an A/B test, wait for statistical significance — is the wrong tool. It was designed for consumer-scale traffic, and it quietly breaks on the data a positioning experiment actually produces.

Why the conventional answer is wrong

A typical positioning experiment in B2B yields 8 to 30 conversions per arm. Source: the Calibration tutorial corpus, Essay 4. Enterprise sales cycles are long, niche segments are small, and the conversion event you care about is rare by construction.

Frequentist significance testing underpowers at that scale. It assumes a long run of repeatable trials, and at 8 conversions there is no long run. The test either returns non-significance — and the team concludes “no effect” when a real lift may be sitting there undetected — or it returns a false positive through the back door of multiple testing.

That second failure is the more dangerous one. Run a portfolio of expressions and, even with no real differences, a few will clear p < 0.05 by chance alone (Source: the Calibration tutorial corpus, Essay 10, on multiple comparisons). A naive process treats those flukes as winners and writes them into the canon.

The honest summary: at B2B sample sizes, a frequentist A/B test produces noise dressed as a verdict. The method has to fit the sample size, not the other way around.

The substrate-level approach

The fix is not a bigger sample. It is a model that uses the structure already present in your data — segments, time, and prior knowledge — instead of pretending each arm is an isolated coin flip.

Hierarchical Bayesian inference does exactly this. Segments contribute information to each other without being assumed identical, a property called partial pooling (Source: the Calibration tutorial corpus, Essay 5). A vertical with 60 outcomes borrows strength from one with 800, so the small segment still gets a usable estimate.

Three behaviors follow from that. The model shares information across segments, weights recent evidence over stale evidence, and detects non-stationarity when the market moves under you. Source: the Calibration product canon, objection OBJ-02.

The output is a credible interval, not a point estimate. A Bayesian credible interval states the actual probability that the true lift sits inside a range — which is what the marketer always wanted — rather than the procedural long-run guarantee a frequentist interval offers (Source: the Calibration tutorial corpus, Essay 2). When the interval is too wide to act on, the honest answer is “not yet”, reported as a status rather than a fabricated number.

Seven steps to a defensible small-sample conclusion

  1. Define the conversion event before you start. Pick the one downstream outcome that matters — a booked meeting, a stage advance, a closed deal — and freeze the definition. Changing it mid-test is how you manufacture a winner.

  2. Name your segments up front. Vertical, company size, motion. Hierarchical models pool across these strata, so the strata have to be declared before data arrives, not discovered after.

  3. Set a weakly-informative prior, and write it down. Anchor on historical industry rates rather than a flat guess; an over-informative prior biases the result, an over-vague one destabilizes small-sample inference (Source: the Calibration tutorial corpus, Essay 8). Publish the prior so a reviewer can substitute their own and re-run.

  4. Fit the hierarchical model, not a per-arm test. Let the segment with 60 outcomes borrow strength from the segment with 800. The model returns a posterior distribution per segment, with recency weighting applied.

  5. Read the credible interval, not a p-value. A 90% credible interval of +4.0% to +14.0% on reply rate, mean +9.2%, n=312, is an honest finding (illustrative figures, drawn from the Calibration credible-interval glyph; not a measured result). An interval that crosses zero means no effect was detected.

  6. Gate on width, and accept “not yet.” If the interval is too wide to support a decision, mark the result insufficient and keep collecting. A platform that says “not yet” beats one that declares a winner it cannot defend.

  7. Route the winner through review before it becomes canon. A confirmed lift is a proposal, not an automatic edit. Once a human approves it, the canonical claim updates and every downstream surface refreshes from the same source.

What good looks like

A defensible small-sample conclusion is legible to a skeptic. It states a range, the sample behind it, and the prior that shaped it — and it shrinks visibly as more outcomes arrive.

You see two numbers, not one. The raw estimate is the optimistic reading; the shrunken estimate is pulled toward the overall mean by an amount that depends on sample size, and that shrunken number is the action-ready one (Source: the Calibration tutorial corpus, Essay 6). On a small healthcare arm with 12 conversions, shrinkage is the difference between “this segment is winning” and “this segment is winning, with the appropriate skepticism for the sample we have.”

Above all, the conclusion is calibrated. A claim made at 90% should be right about 90% of the time, and a system that reports credible intervals with calibration is making a claim it can defend rather than one it merely asserts.

Anti-patterns to avoid

Peeking and stopping when it looks good. Watching a frequentist test and calling it the moment one arm leads inflates false positives. If you want to stop early on evidence, the decision rule has to be Bayesian by design, not a frequentist test you abandoned partway through.

Treating non-significance as proof of no effect. An underpowered test that fails to reach significance has not shown the lift is zero. It has shown the test was too small (Source: the Calibration tutorial corpus, Essay 7, on power). The correct label is “preliminary,” not “no effect.”

Testing fifty expressions and crowning the top one. Without a false-discovery correction, the apparent winner from a large family is frequently a fluke. This is how a canon fills with false-positive claims it will later have to retract — the slow erosion of trust that follows a wrong claim shipped with confidence.

Hiding the prior. A Bayesian result without its prior published is not auditable. The honest version exposes the prior alongside the posterior so the inference can be re-run.

Closes / opens

Closes the LSO §F.7 calibration-tutorial cluster question of how to reason from the small samples B2B positioning actually produces — the answer is a hierarchical Bayesian model that pools across segments and reports credible intervals, not a frequentist A/B test that was built for traffic you do not have.

Opens the operational question downstream: once a small-sample winner is confirmed, how does that single claim propagate across every rep, document, and AI agent without drifting? That is the substrate problem — and the methodology Assay is developing for the Commercial Truth Index measures whether what those agents emit stays grounded, calibrated, and coherent as the canon updates.

Small samples are a feature of B2B, not a flaw to engineer around. The method has to fit the sample size you have — and at eight conversions, the honest method is the one that reports a range, not a verdict.

This essay is grounded in the Calibration tutorial corpus and the Calibration product canon. Methodology for the Commercial Truth Index is in development.

FAQ

Frequently Asked Questions

Can you draw a conclusion from only eight conversions?
Yes, but not with a naive A/B test. Frequentist significance testing is underpowered at that scale and either returns non-significance or a false positive. A hierarchical Bayesian model that pools information across segments produces a credible interval — an honest range — rather than a binary verdict it has not earned.
Why do frequentist methods fail on small B2B samples?
Frequentist tests assume many repeatable trials. B2B positioning experiments run on 8 to 30 conversions per arm, where p-values are unstable and multiple testing manufactures false winners. Source: the Calibration tutorial corpus, Essay 4.
What does a hierarchical Bayesian model add?
It lets segments share information without being treated as identical. A vertical with 60 outcomes borrows strength from one with 800, recency is weighted, and non-stationarity is detected. The output is a credible interval per segment, not a single number that ignores sample size.
When should the system just say 'not yet'?
When the posterior is too wide to act on. A calibrated platform gates the result as insufficient rather than declaring a winner. 'Not yet' is a first-class output, not a failure.