Holdout Test

Q: How large should a holdout group be?

It depends on baseline conversion and detectable lift, but many teams start around 10% to 20% and validate significance before rollout decisions.

Q: How is this different from A/B testing?

A/B testing compares two treatments. Holdout testing compares treatment versus no treatment to estimate incremental effect.

Direct definition: A holdout test withholds a campaign or journey treatment from a randomly chosen share of an eligible audience so you can compare outcomes with a group that keeps receiving business as usual. The gap between exposed and held-out performance is your best practical read on incremental lift inside a CRM program, separate from what attribution dashboards imply.

Why this matters

Lifecycle teams live inside tools that report opens, clicks, and attributed conversions. Those signals show correlation. They do not prove that an email or push caused someone to purchase or retain when they would have done the same thing without the message. That gap matters when you scale send volume, add aggressive promos, or argue for budget based on channel reports.

Holdouts sit in the same family as incrementality thinking. They trade perfect coverage for a cleaner counterfactual: what happens to people who do not get the treatment under the same time window and eligibility rules. For always-on flows, a holdout is sometimes the only honest answer to whether a journey pays for itself after creative fatigue, deliverability drag, and audience overlap.

They also calibrate other measurement. If attribution says a flow drives huge value but holdouts show flat lift, you stop trusting the dashboard story and debug identity, tracking windows, or cannibalization before you pour more traffic into the same segment.

How it works in practice

Start by defining the population and success metric. Examples: dormant subscribers in the last 90 days, renewal window users, or people who hit a product milestone. The metric might be purchase rate, renewal rate, reactivation, or revenue per user across a fixed window after the test starts.

Split the eligible group into a test split that receives the journey or campaign and a holdout split that does not. Random assignment matters. If you skew holdouts toward low-value users, lift looks artificially high. If you cherry-pick geographies, seasonality will lie to you.

Run the experience long enough for the metric to move. Short windows favor noise. Long windows invite overlap from other campaigns, so document what else is live and whether promos hit both groups.

Compare performance with the same time zone and attribution cut. If your product records purchases and your ESP records sends, reconcile on person or account ID, not on exports done on different days. If you use a warehouse layer, agree whether the metric is gross revenue, net of discounts, or margin after refunds.

Decide in advance how you treat holdouts after the read. Some teams return them to the program, some keep a long-term control for drift. The wrong move is to silently merge holdouts into the same stream without tracking that you broke the comparison.

Common mistakes

Tiny holdouts. A 1% holdout on a rare event produces swingy numbers. You need enough volume for the metric to stabilize.
Contamination. The same user gets the treatment through another channel or a manual send while labeled as holdout. Instrument routing so holdout status is enforced across ESP, CRM, and ad audience syncs.
Confusing holdout with A/B. A/B compares creative A versus B. Holdout compares treatment versus absence. If you run both, know which question you answer first.
Ignoring interference. Heavy paid media or support outreach moves both arms. Note external shocks in the test log.
Peeking once and declaring victory. One clean read beats six anxious screenshots in Slack. Agree on minimum runtime and sample rules up front.

Example

A subscription brand suspects its win-back flow trains people to wait for coupons. They hold out 15% of lapsed users who meet the same dormancy window. After four weeks, exposed users renew at 6.2% and holdout users renew at 5.1%. The incremental lift is about one percentage point on that base, not the gap between exposed users and a global average from last quarter. Finance can compare that lift with coupon margin and message cost before changing the offer ladder.

When holdouts are ethical and when they need extra care

Most lifecycle tests carry low risk, but medical reminders, fraud alerts, or safety notices should never sit behind a holdout without explicit review. If withholding communication could harm users or break regulation, choose a different measurement strategy such as stepped rollout or synthetic control comparisons.

For long sales cycles, align test windows with business reality. A four-week read on enterprise renewal might miss effects that show at day 120. Either extend the window or pair with pipeline metrics that update sooner. Document external shocks like competitor pricing moves so results stay interpretable when leadership replays the quarter.

Close the loop with tooling. Tag holdout IDs in your ESP and CRM, block sneaky manual sends that bypass flags, and reconcile nightly so drift does not accumulate. When tests end, archive cohorts so analysts revisiting cohorts six months later still see who was excluded and why.

Related terms

This topic connects to marketing incrementality, marketing attribution, and cohort analysis when you slice results by signup source or product tier.

FAQ

How large should a holdout group be?

It depends on baseline conversion and the lift you need to detect. Teams often start around 10% to 20% for journey tests where monthly volume is meaningful. If leadership cannot stomach leaving money on the table, shrink the holdout but extend the window so the estimate still firms up.

How is this different from A/B testing?

A/B testing asks which of two treatments wins. Holdout testing asks whether any treatment beats no treatment. You need both at different stages of program maturity.

What to do next

Pick one high-cost flow with enough volume to support a holdout, write down eligibility rules and the metric, then run the test before you redesign creative. Use the CRM Implementation Checklist 2026 for routing and QA discipline, and the CRM Implementation Playbook 2025 for how journeys fit into the wider rollout.

For economics on promo-heavy flows, sanity-check payback with the CAC Payback Calculator. If you want execution help across data and lifecycle, see the CRM Implementation service.

Need a measurement setup that matches your stack?

Holdouts only work when routing and identity are clean.

Explore CRM Implementation