A/B Testing From Scratch: Statistics, Implementation, and Knowing When to Stop

Most developers implement A/B testing incorrectly. They run a test, check the dashboard daily, stop the moment the numbers look green, and declare a winner. This common workflow leads to false positives more than half the time. When the “winning” feature is fully shipped, the promised conversion lift vanishes, and the team slowly loses faith in experimentation altogether.

To run valid experiments, you do not need a degree in statistics. You just need to understand five core concepts and how they prevent common pitfalls like underpowered tests, early-stopping bias, and sample ratio mismatch.

The Five Foundational Concepts

1. Null Hypothesis

Every A/B test starts with the assumption that your new variant makes zero difference. This default assumption is the null hypothesis. Your goal is to gather enough user data to reject this assumption. A successful test doesn’t mathematically prove your variant is superior; it simply shows that the difference in user behavior is highly unlikely to be a random fluke.

2. Statistical Significance and P-Value

The p-value measures the probability of seeing your test results if the new variant actually had zero impact. A p-value of 0.05 means there is a 5% chance that the difference you observed was just random noise.

The standard significance threshold is set at p < 0.05 (accepting a 5% false positive rate). For higher-stakes changes, you might target a stricter threshold like p < 0.01. Choosing this threshold is a business decision, not a scientific law - but keep in mind that a lower threshold requires a significantly larger sample size.

3. Statistical Power

While significance controls for false positives, statistical power protects against false negatives. Power is the probability that your test will detect a real change when one actually exists.

The industry standard is 80% power. If your test is underpowered, it might yield a non-significant result (like p = 0.08) even if your variant is better, simply because you didn’t gather enough data to lift the signal out of the background noise. Running underpowered tests means wasting time rejecting good ideas.

4. Minimum Detectable Effect

Before launching an experiment, determine the smallest improvement that justifies the effort of shipping the change. This is the Minimum Detectable Effect (MDE). If your checkout page converts at 3%, a 0.05% lift is probably too small to matter to your business. Setting an MDE of 0.5% or 1% keeps the experiment grounded.

Crucially, your MDE dictates your sample size. Trying to detect tiny, fractional improvements requires massive volumes of traffic; setting a realistic MDE ensures you aren’t chasing ghosts.

5. Sample Size

Calculate your target sample size before you start the experiment, never during or after. This figure depends on your baseline conversion rate, MDE, significance level, and statistical power.

Instead of writing complex math from scratch, use a reliable tool like Evan Miller’s sample size calculator. Keep the fundamental math rule in mind: detecting smaller improvements requires exponentially more traffic. For instance, detecting a 0.1% lift on a 3% baseline requires roughly 250,000 users per variant. Detecting a 1.0% lift requires only about 2,600 users per variant.

Assignment Mechanisms

How you partition traffic determines whether the experiment remains clean.

User-Level Assignment

For logged-in users, assign them to a variant once and keep them there. Instead of storing these assignments in a database lookup table, use deterministic hashing. By hashing the user ID and experiment ID together, the user is mapped to the same variant every time they visit.

const hash = crypto.createHash('md5').update(`${userId}:${experimentId}`).digest('hex');
const variant = parseInt(hash.substring(0, 8), 16) % 2 === 0 ? 'control' : 'variant';

This stateless approach is fast, reliable, and ensures consistent user experiences across devices.

Session-Level Assignment

Some tests only require consistency within a single session - like testing minor layout tweaks for anonymous landing page visitors. However, avoid session-level splits for core product features, pricing experiments, or onboarding flows. A user seeing two different prices or layouts across two visits destroys their trust and compromises your data.

For anonymous traffic, store a persistent experiment cookie on the client’s browser. If the cookie is missing when they land, make the variant decision, store it in the cookie with a long expiry (such as 30 days), and use that value for subsequent requests. Keep in mind that users clearing their cookies will trigger reassignment, introducing a minor margin of error.

The Peeking Problem

Checking a live experiment dashboard and stopping early when the numbers look promising is the easiest way to ruin a test.

This is known as the peeking problem. Every time you check the results and decide whether to stop or continue, you introduce a new opportunity for a random fluctuation to look like a true effect. If you peek five times during an experiment, your actual false positive rate jumps from 5% to roughly 20%.

For a clear mathematical explanation of why this happens, read Evan Miller’s breakdown of early stopping.

To avoid the peeking trap:

Calculate your target sample size before starting.
Wait until you have collected the full volume of traffic.
Check the results exactly once.

If your product requires early-stopping capabilities (for example, to abort a disastrous UX variant), implement sequential testing frameworks like Sequential Probability Ratio Tests (SPRT). These adjust the significance thresholds over time to preserve your target false positive rate.

Sample Ratio Mismatch

A Sample Ratio Mismatch (SRM) occurs when your actual traffic split differs significantly from your intended split. If you set a 50/50 split but end up with 52% control and 48% variant, your results are invalid. This mismatch points to a fundamental flaw in assignment or tracking.

Before analyzing your results, run a chi-square goodness-of-fit test. If the counts deviate from your expected allocation with a significance of p < 0.05 (a chi-square value above 3.84 for one degree of freedom), stop and debug. Common causes of SRM include CDN caching bypassing variant assignment, redirect failures, bots triggering event tracking but not assignment, or asymmetric telemetry loss.

Analyzing Results

Once your sample size is complete and you have ruled out SRM, run a two-proportion z-test to calculate your p-value and confidence intervals.

When interpreting results, look past the headline p-value and study the confidence interval. If your test shows statistical significance, but the confidence interval for the conversion lift is between 0.02% and 0.5%, the actual business lift might be negligible. A statistically significant result is not automatically a practically significant one.

Guardrail Metrics

Never improve a single metric in a vacuum. Every test needs guardrail metrics - telemetry that must remain stable. For example, if you redesign a checkout flow to increase conversion rates (primary), you must track page load times, error rates, and support ticket submissions (guardrails). A 1% conversion lift is not worth a 5% surge in server errors.

Tracking Infrastructure

To run tests in-house, your database schema needs three entities:

Experiments: Stores metadata, hypothesis description, target sample size, and MDE.
Assignments: Maps user IDs to their assigned variant.
Events: Logs user actions (e.g., page views, checkouts, signups).

By joining your assignments and events on the user ID, you can calculate the conversion rates for each group. While commercial tools like Optimizely, Statsig, or GrowthBook handle this infrastructure and stats engine for you, understanding this data model is key for writing clean telemetry.

Post-Experiment Decisions

Reaching statistical significance doesn’t mean you automatically ship the change. Ask your team:

Is the improvement large enough to offset the ongoing maintenance cost of the new code?
Did any guardrail metrics slide?
Was the test traffic representative of your entire user base, or did it skew toward a specific channel?
Could external factors (like a holiday weekend or marketing campaign) have skewed the data?

Finally, document every test result, especially the failures and flat outcomes. Knowing what doesn’t work is just as valuable as finding a winner; it prevents you from shipping code that adds complexity without adding value.