The autopsy meeting happened on a Wednesday morning in April. We sat with the marketing director of a mid-sized e-commerce client — they sell specialty kitchen equipment, average order around $180, monthly revenue in the low six figures — and tried to explain why the new product detail page layout we'd been shipping since mid-January had quietly cost them roughly $40,000 in lost revenue over eight weeks.

The dashboard had said the new layout was a winner. Click-through to the add-to-cart button was up 9.4%. Statistical significance, the tool reported, was 96%. We'd called the test, shipped variant B to 100% of traffic, and moved on to the next experiment.

What the dashboard hadn't said: revenue per visitor on the new layout was down 6.1%.

Mistake one: clicks are not revenue

Variant B had a more prominent quantity selector and a redesigned "Add to Cart" button. People clicked add-to-cart more often. They also bought smaller average baskets. The new layout was, in effect, pushing customers to add a single item and check out faster, where the old layout encouraged scrolling, reading related items, and adding accessories.

The metric we picked — click-through to add-to-cart — was the thing we could measure most easily, and it had nothing to do with what the business actually cared about. Revenue per visitor is harder to measure cleanly because order size has high variance, but it's the metric that matched the business outcome. We optimized for a proxy. The proxy and the goal pulled in opposite directions.

This is the most common mistake we make. Easy-to-measure metrics drift away from the metrics that pay the bills. Every test we run now has a primary metric expressed in dollars or in something that is provably correlated with dollars on this specific site, with a separate set of guardrail metrics watching for things going sideways elsewhere.

Mistake two: peeking

The test we ran was set up in Google Optimize's successor (we're on GrowthBook now) with a "stop when significant" rule. The marketing director, who has dashboard access, checked it on day six and saw it was already trending positive. By day nine the tool flagged 96% significance. We declared the winner on day ten.

Day ten was too early. Peeking — checking a running test and deciding to stop when the p-value crosses a threshold — inflates false positives dramatically. If you check ten times during a test instead of once at a pre-planned end point, your effective false-positive rate can be three to four times higher than the nominal 5%. That 96% confidence at day nine was, in reality, much less than that.

We knew this. We've known it for years. The tool's "stop when significant" button is right there, glowing, and we still pushed it.

What we do now: pre-register the test duration based on a power calculation before we start, and we don't look at significance numbers until that date passes. The dashboard literally hides the significance column until the planned end date. We built this into our GrowthBook setup last month.

Mistake three: undersized sample

The test ran for ten days. Site traffic during that window was roughly 12,000 visitors per variant. For a baseline conversion rate of around 2% and a minimum detectable effect we'd actually have cared about — say 8% relative lift — the required sample per variant should have been closer to 40,000.

What that means in practice is that the test was statistically powered to detect only very large effects. The 9.4% click lift it "found" was within the range where the test could either confirm a real effect, find noise, or — worst case — find a real effect on the metric we were measuring that pointed in the wrong direction on the metric we cared about. Which is what happened.

We now run a power calculation before the test. If the math says we need six weeks of traffic to detect the effect we care about, we either commit to six weeks or we don't run the test. The middle path — running it for two weeks and hoping — is where money goes to die.

Mistake four: novelty effects

Some of the early lift on the new layout was almost certainly novelty. Returning customers, who make up about 30% of the client's revenue, behaved differently when they saw a layout they hadn't seen before. They clicked more because the page surprised them. After a few weeks they regressed to a lower engagement than the old layout had produced.

Ten days isn't long enough for novelty to wash out. We now require any layout test on a site with significant returning traffic to run for at least two weeks beyond a calculated minimum, and we split the analysis between new and returning visitors. The early signal was almost entirely driven by returning users, which should have been a red flag.

Mistake five: not pre-committing to the analysis plan

When the test ended we had multiple metrics on the dashboard — clicks, add-to-cart rate, checkout starts, completed orders, revenue per visitor, average order value. With six metrics and a willingness to call any of them the winner, the odds that at least one shows a "significant" lift purely by chance get uncomfortable.

Now we pick one primary metric before the test starts. We write it in the test brief. The brief is signed off by the client. If the primary doesn't move, the test failed, no matter how many secondary metrics light up green.

What the autopsy actually produced

We rolled back to the old layout in late March. Revenue per visitor recovered within two weeks. We owed the client a process change, not a discount — the $40k was real, and pretending otherwise would have been worse than admitting it. We sat down and rewrote our internal experimentation playbook from scratch. It's about three pages now. Most of it is "don't do the thing the tool's UI wants you to do."

The honest version of what we learned is this: A/B testing tools have gotten very good at making it feel like you're doing science. The buttons say "significant," the numbers go up, and the chart looks like the kind of chart that wins arguments. None of that is the same thing as knowing whether the change is good for the business. The discipline that turns A/B testing into something useful is mostly about resisting the tool's defaults.

We're more boring about it now. Pre-registered, slow, single-primary-metric, no-peeking. We've lost some of the speed. We've stopped shipping the wrong winners.