A/B Testing an Error Message

A/B test methodology demonstration
Scenario Dead-end error messages blocking workflows and driving support tickets
Approach Hypothesis formation, sample size planning, statistical validation
Outcome How I design and analyze A/B tests for UX copy

Sometimes the biggest wins hide in the smallest places. This is a methodology demonstration: how I would approach a single error message that's driving frustration and support tickets, and turn it into a recovery path.

Methodology Note

  • Format: Demonstration
  • Purpose: Show how I approach A/B testing and statistical validation

The Scenario

Imagine a SaaS platform that helps product teams manage release notes and sync content across tools like GitHub, JIRA, and Zendesk. The GitHub integration works fine most of the time, but occasionally the connection times out.

When it happens, users see: "401 Unauthorized"

Original '401 Unauthorized' error message with no recovery guidance

Accurate, but a dead end. Users either abandon the workflow or open a support ticket. The question: can better copy help more users recover on their own?

Forming the Hypothesis

Before writing new copy, I would dig into why users fail. Support tickets tell you what's broken. Call recordings tell you how users think about it.

In this case, users likely understand authentication failed. They just don't know the fix is a simple reconnect. That gap between understanding and action is where UX copy lives.

The hypothesis: if the error message explains the issue and gives a clear next step, more users will reconnect GitHub on their own.

In A/B testing, I would set up a null hypothesis (H₀): there's no difference between the two messages. The goal is to see whether the new copy produces results unlikely to occur if that null hypothesis were true.

Designing the Test

The proposed message: "Authentication expired, please reconnect your GitHub account." This explained the issue and gave a clear next step.

Before After
401 Unauthorized Authentication expired, please reconnect your GitHub account.
Improved error message explaining authentication issue with clear reconnection instructions

I would partner with a data scientist and engineer to split traffic 50/50 between the old and new messages.

Before launching, I would use a sample size calculator (like Evan Miller's) to determine how long to run the test. With a minimum detectable effect of ~10 percentage points, 80% power, and 0.05 significance, I would need roughly 800-900 users per group. Depending on error frequency, that might take two weeks.

Analyzing Results

Once the test reaches the target sample size, I would compare recovery rates between groups. Here's what a successful result might look like:

A/B test results showing 10 percentage point improvement in recovery rate with confidence intervals
Illustrative example: 55% recovery (control) vs 65% (variant). Point estimate: +10.1pp with 95% CI of ±6.6pp.

With results like these, I would check:

  • P-value: Is it under 0.05? If so, results this extreme would be unlikely if there were no real difference.
  • Confidence interval: Does the range exclude zero? A 95% CI of +3.5pp to +16.7pp means you're 95% confident the true effect falls somewhere in that range.

I would also check for secondary signals: did support ticket volume drop during the test period? That kind of qualitative confirmation strengthens the case.

When I Wouldn't A/B Test

A/B testing is a tool, not a religion. I wouldn't use it when:

  • Low traffic: If you can't reach sample size in a reasonable timeframe, the test drags on and external factors contaminate results.
  • High-stakes flows: If a bad variant causes real harm (failed payments, data loss), the risk isn't worth the learning.
  • Qualitative research already shows a clear winner: If 8 out of 10 usability participants fail on the same step, you don't need statistical significance to act.

Knowing when not to test is as important as knowing how.

The Same Rigor, Different Methods

A/B testing isn't always possible, but the statistical thinking applies everywhere. In my empty state framework work, I used Fisher's exact test to validate usability findings before scaling. In terminology research, chi-square tests and weighted scoring measurably reduced support tickets.

The method matters less than the mindset: design tests that can actually prove something, then let the numbers decide.