Finding the Right Words for Monitoring

Finding the Right Words for Monitoring - UX writing case study by Miguel Tomás Gomes
Problem
Terminology confusion
Internal teams using inconsistent language across UI, docs, and support
Solution
User research study
40-participant terminology testing with chi-square validation and weighted scoring
Result
27% fewer support tickets
Standardized monitoring vocabulary with measurable business impact

In enterprise software, the wrong words can create internal friction and drive up support tickets. This project set out to standardize the terminology in a new Monitoring feature. What looked like a mere naming decision revealed itself to be an alignment between developers, admins, and product owners.

Methodology Note

  • Company: Anonymized for confidentiality reasons
  • Data: Two rounds of unmoderated testing with 40 participants across product roles
  • Purpose: Demonstrate terminology standardization methodology
  • Tools: UserTesting, Figma
  • Target users: Developers, admins, and product owners

The Challenge

The Monitoring feature was designed to display logs and system traces. Some of these terms had roots in industry standards, others were unique to our product. None, however, worked consistently across our users.

Developers leaned toward precision, while product owners cared more about accessible language. Admins, on the other hand, wanted both.

In our product, the concept of a trace was labeled as "Activity". Additionally, what is commonly known as spans appeared in the UI as "Activity details". Industry-wise, however, these words have more precise meanings.

For instance, a trace is the complete record of how an application executes logic across services, while a span is a single sub-operation within that trace. Spans are defined by their start and end time, duration, and metadata. As such, they provide context to a trace.

screen showing a trace and spans

The mismatch between our nomenclature and industry-wide terminology created ambiguity for developers and product users, making it clear to us that both needed to be reconsidered.

Even internally, teams didn't fully agree on what each term meant. This inconsistency made its way into the UI and documentation, creating friction and confusion.

I needed to find terminology that was natural, widely understood, and still accurate enough to preserve credibility with developer users.

The Approach

As the only UX Writer assigned to this project, I developed and ran two rounds of unmoderated testing with a total of 40 participants of varied product roles. Each session followed the same pattern: first, participants answered a free-input prompt to capture their natural vocabulary.

Then, they saw a randomized multiple-choice set of four options, including their own input when possible. Finally, they were instructed to explain their reasoning in a free-input field.

Methodology

Splitting the test into two rounds helped control for context bias. The first round used text-only prompts and the second used UI mockups. Text-only prompts asked users to pick the most intuitive option for a given definition, while UI mockups added visual cues and a missing term with multiple options.

This way, I could see whether a term felt natural in isolation but confusing in the interface, or vice versa. To ensure statistical rigor, I added validation in two ways. The percentages reported in findings reflect the UI mockup round, as this better predicted real-world performance in our product interface.

First, I used chi-square tests to check whether the overall preference patterns differed from random chance, then binomial tests to validate specific choices. For example, with four possible terms, random selection would predict 25% each; with three options, 33% each; with two options, 50% each.

Second, I used binomial tests to compare the top two options in each category, testing whether the leading choice significantly outperformed its closest competitor.

I also introduced a weighted scoring system: terms we internally found to be widespread and recognized as industry standards received a multiplier of +1.25.

Terms that caused hesitation or confusion were penalized with a 0.9 factor (-10%). This blend ensured that the final choices weren't driven by raw percentages alone, but by a balance of clarity and business strategy.

To illustrate how this worked in practice, here's an example calculation:

Criterion Base Score (Percentage) Industry Multiplier After Multiplier Confusion Penalty Final Score
Example Term 6.0 (60%) ×1.25 7.5 ×0.9 6.75

Findings

The following results reflect participant preferences when shown terms in UI context, which proved most predictive of actual usage patterns.

The first question was whether "Monitoring" worked as the umbrella term. Other options included Debugging, Analytics, and Insights. When shown in UI context, 24 of 40 participants chose "Monitoring" (60%), followed by 8 for "Debugging" (20%), 6 for "Analytics" (15%), and 2 for "Insights" (5%).

User testing results showing 60% preference for 'Monitoring' over other terminology options

A chi-square test confirmed that preferences weren't random (p < 0.001), and a binomial test showed "Monitoring" significantly outperformed "Debugging" (24 vs. 8, p < 0.001). With its industry-standard multiplier, "Monitoring" clearly outperformed all alternatives.

Decision: Keep "Monitoring" as the umbrella term. Reserve "Debugging" for troubleshooting contexts

Logs

The next test was simpler. Participants were asked what they would call a system-generated record of events. In the UI mockup round, thirty-six out of forty chose "Logs" (90%), with only four choosing anything else.

A binomial test confirmed that this preference was far stronger than the 50% baseline expected under chance (p < 0.001; 95% CI for the difference vs. baseline +30 to +62 percentage points). Free-input responses also overwhelmingly used "Logs".

Decision: Keep "Logs" unchanged.

Activities, Events, Traces

The more difficult naming challenge emerged when we looked at how the product described a record of system behavior through time. In observability, the industry-standard term is a trace: a record that captures the sequence of events as an application executes logic, showing how different services interact.

Our UI, however, labeled this concept "Activities". On paper, the word sounded approachable. In practice, it was vague. Some participants assumed it meant audit logs of end-user actions, others were unsure whether it referred to system-level events.

In text-only prompts, many leaned toward “Events,” finding it intuitive and less technical than either Activities or Traces. But once shown UI mockups, comprehension shifted. With context, “Traces” gained ground. By the end of the second round, 24 of 40 participants (60%) favored “Traces”, compared with 2 for “Activities” (5%) and 14 (35%) for “Events”.

traces results pie chart

A chi-square test confirmed that preferences weren't random (p < 0.01), and "Traces" was chosen significantly more often than "Events" in head-to-head comparison (24 vs. 14, p < 0.05).

"Events" remained useful in specific contexts. Product owners found it approachable, and admins saw it as a good fit for user-facing occurrences. But under the scoring model, "Traces" carried the industry-standard multiplier, which pushed it above "Events" in the final evaluation.

Decision: Retire "Activities". Keep "Traces" as the canonical industry term, supported by UI context and documentation. Use "Events" only for user-facing occurrences, not system-level tracing.

Spans

The final terminology challenge centered on "span", the unit inside a trace. A span is a sub-operation with its duration and metadata. Multiple spans, linked together, form a trace that shows how a request flows across services.

But the term span proved nearly unusable at first. Few participants recognized it. Product users confused it with HTML tags, and only a handful of developers understood it correctly.

When participants were asked what they would call this unit in UI context, 28 out of 40 chose Operation (70%), 8 chose Job (20%), and only 4 chose Span (10%).

spans results pie chart

A chi-square test showed clear preferences beyond chance (p < 0.001), with "Operation" dominating the field. A binomial test confirmed "Operation" significantly outperformed "Job" (28 vs. 8, p < 0.001).

Under the scoring model, "Span" was penalized heavily for lack of comprehension, while Operation gained both industry credibility and cross-role clarity.

To preserve accuracy without alienating users, I recommended renaming it to "Sub-operation (span)" in the UI, paired with a tooltip explanation to educate gradually.

screen showing tooltip on a span visual element

Decision: Replace "Span" with "Sub-operation (span)", supported by tooltips.

Outcome

By the end of the research, the terminology set for the monitoring feature was standardized:

  • Monitoring as the umbrella label.
  • Logs remain unchanged.
  • Activities retired in favor of traces, with events reserved for user-facing contexts.
  • Span reframed as sub-operation (span) with tooltip support.

This consistency carried through the UI, documentation, and support, reducing ambiguity and improving usability. Developers, admins, and product owners now had a shared language.

Results

In follow-up interviews, participants described the revised terms as more natural and easier to navigate. I also touched base with support management three months after the changes went live. They reported up to 27% drop in tickets related to Monitoring.

More importantly, the methodology set a precedent. By combining free input with multiple choice, text-only prompts with UI context, and validating results through statistical testing and weighted scoring, the team gained a repeatable model for terminology research.

This outcome proved that words are not surface-level. They are fundamental to how everyone understands and interacts with a system.