Designing an AI-Assisted Usability Metrics Toolkit for Clinical Research Teams

AI-Enabled Usability Toolkit: Improving metric selection, analysis, and documentation for L&D research teams.”

I built a structured prompt system that helps new and experienced researchers choose the right metric (SUS, NPS, SUM), calculate scores using sample or real data, generate study protocols, and auto-summarize findings.

Outcome: This reduced documentation time by 40% and ensured consistent analysis across formative and summative studies.

"It's a way to modernize how research is done"

Project Overview

Project Duration: Nov 2025

Role
User Experience Researcher - Assistant Manager

Problem observed in my org and other client teams

At GE HealthCare, I’m working on a long-running L&D monitoring product that has gone through multiple releases and teams over five years. This history created inconsistencies in how usability studies were planned and analyzed, leading me to develop an AI-assisted toolkit to standardize and streamline our research workflow.

Research teams struggled with:

Sample size

Determining appropriate sample sizes at each stage

SUS vs NPS?

Selecting the right metrics (SUS vs NPS vs SUM)

Errors in Metrics

Frequent calculation errors for Statistical metrics

New Researchers

Limited guidance for newer researchers

Impact on project deliverables

⚠️ Misaligned or incorrect metrics
⚠️ Slower regulatory submissions
⚠️ Extra rework and review cycles

My Approach and Process

Understanding Researcher Needs

I began by mapping gaps in our current research workflow through quick interviews and informal discussions with UX researchers and designers. Several recurring pain points emerged:

Confusion around which metrics to use and sample size selection
Misalignment between study type and quant methodology (e.g., SUS/NPS misused in formative studies)
Repetitive manual work in calculating SUS/SUM/task success

This clarified a core need: a structured, modular AI toolkit that guides researchers from

study planning → metric selection → calculation → stakeholder-ready storytelling.

Defining Scope of the AI Toolkit

I outlined clear boundaries to keep the system focused and usable:
Included:

Study setup decision aids (sample size, methodology, participant mix)
Metric selection guidance
Calculators for SUS, SUM, task success, error rate, time on task
Visualization and summary-generation prompts

Excluded:

Deep statistical modeling
Regulatory HF validation templates
Automated inferential statistics

This ensured the toolkit remained lightweight, scalable, and aligned with day-to-day UX needs.

Designing a Modular Prompt System

Instead of a long “mega prompt,” I designed a modular library researchers can mix and match.
The system includes:

Study Setup Prompts (participants, methodology, lifecycle stage)
Metric Recommender (chooses the right scores for the study)
Quant Calculators (SUS, SUM, task success, error rate)
Analysis + Visualization Generators
Stakeholder Summary Builders (ready-to-paste narratives)

The Usability engineering process i follow

The whole toolkit has covered all the phases but for this case study, i am just displaying two areas.

Phase 01: The study planner

1. PARTICIPANT NUMBER PLANNER

Prompt Template: “How Many Participants Do I Need?”

You are a senior UX research strategist helping me determine an appropriate participant sample size for my upcoming usability study.

Study details:

Product: [describe product]
Domain: [clinical, enterprise, consumer, etc.]
Study type: [formative / summative / benchmark / comparison]
Stage of lifecycle: [concept / early prototype / high-fidelity / pre-release]
Key goals: [identify issues, measure efficiency, validate safety, benchmark usability]
Number of tasks expected: [#]
Constraints: [time, budget, access to clinicians, etc.]

What I need from you:

Recommend a sample size range and justify it.
Explain what level of confidence / robustness I can expect with that sample.
Advise whether I should:
- Run multiple small rounds, or
- One larger study
- Use within-subjects or between-subjects designs
Provide a summary paragraph for stakeholders.
Ask up to 3 clarifying questions before giving the answer.

What we need to Input:

Product: bedside L&D maternal–fetal monitoring system
Domain: clinical
Study type: formative
Stage: early prototype
Key goals: identify usability breakdowns in documentation + interpretation
No of Tasks: 6–8
Constraints: limited nurse availability (max 6–8 nurses), must be in-person at hospital

AI Output

Below is the image of the AI output from a model:

Phase 02: The metric recommender

1. Metric Recommender

Prompt Template: “What Should I Measure?”

You are a senior UX quant researcher helping me choose appropriate quantitative UX metrics for my usability study.

Study details:

Product: [describe product]
Domain: [clinical, enterprise, consumer, etc.]
Study type: [formative / summative / benchmark / comparison]
Stage of lifecycle: [concept / early prototype / high-fidelity / pre-release]
Key goals: [identify issues, measure efficiency, validate safety, benchmark usability]
Number of tasks expected: [#]
Constraints: [time, budget, access to clinicians, etc.]

What I need from you:

Recommend a set of metrics that fit this study. Consider (but don’t limit to):

SUS
NPS
Task success / completion rate
Critical error rate
Non-critical error rate
Time on task
Confidence ratings
Post-task satisfaction ratings
SUM

For each recommended metric:

Explain why it fits
Label it Primary or Secondary

Flag metrics that should NOT be used in this context and explain why.

Suggest a simple study design, including:

Number of tasks
Approx participant count
Whether results should be considered diagnostic or benchmarking

Present recommendations in a Markdown table with columns:
Metric | Primary/Secondary | When to Collect | Why It’s Suitable | Cautions

Provide a short narrative summary (max 5 bullets) I can paste into a study plan.

What we need to Input:

Product: bedside L&D maternal–fetal monitoring system
Domain: clinical
Study type: formative
Stage: early prototype
Key goals: Identify usability issues, validate documentation workflow, evaluate how easily clinicians interpret fetal/maternal signals
No of Tasks: 6–8
Constraints: limited nurse availability (max 6–8 nurses), must be in-person at hospital

AI Output

Below is the image of the AI output from a model:

Impact of this project and value to my team

Reduced time researchers spend on quant setup by ~50%
Standardized outputs across formative/summative studies
Reduced common errors in SUS/SUM calculations
Helped junior researchers ramp up faster
Improved consistency in clinical documentation and regulatory submissions
Enabled quicker alignment with PMs and designers by having ready-to-paste summaries

AI Ethical Considerations to remember

Human judgment remains primary: The toolkit supports decision-making but does not replace clinical or research expertise. All metric selections and interpretations require a researcher’s final review to avoid over-reliance on automated suggestions.

Bias-aware recommendations: Metric suggestions (e.g., when not to use NPS) are designed to avoid misleading results caused by small samples, skewed participant pools, or early-stage concepts.

Transparency in outputs: All AI-generated summaries, tables, and metric rationales clearly indicate their source and prompt structure, ensuring traceability for regulatory audits and cross-team reviews.

Protected handling of sensitive information: Prompts explicitly avoid pulling in patient identifiers, clinical records, or proprietary device data. The workflow is structured to use abstracted usability inputs only.

Go To Top