Designing an AI-Assisted Usability Metrics Toolkit for Clinical Research Teams
AI-Enabled Usability Toolkit: Improving metric selection, analysis, and documentation for L&D research teams.”
I built a structured prompt system that helps new and experienced researchers choose the right metric (SUS, NPS, SUM), calculate scores using sample or real data, generate study protocols, and auto-summarize findings.
Outcome: This reduced documentation time by 40% and ensured consistent analysis across formative and summative studies.
"It's a way to modernize how research is done"
Project Overview
Project Duration: Nov 2025
Role
User Experience Researcher - Assistant Manager
Problem observed in my org and other client teams
At GE HealthCare, I’m working on a long-running L&D monitoring product that has gone through multiple releases and teams over five years. This history created inconsistencies in how usability studies were planned and analyzed, leading me to develop an AI-assisted toolkit to standardize and streamline our research workflow.
Research teams struggled with:
Sample size
Determining appropriate sample sizes at each stage
SUS vs NPS?
Selecting the right metrics (SUS vs NPS vs SUM)
Errors in Metrics
Frequent calculation errors for Statistical metrics
New Researchers
Limited guidance for newer researchers
Impact on project deliverables
⚠️ Misaligned or incorrect metrics
⚠️ Slower regulatory submissions
⚠️ Extra rework and review cycles
My Approach and Process
01
Understanding Researcher Needs
I began by mapping gaps in our current research workflow through quick interviews and informal discussions with UX researchers and designers. Several recurring pain points emerged:
-
Confusion around which metrics to use and sample size selection
-
Misalignment between study type and quant methodology (e.g., SUS/NPS misused in formative studies)
-
Repetitive manual work in calculating SUS/SUM/task success
This clarified a core need: a structured, modular AI toolkit that guides researchers from
study planning → metric selection → calculation → stakeholder-ready storytelling.
02
Defining Scope of the AI Toolkit
I outlined clear boundaries to keep the system focused and usable:
Included:
-
Study setup decision aids (sample size, methodology, participant mix)
-
Metric selection guidance
-
Calculators for SUS, SUM, task success, error rate, time on task
-
Visualization and summary-generation prompts
Excluded:
-
Deep statistical modeling
-
Regulatory HF validation templates
-
Automated inferential statistics
This ensured the toolkit remained lightweight, scalable, and aligned with day-to-day UX needs.
03
Designing a Modular Prompt System
Instead of a long “mega prompt,” I designed a modular library researchers can mix and match.
The system includes:
-
Study Setup Prompts (participants, methodology, lifecycle stage)
-
Metric Recommender (chooses the right scores for the study)
-
Quant Calculators (SUS, SUM, task success, error rate)
-
Analysis + Visualization Generators
-
Stakeholder Summary Builders (ready-to-paste narratives)
The Usability engineering process i follow
The whole toolkit has covered all the phases but for this case study, i am just displaying two areas.

Phase 01: The study planner
1. PARTICIPANT NUMBER PLANNER
Prompt Template: “How Many Participants Do I Need?”
You are a senior UX research strategist helping me determine an appropriate participant sample size for my upcoming usability study.
Study details:
-
Product: [describe product]
-
Domain: [clinical, enterprise, consumer, etc.]
-
Study type: [formative / summative / benchmark / comparison]
-
Stage of lifecycle: [concept / early prototype / high-fidelity / pre-release]
-
Key goals: [identify issues, measure efficiency, validate safety, benchmark usability]
-
Number of tasks expected: [#]
-
Constraints: [time, budget, access to clinicians, etc.]
What I need from you:
-
Recommend a sample size range and justify it.
-
Explain what level of confidence / robustness I can expect with that sample.
-
Advise whether I should:
-
Run multiple small rounds, or
-
One larger study
-
Use within-subjects or between-subjects designs
-
-
Provide a summary paragraph for stakeholders.
-
Ask up to 3 clarifying questions before giving the answer.
What we need to Input:
-
Product: bedside L&D maternal–fetal monitoring system
-
Domain: clinical
-
Study type: formative
-
Stage: early prototype
-
Key goals: identify usability breakdowns in documentation + interpretation
-
No of Tasks: 6–8
-
Constraints: limited nurse availability (max 6–8 nurses), must be in-person at hospital
AI Output
Below is the image of the AI output from a model:

Phase 02: The metric recommender
1. Metric Recommender
Prompt Template: “What Should I Measure?”
You are a senior UX quant researcher helping me choose appropriate quantitative UX metrics for my usability study.
Study details:
-
Product: [describe product]
-
Domain: [clinical, enterprise, consumer, etc.]
-
Study type: [formative / summative / benchmark / comparison]
-
Stage of lifecycle: [concept / early prototype / high-fidelity / pre-release]
-
Key goals: [identify issues, measure efficiency, validate safety, benchmark usability]
-
Number of tasks expected: [#]
-
Constraints: [time, budget, access to clinicians, etc.]
What I need from you:
Recommend a set of metrics that fit this study. Consider (but don’t limit to):
-
SUS
-
NPS
-
Task success / completion rate
-
Critical error rate
-
Non-critical error rate
-
Time on task
-
Confidence ratings
-
Post-task satisfaction ratings
-
SUM
For each recommended metric:
-
Explain why it fits
-
Label it Primary or Secondary
Flag metrics that should NOT be used in this context and explain why.
Suggest a simple study design, including:
-
Number of tasks
-
Approx participant count
-
Whether results should be considered diagnostic or benchmarking
Present recommendations in a Markdown table with columns:
Metric | Primary/Secondary | When to Collect | Why It’s Suitable | Cautions
Provide a short narrative summary (max 5 bullets) I can paste into a study plan.
What we need to Input:
-
Product: bedside L&D maternal–fetal monitoring system
-
Domain: clinical
-
Study type: formative
-
Stage: early prototype
-
Key goals: Identify usability issues, validate documentation workflow, evaluate how easily clinicians interpret fetal/maternal signals
-
No of Tasks: 6–8
-
Constraints: limited nurse availability (max 6–8 nurses), must be in-person at hospital
AI Output
Below is the image of the AI output from a model:

Impact of this project and value to my team
-
Reduced time researchers spend on quant setup by ~50%
-
Standardized outputs across formative/summative studies
-
Reduced common errors in SUS/SUM calculations
-
Helped junior researchers ramp up faster
-
Improved consistency in clinical documentation and regulatory submissions
-
Enabled quicker alignment with PMs and designers by having ready-to-paste summaries
AI Ethical Considerations to remember
Human judgment remains primary: The toolkit supports decision-making but does not replace clinical or research expertise. All metric selections and interpretations require a researcher’s final review to avoid over-reliance on automated suggestions.
Bias-aware recommendations: Metric suggestions (e.g., when not to use NPS) are designed to avoid misleading results caused by small samples, skewed participant pools, or early-stage concepts.
Transparency in outputs: All AI-generated summaries, tables, and metric rationales clearly indicate their source and prompt structure, ensuring traceability for regulatory audits and cross-team reviews.
Protected handling of sensitive information: Prompts explicitly avoid pulling in patient identifiers, clinical records, or proprietary device data. The workflow is structured to use abstracted usability inputs only.