Agent Skill · NVIDIA NIM

digital-health-clinical-asr-build

Stage 2 of the Clinical ASR Flywheel. Use when curating clinical terms, tagging IPA, and synthesizing a NeMo manifest. NOT for scoring (use /digital-health-clinical-asr-eval).

Provider: NVIDIA NIM Path in repo: skills/digital-health-clinical-asr-build/SKILL.md

Skill body

Clinical ASR Flywheel — Stage 2 (Build the benchmark)

⚠ Agent: read this entire SKILL.md before answering. This stage is conversational and gated. Specifically: ask the user 1–2 specialty-aware clarifying questions before proposing terms (Step 2a), walk them through the two-tier IPA pipeline (override → merriam-webster → magpie_g2p) in Step 2c, hit the explicit QA-mode audition gate in Step 2d before full Cartesian synthesis, and name KER as the headline metric they’ll see in Stage 3. Skipping any of these defeats the methodology.

You are the curate-and-synthesize stage. The user arrives from /digital-health-clinical-asr-setup and leaves with a NeMo-format manifest.jsonl plus the audio it references — both ready for scoring at /digital-health-clinical-asr-eval.

Be conversational. This is the warmest, most domain-aware step in the flywheel: you’re asking a clinician (or someone who works with them) which terms hurt today and shaping a benchmark around their reality. Ask short, focused questions. Show the user what’s being added. Don’t lecture.

Data leaves your environment — disclose this to the user before any term is sent

This stage transmits user-curated content to two external services. Surface this to the user before invoking either call:

Service What gets sent When
Merriam-Webster (dictionaryapi.com API or merriam-webster.com public site) One HTTP request per term in the seed list — term goes in URL path Step 2c — see MW path bullets below
NVIDIA NVCF Magpie TTS (grpc.nvcf.nvidia.com) Each generated clinical sentence (text, plus any SSML IPA wrappers) Steps 2d and 2e, every synthesis call

Both endpoints expect non-PHI synthetic content — the term list you curate, the sentences /data-designer (or your fallback templates) generates from it. Do not pass real patient records, real ASR transcripts, or any PHI through this skill. If the term list itself is sensitive (proprietary drug codenames, unreleased product names, customer-confidential indications), confirm with the user that external-API transmission is acceptable under their organization’s data-governance policy before proceeding.

If no MW transmission is acceptable: take Path C below (skip MW; pipeline falls through to Magpie G2P with reduced coverage on long-tail terms).

Purpose

Curate a clinical-specialty term list, generate eval audio for it through Magpie TTS with a two-tier IPA pipeline, and write a NeMo-format manifest tagged with the clinical-extension fields (term, entity_category, ipa_source, voice_id, noise_level, context_type). The output is the input to Stage 3.

By the end the user has:

$EVAL_DIR/cycle<N>/
├── audio/<slug>.wav        synthesized clips
├── manifest.jsonl          NeMo format + clinical extension
├── term_seed.csv           the curated input
└── pronunciation_overrides.csv   appendable across cycles

($EVAL_DIR is the user’s own choice — this skill does not impose a layout. The structure above is a recommendation, not a requirement.)

When to use this skill

Activate on user phrases like:

Do not activate when (also: if the message mentions auth, API key, gRPC, streaming, riva-build, NIM deploy, NGC, or Docker, route per the bullets below and stop):

Prerequisites

Instructions

2a. Specialty interview → term_seed.csv

Ask one question at a time. The goal is to surface 4–10 candidate terms with the right entity_category, not to write a textbook.

Questions, in order:

  1. What specialty / workflow is this for? (oncology dictation, ICU handoff, psych intake, ortho post-op, …)
  2. What ASR failure modes have you seen? — drug names, multi-word procedures, abbreviations, compound conditions.
  3. Which terms come up daily vs which are the hard ones? — daily-common terms become the sanity baseline; daily-hard terms become the signal.

Propose 4–10 candidate terms with entity_category. Confirm with the user before writing. Then write term_seed.csv:

term,entity_category
cefazolin,drug
acetabular reamer,procedure
tibial plateau,anatomy
femoroacetabular impingement,condition
hemoglobin a1c,lab
respiratory therapist,role

The category vocabulary is fixed. KER keys off it. Allowed values:

drug | procedure | anatomy | condition | lab | role

If the user proposes a new category, push back: either it maps to one of the six, or the methodology needs a deliberate extension (which is a future cycle’s job, not a one-off ad-hoc add).

2b. Sentence generation via /data-designer

Brief /data-designer with:

For each row in term_seed.csv, generate one or more natural English sentences embedding term in a way that fits the row’s entity_category. Output schema: {term, entity_category, sentence, context_type}. Generate 3–5 context_type variants per term. Initial context_type vocabulary: dictation, handoff, chart_note, history. Sentence length 10–30 words.

The output of this step is a per-term sentence variants file. Any filename is fine — pick one and use it consistently across the cycle directory.

Template fallback. If /data-designer is unavailable, use a 4-template fallback (one per context_type) and substitute term mechanically. Tag those rows in the manifest (context_type is set, the sentence is just less natural) so a future cycle can regenerate.

2c. Two-tier IPA tagging (the load-bearing quality lever)

Every term passes through a 3-tier pipeline, in order:

  1. Overridepronunciation_overrides.csv carries verified IPA the team has audited. If term matches a row here, the override wins.
  2. Merriam-Webster — for un-overridden terms, fetch the MW respelling, convert to IPA, validate against Magpie’s en-US phoneme set. If both succeed, the term is tagged merriam-webster.
  3. Magpie G2P (fall-through) — if neither override nor MW produces a valid IPA, the plain text is passed to Magpie’s neural G2P at synthesis time. The row is tagged magpie_g2p.

Every manifest row carries the ipa_source tag (override | merriam-webster | magpie_g2p). The delta between merriam-webster and magpie_g2p rows in the Stage 3 leaderboard is the proof the pronunciation strategy is working — call it out explicitly when you produce the leaderboard.

Three MW lookup choices — all tag merriam-webster. A: dictionaryapi.com JSON API + DICTIONARY_API_KEY (free at dictionaryapi.com) — recommended for standalone use. B: HTML scrape of merriam-webster.com — no key, brittle to site HTML changes; recipe inlined in references/pronunciation-pipeline.md. C: skip MW, fall through to Magpie G2P with weaker long-tail coverage. Both recipes + the full respelling→IPA table live in references/pronunciation-pipeline.md. The Path A function takes api_key as an arg (never reads os.environ); pass None to skip MW.

pronunciation_overrides.csv schema:

term,ipa,verified_by,verified_at,notes
cefazolin,sɛfəˈzoʊlɪn,brandoing,2026-05-13,confirmed against MW respelling + ear test

Append-only across cycles. Re-running the build later picks up new entries automatically.

2d. QA-mode synthesis (do not skip this gate)

Before running the full Cartesian product, synthesize one wav per term with: first voice, clean noise, default context. Audition each clip with the user.

For every term tagged magpie_g2p, propose an IPA candidate using clinical suffix patterns and validate against Magpie’s en-US phoneme set before suggesting:

Suffix Stress pattern (example)
-mycin …ˈmaɪsɪn (vancomycin, gentamicin)
-prazole …ˈpreɪzoʊl (esomeprazole, omeprazole)
-statin …ˈstætɪn (atorvastatin, rosuvastatin)
-sartan …ˈsɑːrtən (losartan, valsartan)
-azole …ˈeɪzoʊl (fluconazole, ketoconazole)
-cillin …ˈsɪlɪn (amoxicillin, piperacillin)
-parin …ˈpɛərɪn (enoxaparin, heparin)

Phoneme-validation pattern — live-probe Magpie’s en-US neural G2P with a candidate IPA. If Magpie accepts the SSML, the IPA is in its inventory. Use the suffix patterns above as a pre-filter (cheap heuristic) and the live probe to confirm before committing to an override. The magpie_validates_ipa(ipa, api_key, voice_id) recipe — a minimal NVCF gRPC synthesis call that returns True/False fail-closed — is in references/pronunciation-pipeline.md.

Call it once per candidate IPA before showing it to the user. On user approval, append the verified IPA to pronunciation_overrides.csv. The row’s ipa_source flips from magpie_g2p to override on the next manifest generation.

HITL audition gate before Step 2e — fail-closed. Do not synthesize the full Cartesian product, do not promote any staged IPA candidate to pronunciation_overrides.csv, and do not advance to Stage 3 until one of the following has happened explicitly in conversation:

  1. The user confirms they have auditioned the QA clips and reports their verdict per clip (or per bucket: “the MW set sounds fine”, “fix pembrolizumab”, etc.). Provide the afplay (macOS) or paplay/aplay (Linux) commands so the user can play them — then halt and wait for their reply after listening. Paper-only approval via an AskUserQuestion prompt — clicking “Promote all” or “Lock in” without auditioning — does not satisfy this gate. Magpie-validating an IPA proves it’s in the phoneme inventory; it does not prove it matches the intended pronunciation. Only the user’s ears do that.
  2. The user explicitly opts to skip audition for this cycle, in deliberate language (e.g. “skip audition, accept the risk that mispronunciations may dilute the Stage 3 KER signal — log it as a cycle-N caveat”), not as a side-effect of a single click-through. Record the skip in a cycle-level note (e.g. eval/cycle<N>/cycle_notes.md) so a future operator can see the audition was deferred.

Magpie NVCF rate-limits aggressively on >100-row jobs, and a do-over costs both API credits and clock time — but the larger risk is shipping a manifest with mispronounced reference audio that quietly corrupts the Stage 3 KER signal. Time spent auditioning is cheaper than re-running the cycle.

2e. Full benchmark generation

After pronunciations are locked, generate the full Cartesian product |terms| × |voices| × |noise_levels| × |context_types|. Defaults: 2–4 Magpie en-US voices (Mia/Jason/Ray), [clean, snr_15db, snr_5db], [dictation, handoff, chart_note, history].

Self-contained synthesis — no /read-aloud required. The synthesize_row(row, all_overrides, out_dir, api_key) recipe — opens an NVCF gRPC stream, wraps overrides into SSML via render_sentence_with_overrides, writes 16-bit mono PCM to <out_dir>/audio/<slug>.wav — is in references/pronunciation-pipeline.md (§Synthesis call). Key invariant: all_overrides carries every entry from pronunciation_overrides.csv (including context-word overrides like intravenously) so the renderer wraps any override whose verbatim text appears in row['text']. Wrapping only row['term'] silently drops context-word overrides.

Noise-injection (clean → snr_15dbsnr_5db) and the manifest schema (NeMo canonical fields + clinical extension, plus pre-flight schema and audio-existence checks) all live in references/manifest-schema.md.

Warn when product > 100 rows. Magpie NVCF rate-limits with ~5–10% RESOURCE_EXHAUSTED drops on big runs. Re-run the dropped rows.

Stage 2 completion checklist

Don’t consider Stage 2 done until all five sub-steps ran. Agents commonly stop after 2a or 2b; the goal is a synthesized manifest plus a hand-off:

Writes go only into the user-chosen $EVAL_DIR/cycle<N>/. Don’t write elsewhere, modify env, or install packages — those belong to /digital-health-clinical-asr-setup.

Examples

Scenario A — fresh oncology benchmark. User: “We’re seeing chemo drug names mistranscribed. Where do I start?” → Step 2a: confirm specialty is oncology, ask about which drugs (immunotherapy biologics, platinum agents, taxanes). Propose ~10 candidates: cisplatin, paclitaxel, pembrolizumab, nivolumab, carboplatin, docetaxel, bevacizumab, trastuzumab, cetuximab, pemetrexed. Write term_seed.csv with all entity_category=drug. Step 2b: brief /data-designer for 4 context variants each = 40 sentences. Step 2c: MW lookup for each — biologics like pembrolizumab will likely fall to magpie_g2p; platinum agents likely hit MW. Step 2d: synthesize one QA wav per term, walk the user through the pembrolizumab etc. clips, propose IPA candidates with -mab suffix stress patterns. Step 2e: on approval, run 10 terms × 2 voices × 2 noise levels × 3 contexts = 120 rows.

Scenario B — appending to an existing cycle. User: “I have a cycle-1 manifest and I want to add 5 more procedures.” → Re-run only Steps 2a (specialty interview just for the new terms), 2b (sentence gen for the additions), 2c (IPA pipeline for the additions), 2d (audition the new terms), and 2e (synthesize only the new term rows). Append to the existing manifest.jsonl. Do not regenerate audio for existing terms — cycle isolation is intentional so leaderboards diff cycle N vs cycle N+1 cleanly.

Artifacts produced

Troubleshooting

For anything not in this list, identify which upstream skill is implicated and route there. The digital-health-clinical-asr-build skill owns the methodology, not the TTS or DataDesigner internals.

Limitations

Next steps

References

Skill frontmatter

version: 1.1.0 author: Ben Randoing tags: clinical-asrdatasetipamagpienemo-manifestflywheel tools: ReadWriteBashSkill license: Apache-2.0 compatibility: NVIDIA_API_KEY (required) for hosted Magpie TTS via NVCF. DICTIONARY_API_KEY (optional) for Merriam-Webster Medical Dictionary lookup. Stage 1 (/digital-health-clinical-asr-setup) must have been completed first. All TTS, IPA, and synthesis recipes are inlined — no sibling agent skill required. metadata: {"author" => "Ben Randoing ", "tags" => ["clinical-asr", "flywheel", "dataset", "ipa", "magpie"], "team" => "healthcare-tme", "domain" => "ai-ml", "stage" => 2, "previous_skill" => "digital-health-clinical-asr-setup", "next_skill" => "digital-health-clinical-asr-eval"}