Agent Skill · NVIDIA NIM

digital-health-clinical-asr-setup

Stage 1 of Clinical ASR Flywheel. Use when bootstrapping a cycle: NVCF+MW disclosure, NVIDIA_API_KEY check, deps install, TTS+ASR smoke test.

Provider: NVIDIA NIM Path in repo: skills/digital-health-clinical-asr-setup/SKILL.md

Skill body

Clinical ASR Flywheel — Stage 1 (Setup)

Agent: this file is the complete Stage 1 procedure. Do not invoke find, ls, rg, or grep looking for an installer or hidden config — there isn’t one. The four sections below (outbound-data disclosure, three numbered checks, sibling hand-off) are all required reading; don’t skip any. Function IDs, env-var conventions, and the smoke-test gate are inlined further down — answer from what’s actually written here rather than from prior Riva/NVCF familiarity.

Stage 1 has one job: prove the user can reach NVIDIA’s hosted speech stack with the NVIDIA_API_KEY they currently hold. Once a single clinical sentence round-trips through Magpie TTS → Parakeet/Nemotron ASR successfully, the user is cleared to advance to /digital-health-clinical-asr-build.

The four-stage flywheel exists to drive down KER (keyword error rate) on clinical entities — drugs, procedures, anatomy, conditions, labs, roles. WER averages obscure the failures that hurt clinically; KER is what Stage 3 will measure you against.

There is no installer script anywhere in this skill — not install.sh, not setup.py, nothing hidden. Stage 1 is the three steps below: verify the key, install Python deps, run the smoke test. Anything past Stage 1 is composed from sibling skills (/data-designer, /riva-tts, the inlined Stage 3 ASR recipe, /riva-asr-custom). If a user asks “what script installs everything?”, answer from this paragraph; don’t go searching.

Outbound data flows — surface before any text or audio is sent

Two external endpoints receive data during this flywheel. The user has to acknowledge both before Stage 2 begins, against whatever data-governance policy their organization enforces. Render the table below word-for-word in your response — a paraphrase doesn’t satisfy the disclosure; the literal phrasing is what counts.

Service What gets sent When Hosted by
NVIDIA NVCF (grpc.nvcf.nvidia.com) The clinical sentences you synthesize (text), and the WAV files you transcribe (audio) Every Stage 2 TTS call and every Stage 3 ASR call NVIDIA, governed by build.nvidia.com terms
Merriam-Webster (dictionaryapi.com JSON API or the public merriam-webster.com HTML site) Individual clinical terms (drug names, anatomy, procedures), one HTTP request per term Stage 2 IPA tagging — see “Two MW paths” below for which endpoint applies Merriam-Webster, governed by their API or site terms

The data is synthetic by construction — the flywheel manufactures sentences and audio from a user-curated term list, never from real patient encounters. That said: do not feed real patient transcripts, recorded clinical audio, or any PHI through any stage. If the term list itself contains sensitive material (codename drugs, unreleased product names), the user should consult their organization’s external-API policy before proceeding. Either endpoint can be turned off:

Recommend a copy of this notice lands in the user’s workspace README.md; bring it forward on first invocation if it isn’t already there.

Purpose

Get a fresh environment ready for Stage 2. Three things to confirm: key is present, deps import cleanly, hosted stack actually answers. Close by naming which skill to run next.

The four digital-health-clinical-asr-* skills are self-contained — every TTS, ASR, IPA-tagging, and scoring recipe lives inside them; no other agent skill needs installing to run the flywheel end-to-end.

This skill takes no opinion on workspace layout. The user decides where their cycle artifacts live; data/eval_sets/cycle<N>/ is not imposed.

When to use this skill

Activate on user phrases like:

Do not activate when:

Prerequisites

Requirement Required? Why How
NVIDIA_API_KEY (nvapi-…) Required Hosted Magpie TTS + Parakeet/Nemotron ASR via NVCF Issue at https://build.nvidia.com; export NVIDIA_API_KEY=... in shell
Python ≥ 3.10 Required NeMo client, scoring, manifest tools python3 --version
nvidia-riva-client, pandas, soundfile, requests Required TTS + ASR clients, manifest I/O, MW lookup pip install nvidia-riva-client pandas soundfile requests
DICTIONARY_API_KEY Optional Merriam-Webster Medical Dictionary lookup via the JSON API (Path A in the build skill — recommended) Free key at https://dictionaryapi.com. Path B (HTML scrape of merriam-webster.com, no key, brittle) is also documented in the build skill if you can’t get a key. Without either path, Stage 2 falls through to Magpie G2P with weaker long-tail coverage.
jiwer Optional Reference WER/CER against the inlined Levenshtein implementation pip install jiwer — the eval skill includes a pure-Python fallback
(Stage 4 only) NGC_API_KEY + CUDA host + NeMo container Optional, deferred Fine-tune workload Set up inside /digital-health-clinical-asr-finetune; defer until the eval shows KER > 0.3

Instructions

Scope. This skill performs read-only environment checks: confirming a key is exported (length-only), the Python version, that libraries import, and that the hosted NVCF stack responds to a single smoke-test round-trip. It does not install system packages, modify shell rc files, write to disk outside an explicit .venv/, or attempt to authenticate with the real key value. Validate; never mutate without explicit user direction.

1a. Verify NVIDIA_API_KEY (length-only — never echo the value)

# Export NVIDIA_API_KEY in your shell — never echo or commit the value
export NVIDIA_API_KEY=nvapi-...     # from https://build.nvidia.com

# Length-only check; the key value never appears in any log
test -n "$NVIDIA_API_KEY" && echo "NVIDIA_API_KEY len=${#NVIDIA_API_KEY}"

A length of 70+ is normal. If the output is empty or shows len=0, the user must paste a key from https://build.nvidia.com. Do not print the key, even truncated. To persist across shell sessions, add the export line to your shell rc (~/.bashrc, ~/.zshrc) — or use a per-directory tool like direnv.

1b. Install Python dependencies

python3 -m venv .venv
source .venv/bin/activate
pip install nvidia-riva-client pandas soundfile requests
# optional
pip install jiwer

For Stage 4 (fine-tune) only: nemo-toolkit and Docker + NVIDIA Container Toolkit are also required. Defer those to /digital-health-clinical-asr-finetune — there is no point installing them up front if the user may never reach Stage 4.

1c. Smoke-test the hosted NVCF stack

NVIDIA_API_KEY handling — load-bearing, do not deviate:

Verify the NVIDIA_API_KEY actually works against Magpie TTS and Parakeet/Nemotron ASR before advancing. The four skills inline every recipe needed; this round-trip just confirms the API key + network path are real.

The agent harness loads the NVIDIA_API_KEY shell variable and passes it as an explicit function argument to the helpers below. The recipe code itself does not read environment variables — auditors can see exactly which API keys cross the wire.

import wave, tempfile
import riva.client

NVCF_HOST = "grpc.nvcf.nvidia.com:443"
MAGPIE_FUNCTION_ID    = "877104f7-e885-42b9-8de8-f6e4c6303969"   # Magpie TTS
PARAKEET_FUNCTION_ID  = "d3fe9151-442b-4204-a70d-5fcc597fd610"   # Parakeet TDT 0.6B v2 (offline ASR)

def auth_for(function_id: str, api_key: str) -> riva.client.Auth:
    return riva.client.Auth(
        use_ssl=True, uri=NVCF_HOST,
        metadata_args=[
            ["function-id", function_id],
            ["authorization", f"Bearer {api_key}"],
        ],
    )

def smoke_test(api_key: str) -> str:
    """Caller passes api_key (the harness reads $NVIDIA_API_KEY at the shell;
    this code never touches the environment). Returns the ASR transcript."""

    # 1. TTS: "The patient was prescribed cefazolin."
    tts = riva.client.SpeechSynthesisService(auth_for(MAGPIE_FUNCTION_ID, api_key))
    pcm = b"".join(c.audio for c in tts.synthesize_online(
        text="The patient was prescribed cefazolin.",
        voice_name="Magpie-Multilingual.EN-US.Mia",
        language_code="en-US", sample_rate_hz=16000,
    ))
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        with wave.open(f, "wb") as w:
            w.setnchannels(1); w.setsampwidth(2); w.setframerate(16000); w.writeframes(pcm)
        wav_path = f.name

    # 2. ASR: transcribe the WAV we just synthesized.
    asr = riva.client.ASRService(auth_for(PARAKEET_FUNCTION_ID, api_key))
    with open(wav_path, "rb") as f:
        audio_bytes = f.read()
    config = riva.client.RecognitionConfig(
        encoding=riva.client.AudioEncoding.LINEAR_PCM,
        sample_rate_hertz=16000, language_code="en-US",
        max_alternatives=1, enable_automatic_punctuation=True,
    )
    response = asr.offline_recognize(audio_bytes, config)
    transcript = response.results[0].alternatives[0].transcript if response.results else ""
    print(f"TTS:  The patient was prescribed cefazolin.")
    print(f"ASR:  {transcript}")
    return transcript

# Invoke from the agent (api_key sourced by the harness, not by this code):
# smoke_test(api_key="<NVIDIA_API_KEY value>")

Run the smoke test — don’t defer it. This is the gate that proves Stages 2–4 can reach the hosted stack with the user’s current key. “I can run it later” is not an acceptable completion of Stage 1; either invoke smoke_test(api_key=…) now or, if the user has explicitly opted out, log the deferral in your closing summary so they know what they’re missing.

If the transcript matches the input within ~1 token, the hosted stack is reachable and the user can advance to Stage 2. If either call fails:

1d. (Optional) Verify Merriam-Webster lookup

Two paths produce a merriam-webster-tagged manifest row in Stage 2. Pick one (or neither — Magpie G2P fall-through is a valid posture):

Remember the data-disclosure note at the top: under either path, each clinical term in your seed list goes out as an HTTP request to a Merriam-Webster endpoint.

Examples

Fresh shell, never run before. User says something like “I want to start the flywheel.” → Quote the disclosure table first, then walk through 1a → 1b → 1c in order. On a green smoke test, point them at /digital-health-clinical-asr-build and explicitly name KER as the metric Stage 3 will judge them by.

Returning user, env already up. User says “I already have the env, just confirm I’m good to go.” → Skip the venv + pip install (1b). Run only the length check (1a) and the smoke test (1c). On green, advance.

Artifacts produced

No manifest, audio, or model artifact is produced at this stage — those come at Stages 2–4.

Troubleshooting

Limitations

Next steps

Mandatory close on success: finish the Stage 1 response by pointing the user explicitly to /digital-health-clinical-asr-build and naming KER (keyword error rate) as the headline measure they’ll see at Stage 3. Both pointers are required, not optional — they place the user inside the four-stage flywheel.

References

Skill frontmatter

version: 1.1.0 author: Ben Randoing tags: clinical-asrsetupflywheelbootstrap tools: ReadWriteBashSkill license: Apache-2.0 compatibility: NVIDIA_API_KEY (required) for hosted Magpie TTS + Parakeet/Nemotron ASR via NVCF. DICTIONARY_API_KEY (optional) for Merriam-Webster pronunciation lookup. NGC_API_KEY (optional) for Stage 4 fine-tune. Python 3.10+. metadata: {"author" => "Ben Randoing ", "tags" => ["clinical-asr", "flywheel", "setup", "bootstrap"], "team" => "healthcare-tme", "domain" => "ai-ml", "stage" => 1, "next_skill" => "digital-health-clinical-asr-build"}