Agent Skill · NVIDIA NIM

digital-health-clinical-asr-finetune

Stage 4 of the Clinical ASR Flywheel. Use when priority KER is above 0.3 to run stock NeMo SFT on Parakeet TDT v2 and offline cycle N+1 re-eval. NOT for generic word boosting (use /finetune-asr).

Provider: NVIDIA NIM Path in repo: skills/digital-health-clinical-asr-finetune/SKILL.md

Skill body

Clinical ASR Flywheel — Stage 4 (Fine-tune)

⚠ Agent: read this entire SKILL.md before answering. The Critical-workflow-rules section, the base-model table (§4c), the stock-NeMo-SFT recipe (§4d), and the cycle-N+1 decision table (§4e) are all load-bearing — the do-not-SFT bases and broken-adapter warnings live there.

Agent: this file is self-contained. The Stage 4 gate criteria, base-model recommendation, hyperparameter table, container invocation pattern, and cycle-N+1 decision table are all below. Do not run file-discovery commands or open references/stage4-finetune.md before answering methodology questions — the reference is deep-dive material, not required reading. Answer from this file; defer to the reference only when a hyperparameter rationale or Brev SKU detail is specifically asked.

You are the adapt-and-measure stage. The user arrives from /digital-health-clinical-asr-eval with a manifest, a baseline KER number, and the decision-tree’s recommendation that fine-tuning is worth the GPU time. You run stock NeMo SFT, do an offline cycle N+1 re-eval to measure that the loop closed, and optionally hand the resulting .nemo to /riva-asr-custom for production serving.

The cycle KER from offline eval is the measurement that closes the loop. Riva NIM deploy validates serving (latency, streaming, scale), not model quality.

Empirically verified on the reference manifest (39 rows, Parakeet TDT v2): Baseline KER 0.513 → after 3 epochs of stock SFT: 0.128 (-75% relative). Drug names: 0.857 → 0.214. Conditions: 0.500 → 0.000. Procedures: 0.250 → 0.000.

Critical workflow rules (apply on every activation)

Surface these facts in any response, even if the user asks a narrow question:

  1. Read this entire SKILL.md before answering. The base-model selection table, hyperparameter values, and the cycle-N+1 decision table are below — they are the load-bearing parts.
  2. Verified result — Parakeet TDT v2 with the recipe in §4c achieves KER 0.513 → 0.128 (−75% relative) in 3 epochs on the reference manifest. Cite this when the user asks whether SFT will help.
  3. Recipe is /opt/NeMo/examples/asr/speech_to_text_finetune.py inside nvcr.io/nvidia/nemo:25.11.01. Stock script, no patches, no custom adapter logic. The adapter-mixin path is broken on TDT/RNNT decoders (72 NaN tensors at any LR) — do not propose it.
  4. Recommended base is nvidia/parakeet-tdt-0.6b-v2. The full base-model table is in §4c.
  5. Do NOT fine-tune nvidia/nemotron-speech-streaming-en-0.6b. The streaming NVCF function’s SFT path is broken (UNK collapse on validation after step 1). For streaming serving at deploy time, Riva chunks a non-streaming base just fine. Warn the user proactively if they propose it.
  6. Gate the recommendation. Stage 4 only fires when priority-category KER > 0.3 and manifest has ≥ 100 rows (≥ 5 per priority category). Below those thresholds, route back to /digital-health-clinical-asr-build to grow the manifest first.

Purpose

Run stock NeMo SFT (no custom adapter logic, no patches) in nvcr.io/nvidia/nemo:25.11.01 against a term-aware row-disjoint train/val split, produce a .nemo model, and re-eval offline as cycle N+1. Decide based on the cycle-N → cycle-N+1 KER delta whether to keep the model, grow the manifest, or accept that fine-tuning didn’t help. Optionally hand the .nemo to /riva-asr-custom for NIM deploy.

When to use this skill

Activate on user phrases like:

Do not activate when:

Prerequisites

Instructions

4a. Provision a GPU host (skip if you already have one)

Stage 4 needs a CUDA host with ≥ 16 GB VRAM (24 GB comfortable). If you have a local one that fits, skip this section. If not, use Brev — NVIDIA’s per-second-billed GPU host service. Recommended SKU: L40S 48 GB.

Cost disclosure — surface this to the user before any brev create. L40S 48 GB runs ~$1.50/hr at time of writing; a 3-epoch SFT run on a 100-row manifest finishes in 15–30 minutes (~$0.40–$0.75 of compute). The real risk is forgetting to stop the instance — overnight idle on L40S is ~$36, a week of idle is ~$250. Mitigations: (a) always wrap the workflow in a script that ends with brev stop; (b) set a calendar reminder when you start; (c) brev delete instead of brev stop if you don’t need to keep the disk (stop keeps disk at $0.10/GB-month — 200 GB ≈ $20/month of latent cost). Confirm the user accepts the per-hour cost shape and the idle risk before spinning anything up.

Full setup walkthrough — CLI install (download-then-run, not curl-pipe), SKU choice, disk sizing, SSH config — is in references/stage4-finetune.md (§Brev provisioning).

Short happy-path once the CLI is installed. Do not run brev create until the user has explicitly typed YES at the confirmation prompt below — the gate is mandatory, not advisory, because everything after it bills against the user’s account by the second:

brev login                                  # browser auth

# Mandatory cost-confirmation gate — do NOT skip or auto-answer this.
echo "About to provision: digital-health-clinical-asr-sft on L40S 48 GB."
echo "Cost shape: ~\$1.50/hr while running; ~\$36/night if left idle; ~\$20/mo disk if you 'stop' instead of 'delete'."
read -rp "Type YES to provision (anything else cancels): " confirm
[ "$confirm" = "YES" ] || { echo "Cancelled — no GPU instance was created."; exit 1; }

brev create digital-health-clinical-asr-sft \
  --gpu l40s:1 --image ubuntu-22-04-cuda-12-4 --disk 200gi
brev ssh-config                             # writes ~/.ssh/config entries
rsync -avz ./cycle1/ digital-health-clinical-asr-sft:~/cycle1/
brev shell digital-health-clinical-asr-sft            # drops into the instance
nvidia-smi                                  # confirm GPU
docker pull nvcr.io/nvidia/nemo:25.11.01    # ~12 GB, once per instance

When done, always halt billing: brev stop digital-health-clinical-asr-sft (keeps disk) or brev delete digital-health-clinical-asr-sft (frees it). For path rewriting laptop → Brev → NeMo container, see references/container-paths.md.

4b. Term-aware train/val split

Row-disjoint, stratified by entity_category, default val fraction 0.2.

The same term may appear on both sides via different rows (different voice, context, noise). That’s expected and desirable — it measures acoustic + contextual robustness on the trained vocabulary, which is the standard ASR adaptation metric.

Singleton categories (one row total) get forced to train with a warning. If any priority category has < 5 rows, bail to /digital-health-clinical-asr-build — held-out validation will be too noisy to attribute movement.

Sketch:

# After loading manifest.jsonl into a list of dicts `rows`:
from collections import defaultdict
import random
random.seed(42)

by_cat = defaultdict(list)
for r in rows:
    by_cat[r["entity_category"]].append(r)

train, val = [], []
for cat, cat_rows in by_cat.items():
    random.shuffle(cat_rows)
    if len(cat_rows) < 2:
        train.extend(cat_rows)
        print(f"warning: singleton category {cat}, forced to train")
        continue
    n_val = max(1, int(0.2 * len(cat_rows)))
    val.extend(cat_rows[:n_val])
    train.extend(cat_rows[n_val:])

Write train.jsonl and validation.jsonl alongside the manifest. These are the inputs to speech_to_text_finetune.py.

4c. Choose the base model

Base SFT viability Notes
nvidia/parakeet-tdt-0.6b-v2 Empirically verified (KER 0.513 → 0.128 in 3 epochs, −75% relative) NVIDIA’s current English ASR default. Stock NeMo SFT recipe works end-to-end. Recommended.
nvidia/nemotron-speech-streaming-en-0.6b Don’t use for SFT NVCF function is streaming-only; SFT path unreliable (UNK collapse on validation after first training step). For streaming serving, Riva chunks a non-streaming base just fine.

Other Parakeet/Conformer bases (1.1B, CTC, RNNT, stt_en_conformer_ctc_large) + decoder → NIM container mapping: references/stage4-finetune.md. If the user asks to fine-tune Nemotron Speech Streaming, warn about the collapse and recommend Parakeet TDT v2.

4d. Stock NeMo SFT

In the NeMo container, invoke /opt/NeMo/examples/asr/speech_to_text_finetune.py directly. No custom adapter logic. No patches. The stock NeMo SFT script is the verified working recipe.

Hyperparameters (verified on Parakeet TDT v2, 39-row manifest):

init_from_pretrained_model: nvidia/parakeet-tdt-0.6b-v2
precision:                  bf16-mixed       # required for TDT numerical stability
lr:                         3e-4             # CosineAnnealing schedule
warmup_steps:               5                # tiny manifest; bump to 500 at production scale
epochs:                     3                # smoke; 10-30 for production
batch_size:                 4                # fits 16 GB VRAM; raise to 16 on L40S 48 GB
gradient_clip_val:          1.0              # defensive

Container invocation: docker run --gpus all --rm -it -v "$PWD:/workspace" nvcr.io/nvidia/nemo:25.11.01 python /opt/NeMo/examples/asr/speech_to_text_finetune.py with model.train_ds.manifest_filepath=/workspace/train.jsonl, model.validation_ds.manifest_filepath=/workspace/validation.jsonl, init_from_pretrained_model=nvidia/parakeet-tdt-0.6b-v2, and the hyperparameter overrides from the table above. Full docker-run line with config-path / config-name flags: references/stage4-finetune.md §Container invocation.

Manifest paths inside the container. Host paths (e.g. $HOME/…) don’t resolve in /workspace. Rewrite snippet: references/container-paths.md.

The training run writes adapted_model.nemo and a training_run_info.json summary. Both go into a per-cycle subdirectory of the user’s choice (e.g. cycle<N>/models/<run>/; the layout doesn’t matter as long as it’s consistent across cycles).

4e. Offline cycle N+1 eval — close the loop

Re-transcribe the cycle’s audio with the fine-tuned .nemo using NeMo’s offline transcribe(). No Riva needed — this is measurement, not serving. NeMo’s offline path runs the same encoder + decoder graph the Riva NIM eventually serves.

Sketch:

import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.restore_from("adapted_model.nemo")
hyps = model.transcribe(["audio/row1.wav", "audio/row2.wav", ...])

Score the same four metrics (WER/CER/KER/SER) and the same five-section leaderboard the eval skill produces. Write them as leaderboard_cycle<N+1>.md. Compare against leaderboard_cycle<N>.md.

Decision table — cycle-N+1 vs cycle-N:

Result Action
KER dropped meaningfully on targeted categories (e.g. drug KER −20% or more, relative) ✅ Keep the .nemo. Update the leaderboard. Advance to Step 4f if you want to deploy.
KER moved a little, you wanted more Loop back to /digital-health-clinical-asr-build, expand the manifest. Tiny manifests rarely benefit from hyperparameter tweaks — signal density beats LR sweeps.
KER got worse Overfit on a tiny manifest. Bail to /digital-health-clinical-asr-build and grow before retraining. Don’t tune harder on the same data.
No measurable change Some categories may already be in the base model’s vocab. Sanity-check per-category numbers before concluding training “didn’t help.”

4f. (Optional) Deploy as a Riva NIM

Hand the .nemo to /riva-asr-custom. Pass the source architecture explicitly/riva-asr-custom can’t reliably detect CTC vs RNNT vs TDT from the .nemo alone, and the wrong NIM container produces a broken RMIR with no clear error:

Source decoder riva-build flag NIM container family
Conformer-CTC decoder=greedy_ctc parakeet-*-ctc-*
Conformer-RNNT decoder=nemo parakeet-rnnt-*
Conformer-TDT (default) decoder=nemo parakeet-tdt-*
Cache-Aware RNNT (Nemotron streaming) decoder=nemo nemotron-streaming-* ⚠ SFT broken on this base, see Limitations

After deploy: re-run /digital-health-clinical-asr-eval against the new endpoint (ASR_ENDPOINT=localhost:50051) to validate that production-serving numbers match offline numbers. Any divergence is in Riva preprocessing or riva-build flags, not the model. Route to /riva-asr-custom.

Examples

Scenario A — gate met. User: “Drug KER 0.42, 130 rows. SFT?” → Yes (gate cleared). parakeet-tdt-0.6b-v2 (verified 0.513 → 0.128). No local GPU? Step 4a (Brev) → 4b (split) → 4d (stock SFT) → 4e (offline re-eval). If cycle-2 drug KER drops ≥ 20% relative, keep the .nemo; otherwise back to /digital-health-clinical-asr-build.

Scenario B — Nemotron Streaming. User: “SFT nvidia/nemotron-speech-streaming-en-0.6b?” → No (UNK collapse). Substitute parakeet-tdt-0.6b-v2. Riva chunks non-streaming bases for streaming serving — base doesn’t need to be streaming-native.

Scenario C — cycle 2 KER unchanged. User: “KER barely moved.” → Back to /digital-health-clinical-asr-build. Signal density beats LR sweeps. If magpie_g2p rows are bad but merriam-webster rows are good, the gap is pronunciation-coverage — /digital-health-clinical-asr-build Step 2d.

Artifacts produced

Troubleshooting

Limitations

Next steps

References

Skill frontmatter

version: 1.0.0 author: Ben Randoing tags: clinical-asrfinetunesftnemoparakeetflywheel tools: ReadWriteBashSkill license: Apache-2.0 compatibility: Requires a CUDA host (24 GB VRAM comfortable, 16 GB workable with batch_size=4), the NeMo container (nvcr.io/nvidia/nemo:25.11.01), and the finetune-asr + riva-asr-custom skills installed alongside this one. No local GPU? Use Brev. NVIDIA_API_KEY required for the offline cycle N+1 eval round-trip and for any NIM deploy. metadata: {"author" => "Ben Randoing ", "tags" => ["clinical-asr", "flywheel", "finetune", "nemo-sft", "parakeet"], "team" => "healthcare-tme", "domain" => "ai-ml", "stage" => 4, "previous_skill" => "digital-health-clinical-asr-eval", "next_skill" => "riva-asr-custom"}