Agent Skill · NVIDIA NIM

dynamo-troubleshoot

Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.

Provider: NVIDIA NIM Path in repo: skills/dynamo-troubleshoot/SKILL.md

Skill body

Dynamo Troubleshoot

Purpose

Turn a Dynamo failure into a clear problem class, strongest signal, and next action. Start with read-only evidence, avoid secrets, and fix one layer at a time.

Prerequisites

Instructions

1. Collect A Read-Only Bundle

Run:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}"

If the user names a deployment, include it:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}" \
  --deployment-name <deployment-name>

Do not collect Kubernetes secrets. Do not print Hugging Face tokens.

2. Classify The Failure

Use references/failure-decision-tree.md and classify into one primary bucket:

3. Debug Top Down

Check in this order:

  1. namespace, storage class, GPU nodes, and HF secret existence
  2. PVC and model-download job
  3. DynamoGraphDeployment status and events
  4. pod status, describe pod, and container logs
  5. frontend service and port-forward
  6. /v1/models
  7. /v1/chat/completions
  8. benchmark job only after endpoint smoke test passes

4. Fix One Layer At A Time

Prefer the smallest reversible change:

After each fix, rerun the relevant readiness check before moving deeper.

Available Scripts

Script Purpose Arguments
scripts/collect_dynamo_debug_bundle.py Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status) --namespace, --deployment-name, --output-dir

Invoke via the agentskills.io run_script() protocol:

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])

Examples

Collect everything in a namespace for triage:

python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo

Scope to a single failing deployment:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace dynamo-demo \
  --deployment-name qwen-vllm-disagg

Equivalent through the agent protocol:

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])

Output Contract

Return:

Limitations

Troubleshooting

Symptom Likely cause Next step
kubectl returns Forbidden on events/pods Service account lacks read RBAC Ask operator for read-only role binding on the namespace
Bundle missing DynamoGraphDeployment status Operator not installed or different namespace Verify dynamo-platform operator is installed and watching the namespace
Model-download job in Pending PVC unbound or HF secret missing Fix PVC binding or create the named HF secret, then rerun the job
Worker pods CrashLoopBackOff Image/runtime mismatch or GPU not available Inspect container logs; check nvidia.com/gpu allocatable on nodes

Benchmark

See BENCHMARK.md for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run /nvskills-ci on an upstream PR touching this skill.

References

Skill frontmatter

license: Apache-2.0 metadata: {"author" => "Dan Gil ", "tags" => ["dynamo", "kubernetes", "troubleshooting", "day-2"]}