Agent Skill · NVIDIA NIM

nemotron-customize

Plan, configure, and chain repo-native Nemotron customization steps into single-step or multi-step pipelines: curation, translation, SFT/PEFT (AutoModel or Megatron-Bridge), pretraining/CPT, RL alignment (DPO/RLVR/GRPO/RLHF), BYOB/MCQ benchmarks, checkpoint conversion, ModelOpt optimization, env profiles, and evaluation of trained checkpoints or existing/hosted endpoints. Use when a request names a Nemotron step or workflow, or asks to clean, translate, train, fine-tune, align, convert, optimize, evaluate, or compose these into a pipeline. Do NOT use for frontend/dashboard/visualization work, generic ML advice, billing/access, or non-Nemotron coding tasks.

Provider: NVIDIA NIM Path in repo: skills/nemotron-customize/SKILL.md

Skill body

nemotron-customize

IMPORTANT: Read this file before answering any nemotron-customize, Nemotron customization, Curator curation, translation, SFT, PEFT, RL, conversion, optimization, checkpoint or existing/hosted-endpoint evaluation, or multi-step pipeline request. This applies whether the user names one step or asks you to compose several steps into a pipeline.

Evaluation requests count even when no training is involved: “evaluate”, “benchmark”, “smoke test”, or “score” an existing/hosted endpoint, an API/model ID, or a deployed model all route to eval/model_eval. Read this skill for those too.

Purpose

Turn a model-customization request into a repo-native Nemotron step pipeline. Plan the DAG, validate artifact wiring, and create only the YAML/config files needed to run existing steps.

Use this skill only for inspecting, configuring, validating, running, or submitting existing Nemotron steps or multi-step training/customization pipelines. For frontend, dashboard, visualization, generic ML advice, billing/access, or unrelated coding tasks, stop with a short scope note and do not inspect the step catalog or edit files in that turn.

Prerequisites

Limitations

Core Rule

Use bundled references first. The references/ folder is the first decision surface for routing, artifacts, patterns, hardware heuristics, and command shape. Use src/nemotron/steps/... only as a live verification/fallback source when you need exact current config fields, manifests, runner imports, or details missing from bundled references.

If sources disagree:

  1. Checked live repo files win for exact execution.
  2. Bundled references win for initial routing and planning.
  3. Upstream docs/context packs are used only for exceptional code generation or library API details.

Before You Begin

Safety

Keep Bash scoped to repo-safe commands such as uv run nemotron steps ..., targeted tests, git status/diff, and config validation. Never run environment dumps (env, printenv, broad export) or commands that expose secret values. For remote submissions, destructive changes, or expensive launches, confirm before execution.

When inspecting env/config files, avoid printing whole files that may contain secrets. Use targeted reads, report only section names and env-var names, and redact values for fields containing token, key, secret, password, credential, or auth.

Reference Map

Question Read first Live fallback / verification
Which step or category fits? references/CATALOG.md uv run nemotron steps list/show, then selected step.toml
Do artifacts chain? references/ARTIFACTS.md src/nemotron/steps/types.toml
What run shape should I emit? references/COMMANDS.md checked-in config YAML plus active profile TOML
Remote profile generation or selection references/COMMANDS.md active NEMOTRON_ENV_FILE, env.toml, or env.*.toml
What hardware/backend should I recommend? references/HARDWARE.md selected step [[models]] and [[strategies]]
Which cross-step guardrails apply? references/PATTERNS.md src/nemotron/steps/patterns/<id>.md
How do I run the full workflow? references/WORKFLOW.md selected step configs, step.py, and runners
Which upstream library API should generated code use? references/context/index.toml -> matching pack selected step.py, _runners/, upstream docs
New project scaffold, only when existing repo code cannot support the request references/act/PROJECT.md existing repo project/recipe shape
Per-stage code rules, only when existing repo code cannot support the request references/act/STAGE.md selected step.py and shared runner

Do not start by reading category READMEs or step.toml for ordinary decisions. Select candidates from bundled references, then verify exact live details before writing configs or final commands.

Routing

Use references/CATALOG.md as the authoritative home for step selection and route-specific fast paths. Use ARTIFACTS.md, PATTERNS.md, and HARDWARE.md only to resolve artifact, cross-step, or hardware constraints after the catalog narrows the route.

Each step is independent and stitching steps together is your job. Compose any pipeline by artifact matching from the user’s end goal: chain a step only when the next step consumes an artifact type nothing upstream already produces. Do not rely on fixed, named step combinations.

Instructions

Follow the flow that matches the request: a recommendation/plan, a single-step command, or a multi-step pipeline. In all cases, route from the bundled references first, gather required inputs, and verify the selected live step before presenting anything as runnable.

Recommendation Response

Use this shape for planning answers:

Decision, Why, Required inputs, Config/command, Avoid, and Next step. Call out the stack to avoid when the user’s constraints make it a poor fit.

Whenever the answer includes a command that touches a hosted service or remote execution, also state, in the answer:

Single-Step Command Flow

  1. Confirm repo root has pyproject.toml and src/nemotron/steps/.
  2. Read references/CATALOG.md and the selected section of references/COMMANDS.md.
  3. Verify the selected live step with uv run nemotron steps show <step_id> when available, or the selected step.toml when the CLI is unavailable.
  4. Read the requested checked-in config or user overlay before emitting the command.
  5. For remote execution, read NEMOTRON_ENV_FILE or repo-root env*.toml and pick an actual section whose profile matches the step.
  6. Emit the full command in one reply with the source tier: Verified, Repo-grounded, Reference-grounded, or Blocked.

Canonical command shapes live in references/COMMANDS.md.

Pipeline Workflow

For pipelines with two or more stages, use Orient -> Plan -> Act -> Verify. Read references/WORKFLOW.md for the phase checklist.

Catalog Mode

Use when the request maps to existing steps. Fast path:

references/CATALOG.md -> references/ARTIFACTS.md -> references/COMMANDS.md -> verify selected live manifest/config/profile -> add a new named config under the selected step’s config/ directory.

Customization Surface

Explorer Mode

Use only after confirming no existing step, runner, recipe, CLI, or YAML config surface can satisfy the request. Full procedure lives in references/WORKFLOW.md.

Configuration Alignment

Surface these constraints before commands or config writes:

Operational Nuances

Boundaries

Do:

Do not:

Examples

Single-step routing (LoRA on a small box). User: “LoRA fine-tune a HF model on 2 GPUs.” Route per CATALOG.md -> peft/automodel (HF base + small GPU count); do not offer Megatron-Bridge. Collect base model, JSONL data path, output dir, LoRA rank/alpha, then emit one uv run nemotron steps run peft/automodel -c <config> --dry-run ... command.

Multi-step pipeline (Super3 SFT). User: “data prep + SFT for Super3.” This is two stages, so plan first: SFT on Super3 -> Megatron-Bridge, which consumes packed_parquet, so data_prep/sft_packing is required upstream. Present the DAG (sft_packing -> sft/megatron_bridge), align pack_size/seq_length/ tokenizer, wait for approval, then add new configs under src/nemotron/steps/<step>/config/<name>.yaml. Super3 needs a remote profile; state the env TOML prerequisite or mark Blocked.

Hosted-endpoint evaluation (no training). User: “benchmark my hosted model endpoint.” Route to eval/model_eval with -c tiny_chat. Collect endpoint URL, model id, task IDs, and the auth env-var name (value exported, never inlined). See references/COMMANDS.md Evaluation Examples.

Troubleshooting

Situation Action
Artifact types do not chain Recheck references/ARTIFACTS.md; insert a converter or change the DAG before writing configs.
Remote profile or --batch is unclear Read active env TOML; do not guess profile names.
Config key is unclear Verify selected checked-in config, step.py, and shared runner before editing.
Strategy points to a missing context pack Skip the pack, use catalog/pattern text, and flag the plan with WARNING: <topic> docs unavailable.
Hardware looks too small Use references/HARDWARE.md; suggest smaller model, AutoModel, then LoRA before full Megatron-Bridge.
Two Act attempts fail Stop, explain what was tried and failed, and ask how to proceed.
No existing repo path matches Check references/context/index.toml and selected source fallback; use Explorer mode only after naming the gap.

Skill frontmatter

version: 0.1.1 license: Apache-2.0 metadata: {"version" => "0.1.1", "author" => "NVIDIA Nemotron Team ", "tags" => ["nemotron", "customization", "training", "pipelines"]}