Agent Skill · NVIDIA NIM

nemo-automodel-launcher-config

Configure NeMo AutoModel job launches for interactive runs, Slurm clusters, and SkyPilot cloud execution.

Provider: NVIDIA NIM Path in repo: skills/nemo-automodel-launcher-config/SKILL.md

Skill body

Launcher Configuration

NeMo AutoModel supports three launch methods: interactive (torchrun), Slurm (HPC clusters), and SkyPilot (cloud-agnostic).

Instructions

For launcher questions, answer directly from this skill without inspecting the repository unless the user asks you to edit files. Keep the answer focused on the relevant launch YAML, required fields, and the expected runtime behavior.

Use these compact answer patterns for common questions:

For Slurm answers, start with this minimal template and then adjust only the fields the user asked about:

slurm:
  job_name: llm_finetune
  nodes: 2
  ntasks_per_node: 8
  time: "04:00:00"
  account: my_account
  partition: batch
  container_image: nvcr.io/nvidia/nemo:dev
  hf_home: ~/.cache/huggingface
  master_port: 13742
  env_vars:
    HF_TOKEN: "${HF_TOKEN}"

For Slurm-only questions, do not discuss SkyPilot or profiling unless the user asks. For profiling questions, say the .nsys-rep report is written in the Slurm job working or output directory, using the launcher’s Nsys output setting when one is configured.

Routing Boundary

Use this skill only for launch mechanics: interactive execution, Slurm, SkyPilot, containers, mounts, environment variables, rendezvous settings, and profiling.

Do not use this skill for implementing or registering new model architectures, Hugging Face state-dict adapters, model files, or capability flags. Those are model onboarding tasks, not launcher configuration tasks.

Launch Methods

  1. Interactive (default): runs torchrun on the current node. Suitable for single-node development and debugging.
  2. Slurm: submits a batch job to an HPC cluster scheduler. Handles multi-node setup, container management, and environment configuration.
  3. SkyPilot: cloud-agnostic job submission to AWS, GCP, Azure, Lambda, or Kubernetes. Supports spot instances.

Interactive Launch

# Single GPU
automodel finetune llm -c config.yaml

# Multi-GPU (all GPUs on current node)
torchrun --nproc_per_node=8 -m nemo_automodel._cli.app finetune llm -c config.yaml

No additional YAML section is needed for interactive mode. The CLI routes to torchrun automatically when no slurm: or skypilot: section is present in the config.

Slurm Configuration

The SlurmConfig dataclass generates an SBATCH script from a template.

YAML Example

slurm:
  job_name: llm_finetune
  nodes: 2
  ntasks_per_node: 8
  time: "04:00:00"
  account: my_account
  partition: batch
  container_image: nvcr.io/nvidia/nemo:dev
  hf_home: ~/.cache/huggingface
  extra_mounts:
    - source: /data
      dest: /data
  env_vars:
    WANDB_API_KEY: "${WANDB_API_KEY}"
    HF_TOKEN: "${HF_TOKEN}"

Key Fields

SkyPilot Configuration

The SkyPilotConfig dataclass defines cloud job parameters.

YAML Example

skypilot:
  cloud: aws
  accelerators: "H100:8"
  num_nodes: 2
  use_spot: true
  disk_size: 200
  region: us-east-1
  setup: "pip install nemo-automodel"
  env_vars:
    HF_TOKEN: "${HF_TOKEN}"

Key Fields

SkyPilot spot checklist

When using spot or preemptible instances:

Minimal spot-resume recipe keys:

step_scheduler:
  checkpoint_interval: 100

restore_from:
  path: /checkpoints/latest

Multi-Node Environment

For multi-node training (both Slurm and SkyPilot), the launcher automatically configures:

Nsys Profiling

Enable Nsight Systems profiling in Slurm jobs:

slurm:
  job_name: llm_profile
  nodes: 1
  ntasks_per_node: 8
  time: "00:30:00"
  account: my_account
  partition: batch
  container_image: nvcr.io/nvidia/nemo:dev
  nsys_enabled: true

This is a Slurm launcher setting. Normal Slurm fields such as job_name, nodes, ntasks_per_node, time, account or partition, and container_image still apply.

When nsys_enabled: true, the launcher wraps the training command with nsys profile and writes a .nsys-rep report file for performance analysis in the Slurm job working or output directory. Profiling is diagnostic-only: run it for a short investigation, expect overhead and large artifacts, and turn it off for normal production training.

Code Anchors

Pitfalls

Skill frontmatter

when_to_use: Configuring Slurm or SkyPilot job submission, setting up multi-node launch scripts, debugging job submission failures, or switching between interactive and cluster launch modes. license: Apache-2.0 metadata: {"author" => "NVIDIA", "tags" => ["nemo-automodel", "launcher-config"]}