Agent Skill · NVIDIA NIM

mcore-run-on-slurm

How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.

Provider: NVIDIA NIM Path in repo: skills/mcore-run-on-slurm/SKILL.md

Skill body

Run Megatron-LM on SLURM

Answer-First Constants

For text-only SLURM setup questions, answer with these constants before the full script:

Prerequisites

Minimal sbatch script

Save as run_megatron.slurm in the worktree:

#!/bin/bash
#SBATCH --job-name=megatron
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --nodes=<NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=<GPUS_PER_NODE>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

set -euo pipefail
cd <MEGATRON_WORKTREE>

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=${MASTER_PORT:-29500}
export NNODES=${SLURM_NNODES}
export GPUS_PER_NODE=<GPUS_PER_NODE>
export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))

# Set CUDA_DEVICE_MAX_CONNECTIONS only when your configuration requires it
# (see the section below). Example for pre-Blackwell with TP>1 or CP>1
# (non-FSDP):
#   export CUDA_DEVICE_MAX_CONNECTIONS=1

srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '
  # NODE_RANK comes from SLURM_NODEID with one task per node.
  NODE_RANK=${SLURM_NODEID}
  uv run python -m torch.distributed.run \
    --nnodes='"${NNODES}"' \
    --nproc-per-node='"${GPUS_PER_NODE}"' \
    --node-rank=${NODE_RANK} \
    --master-addr='"${MASTER_ADDR}"' \
    --master-port='"${MASTER_PORT}"' \
    pretrain_gpt.py \
      <MEGATRON_ARGS>
'

Submit:

mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
echo "Submitted ${JOB_ID}"

Multi-node rules

CUDA_DEVICE_MAX_CONNECTIONS

The right value depends on your hardware and parallelism mode. Do not export it unconditionally:

Set it explicitly in the sbatch script when your configuration calls for it.

Containers

Many sites run Megatron-LM inside a container (enroot/pyxis on some clusters, singularity on others). If you do, the uv-managed .venv must live on a path that is visible from inside the container, and the container image must provide the CUDA / NCCL / torch versions the repo expects (see docker/.ngc_version.dev and .ngc_version.lts). The skeleton above stays the same; wrap the srun invocation with your scheduler’s container flags (--container-image=…, --container-mounts=…, etc.).

Monitor and collect

squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"
sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsed
scancel "$JOB_ID"

If your training script writes a result artifact (a JSON metrics file from rank 0, a final checkpoint, etc.), poll for the artifact rather than waiting only on squeue state. Useful output usually appears before SLURM marks the job complete, and polling on the artifact lets you cancel the job as soon as it lands instead of holding the allocation until the timeout.

Failure diagnosis

Scan stderr from every rank, not just rank 0. The earliest non-NCCL Python traceback is usually the root cause; later NCCL timeouts on other ranks are downstream symptoms of the first crash.

Classify quickly:

Common pitfalls

Skill frontmatter

license: Apache-2.0 when_to_use: Submitting a SLURM job; writing or debugging an sbatch script; configuring multi-node distributed training; setting MASTER_ADDR / MASTER_PORT / WORLD_SIZE; diagnosing a SLURM job failure; 'how do I run on the cluster', 'sbatch', 'multi-node training'. metadata: {"author" => "Philip Petrakian "}