Agent Skill · NVIDIA NIM

tao-run-on-lepton

DGX Cloud Lepton managed GPU compute platform with run/status/cancel interface. Use when submitting TAO jobs to DGX Cloud, dispatching training/eval/inference to Lepton GPU resources, or managing Lepton workspace deployments. Trigger phrases include "run on Lepton", "submit to DGX Cloud", "Lepton job", "managed GPU on DGX Cloud".

Provider: NVIDIA NIM Path in repo: skills/tao-run-on-lepton/SKILL.md

Skill body

Lepton

Managed GPU compute platform on DGX Cloud. Jobs are submitted as container workloads that run on dedicated or shared GPU node groups. Lepton handles scheduling, image pulling, log collection, and job lifecycle.

Use Lepton when you need cloud-based GPU compute without managing Kubernetes or SLURM infrastructure directly.

Preflight

Lepton is API-first — no docker-run alternative. This skill needs the TAO SDK with the Lepton extra. nvidia-tao-sdk is on public PyPI; the pinned version lives in versions.yaml (wheels.tao_sdk_lepton), resolved via scripts/resolve_versions_key.py:

PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_lepton)
python -c "import tao_sdk" 2>/dev/null || {
  echo "MISSING: nvidia-tao-sdk not installed. Run:"
  echo "  pip install \"$PIN\""
  exit 1
}
python -c "import leptonai" 2>/dev/null || {
  echo "MISSING: lepton extra not installed. Run:"
  echo "  pip install \"$PIN\""
  exit 1
}

If missing, the agent prompts the user to authorize the install via Bash, then re-runs the preflight before continuing.

Credentials

Launch Preflight

Before generating scripts or submitting jobs:

  1. Verify LEPTON_WORKSPACE_ID and LEPTON_AUTH_TOKEN are set.
  2. Verify the workspace API is reachable with the packaged helper: scripts/check_tao_launch_preflight.py --platform lepton ....
  3. For s3:// datasets/results, verify ACCESS_KEY and SECRET_KEY are set and the exact paths are readable with aws s3 ls.
  4. For NFS/Lustre mounted paths, require proof from Lepton volume/storage permissions that the path will be mounted into the job. Do not treat a local filesystem test -e on the agent host as proof for Lepton jobs.
  5. Verify model-specific credentials such as HF_TOKEN before launch.

Backend Details

LeptonSDK.create_job accepts these Lepton-specific kwargs (in addition to the platform-agnostic ones — image, command, gpu_count, env_vars, inputs, outputs, hooks):

Discovering the workspace’s shapes / volumes

shapes = sdk.list_resource_shapes()
# {<platform_id>: {"cluster": ..., "gpu_type": "gpu.8xh100-sxm",
#                   "gpu_count": 8, "instance_type": ..., ...}, ...}

volumes = sdk.get_volumes(node_group_id="my-h100-pool")
# [{"name": "lustre", "from_path": "/lustre", "type": "Lustre"}, ...]

prefixes = sdk.get_storage_permissions("lustre", "my-h100-pool")
# ["/lustre/fsw/portfolios/edgeai/...", ...]

Multi-node training (distributed)

Pass num_nodes > 1 to create_job for multi-node distributed training. The Lepton handler (tao_sdk/platforms/lepton/handler.py) configures the underlying LeptonJob by setting intra_job_communication=True (opens pod-to-pod networking), parallelism=num_nodes and completions=num_nodes (Lepton schedules N replicas), and exports WORLD_SIZE=num_nodes as a container env var.

Lepton’s native per-replica env vars use Lepton-specific names (LEPTON_JOB_WORKER_INDEX, LEPTON_JOB_TOTAL_WORKERS, LEPTON_JOB_WORKER_PREFIX, LEPTON_SUBDOMAIN), so the handler prepends a bootstrap that sources Lepton’s official translation script:

wget -O init.sh https://raw.githubusercontent.com/leptonai/scripts/main/lepton_env_to_pytorch.sh
chmod +x init.sh
source init.sh
# user command runs here

After sourcing, the following env vars are set:

Env var Source Value
MASTER_ADDR script ${LEPTON_JOB_WORKER_PREFIX}-0.${LEPTON_SUBDOMAIN}
MASTER_PORT script 29400
NNODES script ${LEPTON_JOB_TOTAL_WORKERS}
NODE_RANK script ${LEPTON_JOB_WORKER_INDEX}
WORKER_ADDRS script comma-separated list of non-master worker hostnames
WORLD_SIZE TAO SDK handler num_nodes (TAO container’s convention — same value as NNODES)
NUM_GPU_PER_NODE TAO SDK handler gpu_count (read by TAO container’s entrypoint)
job = sdk.create_job(
    image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
    command='dino train -e /tmp/spec.yaml',  # TAO entrypoint reads WORLD_SIZE + NUM_GPU_PER_NODE
    gpu_count=8,                          # GPUs per node
    num_nodes=4,                          # 4 × 8 = 32 GPUs total
    dedicated_node_group='my-h100-pool',
    inputs={'/data/train.json': 's3://bucket/coco/train.json'},
    outputs=['/results/'],
)

For raw torchrun-based commands (non-TAO containers):

command='torchrun --nnodes=$NNODES --nproc-per-node=8 --node-rank=$NODE_RANK '
        '--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py'

Two ways to run distributed jobs on Lepton

Path When to use
TAO SDK create_job(num_nodes=N) (this skill) Programmatic submission from agent code; you want the SDK’s S3 wrapping, monitoring, failure analysis, and JobStore.
Lepton “Torchrun” job type (Lepton UI / lep CLI) Hand-crafted submission via the Lepton console. Lepton’s UI has a first-class “Torchrun” mode that wires up the rendezvous for you — no bootstrap script needed. See the official example.

Reference reading

Notes

Cloud Storage

Even though the platform is Lepton, the storage layer is S3-compatible. Always use aws as the cloud_metadata key and s3:// as the URI protocol for both datasets and results_dir.

The container’s get_cloud_storage_class_object() parses the URI protocol to look up credentials in CLOUD_METADATA[protocol][bucket].

Shared Storage (NFS/Lustre)

Node groups can have NFS or Lustre volumes attached. The SDK auto-detects these and mounts them into containers for persistent cross-job data sharing.

SDK Functions

LeptonSDK.create_job() calls these automatically to detect mounts and build the appropriate Mount objects for job specs.

How the script runner uses mounts

When a Lustre mount is available:

Volume preference order

lustre > filestore > first available

Lustre Cache Invalidation

Lustre caches files persistently across jobs. There is no built-in invalidation. If upstream data changes but the S3 path stays the same, Lustre serves the stale cached version. To force a cache miss:

Monitoring

Job Status

Use sdk.get_job_status(job_id) for high-level status (Pending, Running, Complete, Error).

Replica Status

Use sdk.get_job_replicas(job_id) during startup for detailed replica-level info. Each replica is a dict:

replicas = sdk.get_job_replicas(job_id)
for r in replicas:
    node = r["status"]["node"]["name"]           # e.g., "node-ip-10-50-111-24"
    node_group = r["status"]["node"]["node_group_id"]
    cpu = r["status"]["cpu"]                      # e.g., 2
    memory_mb = r["status"]["memory_in_mb"]       # e.g., 8192
    readiness = r["status"].get("readiness_issue")
    if readiness:
        reason = readiness["reason"]   # "InProgress", "Failed", "ConfigError"
        message = readiness["message"] # "Pulling image", "Mount point not found", etc.

Key readiness_issue patterns:

Replica status is especially useful when a job is stuck in Pending — it reveals whether the issue is image pulling, resource scheduling, or node health.

Job Logs

Use sdk.get_job_logs(job_id, tail=N) for the most recent N log lines. Logs are fetched from Lepton’s log collection service.

Parallel Jobs

For workflow stages that run in parallel (e.g., video generation x8):

  1. Launch: Call execute_step(plan, step_id, extra_args={"split_id": i}) for each split. Each call returns immediately with a job_id.
  2. Monitor: Poll all jobs: sdk.get_job_status(job_id) for each. Use get_job_replicas(job_id) for startup diagnostics.
  3. Completion: All jobs done when every status is Complete or Error.
  4. Partial failure: Retry only failed splits — successful splits don’t need re-running. Pass the same split_id to execute_step.

Failure Analysis

When a job fails, use sdk.get_failure_analysis(job_id) for automatic root cause detection:

analysis = sdk.get_failure_analysis(job_id)
if analysis:
    print(analysis["err_class"])    # e.g., "ERR_PROGRAM"
    print(analysis["suggestion"])   # Human-readable fix
    for event in analysis.get("job_failure_by_node_event", []):
        print(event["node_event_name"], event["message"])
        # e.g., "OOM", "OOM encountered, victim process: cosmos-rl-evalu, pid: 3368483"

Returns:

Always call this on failed jobs before retrying — it distinguishes user errors (bad config, OOM) from infrastructure issues (node failure, eviction).

Failure Modes

OOM killed: Container exceeded GPU or system memory. Detection: get_failure_analysis() returns node_event_name: "OOM". Common causes: evaluation.batch_size too high, max_length too large for available KV cache. Recovery: reduce batch_size, add GPUs with tensor parallelism, or reduce max_length.

Image pull failure: The TAO container image cannot be pulled from nvcr.io. Usually caused by a missing or expired image pull secret. The SDK auto-provisions the secret from NGC_KEY, but if NGC_KEY is invalid, the job will fail. Detection: check get_job_replicas()readiness_issue.reason will show InProgress with message = "Pulling image" for extended periods, or Failed if the pull fails. Recovery: verify NGC_KEY is valid.

Resource unavailable: The requested GPU shape is not available. Job enters Queueing state indefinitely. Detection: Pending > 15 minutes, replicas show no node assignment. Recovery: try a different resource_shape or dedicated_node_group, or wait for resources.

Auth failure: Invalid or expired LEPTON_AUTH_TOKEN. All API calls fail with 401/403. Detection: job creation raises an exception immediately. Recovery: refresh the token and reinitialize the SDK.

Unhealthy node: The assigned node has infrastructure issues (mount failures, GPU errors, network problems). Detection: check get_job_replicas()readiness_issue.reason = "ConfigError" with messages like "Mount point not found". The job stays Pending indefinitely on the bad node. Recovery: cancel the job and resubmit — Lepton will schedule on a different node. If the issue recurs, try a different dedicated_node_group or resource_shape.

Job eviction: On shared node groups, Lepton may evict jobs under resource pressure. Detection: job unexpectedly transitions from Running to Error. Recovery: retry, or use a dedicated_node_group.

Skill frontmatter

license: Apache-2.0 compatibility: Requires the tao-sdk Python package with the lepton extra (pip install 'tao-sdk[lepton]') plus LEPTON_WORKSPACE_ID and LEPTON_AUTH_TOKEN. metadata: {"author" => "NVIDIA Corporation", "version" => "0.1.0"} allowed-tools: Read Bash tags: dgx-cloudgpucomputelepton