Agent Skill · NVIDIA NIM

nemo-automodel-distributed-training

Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.

Provider: NVIDIA NIM Path in repo: skills/nemo-automodel-distributed-training/SKILL.md

Skill body

Distributed Training in NeMo AutoModel

Purpose

NeMo AutoModel uses PyTorch-native distributed training. All parallelism is orchestrated through a single MeshContext object that holds device meshes, strategy configs, and axis names.

Instructions

For conceptual distributed-training questions, answer directly from the quick patterns in this skill without inspecting the repository. Start with the strategy choice, then list only the YAML fields and constraints relevant to the question.

Use direct action verbs in the final answer: recommend the strategy, show the minimal YAML, state the sizing constraint, and name the unsupported strategies. Do not discuss model onboarding, recipes, Slurm, SkyPilot, or checkpointing unless the user asks.

Examples

TP plus PP for a large multi-node model

Recommend strategy: fsdp2. Mention tp_size, pp_size, cp_size, ep_size, and the pipeline sub-config. State that dp_size is inferred from world_size / (tp_size * pp_size * cp_size).

distributed:
  strategy: fsdp2
  tp_size: 8
  pp_size: 4
  cp_size: 1
  ep_size: 1
  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 1

MoE expert parallelism

Recommend strategy: fsdp2 with ep_size > 1. Say this creates a separate moe_mesh; include the moe sub-config when relevant; state that ep_size must divide dp_size * cp_size. Do not recommend megatron_fsdp or ddp.

distributed:
  strategy: fsdp2
  ep_size: 8
  moe:
    reshard_after_forward: false

MegatronFSDP limitations

Say no for pipeline parallelism, expert parallelism, and sequence_parallel. Recommend fsdp2 for PP, EP, or sequence_parallel; mention that DDP is only simple data parallelism.

Strategy Selection

Three strategies are available, selected via the distributed.strategy YAML key:

Strategy YAML value Best for
FSDP2 fsdp2 General use, recommended default. Supports TP, PP, CP, EP, HSDP.
MegatronFSDP megatron_fsdp NVIDIA Megatron-style FSDP. No PP, no EP, no sequence_parallel.
DDP ddp Simple data parallelism only. No TP, PP, CP, or EP.

Decision tree:

When answering strategy-selection questions, state the chosen distributed.strategy first, then enumerate the YAML fields the user must set.

Quick TP + PP answer:

Quick MoE expert-parallel answer:

YAML Config Structure

The distributed section in the recipe YAML maps directly to parse_distributed_section() in recipes/_dist_setup.py:

distributed:
  strategy: fsdp2           # fsdp2 | megatron_fsdp | ddp
  dp_size: none             # auto-calculated from world_size / (tp * pp * cp)
  dp_replicate_size: none   # FSDP2-only, for HSDP
  tp_size: 1
  pp_size: 1
  cp_size: 1
  ep_size: 1

  # Strategy-specific flags (forwarded to the strategy dataclass):
  sequence_parallel: false
  activation_checkpointing: false
  defer_fsdp_grad_sync: true   # FSDP2 only

  # Sub-configs (optional):
  pipeline:
    pp_schedule: 1f1b
    pp_microbatch_size: 1
    # ... see PipelineConfig fields

  moe:
    reshard_after_forward: false
    # ... see MoEParallelizerConfig fields

The dp_size is always inferred:

dp_size = world_size / (tp_size * pp_size * cp_size)

Infrastructure Flow

YAML distributed section
    -> parse_distributed_section()          [recipes/_dist_setup.py]
    -> setup_distributed()                  [recipes/_dist_setup.py]
        -> create_device_mesh()             [components/distributed/device_mesh.py]
        -> MeshContext(...)                  [components/distributed/mesh.py]
    -> instantiate_infrastructure()         [_transformers/infrastructure.py]
        -> _instantiate_distributed()       -> FSDP2Manager / MegatronFSDPManager / DDPManager
        -> _instantiate_pipeline()          -> AutoPipeline (if pp_size > 1)
        -> parallelize_fn                   -> MoE parallelizer (if ep_size > 1) or PP wrapper
    -> apply_model_infrastructure()         [_transformers/infrastructure.py]
        -> _shard_pp() or _shard_ep_fsdp()  (applies sharding to the model)

FSDP2 Configuration

Basic FSDP2 (data parallelism only)

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1

This auto-calculates dp_size = world_size and applies fully_shard() per transformer block via DTensor-based sharding.

FSDP2 with Tensor Parallelism

Keep TP within a single NVLink domain (typically one node):

distributed:
  strategy: fsdp2
  tp_size: 4        # 2, 4, or 8 -- must divide GPUs per node
  sequence_parallel: true

The TP plan is auto-selected based on the model type. Pass a custom plan via the Python API if needed:

config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)

FSDP2 with Pipeline Parallelism

distributed:
  strategy: fsdp2
  pp_size: 2
  pipeline:
    pp_schedule: interleaved1f1b   # 1f1b, gpipe, interleaved_1f1b, etc.
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

The model must have a _pp_plan attribute (set on the HF model class) for AutoPipeline to know how to split layers across stages. Models without _pp_plan are not compatible with PP.

FSDP2 with HSDP (Hybrid Sharded Data Parallel)

Intra-node full sharding + inter-node replication via a 2D DeviceMesh:

distributed:
  strategy: fsdp2
  dp_replicate_size: 2   # must divide dp_size

Constraint: dp_replicate_size < dp_size (pure replication with no sharding is not supported by FSDP2).

Activation Checkpointing

Trades compute for memory by recomputing activations during backward:

distributed:
  activation_checkpointing: true

This is forwarded to the strategy config for non-EP models, or read from MeshContext.activation_checkpointing for EP models.

Gradient Sync Deferral

FSDP2 defers gradient sync to the final micro-batch by default for communication overlap:

distributed:
  defer_fsdp_grad_sync: true   # default

Mixed Precision

FSDP2Config defaults to bfloat16 for all three precision knobs via MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True). Override via the Python API:

from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
    mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)

Pipeline Parallelism

Requirements

  1. Model class must define _pp_plan (a dict mapping module FQNs to stages).
  2. pp_size > 1 in the distributed section.
  3. A pipeline sub-config with schedule and microbatch size.

Supported schedules

Defined in PipelineConfig.pp_schedule:

Example (8B model on 8 GPUs, PP=2 + DP=4)

distributed:
  strategy: fsdp2
  pp_size: 2

  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

checkpoint:
  model_save_format: safetensors
  save_consolidated: true

How it works

AutoPipeline.build() calls pipeline_model() which splits the model into stages using the model’s _pp_plan, creates PipelineStage objects, and builds the schedule. During training, schedule.step() drives forward and backward through the pipeline.

Context Parallelism

Use CP for long sequences (8K+). CP shards Q/K/V on the sequence dimension as DTensors.

Config

distributed:
  strategy: fsdp2
  cp_size: 2   # or 4, 8

Requirements

How it works

  1. After model sharding, apply_model_infrastructure() calls attach_context_parallel_hooks() on each model part (for non-TE models).
  2. At each training step, make_cp_batch_and_ctx() creates a CP context manager that shards the batch along the sequence dimension and sets up context_parallel() from torch.distributed.tensor.experimental.
  3. For TE attention models, make_cp_batch_for_te() uses THD format and TE’s thd_get_partitioned_indices for sharding.

CP with Sequence Packing

CP works with packed sequences. The packed_sequence_size must be divisible by cp_size. When using TE, chunks are sharded per-chunk via _shard_thd_chunk_for_te().

Sequence Packing

Packing multiple sequences into a single training sample for efficiency.

Config

packed_sequence:
  packed_sequence_size: 4096   # 0 = disabled

step_scheduler:
  local_batch_size: 1          # must be 1 for packed sequences

When packed_sequence_size > 0, the dataset collator packs sequences up to that length. local_batch_size must be 1 because each “sample” is already a packed batch.

MoE Distributed Training

Expert Parallelism

Set ep_size > 1 to distribute experts across GPUs. This creates a separate moe_mesh alongside the main device_mesh:

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

The moe_mesh shape is (pp_size, ep_shard_size, ep_size) with dimension names ("pp", "ep_shard", "ep").

Constraint: dp_cp_size (= dp_size * cp_size) must be divisible by ep_size.

MoE sub-config

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

  moe:
    reshard_after_forward: false
    ignore_router_for_ac: false
    wrap_outer_model: true

The moe sub-section maps to MoEParallelizerConfig and is only instantiated when ep_size > 1.

Full MoE example (Qwen3-30B-A3B on 8 GPUs)

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
  pp_size: 1
  ep_size: 8
  sequence_parallel: false
  activation_checkpointing: true

MegatronFSDP limitations

Despite its name, megatron_fsdp does not support expert parallelism (ep_size > 1), pipeline parallelism (pp_size > 1), or sequence_parallel. Use fsdp2 for these features.

Parallelism Sizing Guidelines

Dense models

Model size TP PP CP Strategy
< 3B 1 1 1 FSDP2 (DP only)
3-13B 2-4 1 1 FSDP2 + TP
13-70B 4-8 2-4 1 FSDP2 + TP + PP
70B+ 8 4-8 1 FSDP2 + TP + PP
Any + long seq (8K+) as above as above 2-8 add CP

MoE models

MoE models need less TP than dense models of similar total parameter count because only a fraction of parameters are active per token. EP is the primary scaling dimension:

Model TP PP EP Notes
Small MoE (<10B total) 1 1 8 EP only
Medium MoE (10-30B total) 1-2 1 8 small TP for shared layers
Large MoE (100B+ total) 1-2 4+ 8-64 PP for depth, EP for experts

Hardware topology rules

Code Anchors

Pitfalls

  1. TP across nodes destroys throughput. Always keep TP within a single NVLink domain. Use PP or DP for cross-node scaling.

  2. PP requires _pp_plan on the model class. Not all HF models have this. Check validate_hf_model_for_pipeline_support() before enabling PP.

  3. PP bubbles reduce GPU utilization. Use interleaved schedules (interleaved_1f1b) and smaller microbatches to reduce bubble time.

  4. FSDP2 requires DTensor-aware state dict saving. Use safetensors with save_consolidated: true for checkpoint compatibility.

  5. CP requires compatible attention. SDPA (Flash Attention or Efficient Attention) or TE attention only. SDPBackend.MATH is not compatible with DTensor.

  6. MoE EP size must evenly divide dp_size * cp_size. The device mesh creation asserts dp_cp_size % ep_size == 0.

  7. MegatronFSDP is more limited than FSDP2. It does not support PP (pp_size > 1), EP (ep_size > 1), or sequence_parallel. The MeshContext validation raises on these combinations.

  8. DDP supports nothing beyond data parallelism. No TP, PP, CP, EP, or HSDP. Validation raises on any of these.

  9. Activation checkpointing increases compute. It saves memory by recomputing activations during backward, but adds ~30% compute overhead.

  10. Mixed precision policy must match model expectations. The default bfloat16 policy works for most models. FP16 models may need a custom MixedPrecisionPolicy.

  11. packed_sequence_size must be divisible by cp_size when using CP with packed sequences.

  12. dp_replicate_size is FSDP2-only. Passing it with megatron_fsdp or ddp raises a ValueError.

Verification

Run the smallest recipe that exercises the requested strategy. Success means exit code 0, finite loss, no NCCL timeout, and log output matching the expected TP/PP/CP/EP sizes.

Skill frontmatter

when_to_use: Adding or modifying distributed training strategies (FSDP2, HSDP, DDP), debugging multi-GPU or multi-node failures, configuring context or tensor parallelism, or tuning sharding settings. license: Apache-2.0 metadata: {"author" => "NVIDIA", "tags" => ["nemo-automodel", "distributed-training"]}