Agent Skill · NVIDIA NIM

nemo-automodel-distributed-training

Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.

View SKILL.md on GitHub → Source repository Provider profile

Provider: NVIDIA NIM Path in repo: skills/nemo-automodel-distributed-training/SKILL.md

Skill body

Distributed Training in NeMo AutoModel

Purpose

NeMo AutoModel uses PyTorch-native distributed training. All parallelism is orchestrated through a single MeshContext object that holds device meshes, strategy configs, and axis names.

Instructions

For conceptual distributed-training questions, answer directly from the quick patterns in this skill without inspecting the repository. Start with the strategy choice, then list only the YAML fields and constraints relevant to the question.

Use direct action verbs in the final answer: recommend the strategy, show the minimal YAML, state the sizing constraint, and name the unsupported strategies. Do not discuss model onboarding, recipes, Slurm, SkyPilot, or checkpointing unless the user asks.

Examples

TP plus PP for a large multi-node model

Recommend strategy: fsdp2. Mention tp_size, pp_size, cp_size, ep_size, and the pipeline sub-config. State that dp_size is inferred from world_size / (tp_size * pp_size * cp_size).

distributed:
  strategy: fsdp2
  tp_size: 8
  pp_size: 4
  cp_size: 1
  ep_size: 1
  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 1

MoE expert parallelism

Recommend strategy: fsdp2 with ep_size > 1. Say this creates a separate moe_mesh; include the moe sub-config when relevant; state that ep_size must divide dp_size * cp_size. Do not recommend megatron_fsdp or ddp.

distributed:
  strategy: fsdp2
  ep_size: 8
  moe:
    reshard_after_forward: false

MegatronFSDP limitations

Say no for pipeline parallelism, expert parallelism, and sequence_parallel. Recommend fsdp2 for PP, EP, or sequence_parallel; mention that DDP is only simple data parallelism.

Strategy Selection

Three strategies are available, selected via the distributed.strategy YAML key:

Strategy	YAML value	Best for
FSDP2	`fsdp2`	General use, recommended default. Supports TP, PP, CP, EP, HSDP.
MegatronFSDP	`megatron_fsdp`	NVIDIA Megatron-style FSDP. No PP, no EP, no sequence_parallel.
DDP	`ddp`	Simple data parallelism only. No TP, PP, CP, or EP.

Decision tree:

Single GPU: no distributed config needed (FSDP2Manager skips parallelization when world_size=1).
Multi-GPU single node: fsdp2 (default). Use ddp only if you need the simplest possible setup.
Multi-node: fsdp2 with appropriate TP/PP sizing.
MoE models with expert parallelism: fsdp2 with ep_size > 1 (creates a separate moe_mesh).
Large models (70B+): fsdp2 with PP + TP.
Long sequences (8K+): add CP (cp_size > 1).

When answering strategy-selection questions, state the chosen distributed.strategy first, then enumerate the YAML fields the user must set.

Quick TP + PP answer:

Use strategy: fsdp2; do not use megatron_fsdp when pipeline parallelism is required.
Set tp_size for tensor parallelism and pp_size for pipeline parallelism.
Add a pipeline: sub-config with pp_schedule and pp_microbatch_size.
Leave dp_size unset or none; it is inferred as world_size / (tp_size * pp_size * cp_size).
Keep TP inside a fast intra-node domain when possible, and use PP across model depth for 70B+ models.

Quick MoE expert-parallel answer:

Start with strategy: fsdp2 and ep_size > 1.
Include a moe: sub-config only when ep_size > 1; it maps to MoEParallelizerConfig.
Expect a separate moe_mesh for expert parallelism in addition to the main device_mesh.
Do not recommend megatron_fsdp or ddp for expert parallelism; megatron_fsdp has no EP support.
Before finishing an MoE EP answer, explicitly state that ep_size must divide dp_size * cp_size and that megatron_fsdp does not support EP, PP, or sequence_parallel.

YAML Config Structure

The distributed section in the recipe YAML maps directly to parse_distributed_section() in recipes/_dist_setup.py:

distributed:
  strategy: fsdp2           # fsdp2 | megatron_fsdp | ddp
  dp_size: none             # auto-calculated from world_size / (tp * pp * cp)
  dp_replicate_size: none   # FSDP2-only, for HSDP
  tp_size: 1
  pp_size: 1
  cp_size: 1
  ep_size: 1

  # Strategy-specific flags (forwarded to the strategy dataclass):
  sequence_parallel: false
  activation_checkpointing: false
  defer_fsdp_grad_sync: true   # FSDP2 only

  # Sub-configs (optional):
  pipeline:
    pp_schedule: 1f1b
    pp_microbatch_size: 1
    # ... see PipelineConfig fields

  moe:
    reshard_after_forward: false
    # ... see MoEParallelizerConfig fields

The dp_size is always inferred:

dp_size = world_size / (tp_size * pp_size * cp_size)

Infrastructure Flow

YAML distributed section
    -> parse_distributed_section()          [recipes/_dist_setup.py]
    -> setup_distributed()                  [recipes/_dist_setup.py]
        -> create_device_mesh()             [components/distributed/device_mesh.py]
        -> MeshContext(...)                  [components/distributed/mesh.py]
    -> instantiate_infrastructure()         [_transformers/infrastructure.py]
        -> _instantiate_distributed()       -> FSDP2Manager / MegatronFSDPManager / DDPManager
        -> _instantiate_pipeline()          -> AutoPipeline (if pp_size > 1)
        -> parallelize_fn                   -> MoE parallelizer (if ep_size > 1) or PP wrapper
    -> apply_model_infrastructure()         [_transformers/infrastructure.py]
        -> _shard_pp() or _shard_ep_fsdp()  (applies sharding to the model)

FSDP2 Configuration

Basic FSDP2 (data parallelism only)

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1

This auto-calculates dp_size = world_size and applies fully_shard() per transformer block via DTensor-based sharding.

FSDP2 with Tensor Parallelism

Keep TP within a single NVLink domain (typically one node):

distributed:
  strategy: fsdp2
  tp_size: 4        # 2, 4, or 8 -- must divide GPUs per node
  sequence_parallel: true

The TP plan is auto-selected based on the model type. Pass a custom plan via the Python API if needed:

config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)

FSDP2 with Pipeline Parallelism

distributed:
  strategy: fsdp2
  pp_size: 2
  pipeline:
    pp_schedule: interleaved1f1b   # 1f1b, gpipe, interleaved_1f1b, etc.
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

The model must have a _pp_plan attribute (set on the HF model class) for AutoPipeline to know how to split layers across stages. Models without _pp_plan are not compatible with PP.

FSDP2 with HSDP (Hybrid Sharded Data Parallel)

Intra-node full sharding + inter-node replication via a 2D DeviceMesh:

distributed:
  strategy: fsdp2
  dp_replicate_size: 2   # must divide dp_size

Constraint: dp_replicate_size < dp_size (pure replication with no sharding is not supported by FSDP2).

Activation Checkpointing

Trades compute for memory by recomputing activations during backward:

distributed:
  activation_checkpointing: true

This is forwarded to the strategy config for non-EP models, or read from MeshContext.activation_checkpointing for EP models.

Gradient Sync Deferral

FSDP2 defers gradient sync to the final micro-batch by default for communication overlap:

distributed:
  defer_fsdp_grad_sync: true   # default

Mixed Precision

FSDP2Config defaults to bfloat16 for all three precision knobs via MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True). Override via the Python API:

from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
    mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)

Pipeline Parallelism

Requirements

Model class must define _pp_plan (a dict mapping module FQNs to stages).
pp_size > 1 in the distributed section.
A pipeline sub-config with schedule and microbatch size.

Supported schedules

Defined in PipelineConfig.pp_schedule:

1f1b (one-forward-one-backward, default)
gpipe
interleaved_1f1b / interleaved1f1b
looped_bfs
dfs
v_schedule
zero_bubble

Example (8B model on 8 GPUs, PP=2 + DP=4)

distributed:
  strategy: fsdp2
  pp_size: 2

  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

checkpoint:
  model_save_format: safetensors
  save_consolidated: true

How it works

AutoPipeline.build() calls pipeline_model() which splits the model into stages using the model’s _pp_plan, creates PipelineStage objects, and builds the schedule. During training, schedule.step() drives forward and backward through the pipeline.

Context Parallelism

Use CP for long sequences (8K+). CP shards Q/K/V on the sequence dimension as DTensors.

Config

distributed:
  strategy: fsdp2
  cp_size: 2   # or 4, 8

Requirements

SDPA (Flash Attention or Efficient Attention backend) or Transformer Engine attention. SDPBackend.MATH is not compatible with DTensor.
Attention masks are automatically stripped; is_causal=True is set via forward pre-hooks registered by attach_context_parallel_hooks().

How it works

After model sharding, apply_model_infrastructure() calls attach_context_parallel_hooks() on each model part (for non-TE models).
At each training step, make_cp_batch_and_ctx() creates a CP context manager that shards the batch along the sequence dimension and sets up context_parallel() from torch.distributed.tensor.experimental.
For TE attention models, make_cp_batch_for_te() uses THD format and TE’s thd_get_partitioned_indices for sharding.

CP with Sequence Packing

CP works with packed sequences. The packed_sequence_size must be divisible by cp_size. When using TE, chunks are sharded per-chunk via _shard_thd_chunk_for_te().

Sequence Packing

Packing multiple sequences into a single training sample for efficiency.

Config

packed_sequence:
  packed_sequence_size: 4096   # 0 = disabled

step_scheduler:
  local_batch_size: 1          # must be 1 for packed sequences

When packed_sequence_size > 0, the dataset collator packs sequences up to that length. local_batch_size must be 1 because each “sample” is already a packed batch.

MoE Distributed Training

Expert Parallelism

Set ep_size > 1 to distribute experts across GPUs. This creates a separate moe_mesh alongside the main device_mesh:

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

The moe_mesh shape is (pp_size, ep_shard_size, ep_size) with dimension names ("pp", "ep_shard", "ep").

Constraint: dp_cp_size (= dp_size * cp_size) must be divisible by ep_size.

MoE sub-config

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

  moe:
    reshard_after_forward: false
    ignore_router_for_ac: false
    wrap_outer_model: true

The moe sub-section maps to MoEParallelizerConfig and is only instantiated when ep_size > 1.

Full MoE example (Qwen3-30B-A3B on 8 GPUs)

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
  pp_size: 1
  ep_size: 8
  sequence_parallel: false
  activation_checkpointing: true

MegatronFSDP limitations

Despite its name, megatron_fsdp does not support expert parallelism (ep_size > 1), pipeline parallelism (pp_size > 1), or sequence_parallel. Use fsdp2 for these features.

Parallelism Sizing Guidelines

Dense models

Model size	TP	PP	CP	Strategy
< 3B	1	1	1	FSDP2 (DP only)
3-13B	2-4	1	1	FSDP2 + TP
13-70B	4-8	2-4	1	FSDP2 + TP + PP
70B+	8	4-8	1	FSDP2 + TP + PP
Any + long seq (8K+)	as above	as above	2-8	add CP

MoE models

MoE models need less TP than dense models of similar total parameter count because only a fraction of parameters are active per token. EP is the primary scaling dimension:

Model	TP	PP	EP	Notes
Small MoE (<10B total)	1	1	8	EP only
Medium MoE (10-30B total)	1-2	1	8	small TP for shared layers
Large MoE (100B+ total)	1-2	4+	8-64	PP for depth, EP for experts

Hardware topology rules

TP must stay within a single NVLink domain (one node, typically 8 GPUs).
Use PP or DP for cross-node scaling.
TP across InfiniBand degrades throughput severely.

Code Anchors

components/distributed/config.py: FSDP2Config, MegatronFSDPConfig, DDPConfig.
components/distributed/mesh.py: MeshContext, strategy map, and mesh sizes.
components/distributed/device_mesh.py: device mesh and moe_mesh creation.
components/distributed/pipelining/config.py: PipelineConfig fields.
components/moe/config.py: MoEParallelizerConfig and MoEConfig.
recipes/_dist_setup.py: YAML parsing and distributed setup.

Pitfalls

TP across nodes destroys throughput. Always keep TP within a single NVLink domain. Use PP or DP for cross-node scaling.
PP requires _pp_plan on the model class. Not all HF models have this. Check validate_hf_model_for_pipeline_support() before enabling PP.
PP bubbles reduce GPU utilization. Use interleaved schedules (interleaved_1f1b) and smaller microbatches to reduce bubble time.
FSDP2 requires DTensor-aware state dict saving. Use safetensors with save_consolidated: true for checkpoint compatibility.
CP requires compatible attention. SDPA (Flash Attention or Efficient Attention) or TE attention only. SDPBackend.MATH is not compatible with DTensor.
MoE EP size must evenly divide dp_size * cp_size. The device mesh creation asserts dp_cp_size % ep_size == 0.
MegatronFSDP is more limited than FSDP2. It does not support PP (pp_size > 1), EP (ep_size > 1), or sequence_parallel. The MeshContext validation raises on these combinations.
DDP supports nothing beyond data parallelism. No TP, PP, CP, EP, or HSDP. Validation raises on any of these.
Activation checkpointing increases compute. It saves memory by recomputing activations during backward, but adds ~30% compute overhead.
Mixed precision policy must match model expectations. The default bfloat16 policy works for most models. FP16 models may need a custom MixedPrecisionPolicy.
packed_sequence_size must be divisible by cp_size when using CP with packed sequences.
dp_replicate_size is FSDP2-only. Passing it with megatron_fsdp or ddp raises a ValueError.

Verification

Run the smallest recipe that exercises the requested strategy. Success means exit code 0, finite loss, no NCCL timeout, and log output matching the expected TP/PP/CP/EP sizes.

Skill frontmatter

when_to_use: Adding or modifying distributed training strategies (FSDP2, HSDP, DDP), debugging multi-GPU or multi-node failures, configuring context or tensor parallelism, or tuning sharding settings. license: Apache-2.0 metadata: {"author" => "NVIDIA", "tags" => ["nemo-automodel", "distributed-training"]}