Agent Skill · NVIDIA NIM

nemo-mbridge-perf-expert-parallel-overlap

Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.

Provider: NVIDIA NIM Path in repo: skills/nemo-mbridge-perf-expert-parallel-overlap/SKILL.md

Skill body

MoE Expert-Parallel Overlap Skill

References

What It Is

Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all communication by running it concurrently with expert FFN compute. Optionally, delayed expert weight-gradient computation (delay_wgrad_compute) provides additional overlap by deferring wgrad to overlap with the next layer’s forward.

Bridge supports two dispatcher paths:

Dispatcher Backend When to use
alltoall Standard MoE all-to-all Default, broadest compatibility
flex DeepEP or HybridEP Higher overlap on Ampere/Hopper/Blackwell

Quick Decision

Use EP overlap when:

Prefer:

Avoid EP overlap when:

Expected outcome:

Correctness-First alltoall Benchmark

For the plain EP-overlap isolation benchmark, keep flex dispatch and delayed wgrad disabled. The measured shape was Qwen3 MoE 30B-A3B SFT on 16 H100 GPUs: EP=16, alltoall, BF16, global batch size 1024, CUDA graphs disabled, moe_permute_fusion=false, measured over iterations 3-8.

Use these overrides for the plain-overlap case:

--cuda_graph_impl none \
--moe_flex_dispatcher_backend None \
--moe_a2a_overlap false \
comm_overlap.overlap_moe_expert_parallel_comm=true \
comm_overlap.delay_wgrad_compute=false \
model.moe_shared_expert_overlap=false

Do not use --moe_a2a_overlap true for this isolation test: the performance harness helper enables both overlap_moe_expert_parallel_comm and delay_wgrad_compute, so it does not isolate plain EP overlap.

Steady-window timing from that benchmark:

Case Steady mean Relative
no EP overlap 41.25s 1.000x
EP overlap 31.31s 1.317x
EP overlap plus delay_wgrad_compute 31.20s 1.322x

This is evidence for enabling plain EP overlap on this inter-node all-to-all shape. It does not show a meaningful independent win from delayed wgrad, and it does not validate fused MoE permutation because that path was disabled for the runtime stack.

Enablement

alltoall dispatcher

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.moe_shared_expert_overlap = False

cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False

Enable delay_wgrad_compute=True only after the plain overlap path is known to work and its extra compatibility constraints have been checked.

flex dispatcher (DeepEP or HybridEP)

from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")
# or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")

Compatibility And Constraints

Minimal Working Config

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True

Use this as the correctness-first starting point. Add delayed wgrad, flex dispatch, and CUDA-graph interactions only after the plain overlap path is known to work.

Minimal Runnable Command

Performance harness example inside a Slurm allocation. Keep the model, parallelism, dispatcher, and runtime fixed, and vary only the two overlap overrides:

uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false

Do not use --moe_a2a_overlap true when separating plain EP overlap from delayed wgrad: the performance harness helper enables both overlap_moe_expert_parallel_comm and delay_wgrad_compute.

Unit test verification:

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py -k "moe" \
  tests/unit_tests/training/test_deepep.py -q

Verification

Unit tests

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py \
  tests/unit_tests/training/test_deepep.py -q

Log checks

After a successful run with EP overlap:

  1. Confirm no assertion errors during CommOverlapConfig finalization
  2. Confirm overlap_moe_expert_parallel_comm appears as True in the logged config
  3. If using flex dispatcher, confirm moe_token_dispatcher_type = "flex" and the correct backend in logs

Success criteria

Code Anchors

Bridge overlap validation

if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
    assert model_cfg.expert_model_parallel_size > 1, ...
    assert model_cfg.num_moe_experts > 1, ...
    assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
    assert model_cfg.bf16 or model_cfg.fp16, ...
    assert is_torch_min_version("2.6.0"), ...
    # ... PP + VPP check, recompute checks, shared_expert_overlap check ...

Delayed wgrad validation

if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
    # TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
    # CUDA graph scope validations for delayed wgrad
    assert overlap_moe_expert_parallel_comm, ...

Flex-dispatcher activation

def apply_flex_dispatcher_backend(...):
    # GPU architecture check for DeepEP / HybridEP
    model_config.moe_token_dispatcher_type = "flex"
    model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
    model_config.moe_shared_expert_overlap = False

Perf harness override

def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
    if moe_a2a_overlap:
        recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
        recipe.comm_overlap.delay_wgrad_compute = True
        recipe.model.moe_shared_expert_overlap = False

Tests

File Coverage
tests/unit_tests/training/test_comm_overlap.py EP overlap validation, delayed wgrad, CUDA graph + wgrad interaction
tests/unit_tests/training/test_deepep.py DeepEP/HybridEP helper activation and GPU gating

Failure Diagnosis

Symptom Likely Cause How To Confirm Fix
assert expert_model_parallel_size > 1 EP not configured Check expert_model_parallel_size Set EP > 1
assert moe_token_dispatcher_type Wrong dispatcher Check dispatcher type Use "alltoall" or "flex"
assert on BF16/FP16 Wrong precision Check bf16 and fp16 Set bf16 = True
hang during training PyTorch < 2.6 Check PyTorch version Upgrade to >= 2.6.0
assert virtual_pipeline_model_parallel_size PP > 1 without VPP Check PP and VPP config Set VPP when PP > 1
assert recompute_granularity Full recompute enabled Check recompute settings Disable full recompute
assert overlap_moe_expert_parallel_comm required delayed wgrad without EP overlap Check delay_wgrad_compute without overlap Enable EP overlap first
assert gradient_accumulation_fusion CUDA graph + delayed wgrad Check graph scope + wgrad settings Enable gradient_accumulation_fusion
assert on attention bias CUDA graph attn + delayed wgrad + bias Check add_bias_linear / add_qkv_bias Disable attention bias
no throughput gain from flex dispatcher apply_flex_dispatcher_backend not called Check moe_token_dispatcher_type in logs Call apply_flex_dispatcher_backend(...)
DeepEP/HybridEP silently skipped Unsupported GPU Check warning logs Run on Ampere/Hopper/Blackwell

Known Limitations

Skill frontmatter

license: Apache-2.0 when_to_use: Enabling EP overlap to hide dispatch/combine latency, or tracing a throughput regression to an EP overlap config change; 'overlap_moe_expert_parallel_comm', 'delay_wgrad_compute', 'flex dispatcher', 'DeepEP overlap', 'HybridEP overlap'.