Agent Skill · NVIDIA NIM

cupynumeric-migration-readiness

Pre-migration readiness assessor for porting NumPy to cuPyNumeric. Use BEFORE substantial porting work begins when the user asks whether code will scale on GPU, whether they should migrate to cuPyNumeric, which NumPy patterns transfer cleanly, what must be refactored before porting, or mentions pre-port assessment, scaling analysis, or refactor planning. Inspect the user's source code, look up NumPy usage, cross-reference the cuPyNumeric API support manifest, and distinguish distributed-scaling-friendly patterns from blockers such as unsupported APIs, scalar synchronization, host round-trips, Python/object-heavy control flow, shape/data-dependent branching, and in-place mutation hazards. Produce a verdict of READY, LIGHT REFACTOR, SIGNIFICANT REFACTOR, or NOT RECOMMENDED, with concrete refactor pointers.

Provider: NVIDIA NIM Path in repo: skills/cupynumeric-migration-readiness/SKILL.md

Skill body

cuPyNumeric Migration Readiness

Purpose

Use this skill BEFORE the migration, not during. Answer one question: which of the user’s existing NumPy APIs will scale on cuPyNumeric, and which need refactoring, before they commit engineer-weeks to porting? To answer it: read the source, classify each NumPy idiom by its expected multi-GPU scaling on the Legate/NVIDIA GPU stack, cross-reference the bundled API-support manifest, and produce a structured verdict with per-finding reasoning and recipe pointers.

This is a static, read-only assessment. Inspect the user’s source with Read, Grep, and Glob. Do not execute the user’s code, modify or write files, or print environment variables or secrets. The legate, and cuPyNumeric Doctor commands shown below are suggestions for the user to run — not actions this skill performs.

If this skill has never been seen before, head to references/getting-started.md first.

When to use this skill

Use when the user is about to migrate NumPy code to GPU and asks whether it will scale on cuPyNumeric / GPU, whether they should migrate, which parts will benefit, what must change before porting, or whether the port is worth it — or mentions pre-port assessment, scaling analysis, idiom analysis, GPU refactor planning, or identifying NumPy anti-patterns for GPU.

Decline and redirect when the request is not a pre-migration assessment:

A graph / sparse / ML / NLP workload that the user is asking to migrate is still in scope: assess it and return NOT RECOMMENDED via Gate 4. That is a verdict, not a decline.

Instructions

Run all five steps below, in order. Read the user’s code and reason about it semantically; do not emit a one-shot prose verdict.

Step 1 — Gather context

Elicit before scanning code. Each item below has a default tuned to the typical workload — use the default when the user does not volunteer specifics; do not block on questions.

State the defaults you applied at the top of the assessment so the user can correct them. If a value is indeterminable, say so plainly and proceed with the qualitative-only assessment — do not fabricate numbers beyond the defaults above.

Step 2 — Load the API support manifest

Read assets/api-support.md, the committed snapshot of the upstream NumPy-vs-cuPyNumeric comparison table. For each NumPy API the code calls, find its line and read the leading glyph:

If the Fetched: line is more than ~90 days old, refresh the snapshot — see the Available Scripts section.

Step 3 — Read the code semantically

Walk the user’s files with Read and Grep and classify each region of array math against references/idioms-that-scale.md and references/idioms-that-block.md (full rationale and R-codes live there). Read semantically, not by regex: before flagging, confirm arr traces back to a cupynumeric array (or np.* aliased to it) and check whether the access sits inside a hot loop. Apply these rules:

For the deep “why,” read references/gpu-stack.md (memory, SM, communication, dispatch) and references/execution-model.md (lazy execution, sync points, mapper).

Step 4 — Produce a structured assessment

Deliver the report in this order. Cite file:line for every finding so the user can navigate.

  1. Verdict in one sentence — see “Verdict framework” below.
  2. What works (SCALES findings) — quote representative lines so the user sees what will speed up after the import swap.
  3. What blocks (BLOCKS findings) — each tied to idioms-that-block.md and a recipe in refactor-recipes.md.
  4. What’s fixable (REFACTOR findings) — group by recipe; one recipe often fixes many sites.
  5. Compatibility / cost notes (INFO findings) — SciPy boundaries, single-GPU-only linalg / FFT, RNG layout vs --gpus N.
  6. API support gaps — APIs the code calls that are unimplemented or single-GPU only per the manifest.
  7. Decision-framework summary — Gates 1–6 from references/decision-framework.md, marked pass / fail / uncertain.
  8. Recommended next steps — which recipes to apply first, whether to port one module first, and when to involve cuPyNumeric Doctor.

All 8 sections must appear, even when the verdict is READY or NOT RECOMMENDED. Under an empty section write “None for this code” or “n/a — see verdict” in one line — do NOT omit the heading; the headings are the structural contract the report is graded on. See assets/sample_report.md for worked reports.

Step 5 — Hand off to cuPyNumeric Doctor for runtime validation

Direct the user to run cuPyNumeric Doctor once they have applied the recipes and the code runs:

CUPYNUMERIC_DOCTOR=1 CUPYNUMERIC_DOCTOR_FORMAT=json CUPYNUMERIC_DOCTOR_FILENAME=doctor-report.json legate --gpus 1 main.py

cuPyNumeric Doctor catches at runtime what source review can miss (scalar item access, ndarray iteration, advanced indexing, nonzero misuse, mpi4py import, in-place ops on views). End the assessment at: “now run with cuPyNumeric Doctor enabled; here is what to look for in its output.”

Verdict framework

Assign the verdict qualitatively, from the kinds of findings, not a score:

Verdict When Action
READY No BLOCKS; few/no REFACTOR Swap the import; benchmark
LIGHT REFACTOR A few recipe-fixable patterns (R201R206), or one or two simple BLOCKS Apply 1–3 recipes from refactor-recipes.md; re-walk to READY
SIGNIFICANT REFACTOR Multiple BLOCKS in hot paths, or any R108 (mpi4py) — rewrites, not disqualifications Real project; budget 1–3 engineer-weeks per module
NOT RECOMMENDED Only two failures: Gate 2 (arrays below the 65,536 floor) or Gate 4 (wrong compute pattern). A pile of BLOCKS does not land here Restructure first or use a different runtime

Apply these in order; the first match wins:

  1. Gate 4 fails (sparse / graph / ML / sequential / string) → NOT RECOMMENDED.
  2. Gate 2 fails (hot-path arrays < 65,536 elements/GPU, no realistic batching path) → NOT RECOMMENDED.
  3. Any R108 (mpi4py)SIGNIFICANT REFACTOR (the parallelism-layer rewrite is the cost, not a disqualification).
  4. Multiple BLOCKS (R101R111) across hot paths → SIGNIFICANT REFACTOR (count does not escalate past this — each BLOCKS has a documented recipe).
  5. One or two recipe-fixable BLOCKS (e.g., R101–R104 element-loop / sync) → LIGHT REFACTOR.
  6. Only REFACTOR patterns (R201–R206) → LIGHT REFACTOR; recipes are mechanical.
  7. No BLOCKS, no REFACTORREADY.
  8. APIs missing from the manifest on the hot path → demote one tier (SIGNIFICANT stays SIGNIFICANT, never NOT RECOMMENDED). Single-GPU-only APIs matter only for multi-node.

Weigh the kinds of findings, not their count. One R101 in a hot loop outranks ten R001s — it destroys the scaling the R001s would have delivered. Conversely a pile of BLOCKS + R108 is still SIGNIFICANT, not NOT RECOMMENDED — the tiers measure engineering cost, not despair. NOT RECOMMENDED requires a size or compute-pattern failure. Full framework: references/decision-framework.md.

What scales vs what blocks (at-a-glance)

Full taxonomy in idioms-that-scale.md and idioms-that-block.md. Pass over silently any API the manifest doesn’t list (out of scope of the upstream table — flagging it would be noise).

Reading order

The canonical, read-in-order guide lives in references/getting-started.md — read it once for orientation.

For a non-trivial assessment the must-reads are idioms-that-block.md, refactor-recipes.md, and decision-framework.md; the rest (idioms-that-scale.md, gpu-stack.md, execution-model.md, partitioning-and-balance.md, case-studies.md) are read on demand.

Limitations

Examples

A worked assessment of the bundled assets/examples/ fixtures (an example, not a template):

Verdict: LIGHT REFACTOR. scales_well.py translates cleanly; needs_refactor.py needs one allocation hoisted; blocks_scaling.py syncs every iteration via .item().

What works: scales_well.py:23-31 (stencil R005), :40-44 (reduction R002), :18-22 (elementwise R001). What blocks: blocks_scaling.py:51-58 (R104.item() in hot loop) → RR-sync. What’s fixable: needs_refactor.py:21-28 (R201 — alloc in loop) → RR-alloc. Next: apply the recipes; re-walk to READY; enable CUPYNUMERIC_DOCTOR=1 on the first real run.

The full worked report is in assets/sample_report.md.

Authoritative upstream references

Available Scripts

Script Purpose Arguments
scripts/fetch_api_support.py Scrape the upstream comparison table into assets/api-support.md. Python stdlib only; standalone. --default-path (write the committed assets/api-support.md); --docs-nvidia-url (use canonical docs.nvidia.com instead of the default GitHub Pages mirror)

The user runs this to refresh the manifest (python scripts/fetch_api_support.py --default-path).

Bundled references and assets

The references/ files are enumerated under Required reading order above (R-code ranges: idioms-that-scale.md = R001–R007 / R301–R305; idioms-that-block.md = R101–R111 / R201–R206). Assets: assets/api-support.md (committed API snapshot, load in Step 2), assets/sample_report.md and assets/examples/*.py (worked report and fixtures).

Troubleshooting

Symptom Cause Fix
Fetched: line in the manifest > ~90 days old Stale snapshot Run fetch_api_support.py --default-path (user-run)
Manifest missing or scraper fails Upstream HTML changed WebFetch the comparison table for that assessment
NOT RECOMMENDED for many fixable BLOCKS Heuristics applied out of order Re-apply order: Gate 4 → Gate 2 → R108 → BLOCKS → REFACTOR; weigh kinds, not count
Kernel authoring or post-migration profiling Out of scope Decline and redirect (see “When to use”) — no verdict

Skill frontmatter

license: CC-BY-4.0 OR Apache-2.0 compatibility: Knowledge-driven assessment; no cuPyNumeric install required. Runtime claims target Linux x86_64/aarch64 with NVIDIA compute capability >= 7.0 and CUDA 12.x/13.x. Runtime validation is delegated to cuPyNumeric Doctor. metadata: {"author" => "NVIDIA Corporation ", "version" => "2.0.0", "tags" => ["cupynumeric", "legate", "numpy", "gpu", "distributed-computing"], "upstream" => "https://github.com/nv-legate/cupynumeric", "docs" => "https://docs.nvidia.com/cupynumeric/latest/"}