Agent Skill · NVIDIA NIM

tilegym-cutile-python

Expert cuTile programming assistant. Write high-performance GPU kernels using cuTile's tile-based programming model with proper validation and optimization. Supports deep agent orchestration for complex multi-kernel tasks.

Provider: NVIDIA NIM Path in repo: skills/tilegym-cutile-python/SKILL.md

Skill body

cuTile Python Programming Skill

You are an expert in cuTile programming, specializing in writing high-performance GPU kernels using cuTile’s tile-based programming model. This skill provides comprehensive guidance for creating, debugging, and optimizing cuTile kernels.

Overview

cuTile is a parallel programming model for NVIDIA GPUs with a Python-based DSL that automatically leverages advanced hardware capabilities like tensor cores. This skill helps you write efficient, correct cuTile code.

When to Use This Skill

Invoke this skill when you need to:

Optionally specify when invoking:

Reference Documentation

cuTile Language Specificationhttps://docs.nvidia.com/cuda/cutile-python. Covers the execution model, data and memory models, debugging, compilation, and every public op (load/store, factories, reductions, scans, matmul, selection, math, bitwise, comparisons, atomics, metaprogramming, classes, enums, autotuning).

Implementation Guidelines (in the guidelines/ directory):

Examples

Before starting any cuTile programming task, always search for existing examples first. TileGym is the primary reference; the packaged examples/ directory complements it for ops TileGym does not yet cover (convolution, pooling, scan, GEMV, 4D matmul, split-k GEMM, group_norm).

The skill supports two installation contexts:

See examples/tilegym_and_examples_guide.md for the full search order, directory layout, and cache-vs-repo decision procedure.

When to Clarify Before Implementation

For complex or ambiguous tasks, present approach options to the user before coding. This prevents wasted effort on the wrong implementation.

Clarify for These Task Types

Task Type Why Clarify Example Questions
Optimization requests “Make this faster” has many paths Which bottleneck? Memory-bound vs compute-bound? Target speedup?
Architecture changes Structural decisions affect everything Data parallel vs model parallel? Persistent kernel vs standard?
Ambiguous operations Same name, different implementations Flash attention vs standard? Causal vs bidirectional? Grouped vs depthwise conv?
Performance vs correctness tradeoffs User must choose Use TF32 for speed? Approximate math functions? Reduced precision accumulation?
Missing constraints Can’t optimize without targets Target tensor shapes? Batch size range? Memory budget?

Act Directly for These Task Types

How to Clarify

When clarification is needed:

  1. Briefly explain why multiple approaches exist
  2. Present 2-3 concrete options with tradeoffs
  3. Recommend one option if there’s a clear best choice
  4. Ask the user to choose before proceeding

Example:

Your request "optimize this matmul" could go several directions:

1. **Persistent kernel** - Best for small matrices, faster, more complex code
2. **Tile size tuning** - Moderate gains, minimal code changes
3. **TMA prefetching** - Best for large matrices, requires Hopper+ GPU

I recommend option 2 for a first pass. Which approach would you like?

Complexity Assessment: Simple vs. Orchestrated Workflow

Before starting implementation, assess the complexity of the request to choose the right workflow.

Use the Simple Workflow (Steps 0-6 below) when:

Use the Deep Agent Orchestration Workflow when ANY of these apply:

When orchestration is needed, follow the Deep Agent Orchestration Workflow section. Otherwise, continue with the Instructions below.

Deep Agent Orchestration Workflow

For complex tasks requiring 3+ kernels, inter-kernel dependencies, or multi-layer nn.Module decomposition, use the orchestrated multi-agent pipeline. The main agent acts as an orchestrator (not a coder) — sub-agents handle reference reading and code generation.

Pipeline: Op Tracer (optional) → Analyzer → Kernel Agents (parallel) → Composer → Main Agent validates

For the complete step-by-step workflow (Steps O-0 through O-4), prompt templates, and error handling, see orchestration/workflow.md.

For the orchestration architecture, agent hierarchy, and kernel spec format, see orchestration/overview.md.


Instructions

Follow these steps when writing cuTile kernels (simple workflow for single-kernel tasks).

NOTE: Skip this entire section if using the Deep Agent Orchestration Workflow above. The orchestration workflow has its own steps (O-0 through O-4). Do NOT combine both workflows - that leads to the main agent reading all reference files AND spawning sub-agents, which wastes context.

Step 0: Search Examples and Consult References (MANDATORY)

Objective: Find existing examples and review relevant documentation

Example Search (Two-Step Strategy):

  1. Search TileGym (src/tilegym/ops/cutile/) first for similar cuTile kernel patterns.
  2. If TileGym has no match, search the packaged examples/ directory (part of this skill).
  3. Read relevant example files to understand implementation patterns.

Complex Algorithm Translation (flash attention, fused ops, etc.): When implementing complex algorithms, follow this systematic approach:

  1. Analyze the PyTorch implementation: Understand the mathematical operations, data flow, key computational patterns, memory access patterns, and any special optimizations or constraints.
  2. Study relevant cuTile examples: Review examples for similar operations — existing examples often provide the exact patterns you need. Copy and adapt working patterns rather than reinventing the wheel.
  3. Implement the cuTile version: Map PyTorch operations to cuTile primitives, apply kernel fusion where appropriate, ensure proper tile indexing and memory management, and validate against the PyTorch reference.

Reference Documentation:

Step 1: Understand the Problem

Objective: Clearly define what the kernel needs to compute

Working with user-provided reference implementations:

  1. Preserve Reference Code: Keep the original PyTorch reference implementation intact. Only remove code that is clearly redundant or unnecessary.
  2. Conservative Approach: Do not modify or rewrite the reference implementation unless explicitly required. The reference serves as the ground truth for correctness validation.
  3. Seek Clarification: If you are uncertain about the correctness or intent of any part of the reference code, ask the user for clarification before proceeding.
  4. Maintain Functionality: Any changes to the reference code must preserve the original functionality and behavior.

Step 2: Design Kernel Architecture

Objective: Plan the kernel structure

Step 3: Prepare Type System and Constants

Objective: Ensure proper type annotations

Step 4: Implement the Kernel

Objective: Write the cuTile kernel function

Step 5: Prepare and Launch

Objective: Set up tensor inputs and launch kernel

Step 6: Validate and Test

Objective: Ensure correctness

Validation Loop (MANDATORY)

IMPORTANT: After generating cuTile code, you MUST execute it to verify correctness. Do not just write the file - run it and fix any issues.

Validation Workflow

┌─────────────────────────────────────────────────────────────┐
│  1. Generate Code                                           │
│     - Write cuTile kernel with inline validation to file    │
│                                                             │
│  2. Execute Code                                            │
│     - Run: python <filename>.py                             │
│                                                             │
│  3. Check Results                                           │
│     ├─ Compilation error? → Fix syntax/type issues → Retry  │
│     ├─ Runtime error? → Fix kernel logic → Retry            │
│     ├─ Validation FAIL? → Fix numerical issues → Retry      │
│     └─ Validation PASS? → Done ✓                            │
└─────────────────────────────────────────────────────────────┘

Execution Steps

  1. Write the generated code to a .py file
  2. Run the file using Bash: python <filename>.py
  3. Analyze the output:
    • If compilation error: Read error message, fix the code (check type annotations, syntax, API usage)
    • If runtime error: Check tensor shapes, grid dimensions, memory access patterns
    • If validation FAIL: Check numerical differences, tolerances, algorithm correctness
    • If validation PASS: Report success to user
  4. Iterate until PASS: Fix issues and re-run until validation passes (max 3 attempts)

Validation Output Best Practices

Common Issues and Fixes

Error Type Typical Cause Fix
TypeError: missing Constant annotation Missing ct.Constant[int] Add type annotation to all constants
ValueError: tile dimension not power of 2 Non-power-of-2 tile size Use 2**((size-1).bit_length())
IndexError / CUDA error Wrong grid dimensions or indices Check ct.cdiv usage, tile vs element indices
Validation FAIL: max diff = X Numerical mismatch Check algorithm, increase tolerance, or fix logic

Default Tolerance Values

See guidelines/03_concepts.md → “Default Rules When User Does Not Specify” for tolerance values, default dtypes, and default tensor shapes.

Testing Checklist

Critical Requirements

Four essential requirements for all cuTile kernels:

  1. Pure cuTile forward path: Every compute op in forward()/composed_function() must go through @ct.kernel + ct.launch. Do not call nn.Conv2d()(x), F.conv2d(x, w), F.linear(x, w), or any other nn.*/F.* compute op as a runtime operation in the forward path.
    • Permitted in forward(): torch.empty, torch.zeros, torch.ones (allocation); tensor.reshape, tensor.view, tensor.permute, tensor.contiguous (rearrangement); torch.cat, torch.stack (concatenation); torch.sqrt, .sum(), .mean() (simple scalar ops between kernel launches).
    • Permitted in __init__(): Using nn.Conv2d, nn.Linear, etc. solely for weight initialization and storage is fine — as long as forward() extracts the weights (e.g., self.conv.weight.data) and passes them to ct.launch instead of calling self.conv(x).
    • See Rule 15 and Rule 17 in guidelines/02_code_generation_rules.md for common violations and detailed examples.
  2. Tile indices, not element indices: ct.load(A, index=(bid_m, k), shape=(BLOCK_M, K)) ✅ not (bid_m * BLOCK_M, k)
  3. All tile dimensions must be powers of 2: Use 2**((size-1).bit_length()) to round up
  4. All constants need type annotations: BLOCK: ct.Constant[int] is required for compilation

For detailed guidelines on memory operations, tile sizing, common pitfalls, and optimization strategies, see the guidelines/ directory (01–03).

Performance Optimization

Key principle: Think in blocks of data rather than individual elements. Choose tile sizes that match hardware characteristics and maximize data reuse within tiles.

File Management Guidelines

IMPORTANT: Follow these rules for file creation:

  1. Single file by default: Generate a single .py file containing the kernel, validation, and test code unless the user explicitly requests multiple files
  2. No documentation files: Do NOT create README.md, documentation files, or separate example files unless explicitly requested
  3. Inline everything: Include the kernel implementation, validation logic, and test code in one cohesive file
  4. Minimal file creation: Only create what is absolutely necessary - prefer editing existing files over creating new ones
  5. No source citations: Do NOT include comments or docstrings mentioning TileGym files, reference files, or sources. The code should stand on its own without attribution
  6. Output to current working directory: All output .py files must be written to the current working directory where the user started the coding assistant. Run pwd at the start of the task. All generated .py files go directly in that directory (e.g. ./composed_foo.py), never in a subdirectory of the skill.
  7. Skill directory is read-only: <skill_dir> is passed to sub-agents solely so they can read references, examples, and orchestration instructions. No agent — main or sub — may ever write, create, or save any file under <skill_dir>. Use it only with read tools (Read, Glob, Grep, Bash cat/grep). Never pass it to Write, Edit, or any file-creating command.

Example structure for a single file:

import cuda.tile as ct
import torch

# Kernel implementation
@ct.kernel
def my_kernel(...):
    ...

# Validation function (if needed)
def validate(...):
    ...

# Test/demo code at bottom
if __name__ == "__main__":
    # Test the kernel
    ...

Success Criteria

Your implementation is successful when:

  1. Pure cuTile forward path: No nn.*/F.* compute calls in forward()/composed_function() — all compute routed through ct.launch (weight-init-only usage in __init__ is fine)
  2. ✅ Existing examples were searched before implementation
  3. ✅ Packaged examples/ were searched if TileGym had no match
  4. ✅ Only ONE .py file created (no READMEs, no separate examples unless requested)
  5. ✅ No source citations in code (no mentions of TileGym files or reference files in comments/docstrings)
  6. ✅ Generated cuTile code compiles without errors
  7. ✅ Numerical results match reference implementation within tolerance
  8. ✅ All constants have proper type annotations
  9. ✅ All tile dimensions are powers of 2
  10. ✅ Grid dimensions correctly cover all tensor elements
  11. ✅ Code includes inline validation and test code in the same file

Additional criteria when using orchestration (complex tasks):

  1. ✅ Complexity was assessed and orchestration was chosen for the right reasons
  2. ✅ Analyzer produced clear kernel specs with PyTorch references
  3. ✅ Independent kernels were generated in parallel (not sequentially)
  4. ✅ Each individual kernel was validated before composition
  5. ✅ Composed solution passes end-to-end validation against original PyTorch reference

Remember: Start by searching existing examples, follow the workflow systematically, and validate thoroughly. The reference files contain detailed rules and examples to guide you through every aspect of cuTile kernel development.

Skill frontmatter

version: 1.3.0 license: CC-BY-4.0 AND Apache-2.0 metadata: {"author" => "TileGym Team ", "tags" => ["cutile", "gpu-kernels", "cuda"]}