Agent Skill · NVIDIA NIM

nemo-data-designer-plugin

Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline.

Provider: NVIDIA NIM Path in repo: skills/nemo-data-designer-plugin/SKILL.md

Skill body

Before You Start

Do not explore the workspace first. The workflow’s Learn step gives you everything you need.

Goal

Build a synthetic dataset using the Data Designer library that matches this description:

$ARGUMENTS

Workflow

Use Autopilot mode if the user implies they don’t want to answer questions — e.g., they say something like “be opinionated”, “you decide”, “make reasonable assumptions”, “just build it”, “surprise me”, etc. Otherwise, use Interactive mode (default).

Read only the workflow file that matches the selected mode, then follow it:

Rules

Usage Tips and Common Pitfalls

Troubleshooting

Output Template

Write a Python file to the current directory with a load_config_builder() function returning a DataDesignerConfigBuilder. Name the file descriptively (e.g., customer_reviews.py). Use PEP 723 inline metadata for dependencies.

# /// script
# dependencies = [
#   "data-designer", # always required
#   "pydantic", # only if this script imports from pydantic
#   # add additional dependencies here
# ]
# ///
import data_designer.config as dd
from pydantic import BaseModel, Field


# Use Pydantic models when the output needs to conform to a specific schema
class MyStructuredOutput(BaseModel):
    field_one: str = Field(description="...")
    field_two: int = Field(description="...")


# Use custom generators when built-in column types aren't enough
@dd.custom_column_generator(
    required_columns=["col_a"],
    side_effect_columns=["extra_col"],
)
def generator_function(row: dict) -> dict:
    # add custom logic here that depends on "col_a" and update row in place
    row["name_in_custom_column_config"] = "custom value"
    row["extra_col"] = "extra value"
    return row


def load_config_builder() -> dd.DataDesignerConfigBuilder:
    config_builder = dd.DataDesignerConfigBuilder(
        # Declaring model configs programmatically here is the portable path:
        # it works for both local `run` and cluster `submit`, while the local
        # YAML registry alternative only works for `run`. The provider below
        # is a common default created during `nemo setup` — confirm it (or
        # discover others) with `nemo inference providers list`. See
        # references/nemo-platform-plugin-additions.md for the local-YAML alternative.
        model_configs=[
            dd.ModelConfig(
                alias="text",
                model="...",
                provider="default/nvidia-build",
                inference_parameters=dd.ChatCompletionInferenceParams(),
            ),
        ],
    )

    # Seed dataset (only if the user explicitly mentions a seed dataset path)
    # config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))

    # config_builder.add_column(...)
    # config_builder.add_processor(...)

    return config_builder

Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them. Prefer including model_configs when the dataset uses LLM columns — declaring it in the script keeps the config portable between local run and cluster submit, while the local YAML registry alternative only works for run.

Skill frontmatter

argument-hint: describe the dataset you want to generate license: Apache-2.0 metadata: {"owner" => "nemo-platform"}