Agent Skill · NVIDIA NIM

accelerated-computing-cudf

Official NVIDIA-authored guidance for NVIDIA cuDF GPU DataFrames, pandas acceleration, dask-cuDF, ETL, joins, groupby, CSV/Parquet I/O, nullable semantics, and multi-GPU DataFrame workloads.

Provider: NVIDIA NIM Path in repo: skills/accelerated-computing-cudf/SKILL.md

Skill body

cuDF & dask-cuDF Implementer’s Guide

Compatibility

Naming

Use NVIDIA library-first wording in user-facing answers. Keep literal RAPIDS/rapidsai URLs, package names, and release metadata when citing sources.

Role

You are a cuDF expert helping an implementer work with GPU DataFrames. The user understands pandas and their data — your job is to get them to correct, fast GPU code with minimal friction. Choose the path from the user’s intent: cudf.pandas for broad compatibility or minimal-change acceleration, explicit cuDF for named DataFrame migrations, hot ETL paths, and parity-sensitive work. Treat source schema, row counts, null placement, ordering, and numeric tolerances as user-visible behavior.

Critical Rules

  1. Choose the right cuDF path. Use cudf.pandas for broad compatibility or minimal-change acceleration. Use explicit cuDF when the user asks to migrate DataFrame code, inspect parity, optimize a visible ETL hot path, or control unsupported operations.
  2. Size gate: 100K rows minimum. Below that, GPU transfer overhead usually beats the speedup; use small data for correctness and benchmark larger working sets for performance.
  3. Keep conversions at boundaries. Use .to_pandas(), .values, or .numpy() for display, plotting, CPU-only libraries, or final output boundaries. Keep intermediate ETL data on GPU.
  4. Float32 is your friend. cuDF operations on float64 are slower; cast early when precision allows.
  5. Validate semantics on representative slices. For null handling, joins, time series, reshape, or grouped logic, keep a small pandas reference path and compare shape, labels, null counts, ordering, and representative values before claiming parity.
  6. For data > GPU memory, move to dask-cuDF with enable_cudf_spill=True. See references/dask-cudf-patterns.md.

Three Paths to GPU DataFrames

Path 1: cudf.pandas Accelerator (Compatibility / Minimal Change)

Use when the user needs a small code change, third-party pandas compatibility, or one code path that can keep running while unsupported operations fall back.

Jupyter/IPython:

%load_ext cudf.pandas
import pandas as pd   # now GPU-backed; falls back silently for unsupported ops

Script:

python -m cudf.pandas my_script.py

With multiprocessing:

import cudf.pandas
cudf.pandas.install()   # must come BEFORE pandas import, before Pool creation
from multiprocessing import Pool

Confirm acceleration with the cudf.pandas profiler before claiming speedup. For notebook, CLI, and stats examples, read references/cudf-pandas-accelerator.md. If the profile shows the hot path running on CPU, use Path 2 for explicit cuDF control.

Path 2: Explicit cuDF API

For full control, hot-path optimization, named DataFrame migrations, and parity-sensitive operations:

import cudf

# Read data directly to GPU
df = cudf.read_parquet("data.parquet")

# Operations mirror pandas
result = df.groupby("key")["value"].sum()
merged = df.merge(lookup, on="id", how="left")
filtered = df[df["amount"] > 1000]

# String operations
df["clean"] = df["name"].str.strip().str.lower()

# To check API coverage before committing to migration:
# See references/api-patterns.md for known gaps and workarounds

Keep data on GPU end-to-end. Only call .to_pandas() at the very end for display or CPU or non-GPU handoff.

Prefer explicit cuDF for tasks involving read_csv/read_parquet, joins, groupby, reshape, nullable types, fillna/where, time buckets, rolling windows, or CPU/GPU parity checks. Add a small CPU/GPU validation path when semantics matter instead of relying on successful execution alone.

For pandas code with null handling, reshape, or time-series behavior, read references/api-patterns.md for the relevant semantic checklist before rewriting. A cudf.pandas bootstrap is enough for a minimal-change request; an implementation request should make the hot path explicit and observable.

For reshape-heavy pandas code (pivot_table, melt, stack/unstack, crosstab), keep the source schema as part of the contract: index labels, column labels or levels, fill_value, aggfunc, margins, and normalization. Use explicit cuDF where the equivalent is supported; use cudf.pandas or a narrow compatibility boundary when exact pandas reshape semantics matter more than rewriting every operation. Add a small pandas-reference parity check for shape, labels, and representative values before finalizing. See references/api-patterns.md.

Path 3: dask-cuDF (Multi-GPU / Large Data)

When dataset exceeds GPU memory. See references/dask-cudf-patterns.md for full patterns.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf

cluster = LocalCUDACluster(enable_cudf_spill=True)  # one worker per GPU
client = Client(cluster)

ddf = dask_cudf.read_parquet("s3://bucket/data/*.parquet")
result = ddf.groupby("key").agg({"value": "sum"}).compute()

Memory Management

Enable spill before OOM happens (not after):

import cudf
cudf.set_option("spill", True)   # spill to host RAM when GPU is full

RMM pool allocator (reduces cudaMalloc overhead in pipelines with many allocations):

import rmm
rmm.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
# Must be called BEFORE any cuDF operations
GPU Free vs Dataset Strategy
Free > 2× dataset Single GPU cuDF
Free 1–2× dataset cuDF + cudf.set_option("spill", True)
Dataset > GPU mem dask-cuDF
Dataset > node mem dask-cuDF + multi-node (see accelerated-computing-mpf)

Troubleshooting

No speedup vs pandas:

OOM (CUDA out of memory):

  1. Enable spill: cudf.set_option("spill", True)
  2. If allocator fragmentation or repeated allocation overhead is visible, use the accelerated-computing-rmm memory-resource setup guidance before GPU allocations
  3. Still failing: move to dask-cuDF

AttributeError / NotImplementedError:

Wrong results vs pandas:

Nullable and Fill Semantics

When the user explicitly cares about pandas nullable dtypes, fillna, where/mask, or grouped null behavior, treat parity checks as part of the implementation. See references/api-patterns.md for nullable dtype examples.

Reference Files

External Documentation

Use WebFetch to retrieve detailed API signatures, parameter descriptions, and examples on demand.

Skill frontmatter

license: CC-BY-4.0 AND Apache-2.0 metadata: {"author" => "NVIDIA", "tags" => ["cudf", "dataframes", "pandas", "dask-cudf", "etl"]}