Agent Skill · NVIDIA NIM

rag-perf

Performance benchmarking for a deployed NVIDIA RAG Blueprint server: profiling pass + aiperf load test driven by a single YAML config. Not for accuracy / RAGAS scoring (use rag-eval) or for deploying / repairing services (use rag-blueprint).

Provider: NVIDIA NIM Path in repo: skills/rag-perf/SKILL.md

Skill body

RAG-Perf — config-driven perf benchmark CLI

Purpose

Drive a deployed NVIDIA RAG Blueprint server with a YAML config, run a server-side profiling pass (per-stage timing, citation quality, bottleneck inference) and an optional aiperf load test (TTFT / E2E / token & request throughput / error rate), and write a unified report. The CLI is intentionally minimal: rag-perf -c <config> plus --help / --version. Behaviour is fully config-driven; field variations belong in YAML.

Scope

Prerequisites

Instructions

  1. Pick a preset. The three under scripts/rag-perf/configs/ are:
    • quick_profile.yaml — profile-only, ~30 s. Skips load test. For fast iteration on retrieval / reranker tuning.
    • single_run.yaml — one concurrency level, profiling + aiperf, ~2 min. Regression checks.
    • sweep.yaml — multi-axis sweep. load.concurrency, rag.vdb_top_k, rag.reranker_top_k are all int | list[int]; any of them as a list becomes a sweep axis (Cartesian product).
  2. Edit the preset. Required: replace rag.collection_names: ["<collection_name>"] with a real collection on the deployed ingestor server. Verify the collection exists via GET /v1/collections on the ingestor. The placeholder <collection_name> validates fine but every request will fail at retrieval. Use a copied YAML preset for variants; the CLI surface is intentionally config-only.

  3. Run. From repo root:
    uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml
    

    Same form for the other presets. The CLI accepts only -c / --config (required), --help, --version.

  4. Read stdout. Every invocation prints, in order: a startup banner, a one-line summary, the fully resolved config as YAML (so the run is reproducible from terminal output), per-grid-point progress with the shlex-joined aiperf command in copy-pastable form, a rich per-point summary table (stage breakdown with bars, citation quality, bottleneck, load-test block), and finally a side-by-side comparison table auto-labelled by whichever axis varied. See references/output-and-analysis.md.

  5. Inspect artifacts. Layout depends on run shape — flat for single-point + iterations=1, nested under iter_<i>/<point>/... otherwise. See references/output-and-analysis.md for the full directory tree, file purposes, and how to parse results.json / results.csv / report.md.

  6. Summarise for the user. When reporting back, follow the playbook in references/output-and-analysis.md#summarising-results-to-the-user: pick the canonical result file for the run shape, build a headline table (concurrency × top-k axes × TTFT × throughput × bottleneck × citation quality), compute scaling efficiency on sweeps, always flag zero citations / non-zero error rate / suspect llm_ttft_ms / small-sample p99, and propose a concrete next-experiment YAML.

  7. Tune. Schema is fully documented in docs/performance-benchmarking.md and the deeper-dive references below. Common knobs: turn aiperf.enabled: false for profile-only mode, increase load.iterations for variance estimation, set load.sleep_between_points_s: 60 for overnight Cartesian sweeps.

Examples

Profile-only (quickest signal on retrieval / reranker tuning):

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/quick_profile.yaml

Output: rag-perf-results/quick_profile/run_<ts>/{profile_report.md, profile_results.json, profiling/}. The aiperf_rag_on/ directory is omitted. Filenames are profile_* because aiperf.enabled: false.

Single benchmark point with full report:

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml

Output: flat run_<ts>/{report.md, results.json, results.csv, profiling/, aiperf_rag_on/}.

Concurrency sweep:

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/sweep.yaml

Output: nested run_<ts>/iter_1/<CR:_VDB-K:_RERANKER-K:_…>/{profiling,aiperf_rag_on}/ per point, plus aggregate report.md / results.json / results.csv at the run root.

Run unit tests:

uv sync --project scripts/rag-perf --extra dev   # one-time, installs pytest-asyncio
uv run --project scripts/rag-perf python -m pytest tests/unit/test_rag_perf/

Limitations

Troubleshooting

Error / signal Likely cause What to do
Configuration errors in <yaml>: • input — ... XOR rule Both input.file and input.synthetic set Pick one. The XOR validator runs at YAML load time.
input.file must end in .jsonl or .csv Extension other than .jsonl / .csv Rename or convert.
load.concurrency has duplicate values e.g. [2, 2, 4] Each concurrency maps to a unique point dir; dedupe.
warmup_requests must be >= 1 YAML had warmup_requests: 0 aiperf rejects warmup=0; minimum is 1.
LLM returned empty content (reasoning_content was populated — model exhausted its budget on chain-of-thought; raise min_query_tokens or set synthetic.disable_thinking=true). Reasoning model used CoT and ran out of tokens Set synthetic.disable_thinking: true (the default) or raise min_query_tokens.
✗ All N profiling requests failed across M point(s). + exit 1 Bad URL, server down, wrong collection Verify target.url, rag.collection_names (the <collection_name> placeholder will hit this).
Per-iteration ⚠ N profiling requests failed warning, run continues Some requests timed out / errored mid-run Check rag-server logs, raise target.timeout_s, drop concurrency.
RuntimeError: Random synthetic query generation failed at query N: ... LLM endpoint rejected a request mid-generation Partial JSONL is at synthetic.jsonl_output_path; fix endpoint and re-run with reduced num_queries, or point input.file at the partial file.
Citation count (mean): 0 and Citation relevance score: N/A for a non-empty deployment Collection mismatch between rag.collection_names and what’s actually ingested Run curl -s http://<ingestor>:8082/v1/collections to list real collections.
Tests error with ModuleNotFoundError: No module named 'pytest_asyncio' Dev extras missing uv sync --project scripts/rag-perf --extra dev.
CI: ModuleNotFoundError: No module named 'ruamel' from tests/unit/test_rag_perf/ rag-perf package missing from CI venv Add uv pip install -e ./scripts/rag-perf after the top-level install in the unit-tests job.

Gotchas

Source of truth

Piece Location
Driver scripts/rag-perf/rag_perf/cli.py (main is the single Click command)
Schema scripts/rag-perf/rag_perf/config.py (RunConfig and sub-models)
Orchestrator scripts/rag-perf/rag_perf/runner.py (BenchmarkRunner.run, RagProfiler, AiperfRunner)
aiperf plugin scripts/rag-perf/rag_perf/plugin/nvidia_rag.py
User-facing doc docs/performance-benchmarking.md
Presets scripts/rag-perf/configs/{quick_profile,single_run,sweep}.yaml
Sample queries scripts/rag-perf/examples/queries.jsonl
Synthetic prompts scripts/rag-perf/prompts/default_prompts.yaml
Config schema details references/config-schema.md
Synthetic-query generation references/synthetic-generation.md
Output layout & metric semantics references/output-and-analysis.md

Agent playbook

  1. Sync deps: uv sync --project scripts/rag-perf (one-time per checkout).
  2. Pick & customise a preset: copy scripts/rag-perf/configs/<preset>.yaml if you want a variant; always set rag.collection_names to a real collection.
  3. Run: uv run --project scripts/rag-perf rag-perf -c <config> from repo root.
  4. Read the per-point + aggregate tables on stdout. Bottleneck inference is in the per-point profiling section; comparison across points is the final aggregate table.
  5. Parse artifacts under output.dir/run_<ts>/ — see references/output-and-analysis.md. For multi-point runs, results.csv has one row per (point × iteration).
  6. Summarise for the user using the playbook in references/output-and-analysis.md#summarising-results-to-the-user — headline table, scaling-efficiency math for sweeps, mandatory flags for zero citations / non-zero errors / suspect llm_ttft_ms / low sample size, and a concrete next-experiment YAML.
  7. Tune retrieval / reranker: flip to quick_profile.yaml or aiperf.enabled: false for fast iteration, then return to single_run.yaml / sweep.yaml when characterising under load.
  8. Triage failures: see Troubleshooting above and references/output-and-analysis.md for empty-citation / bottleneck=N/A patterns.

Skill frontmatter

version: 2.6.0 license: Apache-2.0 compatibility: Repository checkout with uv; Python 3.11+; run from repo root; uv sync --project scripts/rag-perf (perf deps live in scripts/rag-perf/pyproject.toml); reachable RAG server (default http://localhost:8081); for synthetic queries an OpenAI-compatible chat-completions endpoint is required (default http://localhost:8999/v1/chat/completions); aiperf load-test phase uses the bundled nvidia_rag endpoint plugin, registered automatically when rag-perf is installed editable. metadata: {"tool-version" => "0.1.0", "author" => "NVIDIA RAG ", "github-url" => "https://github.com/NVIDIA-AI-Blueprints/rag", "endpoint-openapi-schemas" => ["docs/api_reference/openapi_schema_rag_server.json"], "argument-hint" => "rag-perf | aiperf | TTFT | latency | throughput | concurrency sweep | bottleneck | retrieval / reranker tuning | profile-only | synthetic queries | quick_profile.yaml | single_run.yaml | sweep.yaml | uv run --project scripts/rag-perf", "tags" => ["nvidia", "blueprint", "rag", "performance", "benchmarking", "aiperf", "nvidia-rag-blueprint"], "languages" => ["python", "shell"], "frameworks" => ["aiperf", "fastapi"], "domain" => "ai-ml"} allowed-tools: Read Grep Glob Bash(ls *) Bash(python3 *) Bash(uv *) Bash(cat *) Bash(curl *) Write Edit