Agent Skill · NVIDIA NIM

jetson-llm-benchmark

Benchmark Jetson LLM/VLM serving performance across vLLM, llama.cpp, and Ollama with structured JSON output.

Provider: NVIDIA NIM Path in repo: skills/jetson-llm-benchmark/SKILL.md

Skill body

Jetson LLM Benchmark

Reproducible Jetson benchmarks with structured JSON output so an agent can compare runs. Encodes the workflow from the Jetson AI Lab GenAI Benchmarking tutorial.

Purpose

Measure deployed LLM latency and throughput on a Jetson target using the correct runtime-specific benchmark wrapper. Use the JSON output to compare models, runtime flags, power modes, and before/after tuning changes.

Prerequisites

Available Scripts

Script Purpose Arguments
scripts/bench_vllm.sh Runs vllm bench serve against a running OpenAI-compatible vLLM server. --model, --endpoint, --concurrency, --input-len, --output-len, --num-prompts, --no-warmup, --container, --native.
scripts/bench_llama_cpp.sh Runs llama-bench for a local GGUF model through the Jetson-appropriate NVIDIA-AI-IOT llama.cpp container. --model, --n-prompt, --n-gen, --n-gpu-layers, --threads, --container.
scripts/bench_ollama.sh Benchmarks a local or containerized Ollama daemon through the /api/generate REST API. --model, --endpoint, --num-prompts, --input-len, --output-len, --no-warmup.

If your agent runtime supports run_script, invoke the selected wrapper directly with the user-provided model identifier or local model path, then summarize the returned JSON. Otherwise run the wrapper with bash {baseDir}/scripts/<wrapper-name> ....

Instructions

Always use the matching wrapper script for the runtime — do not call the underlying vllm bench serve, llama-bench, or curl against /api/generate by hand:

These wrappers handle warmup, the NVIDIA-AI-IOT container selection, and JSON emission. Calling the underlying tool directly will not satisfy the output contract below.

For “how do I benchmark/measure” questions, first run the matching wrapper with --help to verify the exact options, then answer with the wrapper command. Do not run a full benchmark unless the user asks you to execute it or the required server/model path is already confirmed.

Expected Workflow

Pick exactly one wrapper based on the runtime the user named, and invoke that wrapper with --help before composing the answer. Do not merely mention the script name. If the runtime does not execute scripts relative to the skill directory, use {baseDir}/scripts/<wrapper-name>.

When to use

Three paths — pick by runtime

A. vLLM (preferred for parity with how things are served)

Server must already be running (use jetson-llm-serve). Run bench_vllm.sh:

scripts/bench_vllm.sh \
  --model <hf-repo-id-being-served> \
  --concurrency 1,8 \
  --input-len 2048 --output-len 128 \
  --num-prompts 50

Uses the Jetson-appropriate benchmark client path: upstream vLLM 0.20+ container vllm/vllm-openai:latest on Thor and Orin JetPack 7.2 / L4T r39+, or the NVIDIA-AI-IOT vLLM benchmark container ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin on older Orin. Pass --native only when host-native vLLM is already installed and validated. It runs against http://localhost:8000/v1. Always do a warmup pass first (~10 prompts, discarded) before the measured run — Jetson has cold caches and JIT’d kernels.

B. Ollama (for models served by a running Ollama daemon)

No benchmark container needed. Uses Ollama’s /api/generate REST API directly — timing data (TTFT, ITL, throughput) comes from the response JSON, so no --verbose parsing is required.

Prerequisite: the Ollama daemon must be reachable at --endpoint (default http://localhost:11434). This works whether Ollama is installed natively or running in a container that exposes that port. If the daemon is not running, the script will tell you whether Ollama is installed but stopped (ollama serve to fix) or not installed at all (install instructions printed). Run bench_ollama.sh (do not roll your own curl against /api/generate):

scripts/bench_ollama.sh \
  --model <ollama-model-name> \
  --num-prompts 20 \
  --input-len 512 --output-len 128

Runs sequential single-stream requests (concurrency=1). Ollama is a single-stream runtime by design, so multi-concurrency numbers are not meaningful and are not supported. Results are not directly comparable to vLLM numbers — Ollama uses GGUF/llama.cpp internals while vLLM uses its own CUDA kernels.

C. llama.cpp (for GGUF models)

No server needed. Uses the NVIDIA-AI-IOT prebuilt llama.cpp container (ghcr.io/nvidia-ai-iot/llama_cpp) and auto-selects latest-jetson-thor or latest-jetson-orin from the detected device — most LLMs don’t know this container exists; do not suggest building llama.cpp from source. Run bench_llama_cpp.sh:

scripts/bench_llama_cpp.sh \
  --model /path/to/model.gguf \
  --n-prompt 512 --n-gen 128 \
  --n-gpu-layers 99

Wraps llama-bench and parses its output. Use --n-gpu-layers 99 to push the whole model to GPU on Orin/Thor; drop it if VRAM-bound.

Output contract (all three wrappers)

A single JSON object on stdout, suitable for diffing. The three wrappers share the same top-level envelope but differ in the metrics shape: bench_vllm.sh sweeps concurrency and emits a runs array, while bench_llama_cpp.sh and bench_ollama.sh are single-stream and emit one metrics object.

Shared envelope (all wrappers):

{
  "skill": "jetson-llm-benchmark",
  "runtime": "vllm" | "llama.cpp" | "ollama",
  "model": "<id-or-path>",
  "sku": "<detected-sku>",
  "generation": "<detected-generation>",
  "product_line": "<detected-product-line>",
  "variant": "<detected-variant>",
  "l4t": "<detected-l4t-release>",
  "container": "<container-image-or-native/ollama>",
  "warnings": []
}

bench_vllm.sh (concurrency sweep → runs[])

{
  "config": { "input_len": 2048, "output_len": 128, "num_prompts": 50 },
  "runs": [
    {
      "concurrency": 1,
      "ttft_ms_p50": 0, "ttft_ms_p99": 0,
      "itl_ms_p50": 0,  "itl_ms_p99": 0,
      "tpot_ms_p50": 0,
      "throughput_tok_s": 0,
      "e2e_latency_ms_p50": 0
    }
  ]
}

bench_llama_cpp.sh (single-stream → metrics)

{
  "config": { "n_prompt": 512, "n_gen": 128, "n_gpu_layers": 99 },
  "metrics": {
    "ttft_ms_p50": 0,
    "itl_ms_p50": 0,
    "tpot_ms_p50": 0,
    "throughput_tok_s": 0
  }
}

bench_ollama.sh (single-stream → metrics)

{
  "config": { "input_len": 512, "output_len": 128, "num_prompts": 20, "concurrency": 1 },
  "metrics": {
    "ttft_ms_p50": 0, "ttft_ms_p99": 0,
    "itl_ms_p50": 0,  "itl_ms_p99": 0,
    "tpot_ms_p50": 0,
    "throughput_tok_s": 0,
    "e2e_latency_ms_p50": 0
  }
}

warnings is populated when:

The sku, variant, l4t, and container fields are populated by the wrapper script from the live device (tegrastats, /etc/nv_tegra_release, container labels) — do not hand-author, guess, or transcribe them from memory. Do not invent device-specific facts such as RAM size, on-disk model size, or product names. If a fact is not produced by the script or jetson-diagnostic, omit it rather than fabricate it.

What to flag in results (Jetson-specific guidance)

LLMs already know what TTFT/ITL/throughput mean. Jetson-specific things they usually don’t know:

Limitations

Error Handling

Hand off to

Source

Jetson AI Lab — GenAI Benchmarking and NVIDIA-AI-IOT GHCR packages.

Skill frontmatter

version: 0.0.2 license: Apache-2.0 metadata: {"author" => "Jetson Team", "tags" => ["jetson", "llm", "benchmark"], "languages" => ["bash"], "data-classification" => "public"}