jetson-llm-benchmark
Benchmark Jetson LLM/VLM serving performance across vLLM, llama.cpp, and Ollama with structured JSON output.
Skill body
Jetson LLM Benchmark
Reproducible Jetson benchmarks with structured JSON output so an agent can compare runs. Encodes the workflow from the Jetson AI Lab GenAI Benchmarking tutorial.
Purpose
Measure deployed LLM latency and throughput on a Jetson target using the correct runtime-specific benchmark wrapper. Use the JSON output to compare models, runtime flags, power modes, and before/after tuning changes.
Prerequisites
- Run on the Jetson device that hosts the model runtime.
- For vLLM, start the OpenAI-compatible vLLM server first and know the served model ID.
- For Ollama, ensure the Ollama daemon is reachable at
--endpointand the named model is already pulled. - For llama.cpp/GGUF, provide a readable
.ggufmodel path on the host. - Put the device in the intended power mode before measuring. MAXN is preferred for comparable performance numbers.
Available Scripts
| Script | Purpose | Arguments |
|---|---|---|
scripts/bench_vllm.sh |
Runs vllm bench serve against a running OpenAI-compatible vLLM server. |
--model, --endpoint, --concurrency, --input-len, --output-len, --num-prompts, --no-warmup, --container, --native. |
scripts/bench_llama_cpp.sh |
Runs llama-bench for a local GGUF model through the Jetson-appropriate NVIDIA-AI-IOT llama.cpp container. |
--model, --n-prompt, --n-gen, --n-gpu-layers, --threads, --container. |
scripts/bench_ollama.sh |
Benchmarks a local or containerized Ollama daemon through the /api/generate REST API. |
--model, --endpoint, --num-prompts, --input-len, --output-len, --no-warmup. |
If your agent runtime supports run_script, invoke the selected wrapper directly with the user-provided model identifier or local model path, then summarize the returned JSON. Otherwise run the wrapper with bash {baseDir}/scripts/<wrapper-name> ....
Instructions
Always use the matching wrapper script for the runtime — do not call the underlying vllm bench serve, llama-bench, or curl against /api/generate by hand:
- vLLM →
scripts/bench_vllm.sh(required for the vLLM path) - llama.cpp / GGUF →
scripts/bench_llama_cpp.sh(required for the GGUF path) - Ollama →
scripts/bench_ollama.sh(required for the Ollama path)
These wrappers handle warmup, the NVIDIA-AI-IOT container selection, and JSON emission. Calling the underlying tool directly will not satisfy the output contract below.
For “how do I benchmark/measure” questions, first run the matching wrapper with
--help to verify the exact options, then answer with the wrapper command. Do
not run a full benchmark unless the user asks you to execute it or the required
server/model path is already confirmed.
Expected Workflow
Pick exactly one wrapper based on the runtime the user named, and invoke that
wrapper with --help before composing the answer. Do not merely mention the
script name. If the runtime does not execute scripts relative to the skill
directory, use {baseDir}/scripts/<wrapper-name>.
- Existing vLLM OpenAI-compatible server at
localhost:8000:{baseDir}/scripts/bench_vllm.sh --help, then show a command using--concurrency 1,8and the served model ID. - llama.cpp / GGUF /
llama-server:{baseDir}/scripts/bench_llama_cpp.sh --help, then show a command for the GGUF model path and report that prompt/generation speed maps to TTFT, ITL/TPOT, and throughput. - Ollama:
{baseDir}/scripts/bench_ollama.sh --help, then show a command with--model <ollama-tag>. Do not use vLLM or llama.cpp wrappers for Ollama.
When to use
- “Benchmark / measure / compare X on this Jetson.”
- After
jetson-llm-serveto actually quantify the deployment. - Before/after applying flags from
jetson-inference-mem-tuneto confirm the change helped.
Three paths — pick by runtime
A. vLLM (preferred for parity with how things are served)
Server must already be running (use jetson-llm-serve). Run bench_vllm.sh:
scripts/bench_vllm.sh \
--model <hf-repo-id-being-served> \
--concurrency 1,8 \
--input-len 2048 --output-len 128 \
--num-prompts 50
Uses the Jetson-appropriate benchmark client path: upstream vLLM 0.20+ container
vllm/vllm-openai:latest on Thor and Orin JetPack 7.2 / L4T r39+,
or the NVIDIA-AI-IOT vLLM benchmark container
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin on older Orin. Pass
--native only when host-native vLLM is already installed and validated. It
runs against http://localhost:8000/v1. Always do a warmup pass first (~10
prompts, discarded) before the measured run — Jetson has cold caches and JIT’d
kernels.
B. Ollama (for models served by a running Ollama daemon)
No benchmark container needed. Uses Ollama’s /api/generate REST API directly —
timing data (TTFT, ITL, throughput) comes from the response JSON, so no
--verbose parsing is required.
Prerequisite: the Ollama daemon must be reachable at --endpoint (default
http://localhost:11434). This works whether Ollama is installed natively or
running in a container that exposes that port. If the daemon is not running,
the script will tell you whether Ollama is installed but stopped (ollama serve
to fix) or not installed at all (install instructions printed). Run
bench_ollama.sh (do not roll your own curl against /api/generate):
scripts/bench_ollama.sh \
--model <ollama-model-name> \
--num-prompts 20 \
--input-len 512 --output-len 128
Runs sequential single-stream requests (concurrency=1). Ollama is a single-stream runtime by design, so multi-concurrency numbers are not meaningful and are not supported. Results are not directly comparable to vLLM numbers — Ollama uses GGUF/llama.cpp internals while vLLM uses its own CUDA kernels.
C. llama.cpp (for GGUF models)
No server needed. Uses the NVIDIA-AI-IOT prebuilt llama.cpp container (ghcr.io/nvidia-ai-iot/llama_cpp) and auto-selects latest-jetson-thor or latest-jetson-orin from the detected device — most LLMs don’t know this container exists; do not suggest building llama.cpp from source. Run bench_llama_cpp.sh:
scripts/bench_llama_cpp.sh \
--model /path/to/model.gguf \
--n-prompt 512 --n-gen 128 \
--n-gpu-layers 99
Wraps llama-bench and parses its output. Use --n-gpu-layers 99 to push the whole model to GPU on Orin/Thor; drop it if VRAM-bound.
Output contract (all three wrappers)
A single JSON object on stdout, suitable for diffing. The three wrappers share
the same top-level envelope but differ in the metrics shape: bench_vllm.sh
sweeps concurrency and emits a runs array, while bench_llama_cpp.sh and
bench_ollama.sh are single-stream and emit one metrics object.
Shared envelope (all wrappers):
{
"skill": "jetson-llm-benchmark",
"runtime": "vllm" | "llama.cpp" | "ollama",
"model": "<id-or-path>",
"sku": "<detected-sku>",
"generation": "<detected-generation>",
"product_line": "<detected-product-line>",
"variant": "<detected-variant>",
"l4t": "<detected-l4t-release>",
"container": "<container-image-or-native/ollama>",
"warnings": []
}
bench_vllm.sh (concurrency sweep → runs[])
{
"config": { "input_len": 2048, "output_len": 128, "num_prompts": 50 },
"runs": [
{
"concurrency": 1,
"ttft_ms_p50": 0, "ttft_ms_p99": 0,
"itl_ms_p50": 0, "itl_ms_p99": 0,
"tpot_ms_p50": 0,
"throughput_tok_s": 0,
"e2e_latency_ms_p50": 0
}
]
}
bench_llama_cpp.sh (single-stream → metrics)
{
"config": { "n_prompt": 512, "n_gen": 128, "n_gpu_layers": 99 },
"metrics": {
"ttft_ms_p50": 0,
"itl_ms_p50": 0,
"tpot_ms_p50": 0,
"throughput_tok_s": 0
}
}
bench_ollama.sh (single-stream → metrics)
{
"config": { "input_len": 512, "output_len": 128, "num_prompts": 20, "concurrency": 1 },
"metrics": {
"ttft_ms_p50": 0, "ttft_ms_p99": 0,
"itl_ms_p50": 0, "itl_ms_p99": 0,
"tpot_ms_p50": 0,
"throughput_tok_s": 0,
"e2e_latency_ms_p50": 0
}
}
warnings is populated when:
nvpmodelis not in a recognized max-performance mode (MAXNorMAXN_*such asMAXN_SUPER); wattage-named modes are reported as warnings because they vary by Jetson SKU- Background processes >5% GPU during the run (use
jetson-diagnostic) tegrastatsshows thermal throttling during the run
The sku, variant, l4t, and container fields are populated by the wrapper script from the live device (tegrastats, /etc/nv_tegra_release, container labels) — do not hand-author, guess, or transcribe them from memory. Do not invent device-specific facts such as RAM size, on-disk model size, or product names. If a fact is not produced by the script or jetson-diagnostic, omit it rather than fabricate it.
What to flag in results (Jetson-specific guidance)
LLMs already know what TTFT/ITL/throughput mean. Jetson-specific things they usually don’t know:
- On Orin Nano/NX, single-stream
tok/sandconcurrency=8tok/sdiffer wildly because of memory bandwidth saturation, not compute. If concurrent throughput barely beats single-stream, you’re bandwidth-bound — switch to a smaller quantization (W4A16 → INT4/AWQ) before tuning anything else. - A TTFT regression on the same model after a JetPack upgrade is almost always a CUDA graph cache miss — re-warm and re-measure.
- Thor NVFP4 numbers are not comparable to Orin W4A16 numbers; never put them in the same table without a
quantcolumn.
Limitations
- vLLM measurements require an already-running OpenAI-compatible vLLM server. This skill benchmarks the server; it does not launch or tune the server.
- Ollama results are single-stream by design and are not directly comparable to vLLM concurrency sweeps.
- llama.cpp/GGUF benchmarking runs a NVIDIA-AI-IOT container by default. Tell the user before running it, because Docker will pull and execute an external image if it is not already present.
- Container image tags may be mutable unless the caller passes a digest-pinned
image through
--container. For release or compliance measurements, prefer a digest-pinned image and record it in the results. The default vLLM benchmark client image is upstream vLLM 0.20+ viavllm/vllm-openai:lateston Thor and Orin JetPack 7.2 / L4T r39+, and NVIDIA-AI-IOTghcr.io/nvidia-ai-iot/vllm:latest-jetson-orinon older Orin. - Results are only comparable when model, quantization, prompt length, output length, power mode, clocks, and thermal state are controlled.
Error Handling
- Exit
2: invalid arguments, missing--model, or a required model file is not readable. Re-run the wrapper with--helpand correct the path or model ID. - Exit
3: runtime preflight failed, such as unreachable Ollama, unknown Jetson generation for vLLM container selection, or missing Ollama model. Start the service, pull the model, or pass an explicit--container. - Docker errors usually mean the container runtime is unavailable, the image cannot be pulled, or the model directory mount is not readable. Report the exact stderr and do not fabricate benchmark numbers.
- Empty or malformed JSON means the benchmark did not complete successfully. Preserve the raw error, fix the runtime issue, and rerun.
Hand off to
jetson-inference-mem-tuneif results indicate memory pressure.jetson-speculative-decodingif TTFT is acceptable but TPOT is too slow.jetson-diagnosticifwarningsis non-empty.
Source
Jetson AI Lab — GenAI Benchmarking and NVIDIA-AI-IOT GHCR packages.