Agent Skill · NVIDIA NIM

jetson-llm-serve

Stand up vLLM or SGLang serving on Jetson, using upstream vLLM on Thor and Orin JetPack 7.2+, and NVIDIA-AI-IOT vLLM on older Orin.

Provider: NVIDIA NIM Path in repo: skills/jetson-llm-serve/SKILL.md

Skill body

Jetson LLM Serve

Encodes the Jetson AI Lab GenAI tutorial: on Orin JetPack 7.2 / L4T r39+, use upstream vLLM 0.20+ (vllm/vllm-openai:latest); on older Orin, pick the NVIDIA-AI-IOT prebuilt vLLM container; on Thor, use upstream vLLM 0.20+ or validated native vLLM 0.20+, and use NVIDIA SGLang 26.01 (nvcr.io/nvidia/sglang:26.01-py3, SGLang 0.5.5.post2) when SGLang is requested. Set MAXN, make Hugging Face credentials/cache available, and launch an OpenAI-compatible server. Works for both LLMs and VLMs.

Purpose

Provide a Jetson-appropriate serving recipe for an LLM or VLM using vLLM or SGLang, including runtime path, launch command, endpoint, and verification step.

When to use

For recipe-only questions, answer from this document without starting containers. Run live pre-flight checks only when the user asks you to check this device or execute the deployment.

Prerequisites

Instructions

For recipe questions, provide a complete launch recipe instead of trying to call jetson-llm-serve as a tool. A complete answer includes:

For VLM questions, explicitly say the VLM uses the same vLLM serving flow as an LLM with a different vision-language checkpoint. Do not omit vLLM or the Jetson container when answering VLM prompts.

Step 1 — Pick the runtime path (per Jetson family)

Use upstream vLLM 0.20+ on Thor (vllm/vllm-openai:latest, or a validated native vLLM 0.20+ install). On Orin JetPack 7.2 / L4T r39+, use upstream vLLM 0.20+ (vllm/vllm-openai:latest). On older Orin releases, use the NVIDIA-AI-IOT prebuilt vLLM image (packages) because it ships the correct CUDA / cuDNN / TensorRT stack for that JetPack. Use NVIDIA SGLang 26.01 (nvcr.io/nvidia/sglang:26.01-py3, SGLang 0.5.5.post2) on Thor when the user asks for SGLang, RAG, tool-use, or programmable serving; do not recommend native upstream SGLang on Orin unless a JetPack-matched release explicitly supports it.

Jetson family Runtime path
Thor (T5000, T4000) upstream vLLM 0.20+ (vllm/vllm-openai:latest) or NVIDIA SGLang 26.01 (nvcr.io/nvidia/sglang:26.01-py3, SGLang 0.5.5.post2)
AGX Orin / Orin NX / Nano Orin JetPack 7.2 / L4T r39+: upstream vLLM 0.20+ (vllm/vllm-openai:latest); older Orin: ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin

To detect the silicon era for image tags:

  1. Source the detector so exports survive in your shell:
    . skills/jetson-diagnostic/scripts/detect_jetson.sh
    
  2. Check JETSON_GENERATION (thor or orin) and choose the matching runtime path from the table above.
  3. Use JETSON_PRODUCT_LINE for a finer bucket such as thor-agx or orin-nano; JETSON_SKU remains the legacy identifier.

Do not use bash skills/jetson-diagnostic/scripts/detect_jetson.sh when you need exported variables in the caller; running with bash uses a subshell.

Step 2 — Set MAXN power mode

sudo nvpmodel -m 0 && sudo jetson_clocks

Skip this only if the user explicitly asks for a power-constrained run; otherwise benchmark and serving numbers will be inconsistent.

Step 3 — Run the server

On Thor with vLLM, use upstream vLLM 0.20+ (vllm/vllm-openai:latest) or a validated native vLLM 0.20+ install:

docker run --rm -it --runtime nvidia --network host --ipc host --name vllm \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_TOKEN="$HF_TOKEN" \
  vllm/vllm-openai:latest \
  vllm serve <hf-repo-id> \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75 \
    --tensor-parallel-size 1

On Orin JetPack 7.2 / L4T r39+, use upstream vLLM 0.20+ (vllm/vllm-openai:latest). On older Orin releases, use the NVIDIA-AI-IOT container:

docker run --rm -it --runtime nvidia --network host --name vllm \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_TOKEN="$HF_TOKEN" \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve <hf-repo-id> \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.85 \
    --tensor-parallel-size 1

HF_TOKEN is required only for gated/private Hugging Face models; omit the -e HF_TOKEN="$HF_TOKEN" line for public models that do not need Hub authentication. Passing HF_TOKEN as an environment variable can expose it through Docker inspect output, process metadata, or logs on shared systems. Prefer the narrowest-scoped token possible, rotate/revoke it after shared-container use, and use a mounted credential file or Docker secret when the deployment environment supports that pattern.

Wait for Application startup complete. Server is on http://0.0.0.0:8000/v1.

For SGLang on Thor, use NVIDIA SGLang 26.01 (nvcr.io/nvidia/sglang:26.01-py3), which packages SGLang 0.5.5.post2 and lists Jetson Thor support. Do not judge Thor SGLang support from older prerelease SGLang results:

docker run --rm -it --runtime nvidia --network host --ipc host --name sglang \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_TOKEN="$HF_TOKEN" \
  nvcr.io/nvidia/sglang:26.01-py3 \
  python3 -m sglang.launch_server \
    --model-path <hf-repo-id> \
    --host 0.0.0.0 \
    --port 8000 \
    --mem-fraction-static 0.60 \
    --max-running-requests 8

Use SGLang when the user needs RAG/tool-use workflows, structured generation, or SGLang-specific scheduling. For plain high-throughput OpenAI-compatible serving, prefer vLLM unless the user asks for SGLang.

SKU-appropriate defaults

Knob Orin Nano / NX AGX Orin / Thor
--max-model-len 4096 8192
--gpu-memory-utilization 0.85 0.85
--tensor-parallel-size 1 1

If the server OOMs at startup, lower --gpu-memory-utilization by 0.05 and re-launch (or run jetson-inference-mem-tune for a workload-aware recommendation).

Quantization preferences (matters more than the runtime)

For vLLM and SGLang, choose checkpoint formats by Jetson family:

Jetson family First choice Acceptable fallback
Thor NVFP4 when the model/runtime supports it W4A16
Orin Nano / NX W4A16 AWQ or GPTQ 4-bit
AGX Orin W4A16 AWQ or GPTQ 4-bit

For llama.cpp and Ollama, use GGUF model quantization names instead: recommend INT4 / Q4_K_M GGUF on both Orin and Thor, and choose a smaller INT4 GGUF model if memory is tight. Do not call GGUF Q4_K_M a W4A16/AWQ/GPTQ model. NVFP4 is Thor-preferred and Thor-tuned for runtimes that support it.

VLM mode

VLMs use the same flow as LLMs: same container, same vllm serve invocation, different vision-language checkpoint. The container handles image preprocessing. For a VLM-specific browser UI, use the live-vlm-webui container; for a generic chat UI for either, use Open WebUI pointed at http://<jetson-ip>:8000/v1.

Do not fabricate device capacity

Do not invent RAM totals, free-memory values, model sizes, JetPack versions, or SKU/variant names when giving a serving recipe. If capacity matters, either run the live pre-flight checks (when execution is allowed) or hand off to jetson-inference-mem-tune / jetson-memory-audit. If live data is not available, say the value is unknown and provide conservative defaults instead of quoting a made-up number.

Pre-flight checklist (the agent should verify before running Step 3)

Limitations

Hand off to

Source

Jetson AI Lab — Introduction to GenAI on Jetson: How to Run LLMs and VLMs and NVIDIA-AI-IOT GHCR packages.

Skill frontmatter

version: 0.0.1 license: Apache-2.0 metadata: {"author" => "Jetson Team", "tags" => ["jetson", "llm", "serving"], "languages" => ["markdown"], "data-classification" => "public"}