Agent Skill · NVIDIA NIM

jetson-inference-mem-tune

Pick the serving stack and per-runtime memory flags (vLLM, SGLang, llama.cpp, TensorRT Edge-LLM) for an LLM/VLM workload on any NVIDIA Jetson.

Provider: NVIDIA NIM Path in repo: skills/jetson-inference-mem-tune/SKILL.md

Skill body

Jetson Inference Memory Tuning

Recommends an inference runtime and the specific memory-related flags to pass to it, given the Jetson SKU/variant and the user’s workload. Does not include quantization recipe selection — that lives in the model-benchmarking skill — but it does point at the precision floor each runtime can serve efficiently.

Purpose

Turn a live jetson-memory-audit snapshot into runtime and launch-flag recommendations for LLM/VLM serving on Jetson. Use this when the user needs to fit a model, reduce OOM risk, or switch to a lower-memory serving stack.

When to use

Prerequisites

Available Scripts

Script Purpose Arguments
scripts/recommend.py Reads an audit JSON and emits runtime plus launch-flag recommendations. --audit PATH, --runtime, --workload, --target-mb, --human.

If your agent runtime supports run_script, invoke run_script("scripts/recommend.py", ["--audit", "/tmp/audit.json", "--runtime", "auto", "--workload", "llm-server"]) and summarize the returned JSON. Otherwise run it with python3 from the repository root.

Instructions

  1. Run jetson-memory-audit/scripts/audit.sh to capture the device baseline.
  2. Run scripts/recommend.py --audit /tmp/audit.json --runtime auto --workload llm-server --target-mb 6000 to get a JSON of runtime + flag recommendations.
  3. The agent presents the suggested runtime and the exact CLI flags. The user (or an outer agent) launches / restarts the server with those flags.
  4. Re-run the audit to verify.

Expected workflow

Use scripts/recommend.py for the specific prompt and answer from the JSON it emits. If direct execution is blocked, run it as python3 {baseDir}/scripts/recommend.py ....

Limitations

Error handling

Output contract for recommend.py

{
  "sku": "orin-nx",
  "variant": "orin-nx-16gb",
  "mem_total_gb": 16,
  "runtime": "vllm",
  "rationale": "Highest throughput at this memory budget given continuous batching + paged attention.",
  "launch_flags": [
    "--gpu-memory-utilization=0.55",
    "--max-model-len=4096",
    "--max-num-seqs=8",
    "--enable-prefix-caching"
  ],
  "alternatives": [
    { "runtime": "llama-cpp", "rationale": "Lower memory floor with GGUF Q4_K_M.", "launch_flags": ["-ngl 28", "-c 4096", "--no-mmap"] }
  ],
  "notes": ["Lower --gpu-memory-utilization further if you also run a small VLM alongside."]
}

Runtimes covered

Runtime Best for Key memory knobs Preferred install path
llama.cpp Tightest budget; GGUF; Orin Nano-class -ngl, -c, --mlock, --no-mmap ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-{orin,thor}
vLLM High-throughput serving with continuous batching --gpu-memory-utilization, --max-model-len, --max-num-seqs, --enable-prefix-caching Thor and Orin JetPack 7.2 / L4T r39+: upstream vLLM 0.20+ (vllm/vllm-openai) container or validated native vLLM 0.20+. Older Orin: NVIDIA-AI-IOT image
SGLang Programmable workflows (RAG, tool use, structured output) --mem-fraction-static, --mem-fraction-dynamic, --max-running-requests Thor: NVIDIA SGLang 26.01 (nvcr.io/nvidia/sglang:26.01-py3, SGLang 0.5.5.post2). Orin: JetPack-matched environment
TensorRT Edge-LLM NVIDIA-tuned production serving Build profile per SKU; paged-KV; KV reuse Vendor docs for the target JetPack

For Orin JetPack 7.2 / L4T r39+, upstream vLLM 0.20+ is supported. For older Orin releases, prefer NVIDIA-AI-IOT prebuilt vLLM images where available because they ship the matching CUDA/cuDNN/TensorRT stack for JetPack. For Thor, prefer upstream vLLM 0.20+ (vllm/vllm-openai) or a validated native vLLM 0.20+ install; for SGLang use NVIDIA SGLang 26.01 (nvcr.io/nvidia/sglang:26.01-py3, SGLang 0.5.5.post2) or newer NVIDIA SGLang release notes that explicitly list Jetson Thor support. Do not force an Orin-specific Jetson container path on Thor, and do not assume native upstream SGLang support on Orin.

Quantization recommendations

Use runtime-specific quantization names. vLLM and SGLang usually consume Hugging Face checkpoints such as W4A16, AWQ, GPTQ, FP16, or NVFP4. llama.cpp and Ollama consume GGUF models, so recommend INT4/Q4_K_M-style GGUF instead.

Runtime family Jetson family First choice Fallback
vLLM / SGLang Thor NVFP4 when the model/runtime supports it W4A16
vLLM / SGLang Orin Nano / NX W4A16 AWQ or GPTQ 4-bit
vLLM / SGLang AGX Orin W4A16 AWQ or GPTQ 4-bit
llama.cpp / Ollama Orin and Thor GGUF INT4 / Q4_K_M Smaller INT4 GGUF model if memory is tight

Do not describe GGUF Q4_K_M as W4A16/AWQ/GPTQ. Do not compare Thor NVFP4 results with Orin W4A16 results unless the output includes a quant field.

Runtime command guidance

Use recommend.py as the source of truth for memory knobs, then place its launch_flags into the matching serving command. Keep the command guidance in this skill instead of separate small reference files so agents ingest one complete instruction set.

For vLLM on Orin with JetPack 7.2 / L4T r39+, use upstream vLLM 0.20+ (vllm/vllm-openai:latest). On older Orin releases, use the NVIDIA-AI-IOT image:

docker run --rm -it --runtime nvidia --network host --name vllm \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_TOKEN="$HF_TOKEN" \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve <hf-model-id-or-local-path> \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.60 \
    --max-model-len 4096 \
    --max-num-seqs 8 \
    --enable-prefix-caching

For vLLM on Thor, use upstream vLLM 0.20+ (vllm/vllm-openai:latest) unless host-native vLLM 0.20+ is already installed and validated:

docker run --rm -it --runtime nvidia --network host --ipc host --name vllm \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_TOKEN="$HF_TOKEN" \
  vllm/vllm-openai:latest \
  vllm serve <hf-model-id-or-local-path> \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.75 \
    --max-model-len 8192 \
    --max-num-seqs 32 \
    --enable-prefix-caching

Thor vLLM note: do not judge Thor support from pre-0.20 vLLM results; upstream vLLM support starts at vLLM 0.20+.

For SGLang on Thor, use NVIDIA SGLang 26.01 (nvcr.io/nvidia/sglang:26.01-py3). NVIDIA SGLang 26.01 contains SGLang 0.5.5.post2 and explicitly lists Jetson Thor support. Avoid judging Thor support from older prerelease SGLang results. Avoid recommending gpt-oss or FP8 paths on Thor unless newer NVIDIA SGLang release notes say those known issues are fixed.

docker run --rm -it --runtime nvidia --network host --ipc host --name sglang \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_TOKEN="$HF_TOKEN" \
  nvcr.io/nvidia/sglang:26.01-py3 \
  python3 -m sglang.launch_server \
    --model-path <hf-model-id-or-local-path> \
    --host 0.0.0.0 \
    --port 8000 \
    --mem-fraction-static 0.60 \
    --max-running-requests 8

For llama.cpp, use the NVIDIA-AI-IOT llama.cpp image when available, or the llama-server binary from a JetPack-matched build. Start with GGUF INT4 / Q4_K_M on both Orin and Thor; choose a smaller INT4 GGUF model if the audit shows tight memory.

docker run --rm -it --runtime nvidia --network host --name llama-cpp \
  -v "$PWD/models:/models:ro" \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-<orin-or-thor> \
  llama-server \
    -m /models/<model>.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    -ngl 28 \
    -c 4096 \
    --no-mmap \
    --flash-attn

Procedure (the script encodes this)

  1. Pick the lightest runtime that satisfies the user’s required features (continuous batching? structured generation? CPU offload?).
  2. Pick the lowest precision that meets the user’s accuracy bar (model-benchmarking skill).
  3. Sweep the runtime’s memory knobs (start with gpu-memory-utilization for vLLM, n-gpu-layers and ctx-size for llama.cpp) to find the minimum footprint that sustains target throughput.
  4. Re-measure with jetson-memory-audit.

Safety

Read-only. The skill never starts, stops, or restarts a model server. It emits flags; the user (or an outer orchestration agent) is responsible for invoking the runtime.

Skill frontmatter

version: 0.0.1 license: Apache-2.0 metadata: {"author" => "Jetson Team", "tags" => ["jetson", "inference", "memory"], "languages" => ["python"], "data-classification" => "public"}