Agent Skill · NVIDIA NIM

jetson-speculative-decoding

Add EAGLE-3 or draft-model speculative decoding to a Jetson vLLM server when TPOT is the bottleneck.

Provider: NVIDIA NIM Path in repo: skills/jetson-speculative-decoding/SKILL.md

Skill body

Jetson Speculative Decoding (vLLM)

Speculative decoding lets a small “draft” model propose tokens that the target model verifies in a single forward pass, reducing per-token latency. On Jetson, the win/loss is dominated by VRAM headroom, not by the draft quality. This skill encodes the parts an LLM won’t already know.

Purpose

Tune an existing Jetson vLLM deployment for faster token generation by appending the right --speculative-config and validating whether it improves single-stream decode speed.

When to use

When NOT to use

Prerequisites

Instructions

Append --speculative-config to the vllm serve command shown in jetson-llm-serve.

EAGLE-3 (preferred when a head is published for the target model):

--speculative-config '{
  "method": "eagle3",
  "model": "<eagle3-head-repo-id>",
  "num_speculative_tokens": 5,
  "draft_tensor_parallel_size": 1
}'

Draft-model (fallback — pair a small same-family model):

--speculative-config '{
  "method": "draft_model",
  "model": "<small-draft-model-repo-id>",
  "num_speculative_tokens": 4,
  "draft_tensor_parallel_size": 1
}'

Jetson-specific tuning rules

How to verify it actually helped

  1. Run jetson-llm-benchmark (vLLM path) at --concurrency 1 before and after enabling speculation.
  2. Acceptance: target ≥30% improvement in throughput_tok_s and ≥20% drop in tpot_ms_p50 at concurrency 1.
  3. If improvement is <10%, or throughput_tok_s regresses at concurrency 8, disable speculation. The draft model is costing more than it returns.

Limitations

Error handling

Hand off to

Source

vLLM speculative decoding docs and the Jetson AI Lab GenAI tutorial.

Skill frontmatter

version: 0.0.1 license: Apache-2.0 metadata: {"author" => "Jetson Team", "tags" => ["jetson", "llm", "speculative-decoding"], "languages" => ["markdown"], "data-classification" => "public"}