A practical, end-to-end guide for serving LLMs with vLLM on a DGX Spark (GB10 Grace Blackwell), assembled from NVIDIA’s official playbook, the vLLM team’s deep-dive, model cards, and battle-tested community setups.
Warning
Flag choices on Spark are model- and image-specific, not hardware-wide defaults. The recipes below are starting points that worked for their authors against a specific container tag. Validate against the exact image you run, and pin a known-good tag/digest for anything you depend on. Copying a flag from one model’s recipe to another can silently regress throughput or output quality.
1. The landscape: current playbooks & walkthroughs
| Source | Best for | Link |
|---|---|---|
| NVIDIA official playbook | Canonical install steps; single / stacked / switched / troubleshooting tabs | https://build.nvidia.com/spark/vllm/instructions |
NVIDIA dgx-spark-playbooks | Source-of-truth repo; benchmarking + cluster bootstrap | https://github.com/NVIDIA/dgx-spark-playbooks |
| DeepWiki: vLLM on Spark | Clean model support matrix + serving params | https://deepwiki.com/NVIDIA/dgx-spark-playbooks/4.2-vllm |
| vLLM team blog (Jun 2026) — authoritative config deep-dive | Why each flag matters, unified-memory behavior, JIT pre-warm | https://vllm.ai/blog/2026-06-01-vllm-dgx-spark |
mark-ramsey-ri/vllm-dgx-spark | 1-to-N Spark orchestration, 41 model presets | https://github.com/mark-ramsey-ri/vllm-dgx-spark |
AEON-7/vllm-dflash | Plug-and-play DFlash container for Spark | https://github.com/AEON-7/vllm-dflash |
| vLLM Recipes index | Per-model launch commands, kept current | https://recipes.vllm.ai/ |
| NGC vLLM container tags | Latest nvcr.io/nvidia/vllm build | https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm |
When vLLM is the right engine (and when it isn’t)
Four engines run well on Spark — vLLM, SGLang, llama.cpp, and TensorRT-LLM — and the right pick is set by your workload, not a star rating. This guide is vLLM-centric because vLLM is the strongest fit for serving; here’s the honest boundary:
| Engine | Reach for it when | Trade-off on Spark |
|---|---|---|
| vLLM | You’re serving concurrent requests and care about aggregate tok/s — a RAG backend, an internal API several apps call, batch scoring. Its paged KV cache + continuous batching shine the more concurrency you throw at it. | Heaviest to operate; on aarch64 you must track which container tag supports your arch. For a single user issuing one request at a time, much of its machinery is idle and you’re back to bandwidth-bound — llama.cpp is often simpler there. |
| llama.cpp / Ollama | Single-user, lightweight, GGUF, fast to start; lazy KV allocation makes it look far lighter on memory. | Lower aggregate throughput under real concurrency. |
| SGLang | Heavy prefix reuse / structured & agentic workloads (RadixAttention); also strong batched throughput. | Spark image/arch support tracks releases like vLLM. |
| TensorRT-LLM | Lowest latency / max single-stream speed on NVIDIA HW; NVIDIA’s own Spark spec-decode playbooks use it. | Heavier build, less flexible than vLLM; engine-compile step. |
Rule of thumb: concurrency → vLLM/SGLang; one local user → llama.cpp; squeeze every last token/s on one stream → TensorRT-LLM. Confirm on your own box — all four ship runnable Spark containers.
2. Hardware facts that drive every config
DGX Spark is a GB10 Grace Blackwell SoC with 128 GB unified memory shared by CPU and GPU, on a consumer-Blackwell sm_121 GPU, ~273 GB/s bandwidth, and an ARM64 (aarch64) host. Four consequences shape everything below:
- One memory pool for everything. Weights, KV cache, the OS, page cache, and the runtime all draw from the same 128 GB.
--gpu-memory-utilizationis a fraction of that shared pool — leave headroom or you OOM with “free” RAM still showing. - Memory-bandwidth bound for decode. Spark shines at small-batch, single-user interactive serving. Keep concurrency low (
--max-num-seqs≈ 1–4). - NVFP4 MoE is the sweet spot. Quantized MoE with ~3–13 B active params (even at 25–120 B total) is the best fit: small active set = fast decode, and 4-bit weights free unified memory for KV cache. Prefer NVFP4/FP8/MXFP4 over BF16.
- Use sm_121-validated builds. The upstream stable vLLM image does not support GB10. Use the CUDA-13 nightly track (
vllm/vllm-openai:cu130-nightly/:nightly) or NVIDIA’s NGC container (nvcr.io/nvidia/vllm:26.02-py3+).
Measured GB10 performance (what to actually expect)
These are real numbers measured on one DGX Spark with this guide’s bench script (§10) — not datacenter figures or vendor claims. Decode is bandwidth-bound, so active parameter count dominates: both of these few-billion-active MoEs sit near their bandwidth ceiling, while a dense model the same size crawls (Gemma-4-31B ≈ 6 tok/s). The Qwen3.6-vs-Gemma gap below is mostly speculative decode — Qwen3.6’s built-in MTP draft, which Gemma-4-26B-A4B lacks.
Speculative decoding is the single biggest single-stream lever on Spark — often a bigger win than swapping models. A small drafter proposes tokens the big model verifies in one pass, so accepted tokens come nearly free. On GB10, AEON-7’s measurements take a 27B from ~12 tok/s to ~33 tok/s single-stream — up to 2.7x, content-dependent — in the chart below; full setup in §14.
Prefill / time-to-first-token is a different bottleneck — it’s compute-bound, and Marlin FP4 is slower for it. Short prompts hit first token in ~50–70 ms, but a long prompt scales badly: a dense ~90 B model at large context took ~133 s to first token. Keep prompts tight, and use --enable-prefix-caching so repeated system prompts aren’t re-prefilled.
Concurrency is where Spark surprises you. Single-stream reviews understate it — measured on one Spark (vLLM), the same Gemma-4-31B that does 6 tok/s solo reaches ~92 tok/s aggregate at 16 concurrent requests, and a small Qwen2.5-3B jumps 26 → 477 tok/s at the same concurrency (≈1,460 at 64). If you serve more than yourself, benchmark at the concurrency you’ll actually use, not at batch 1.
Picking a model for the job
| Use case | Strong picks in this guide |
|---|---|
| Fastest general chat (decode tok/s) | Nemotron-3-Nano-30B-A3B (~56), Gemma-4-26B-A4B (~52) |
| Coding / agentic coding | Qwen3-Coder-Next (80B/3B, ~43 tok/s), Qwen3.6-27B (SWE-bench ~77), gemma-4-12B-coder (Python) |
| Reasoning / math | Qwen3.6-35B-A3B, Nemotron-3-Nano (AIME ~89) |
| Multimodal (image/video, +audio) | Nemotron-Nano-Omni-30B (video+audio+image, ~50 tok/s), Qwen3.6-35B/27B, Gemma-4-31B/26B |
| Long context | Qwen3.6 (256K native → 1M YaRN) |
| Max quality on one Spark | Qwen3.5-122B-A10B (~16 tok/s) |
(Throughput figures are the measured GB10 decode rates above; capability picks follow each model’s published evals — accuracy is quantization-, not hardware-, dependent, so a model’s published quality holds on Spark.)
3. Storing all models in ~/vllm
vLLM fetches weights through the Hugging Face stack, which caches under the directory named by HF_HOME. Point that at a host directory and mount it, and every model lives in ~/vllm, shared across containers.
mkdir -p ~/vllm
Pre-stage on the host (recommended) — files end up owned by you, the download happens once, and the first vllm serve doesn’t stall on a multi-GB pull:
pip install -U "huggingface_hub[cli]" # one-time
export HF_HOME=~/vllm
export HF_TOKEN=hf_xxx # for gated models; accept the license on the HF page first
hf download nvidia/Qwen3.6-35B-A3B-NVFP4
The mount, in every docker run:
-e HF_HOME=/models \
-v ~/vllm:/models \
Inside the container HF now caches at /models (= ~/vllm on host); pre-staged weights at ~/vllm/hub/... appear at /models/hub/... automatically.
Tip
Force fully-local loading (air-gapped, no network checks) with
-e HF_HUB_OFFLINE=1once weights are staged. If you let the container (root) download instead of pre-staging, files in~/vllmare root-owned — add--user $(id -u):$(id -g)if that matters.
4. Is volume mounting necessary?
Yes — treat it as mandatory. A container filesystem is ephemeral: a freshly downloaded 27–120 B model is destroyed when the container is removed, so without a mount you re-pull many GB and re-pay the 10–15 min weight-load on every run. The vLLM team’s guidance is “download once, mount everywhere.”
The one other volume worth adding is the torch.compile cache (the slow part of cold start), which lives separately from your models at /root/.cache/vllm:
-v ~/vllm-compile-cache:/root/.cache/vllm \
The compile cache only pays off when you recreate a container (image/flag change) — a long-running --restart unless-stopped container compiles once anyway. It’s keyed to GPU arch + vLLM version + model + flags, so clear it (rm -rf ~/vllm-compile-cache/*) if a stale entry causes a startup hiccup after an upgrade. --ipc=host / --shm-size is shared memory (RAM), not a -v volume.
5. One-time host setup
nvidia-smi # GPU + driver visible
docker ps # else: sudo usermod -aG docker $USER && newgrp docker
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
mkdir -p ~/vllm
export HF_TOKEN=hf_xxx # gated models
# NGC pulls 401? -> docker login nvcr.io (user "$oauthtoken", password = NGC API key)
Warning
Unified-memory OOM valve. Because CPU and GPU share DRAM, the Linux page cache can hold memory CUDA can’t reclaim — an “OOM” well under 128 GB. If a big model fails to load after heavy file activity, flush caches first:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
6. Baseline single-Spark recipe (template)
Minimal, model-agnostic shape — swap the model handle and per-model flags from §7.
export HF_TOKEN=hf_xxx # only for gated models
docker run -d --name vllm --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-e HF_HOME=/models \
-v ~/vllm:/models \
nvcr.io/nvidia/vllm:26.04-py3 \
vllm serve <HF_MODEL_HANDLE> \
--host 0.0.0.0 --port 8000 \
--max-model-len 65536 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 4
Smoke-test (first request triggers JIT warmup — see §9):
curl -sS http://localhost:8000/v1/models | jq -r '.data[0].id'
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model":"<HF_MODEL_HANDLE>","messages":[{"role":"user","content":"12*17"}],"max_tokens":500}'
# model should answer 204 (= 12*17)
Note
Two container tracks — pick per recipe. NVIDIA’s NGC image
nvcr.io/nvidia/vllm:26.04-py3(vLLM 0.19.0) is the default for manynvidia/...checkpoints. Several recipes below use the upstreamvllm/vllm-openai:nightly/:cu130-nightlytrack (needed for the newest architectures and for DFlash). A given model may need a newer container than you have —vllm --versioninside the image tells you.
Important
The two image families take the launch command differently — this is the #1 first-run mistake. Upstream
vllm/vllm-openai:*images already havevllm serveas their entrypoint, so you pass only the model handle + flags straight after the image tag. Writingvllm serveagain producesvllm serve vllm serve <model> …, which fails with vllm: error: unrecognized arguments: serve. NVIDIA’s nvcr.io/nvidia/vllm:*images use a pass-through entrypoint, so those do take the fullvllm serve <model> …. That’s why everyvllm/vllm-openairecipe below starts with the bare model handle, while the NGC recipes start withvllm serve. (If avllm/vllm-openaicontainer is crash-looping on this error, remove it first —docker rm -f <name>— since--restart unless-stoppedkeeps respawning it and holds the name.)
Important
Quantization flag rule. For NVIDIA ModelOpt NVFP4 checkpoints (the
nvidia/...repos), pass--quantization modelopt. For compressed-tensors NVFP4 (Unsloth, llm-compressor community builds), vLLM auto-detects — do not pass--quantization. Getting this wrong is a common first-run failure.
Container image → vLLM version (what actually ships, mid-2026)
The :nightly/:latest tags move, and a recipe written against one can break on a newer one (deprecated flags, changed kernels). Known-good mappings people run on Spark today:
| Image | vLLM | Notes |
|---|---|---|
nvcr.io/nvidia/vllm:26.04-py3 | 0.19.0 | CUDA 13.2.1, PyTorch 2.12; the de-facto stable Spark NGC image. Ships no Ray (cluster scripts pip install it). |
nvcr.io/nvidia/vllm:26.03-py3 | 0.17.1 | previous NGC |
nvcr.io/nvidia/vllm:26.02-py3 | 0.15.1 | older; rejects MIXED_PRECISION checkpoints (mixed FP8+NVFP4 MoE) |
nvcr.io/nvidia/vllm:25.12.post1-py3 | (Dec-2025) | the Spark/Jetson image NVIDIA points to for Nemotron-3-Nano |
vllm/vllm-openai:nightly / :cu130-nightly | 0.23.x | upstream; newest archs + DFlash. Moves — pin a digest for anything you depend on |
vllm/vllm-openai:gemma4-cu130 | Gemma-4 build | required for Gemma 4 on GB10 — the bare :gemma4 tag is v0.18.2-dev and crashes on sm121 (FP4 gemm Runner ... sm120/sm121) |
vllm/vllm-openai:gemma | diffusion build | official aarch64-cu130 image with diffusion_gemma baked in for GB10 |
Caution
vLLM 0.23 deprecated the FlashInfer-MoE env vars (
VLLM_USE_FLASHINFER_MOE_FP4,VLLM_USE_FLASHINFER_MOE_FP8, …) in favour of the--moe-backendflag (marlin, flashinfer_cutlass, flashinfer_trtllm, flashinfer_cutedsl). On a 0.23+ image they log “Unknown vLLM environment variable detected” and do nothing. On sm121 (GB10), FP4 MoE must use--moe-backend marlin— the GPU has no native FP4 compute, so dense layers run FLASHINFER_CUTLASS while MoE layers fall back to Marlin W4A16. These older env vars still work on the NGC 0.19 image; prefer the CLI flag everywhere for portability.
Note
!!!!!or empty output from an FP4 model is an sm121 kernel problem, not your prompt — the working path is Marlin (--moe-backend marlinfor NVFP4,VLLM_MXFP4_BACKEND=marlinfor MXFP4).
GB10 (sm121, compute capability 12.1) has no native FP4, and vLLM’s auto-fallback chain trips on it: FlashInfer-TRTLLM MoE is “SM100+ only,” and the CUTLASS FP4 GEMM it falls back to fails on sm120/sm121 ([FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120). Marlin (W4A16) is the path that works — which is why these recipes force it. Garbage output usually means an image without the sm121 Marlin/PTX fixes; use NGC 26.04-py3, a current nightly, or a community Spark fork (eugr/spark-vllm-docker, christopher_owen/spark-vllm-mxfp4-docker).
7. Per-model configurations
Grouped by family: Qwen → Gemma → NVIDIA → other. Each model’s published quality benchmarks live on its card (accuracy is quantization-, not hardware-dependent); all commands assume ~/vllm is your model store.
Quick comparison
| Model | Publisher | Arch | Total / Active | Quant | Image | Key flags |
|---|---|---|---|---|---|---|
| Qwen3.6-35B-A3B-NVFP4 | NVIDIA (Qwen) | MoE+hybrid attn, multimodal | 35B / 3B | NVFP4 (modelopt) | vllm-openai:nightly | env vars, MTP-3, --quantization modelopt |
| Qwen3.6-27B-NVFP4 | Unsloth (Qwen) | dense + vision | 27B | NVFP4 (c-t) | vllm-openai:nightly | --dtype bfloat16, trust-remote-code |
| Qwen3-Coder-Next | Qwen | MoE (Gated-DeltaNet+attn), coder | 80B / 3B | FP8 | cu130-nightly | flashinfer attn, prefix-caching on, qwen3_coder |
| Qwen3.5-122B-A10B-NVFP4 | RedHatAI (Qwen) | MoE, multimodal | 122B / 10B | NVFP4 (c-t) | cu130-nightly | marlin, ~16 tok/s, Transformers 5+ |
| Gemma-4-31B-IT-NVFP4 | NVIDIA (Gemma) | dense, multimodal | 30.7B | NVFP4 (modelopt) | vllm-openai:gemma4-cu130 | --quantization modelopt, TP=1 |
| Gemma-4-26B-A4B-NVFP4 | NVIDIA (Gemma) | MoE, multimodal | 25.2B / 3.8B | NVFP4 (modelopt) | vllm-openai:gemma4-cu130 | TP=1, +gemma4_patched.py, marlin |
| gemma-4-12B-coder MTP-NVFP4 | community (Gemma) | dense coder + MTP draft | 12B | NVFP4 (c-t) | vllm-openai:nightly | bundled MTP, kv-cache fp8, thinking-on |
| Nemotron-3-Nano-30B-A3B-NVFP4 | NVIDIA | hybrid Mamba-2 + MoE | 30B / 3.5B | NVFP4 + fp8 KV | nvcr.io vllm:25.12.post1 | nano_v3 parser, Spark-tested |
| Nemotron-3-Nano-Omni-30B-A3B-NVFP4 | NVIDIA | Mamba-2 + MoE, omni (video/audio/image) | 31B / 3B | NVFP4 | vllm-openai:v0.20.0 | audio install, mm flags, nemotron_v3 |
(c-t = compressed-tensors auto-detect; env vars = the two sm_121a exports shown in 7.1.1.)
7.1 Qwen family
7.1.1 Qwen3.6-35B-A3B-NVFP4 — flagship MoE agent (official Spark recipe)
MoE with hybrid attention, 35B total / 3B active, multimodal (text/image/video), 262K context, Apache 2.0, ~19 GB NVFP4. NVIDIA’s recipe was written for an older upstream nightly; the command below is the version-corrected form that boots on current vllm/vllm-openai:nightly (vLLM 0.23.x) — see the two notes after it for what changed and why.
export HF_TOKEN=hf_xxx
docker run -d --name qwen36-35b --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e CUTE_DSL_ARCH=sm_121a \
vllm/vllm-openai:nightly \
nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 \
--tensor-parallel-size 1 --trust-remote-code --dtype auto \
--quantization modelopt --kv-cache-dtype fp8 \
--moe-backend marlin \
--gpu-memory-utilization 0.85 --max-model-len 65536 \
--max-num-seqs 4 --max-num-batched-tokens 8192 \
--enable-chunked-prefill --async-scheduling --enable-prefix-caching \
--speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_xml
Two changes from NVIDIA’s published recipe (verified on vLLM 0.23.x): (1) The env vars VLLM_USE_FLASHINFER_MOE_FP4=0 and VLLM_FP8_MOE_BACKEND=... are removed — the first was deprecated in vLLM 0.23 in favour of --moe-backend (you’ll see “Unknown vLLM environment variable detected” if you keep it), and the second was never a real vLLM variable. --moe-backend marlin is the correct, version-stable replacement and is required on sm121 (GB10 has no native FP4 compute, so MoE must use the Marlin W4A16 path). (2) --attention-backend flashinfer is dropped — forcing one global attention backend crashes at init_device on this hybrid model (it mixes full attention + Gated-DeltaNet/linear + Mamba layers); let vLLM auto-select per layer. Also note: the card’s --max-model-len 262144 reserves far too much KV cache on a 128 GB Spark — use 65536.
NVIDIA’s published card was later updated (and now lists --attention-backend flashinfer, --tool-call-parser qwen3_xml, --gpu-memory-utilization 0.4, and --load-format fastsafetensors). That flashinfer flag is the one this recipe deliberately drops — it crashed at init_device on the nightly used here for this hybrid model. If you’re on a newer build, it’s worth re-testing the card’s exact command; if it boots, prefer qwen3_xml as the tool-call parser (the card’s current choice) and 0.4 utilization for one user. If it crashes at init_device, fall back to the version-corrected command above. Either way, --moe-backend marlin stays required on sm121.
The last two lines are required for agent / tool-calling clients (Hermes, Cline, OpenWebUI tools, any framework that sends tools). Omit them and the model still loads — /v1/models returns 200 — but every agent request dies with HTTP 400: “auto” tool choice requires –enable-auto-tool-choice and –tool-call-parser to be set, because agents send tool_choice: "auto". What each of the three parser-related flags does:
--enable-auto-tool-choice— turns on automatic tool selection: the server honours a request’s tools list withtool_choice: "auto"and decides, per response, whether to emit a tool call. Without it vLLM serves plain chat but refuses any tool-calling request (the 400 above).--tool-call-parser qwen3_xml— parses the model’s emitted tool call back into a structured OpenAI-style tool_calls object. It must match the model’s output format, not your client: Qwen3.6 emits the qwen3_xml format. Don’t reach for the hermes parser just because your agent is named Hermes — the parser tracks the model family.--reasoning-parser qwen3— splits the model’s<think>block out of the visible answer into a separate reasoning_content field, so chain-of-thought doesn’t leak into the reply. Not needed to clear the 400, but you want it on this reasoning model.
If init_device fails instead, bisect: start from a minimal run (--quantization modelopt --moe-backend marlin --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4 --reasoning-parser qwen3) and re-add --kv-cache-dtype fp8, prefix caching, chunked prefill, async scheduling, and MTP one at a time to find the offender.
Tip
MTP vs DFlash on this model. The recipe above uses the card’s built-in MTP (
num_speculative_tokens:3) — the most reliable spec-decode path on a stock image, though it can occasionally fall into output loops. The alternative is DFlash with the dedicated ~0.5 B drafterz-lab/Qwen3.6-35B-A3B-DFlash, which avoids the loops and reaches ~50 tok/s at 2.7–4.4 accepted tokens/step. Swap the--speculative-configline for:--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \Two caveats that bite on Spark: (1) on quantized (NVFP4/FP8) weights under stock vLLM, DFlash acceptance collapses — only the first few of 15 tokens get accepted, so it slows down rather than speeds up. The fix is a DFlash-capable sm121 build (the production-validated
ghcr.io/aeon-7/…image paired withAEON-7/Qwen3.6-35B-A3B-heretic-NVFP4), or running a BF16 target instead (works on stock nightly, at much higher memory). (2) Keep--attention-backend triton_attnand set-e VLLM_USE_DEEP_GEMM=0; flash_attn crashes on sm121. Re-pull the drafter if you cloned it before 2026-04-19 (a long-context crash was fixed then). Full DFlash setup is in §14.
7.1.2 Qwen3.6-27B-NVFP4 — flagship-level coding in a 27B dense model (Unsloth)
Dense 27B with a vision encoder (text/image/video), hybrid Gated-DeltaNet + Gated-Attention layers, MTP-trained, 262K context (extensible to ~1M via YaRN), Apache 2.0, ~18 GB NVFP4 (compressed-tensors → no --quantization flag).
docker run -d --name qwen36-27b --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models \
vllm/vllm-openai:nightly \
unsloth/Qwen3.6-27B-NVFP4 \
--trust-remote-code --dtype bfloat16 \
--gpu-memory-utilization 0.85 --max-model-len 65536 \
--max-num-seqs 4 --reasoning-parser qwen3
The card’s default --max-model-len is a conservative 4096 (“increase only after checking memory”). 65536 is comfortable on Spark; keep ≥128K only if you have headroom, since the model leans on long context for thinking. For coding agents add --tool-call-parser qwen3_coder. MTP is supported by the architecture — you can try --speculative-config '{"method":"mtp","num_speculative_tokens":3}' and validate acceptance on your build. Thinking is on by default; pass chat_template_kwargs={"enable_thinking":false} (or {"preserve_thinking":true} for agents) per request.
Tip
Beyond 262K (YaRN, up to ~1M). This checkpoint is native to 262,144 tokens; for longer contexts enable static YaRN via an
--hf-overridesblock and the long-len escape hatch (mind that static YaRN can slightly dent short-context quality — only enable it when you actually need the length):-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ ... unsloth/Qwen3.6-27B-NVFP4 --trust-remote-code --dtype bfloat16 \ --hf-overrides '{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":4.0,"original_max_position_embeddings":262144}}}' \ --max-model-len 1010000
Tip
DFlash spec-decode for this model. A dedicated drafter
z-lab/Qwen3.6-27B-DFlashpairs with Qwen3.6-27B — use it instead of MTP by appending:--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":15}' \Until it merges, this drafter needs a vLLM build with interleaved-SWA support (PR #40898), plus the usual Spark rules:
--attention-backend triton_attnand-e VLLM_USE_DEEP_GEMM=0(never flash_attn on sm121). The §7.1.1 quantized caveat applies — on NVFP4 under stock vLLM, confirm tokens past the first few are accepted, or use the prebuiltghcr.io/aeon-7/aeon-vllm-ultimateimage, whoseAEON-7/Qwen3.6-27B-…-DFlashtarget runs atnum_speculative_tokens:12(~37.6 tok/s vs ~10.5 eager). Full DFlash setup is in §14.
7.1.3 Qwen3-Coder-Next — 80B/3B-active agentic coder, FP8 (Qwen)
80B total / 3B active MoE (512 experts, 10 active + 1 shared) on a hybrid Gated-DeltaNet + Gated-Attention layout, 256K context, non-reasoning for fast code, Apache 2.0. It performs at the level of models with 10–20× more active params and drops into Claude Code / Cline / Qwen Code. On Spark, FP8 is the path — unlike NVFP4, FP8 kernels run cleanly on sm121 — and the FP8 checkpoint (~85 GB) fits a 128 GB Spark with room for KV cache (Unsloth’s 4-bit dynamic GGUF runs in ~46 GB on llama.cpp if you want more headroom). A community benchmark measures ~43 tok/s single-stream (≈41 @ 16K, ≈39 @ 32K), with prefill from ~5,200 up to ~10,600 t/s.
docker run -d --name qwen3-coder-next --ipc=host --restart unless-stopped \
--gpus all --shm-size 32gb -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models \
vllm/vllm-openai:cu130-nightly \
Qwen/Qwen3-Coder-Next-FP8 \
--served-model-name qwen3-coder-next --host 0.0.0.0 --port 8000 \
--attention-backend flashinfer \
--enable-prefix-caching \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.85 --max-model-len 170000 \
--max-num-seqs 4 --max-num-batched-tokens 32768
Warning
Two Spark-specific gotchas (community-verified). (1) The flags printed on the model card disable prefix caching for this architecture — fatal for coding agents, which re-send a growing prompt every turn (0% cache hits = re-prefill on every request). Add
--enable-prefix-cachingexplicitly; vLLM marks it experimental for this arch, but it works. (2) The defaultFLASH_ATTNbackend caps you near ~60K tokens at 0.85 util, whereas--attention-backend flashinferfits ~170K without fp8-quantizing the KV cache. It’s non-reasoning, so no--reasoning-parser. Unsloth’sunsloth/Qwen3-Coder-Next-FP8dynamic quant claims ~25% more throughput; for the validated community image and a dual-Spark cluster variant, seeeugr/spark-vllm-docker.
7.1.4 Qwen3.5-122B-A10B-NVFP4 — best all-rounder that still fits one Spark (community)
The strongest “max quality on a single Spark” pick. A 122B-total / 10B-active multimodal MoE; the BF16 original is ~234 GB, but the NVFP4 quant is 75.6 GB, leaving ~52 GB for KV cache and overhead. Compressed-tensors (RedHat) → no --quantization. A Spark owner who benchmarked it called it the most well-balanced single-Spark model, measuring ~16 tok/s decode with prefill from ~1,850 up to ~6,300 t/s as context deepens — slower than the A3B models, but markedly higher quality.
docker run -d --name qwen35-122b --ipc=host --restart unless-stopped \
--gpus all --shm-size 32gb -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models \
vllm/vllm-openai:cu130-nightly \
RedHatAI/Qwen3.5-122B-A10B-NVFP4 \
--served-model-name qwen3.5-122b --host 0.0.0.0 --port 8000 \
--trust-remote-code --moe-backend marlin \
--reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder \
--enable-prefix-caching \
--gpu-memory-utilization 0.85 --max-model-len 65536 \
--max-num-seqs 4 --max-num-batched-tokens 8192
Needs a recent vLLM (build from main or current nightly) and Transformers 5+. --moe-backend marlin is the sm121-safe default; the community author actually ran --moe-backend flashinfer_cutlass on this model and found it faster than Marlin here — worth A/B-ing on your build. If the chat output looks malformed, the Qwen3.5 chat template needs the fix shipped in eugr/spark-vllm-docker (mount it via --chat-template).
7.2 Gemma family
Important
Use
vllm/vllm-openai:gemma4-cu130, never the bare:gemma4tag. The non--cu130image is v0.18.2-dev and crashes on GB10 with RuntimeError: [FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120/sm121. After ~90 s startup (≈84 s weight load + torch.compile), a healthy Gemma-4 NVFP4 run logs Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM (dense) and Using ‘MARLIN’ NvFp4 MoE backend (MoE). If a MoE model logs CUTLASS_FP4 for MoE instead of MARLIN, your--moe-backend marlindidn’t apply — stop and fix, because sm121 has no native FP4 MoE kernel.
7.2.1 Gemma-4-31B-IT-NVFP4 — dense multimodal (NVIDIA)
Dense 30.7B, multimodal (text/image/video), 256K context, 140+ languages, hybrid local/global attention with p-RoPE, NVIDIA Open Model License (+ Gemma terms), ~21 GB NVFP4. (Dense — no --moe-backend needed; its linear layers use FLASHINFER_CUTLASS automatically.)
export HF_TOKEN=hf_xxx
docker run -d --name gemma4-31b --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
vllm/vllm-openai:gemma4-cu130 \
nvidia/Gemma-4-31B-IT-NVFP4 \
--quantization modelopt --tensor-parallel-size 1 \
--gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4
Note
The card uses
--tensor-parallel-size 8on a server — on Spark use TP=1. Gated: accept the Gemma license + setHF_TOKEN. For tools/reasoning add--enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4.
7.2.2 Gemma-4-26B-A4B-NVFP4 — MoE multimodal (NVIDIA)
MoE 25.2B total / 3.8B active (8 of 128 experts +1 shared), multimodal, 256K context, ~16.5 GB NVFP4 — a small, fast, high-quality Spark pick.
# 1. Download the community NVFP4 checkpoint — it bundles gemma4_patched.py,
# which fixes vLLM's expert scale-key mapping bug (#38912). The official
# nvidia/Gemma-4-26B-A4B-NVFP4 does NOT ship this patch.
hf download bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 \
--local-dir ~/vllm/gemma4-26b-a4b-nvfp4
# 2. Serve, mounting the patched loader OVER the image's gemma4.py
docker run -d --name gemma4-26b --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-v ~/vllm/gemma4-26b-a4b-nvfp4:/models/gemma4 \
-v ~/vllm/gemma4-26b-a4b-nvfp4/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
vllm/vllm-openai:gemma4-cu130 \
--model /models/gemma4 --served-model-name gemma-4-26b \
--quantization modelopt --moe-backend marlin --trust-remote-code \
--kv-cache-dtype fp8 --max-model-len 65536 \
--gpu-memory-utilization 0.85 --max-num-seqs 4 \
--reasoning-parser gemma4 \
--enable-auto-tool-choice --tool-call-parser pythonic
Warning
This MoE needs three things on Spark: the patched
gemma4_patched.pyloader (the second-vmount above),--moe-backend marlin, and TP=1. Tool calls use the pythonic parser, not gemma4.
Why the loader: the NVFP4 checkpoint carries per-expert scale tensors (.input_scale, .weight_scale, .weight_scale_2) that vLLM 0.19’s gemma4.py can’t map onto FusedMoE params, so a stock image dies at load with KeyError: ‘…experts.0.down_proj.input_scale’ (vLLM #38912). The community bg-digitalservices checkpoint ships gemma4_patched.py, and the second -v mount overlays it onto the image’s copy — that mount is the whole fix. (NVIDIA’s own nvidia/Gemma-4-26B-A4B-NVFP4, published 2026-05-01, has the same scale keys and isn’t yet proven with this patch, so the community checkpoint is the known-good route.) On sm121 the native CUTLASS FP4 MoE path emits garbage (NaN scales, !!!!!), so Marlin is mandatory — a healthy log shows Using ‘MARLIN’ NvFp4 MoE backend for MoE and FLASHINFER_CUTLASS for dense; if MoE says CUTLASS_FP4, the flag didn’t take. The model is also TP=1 only (expert-parallel yes, tensor-parallel no).
Measured on a Spark with this exact recipe (this guide’s bench script, 3-run median, ~1,527-token prompt): decode 49.6 tok/s answer-only and 51.1 tok/s with thinking (--think) — effectively the same ~50 tok/s, the small gap being cross-session noise. That’s expected: with no speculative draft in this recipe, decode is pure memory bandwidth and doesn’t care whether the model is reasoning or answering — unlike Qwen3.6, where the MTP draft made thinking genuinely faster (higher acceptance on structured reasoning). At --output-tokens 100 the thinking run was ~97/100 reasoning chunks and never reached the answer, so 51.1 is a pure-reasoning rate; raise the budget to capture thinking+answer. Prefill ~5–6.5k tok/s (TTFT 235–298 ms for the identical prompt) — compute-bound, so it swings run-to-run unlike the rock-stable decode; either way it tracks the ~6,200 tok/s prefill on Qwen3.6-35B, so GB10 prefill is ~6k tok/s largely regardless of which small-active MoE you run. The decode corroborates the ~52 tok/s in §2’s chart.
Tip
Thinking is OFF by default on Gemma 4 — turn it ON for anything needing a step of reasoning. (The prompt “I want to wash the car and the washing station is just 100m away. Should I walk there or should I drive?” returns gibberish thinking-off, and is answered correctly thinking-on.)
Turn it on three ways. (1) Per request: add chat_template_kwargs={"enable_thinking": true} (curl) or extra_body={"chat_template_kwargs": {"enable_thinking": True}} (OpenAI client), with max_tokens 4096+. (2) Server-wide: add --default-chat-template-kwargs '{"enable_thinking": true}' to the docker run and every client gets it (per-request kwargs still override). (3) In Open WebUI: Admin Panel → Settings → Models → your model → Advanced Params → + Add Custom Parameter, name chat_template_kwargs, value {"enable_thinking": true}. Heads-up: on the gemma4-cu130 image the reasoning may land inside content instead of the separate reasoning_content field (vLLM #38855) — the model still reasons; that split is a parser bug fixed in newer builds.
7.2.3 gemma-4-12B-coder (MTP-NVFP4) — Python specialist with a bundled MTP draft (community)
A weight-only NVFP4 (W4A16) build of a Gemma-4-12B coding fine-tune: 8.25 GB model + a 0.85 GB bundled MTP draft for ~1.6× single-stream. Auto-detects NVFP4 (no --quantization). Because the draft lives in assistant/, download to a local path and mount it.
# download (~9 GB total) into ~/vllm
hf download sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 \
--local-dir ~/vllm/gemma4-coder
# easiest: one GPU, just chat
docker run -d --name gemma4-coder --ipc=host --shm-size 16gb --restart unless-stopped \
--gpus all -p 8000:8000 \
-v ~/vllm/gemma4-coder:/model:ro \
vllm/vllm-openai:nightly \
--model /model --served-model-name gemma4-coder \
--max-model-len 16384 --gpu-memory-utilization 0.92 --trust-remote-code
Add the bundled MTP draft for ~1.6× interactive speed (lossless):
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","model":"/model/assistant","num_speculative_tokens":3}'
Important
This model was trained to think first — enable it per request or quality drops:
extra_body={"chat_template_kwargs":{"enable_thinking":true}}. Needs a recent nightly (registersGemma4UnifiedForConditionalGeneration). With MTP you must use--kv-cache-dtype fp8(NVFP4 KV breaks the draft).
Caution
The build is de-refused / not safety-aligned — add your own guardrails. It’s a superb algorithm/debug assistant but can write look-ahead bias into pandas/numpy time-series & back-test code (its reasoning sometimes states the right rule while the code does the opposite). Gate quant/accounting code; don’t ship it unreviewed.
7.3 NVIDIA models (proprietary)
Models NVIDIA itself designed and trained (not quantizations of someone else’s weights). Both are hybrid Mamba-2 + Transformer MoE models with native function calling: the Nano is the natural single-Spark starting point, and the Nano-Omni adds video/audio/image understanding.
7.3.1 NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 — best single-Spark Nemotron (Spark-tested)
A hybrid Mamba-2 + MoE (52 layers: 23 Mamba-2, 23 MoE with 128 experts +1 shared / 6 active, 6 GQA attention), 30B total / 3.5B active, text-only, 1M context (256K default), unified reasoning + non-reasoning, NVIDIA Nemotron Open Model License, ~18 GB on disk. NVFP4 weights with FP8 KV cache; attention and the Mamba layers feeding it stay BF16, and quantization-aware distillation (QAD) recovers accuracy. NVIDIA lists DGX Spark in this model’s tested hardware and ships a Spark/Jetson-specific container.
export HF_TOKEN=hf_xxx
# one-time: fetch the custom reasoning parser into ~/vllm (mounted at /models)
wget -P ~/vllm \
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py
docker run -d --name nemotron-nano --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
-e VLLM_USE_FLASHINFER_MOE_FP4=1 \
-e VLLM_FLASHINFER_MOE_BACKEND=throughput \
nvcr.io/nvidia/vllm:25.12.post1-py3 \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--served-model-name nemotron-3-nano --port 8000 \
--tensor-parallel-size 1 --trust-remote-code \
--max-model-len 262144 --max-num-seqs 8 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--reasoning-parser-plugin /models/nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3
Important
Use NVIDIA’s
nvcr.io/nvidia/vllm:25.12.post1-py3container for this model and fetch thenano_v3_reasoning_parser.pyplugin (into~/vllm, referenced at/models/...).--kv-cache-dtype fp8is part of the official recipe here — unlike Qwen3.5, this model is built for it.
--max-num-seqs 8 is NVIDIA’s Spark-tested value; drop to 4 for more KV headroom at long context. The two VLLM_*FLASHINFER_MOE* env vars are valid on this Dec-2025 container; on a 0.23+ image they become no-ops (logged “Unknown…”) — drop them and add --moe-backend marlin (the required fallback if the FlashInfer FP4 MoE path ever errors on sm121). Driver: this image is built for NVIDIA driver ≥ 590.44. On a Spark still on 580.x it runs through CUDA forward-compatibility — a clean boot prints CUDA Forward Compatibility mode ENABLED … Using CUDA 13.1 driver 590.44 with kernel driver 580.x and proceeds normally. If that line instead reads compatibility mode is UNAVAILABLE while the container restart-loops, the shim couldn’t initialize: update the host driver to ≥ 590.44, or run on a nightly image (vllm/vllm-openai:cu130-nightly) matching your driver. (A --restart unless-stopped container failing for another reason — e.g. a bad flag — can momentarily print UNAVAILABLE on the colliding restart; fix the real error first.)
Reasoning is on by default — pass chat_template_kwargs={"enable_thinking":false} per request to turn it off. The model also supports a reasoning_budget (cap internal reasoning tokens to hit latency targets). Sampling: temperature=1.0, top_p=1.0 for reasoning; 0.6 / 0.95 for tool calling. Give each request a high max_tokens (≈10,000): the model always emits a reasoning trace first, so a low cap truncates it mid-thought — which returns empty or garbled content and silently breaks tool-call parsing (the single most common “tool calling doesn’t work” cause on this model). For up to 1M context add -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and raise --max-model-len. With only 3.5B active params and ~18 GB resident, decode is fast and KV headroom is large — an excellent default Spark model. The --gpu-memory-utilization 0.85 added above is the memory cap: it bounds vLLM’s slice of the shared 128 GB pool. With only ~18 GB of weights you can safely lower it (e.g. 0.5) to leave more host RAM for the page cache and other services, with no decode penalty — lowering it shrinks KV-cache capacity, not speed. Prefer this over a Docker --memory cgroup cap, which is risky on unified memory (GPU allocations draw from the same DRAM, so a hard cap can trip the OOM-killer mid-load).
7.3.2 Nemotron-3-Nano-Omni-30B-A3B-NVFP4 — fast omni-modal agent (video + audio + image + text)
The Nano family’s omni sibling: a Mamba2-Transformer hybrid MoE (31B / 3B active) with a CRADIO-v4 vision encoder and a Parakeet speech encoder, so it ingests video (mp4 ≤ 2 min), audio (wav/mp3 ≤ 1 hr), images, and text and emits text, JSON, tool calls, reasoning, and word-level transcription timestamps — built for OCR, document intelligence, meeting/video summarization, and GUI/browser agents. 256K context, A3B decode speed (the same ~50-tok/s class as the text Nano), NVFP4 ~20 GB. NVIDIA ships a dedicated Spark recipe.
export HF_TOKEN=hf_xxx
docker run -d --name nemotron-omni --ipc=host --restart unless-stopped \
--gpus all --shm-size 16g -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
--entrypoint /bin/bash \
vllm/vllm-openai:v0.20.0 -c \
"pip install vllm[audio] && vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
--served-model-name nemotron-omni --host 0.0.0.0 --port 8000 --trust-remote-code \
--gpu-memory-utilization 0.8 --max-model-len 131072 \
--max-num-seqs 8 --max-num-batched-tokens 32768 --enable-prefix-caching \
--limit-mm-per-prompt '{\"video\":1,\"image\":1,\"audio\":1}' \
--media-io-kwargs '{\"video\":{\"fps\":2,\"num_frames\":256}}' \
--allowed-local-media-path=/ \
--reasoning-parser nemotron_v3 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder"
Important
This recipe breaks two of the guide’s usual rules on purpose. (1) It pins vLLM to exactly
v0.20.0— the version NVIDIA validated for this model — instead of the nightly track. (2) It overrides the entrypoint to/bin/bashso it canpip install vllm[audio]beforevllm serve: the stock image has no audio stack, and without it any audio input (includinguse_audio_in_video:true) fails. That’s the one case in this guide where you writevllm serveexplicitly on avllm/vllm-openaiimage — because the entrypoint is no longervllm serve. Multimodal input needs--allowed-local-media-pathand--limit-mm-per-prompt; tune video via--media-io-kwargs(drop tofps:1/num_frames:128for 1080p clips). If you OOM, lower--gpu-memory-utilizationto 0.70 and/or--max-model-lento 32768; for more KV headroom you can add--kv-cache-dtype fp8(the card’s non-BF16 path uses it — validate output quality first). If FP4 output is garbled on sm121, add--moe-backend marlin(see §6). Sampling: thinking modetemperature 0.6, top_p 0.95; instruct modetemperature 0.2. BF16/FP8 weights and a non-reasoning Instruct variant live on the samenvidia/Nemotron-3-Nano-Omni-30B-A3B-*repo family.
On the stable NGC 26.04 image, this runs vision-only. NVIDIA’s 26.04 release notes do list Nemotron 3 Nano Omni as supported — but only by overriding the architecture to the previous-generation vision-language model: --hf-overrides='{"architectures":["NemotronH_Nano_VL_V2"]}'. That loads the VL backbone and drops audio. So use 26.04 + that override if you only need image / video / text on the NGC track; stay on the v0.20.0 recipe above for the full omni stack (audio, use_audio_in_video, word-level transcription timestamps) — the version the model card actually requires.
8. Flag reference & tuning
| Flag | On Spark | Guidance |
|---|---|---|
--gpu-memory-utilization | Fraction of the shared 128 GB | Start from the model recipe (0.4–0.92 seen). Leave headroom; lower if OOM. |
--max-num-seqs | Concurrent sequences | Keep low (1–4); above ~4 the bandwidth tax outweighs batching. |
--max-model-len | Prompt + completion cap | 65536 is a sane Spark default; raise toward model max only with headroom. |
| prefix caching | KV reuse across shared prefixes | On by default in V1; --enable-prefix-caching is usually redundant, but a few hybrid archs disable it — re-enable explicitly (see §7.1.3). |
--quantization modelopt | ModelOpt NVFP4 only | Pass for nvidia/... ModelOpt checkpoints; omit for compressed-tensors (auto-detected). |
--reasoning-parser / --tool-call-parser + --enable-auto-tool-choice | Structured reasoning/tools | All three are needed for agent / tool-calling clients: without --enable-auto-tool-choice + a matching --tool-call-parser, a request with tool_choice:"auto" returns HTTP 400. The tool parser tracks the model family, not your client — qwen3_xml (Qwen3.6), qwen3_coder (Coder-Next), gemma4, mistral, nemotron_v3. See §7.1.1. |
--kv-cache-dtype fp8 | Shrinks KV cache | Model-specific: the Qwen3.6-NVFP4 & gemma4-coder-MTP recipes use it; Qwen3.5 and DFlash do not. |
--speculative-config '{"method":"mtp"...}' | Built-in speculative decode | Latency lever for MTP models (Qwen3.x, Nemotron, gemma4-coder). See §14. |
--moe-backend / --attention-backend | Kernel pins | Leave auto unless a tested recipe pins one (marlin / cutlass / flashinfer / TRITON_*). |
--enable-expert-parallel | MoE routing | Enable for MoE (Nemotron, Gemma-4-26B). |
| CUDA graphs | Per-step overhead | Keep enabled. |
--load-format fastsafetensors | Faster weight load | Evaluate if the 10–15 min load matters. |
Tip
Stability ↔ throughput slider. A plain run leaves KV-cache dtype unset, spec-decode off, CUDA graphs on, backends auto. A tuned run layers in FP8 KV cache, async scheduling, MTP/DFlash, and pinned FP4 backends — each validated against the exact model + prompt shape + batch pattern + vLLM release. Want ~2–3× faster interactive decode? See §14 (MTP & DFlash).
Memory & KV cache: why RAM looks “maxed out”
Coming from llama.cpp/Ollama, the first shock on Spark is that vLLM pins almost the whole 128 GB at startup. This is by design, not a leak. vLLM uses PagedAttention and pre-allocates one large KV-cache pool sized to --gpu-memory-utilization, grabbing that memory up front (on Spark’s unified pool it shows as system RAM) no matter how many requests are actually running. llama.cpp instead allocates KV lazily, sized to the -c you pass — so it looks far lighter for the same weights. Read it straight from the startup log:
Model loading took 21.94 GiB memory ← weights (~same as llama.cpp)
Available KV cache memory: 76.49 GiB ← the pre-allocated KV pool
GPU KV cache size: 4,843,965 tokens ← pool capacity, in tokens
Maximum concurrency for 65,536 tokens per request: 73.91x
That token number is a pool capacity, not a context window — the single most common misread. Picture the KV pool as a parking lot measured in token-slots, and --max-model-len as how many slots one request may occupy. Then concurrency = pool_tokens ÷ max-model-len (here 4.84M ÷ 65,536 ≈ 74 simultaneous full-length requests). The max-num-seqs 4 × 65,536 ≈ 262K figure is just the minimum pool you’d want so four max-length requests can be in flight at once. The three knobs:
--max-model-len— context window of a single request (prompt + output).--max-num-seqs— how many requests run concurrently.--gpu-memory-utilization— how big the pool is (total token capacity).
Empirically from that log, fp8 KV on this model costs ~16.6 KB/token (76.49 GiB ÷ 4.84M) — the conversion between “tokens of capacity” and GB.
To reclaim RAM (single user): the pool is wildly over-provisioned for one person — lower --gpu-memory-utilization. You only need pool_tokens ≥ max-num-seqs × max-model-len plus slack.
--gpu-memory-utilization | KV pool (≈) | tokens (≈) | concurrency @64K | RAM freed vs 0.85 |
|---|---|---|---|---|
| 0.85 | ~76 GiB | ~4.8M | ~74x | — |
| 0.5 | ~39 GiB | ~2.4M | ~37x | ~45 GB |
| 0.4 | ~23 GiB | ~1.4M | ~22x | ~58 GB |
Note
--max-num-seqsdoes not shrink the pool —--gpu-memory-utilizationdoes. max-num-seqs only caps how many sequences share the pool. To free RAM, lower the utilization; to fit a long context, make sure the (smaller) pool still holds ≥ one full-length sequence.
Walkthrough: 65K → 256K context (one user)
The baseline recipe ships --max-model-len 65536 for a safe first boot, but this model’s native ceiling is 262,144 (256K) — 4× longer, reachable with no YaRN and no quality trade-off, because you’re still inside what the model was trained on. The only cost is memory, and for a single user that cost is small: one 256K sequence holds 262,144 tokens of KV ≈ ~4 GB (at 16.6 KB/token, fp8). So a handful of concurrent 256K chats fit in a fraction of the pool — the 76 GB the baseline grabbed was never needed.
Step 1 — raise the context to native max. This single change is enough to work:
--max-model-len 262144
Step 2 — right-size the pool for one user, reclaiming the RAM the longer context didn’t actually need:
--max-model-len 262144 \
--max-num-seqs 2 \
--gpu-memory-utilization 0.4 \
--kv-cache-dtype fp8
That leaves a ~1.4M-token pool ≈ ~5 full-length 256K sequences of headroom — plenty for one user plus Open WebUI’s background title/tag calls — while freeing ~58 GB versus the 0.85 default. Raise --gpu-memory-utilization for more concurrent 256K sessions, or drop --max-num-seqs to 1 if it’s strictly you. If you hit a Mamba/CUDA-graph size error when raising the context, reduce --max-cudagraph-capture-size (default 512).
Note
Going beyond native 256K needs YaRN — prefer not to. To push toward ~1M you’d add
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1plus a YaRN--hf-overridesblock (factor 2.0 ≈ 524K, factor 4.0 ≈ 1M). But YaRN stretches the model past its training and degrades short-prompt quality, and a ~1M prefill on GB10 (273 GB/s, Marlin FP4) takes minutes — fits ≠ fast. Stay at native 256K unless you truly need more; the exact override for this model is in the vLLM Qwen3.6 recipe:-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 --max-model-len 1010000 \ --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}'
9. Pre-warm the JIT
The first request after boot triggers Inductor + FlashInfer codegen (~25–60 s). Fire a tiny warmup on the same path as real traffic, then short prompts return in <0.5 s:
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model":"<served-model-name>","messages":[{"role":"user","content":"ping"}],"max_tokens":3}'
This is separate from weight load (10–15 min for a 120B) — address that with fastsafetensors/InstantTensor if needed.
10. Verify & monitor
curl -sS http://localhost:8000/health
curl -sS http://localhost:8000/v1/models | jq
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model":"<served-model-name>","messages":[{"role":"user","content":"Explain quantum computing briefly."}],"max_tokens":200,"stream":true}'
Prometheus telemetry is at /metrics (no extra service). Watch KV-cache utilization (vllm:kv_cache_usage_perc) and TTFT / inter-token-latency histograms. Healthy single-user behavior: prefix-cached later turns don’t spike, decode rate stays steady, KV usage stays well below the context limit.
Tip
Confirm the fast paths are live. Check the startup log to verify NVFP4 GEMM kernels actually engaged (you want a line like Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM (exact backend varies — FLASHINFER_CUTLASS or MARLIN on Spark)) — if it silently fell back, you lose the FP4 speed/memory win. For speculative decoding, confirm acceptance via
curl -s localhost:8000/metrics | grep -i spec_decode(see §14.6).
Benchmark throughput on your own Spark
Don’t trust anyone’s tok/s — including this guide’s — until you’ve measured on your box, image, and flags. vLLM ships vllm bench. NVIDIA’s benchmarking playbook distinguishes offline (raw model throughput, no HTTP/scheduler) from online (end-to-end through the running server), parameterized by ISL/OSL (input/output sequence length).
# Offline — raw decode/prefill throughput inside the container (no server)
docker exec -it qwen36-35b vllm bench throughput \
--model nvidia/Qwen3.6-35B-A3B-NVFP4 --quantization modelopt \
--input-len 1024 --output-len 256 --num-prompts 64
# Online — against your already-running server (realistic: scheduler + KV cache)
docker exec -it qwen36-35b vllm bench serve \
--backend openai-chat --model nvidia/Qwen3.6-35B-A3B-NVFP4 \
--base-url http://localhost:8000 --endpoint /v1/chat/completions \
--dataset-name random --random-input-len 1024 --random-output-len 256 \
--num-prompts 100 --request-rate 4
Note
For a true single-stream decode number (what most “X tok/s” reviews quote), set
--num-prompts 1/--max-concurrency 1; sweep--request-rate(or--max-concurrency) upward to find your aggregate ceiling and the latency knee — recall a small model can go from tens of tok/s solo to many hundreds aggregate (§2). MatchISL/OSLto your real workload (a RAG turn ≈ long input/short output; a coding turn ≈ short input/long output) — throughput differs a lot between them. The samevllm benchworks fortrtllm-bench/SGLang equivalents if you want a cross-engine comparison.
Measuring prefill vs decode (single request)
vllm bench above gives aggregate throughput; sometimes you just want a quick, single-request read of prefill (prompt → first token) versus decode (the streaming speed) on their own. The script below does exactly that against any OpenAI-compatible vLLM endpoint: it warms up the JIT, sends a sized prompt, and reports TTFT, prefill tok/s, and decode tok/s from the server’s exact token counts.
Three details make its numbers honest on Spark:
- it forces full-length generation (ignore_eos) so the decode rate is measured over a real number of tokens, not a model that stops after a short reply;
- it makes every prompt unique so
--enable-prefix-cachingcan’t replay cached KV and make prefill look artificially fast; - it counts reasoning tokens too — Qwen3.6 and other reasoning models stream their
<think>block as reasoning_content, not content, so a content-only counter would see nothing.
Save it as bench_prefill_decode.py:
#!/usr/bin/env python3
"""
bench_prefill_decode.py - measure vLLM PREFILL vs DECODE speed on a DGX Spark.
PREFILL = processing the whole prompt up to the first token (compute-bound;
reported as TTFT and prompt_tokens / TTFT).
DECODE = generating tokens after the first (bandwidth-bound; the streaming speed).
Two things this version gets right:
* ignore_eos forces the model to emit the FULL --output-tokens, so the decode
rate is measured over enough tokens to be stable (a 22-token reply is not).
* each measured prompt carries a unique nonce, so --enable-prefix-caching on the
server can't reuse cached KV and make prefill look artificially fast.
Reasoning models emit <think> as `reasoning_content`; this counts both that and
`content`. Thinking control: --think forces it ON (enable_thinking=true; needed
for Gemma 4, which defaults OFF), --no-think forces it OFF, and omitting both
leaves the model's own default (Qwen3 on, Gemma 4 off).
Usage:
python3 bench_prefill_decode.py \
--base-url http://localhost:8000/v1 \
--model nvidia/Qwen3.6-35B-A3B-NVFP4 \
--prompt-tokens 4000 --output-tokens 256 --runs 3
# add --api-key "$VLLM_API_KEY" only if you launched vLLM with one
"""
import argparse
import json
import statistics
import sys
import time
import uuid
try:
import requests
except ImportError:
sys.exit("This script needs `requests`: pip install requests")
def build_prompt(approx_tokens: int, nonce: str) -> str:
unit = "The DGX Spark is a compact GB10 system with 128 GB of unified memory. "
reps = max(1, approx_tokens // len(unit.split()))
# The nonce defeats prefix caching so prefill is actually measured each run.
return (f"[run {nonce}] Read the text below, then reply.\n\n" + unit * reps)
def run_once(url, headers, model, prompt, max_tokens, enable_thinking=None, ignore_eos=True):
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 0,
"stream": True,
"stream_options": {"include_usage": True},
"ignore_eos": ignore_eos, # vLLM extension: generate the full budget
}
if enable_thinking is not None:
payload["chat_template_kwargs"] = {"enable_thinking": enable_thinking}
t0 = time.perf_counter()
t_first = None
usage = None
finish = None
n_content = n_reasoning = 0
with requests.post(url, headers=headers, json=payload, stream=True, timeout=900) as r:
if r.status_code != 200:
sys.exit(f"HTTP {r.status_code}: {r.text}")
for raw in r.iter_lines():
if not raw:
continue
line = raw.decode("utf-8")
if not line.startswith("data: "):
continue
data = line[len("data: "):]
if data.strip() == "[DONE]":
break
obj = json.loads(data)
if obj.get("usage"):
usage = obj["usage"]
for ch in (obj.get("choices") or []):
delta = ch.get("delta") or {}
# Detect generated text in ANY field (content / reasoning_content /
# reasoning / ...) so field-name differences across builds don't hide it.
field = None
for k, v in delta.items():
if k == "role":
continue
if isinstance(v, str) and v:
field = k
break
if field:
if field == "content":
n_content += 1
else:
n_reasoning += 1
if t_first is None:
t_first = time.perf_counter()
if ch.get("finish_reason"):
finish = ch["finish_reason"]
t_end = time.perf_counter()
return t0, t_first, t_end, usage, n_content, n_reasoning, finish
def measure(url, headers, args, enable_thinking):
nonce = uuid.uuid4().hex[:12]
prompt = build_prompt(args.prompt_tokens, nonce)
t0, t_first, t_end, usage, n_c, n_r, finish = run_once(
url, headers, args.model, prompt, args.output_tokens, enable_thinking=enable_thinking, ignore_eos=True)
p_tok = usage.get("prompt_tokens") if usage else None
c_tok = usage.get("completion_tokens") if usage else None
if t_first is None:
print("\nNo per-token deltas were streamed.")
print(f" usage: prompt_tokens={p_tok} completion_tokens={c_tok} finish_reason={finish}")
if c_tok:
print(" The model generated tokens but the stream exposed none of them. With")
print(" thinking ON this is usually the reasoning parser buffering the <think>")
print(" block: ignore_eos hit the token cap before </think> closed, so nothing")
print(" flushed. Fix: use --no-think, or raise --output-tokens (e.g. 2048).")
sys.exit(1)
ttft = t_first - t0
decode_t = t_end - t_first
prefill_tps = (p_tok / ttft) if (p_tok and ttft > 0) else float("nan")
decode_tps = ((c_tok - 1) / decode_t) if (c_tok and c_tok > 1 and decode_t > 0) else float("nan")
return {"p_tok": p_tok, "c_tok": c_tok, "finish": finish, "ttft": ttft,
"prefill_tps": prefill_tps, "decode_tps": decode_tps,
"n_c": n_c, "n_r": n_r, "total": t_end - t0}
def main():
ap = argparse.ArgumentParser(description="Measure vLLM prefill vs decode throughput.")
ap.add_argument("--base-url", default="http://localhost:8000/v1")
ap.add_argument("--model", required=True, help="must match an id from /v1/models")
ap.add_argument("--api-key", default=None, help="only if launched with VLLM_API_KEY")
ap.add_argument("--prompt-tokens", type=int, default=2000, help="approx prompt size")
ap.add_argument("--output-tokens", type=int, default=256, help="decoded each run (forced)")
ap.add_argument("--runs", type=int, default=1, help="measured runs; reports the median")
ap.add_argument("--think", action="store_true",
help="force thinking ON (sends enable_thinking=true; needed for Gemma 4, which defaults OFF)")
ap.add_argument("--no-think", action="store_true",
help="force thinking OFF (sends enable_thinking=false)")
ap.add_argument("--no-warmup", action="store_true")
args = ap.parse_args()
url = args.base_url.rstrip("/") + "/chat/completions"
headers = {"Content-Type": "application/json"}
if args.api_key:
headers["Authorization"] = f"Bearer {args.api_key}"
if args.think and args.no_think:
sys.exit("--think and --no-think are mutually exclusive")
# None => send nothing and let the model's own default decide (Qwen3 on, Gemma 4 off)
enable_thinking = True if args.think else (False if args.no_think else None)
if not args.no_warmup:
print("Warming up (JIT / torch.compile / CUDA-graph capture) ...", flush=True)
run_once(url, headers, args.model,
build_prompt(64, uuid.uuid4().hex[:12]), max_tokens=8, enable_thinking=enable_thinking)
print(f"Measuring {args.runs} run(s): ~{args.prompt_tokens} prompt tokens, "
f"{args.output_tokens} output tokens each"
f"{' (thinking ON)' if args.think else (' (thinking OFF)' if args.no_think else '')} ...", flush=True)
runs = []
for i in range(args.runs):
m = measure(url, headers, args, enable_thinking)
runs.append(m)
if args.runs > 1:
print(f" run {i + 1}: prefill {m['prefill_tps']:.0f} tok/s, "
f"decode {m['decode_tps']:.1f} tok/s (out {m['c_tok']}, finish {m['finish']})")
last = runs[-1]
prefill = statistics.median(r["prefill_tps"] for r in runs)
decode = statistics.median(r["decode_tps"] for r in runs)
ttft_med = statistics.median(r["ttft"] for r in runs)
tpot = (1000.0 / decode) if decode and decode == decode else float("nan")
print("\n================ results ({} run{}) ================".format(
args.runs, "" if args.runs == 1 else "s, median"))
print(f"prompt tokens (prefilled) : {last['p_tok']}")
print(f"output tokens (decoded) : {last['c_tok']} (finish: {last['finish']})")
if last["n_r"]:
print(f" thinking / answer (chunk) : ~{last['n_r']} thinking / ~{last['n_c']} answer")
print(f"TTFT (prefill latency) : {ttft_med * 1000:8.0f} ms")
print(f"PREFILL throughput : {prefill:8.0f} tok/s (prompt_tokens / TTFT)")
print(f"DECODE throughput : {decode:8.1f} tok/s ({tpot:.1f} ms/token)")
print("=========================================================")
if last["c_tok"] and last["c_tok"] < 64:
print("WARNING: output < 64 tokens - decode number is noisy; raise --output-tokens.")
seen = last["n_c"] + last["n_r"]
if last["c_tok"] and seen and seen < last["c_tok"] / 4:
print("WARNING: the stream wasn't incremental (few chunks for many tokens) - the")
print(" DECODE rate may be unreliable. Prefer --no-think for clean numbers.")
if last["n_r"] and not args.no_think:
print("Reasoning model: thinking tokens count toward DECODE. Use --no-think for answer-only.")
print("Prompts are unique per run (prefix-cache-proof) and forced to full length (ignore_eos),")
print("so PREFILL reflects a real prefill and DECODE is measured over the full output.")
if __name__ == "__main__":
main()
Run it single-stream on the Spark (median of 3 runs, answer-only):
python3 bench_prefill_decode.py \
--base-url http://localhost:8000/v1 \
--model nvidia/Qwen3.6-35B-A3B-NVFP4 \
--prompt-tokens 4000 --output-tokens 256 --runs 3 --no-think
Flags: --prompt-tokens sizes the prompt (TTFT scales ~linearly with it); --output-tokens is forced via ignore_eos; --runs N reports the median; --no-think forces answer-only and --think forces thinking on (needed for models like Gemma 4 that default off — omit both for the model’s own default); --api-key only if you launched vLLM with VLLM_API_KEY. For a reasoning model, give thinking-on runs a large --output-tokens (e.g. 2048) so it can close </think> and emit an answer.
Measured this way on a single DGX Spark — Qwen3.6-35B-A3B-NVFP4 on the §7.1.1 recipe (MTP-3, NVFP4), 3-run medians:
| prompt tok | mode | TTFT | prefill | decode |
|---|---|---|---|---|
| ~6,019 | answer-only | 967 ms | 6,225 tok/s | 102 tok/s |
| ~3,097 | thinking | 502 ms | 6,172 tok/s | 125 tok/s |
| ~6,019 | thinking | 965 ms | 6,240 tok/s | 117 tok/s |
| ~3,098 | thinking + answer | 507 ms | 6,115 tok/s | 115 tok/s |
Four patterns, all of which generalize past this one model: (1) prefill is flat at ~6,200 tok/s regardless of prompt size, so TTFT scales linearly — predict it as prompt_tokens ÷ 6,200 (≈0.5 s at 3K, ≈1.0 s at 6K, ≈10 s at 64K). (2) thinking decodes faster than answering (~120 vs ~102 tok/s) because MTP acceptance is higher on structured reasoning than on free-form prose — the same content-dependence speculative decoding shows everywhere (§14). (3) longer context decodes slightly slower (125 → 117 tok/s from 3K → 6K) because each token re-reads a larger KV cache. (4) MTP is landing ~3 tokens per step (~80% acceptance at num_speculative_tokens:3), which is what lifts decode above the bare bandwidth ceiling. For an agent, felt latency is dominated by TTFT (prompt size) and how much the model thinks — not the per-token rate.
11. Connecting a frontend & agents (Open WebUI / Hermes Agent)
Once vLLM is serving, it exposes an OpenAI-compatible API at http://<host>:8000/v1. Two common ways to put a face on it:
- Open WebUI — a chat frontend that talks straight to the model.
- Hermes Agent (Nous Research) — an autonomous agent (terminal, file ops, web search, memory) that uses your model as its brain and exposes its own OpenAI API; Open WebUI then connects to the agent. The chain is
vLLM → Hermes Agent → Open WebUI.
Important
The Docker-networking gotcha that defeats everyone: Open WebUI (and Hermes) run in their own containers, so
http://localhost:8000/v1points at their container, not your Spark host — the connection fails even though vLLM is up. Use one of the options below. On Linux without Docker Desktop,host.docker.internaldoes not resolve unless the container was started with--add-host=host.docker.internal:host-gateway.
| From a container, reach a host-published port via | Use when |
|---|---|
http://<SPARK_LAN_IP>:PORT/v1 (find it with hostname -I) | always works; simplest for an already-running container |
http://172.17.0.1:PORT/v1 | the container is on the default docker0 bridge |
http://host.docker.internal:PORT/v1 | only if started with --add-host=host.docker.internal:host-gateway |
http://<container-name>:PORT/v1 | both containers share a user-defined network (most robust) |
Putting both on one network is the cleanest long-term fix (works because vLLM serves on --host 0.0.0.0):
docker network create ai 2>/dev/null
docker network connect ai qwen36-35b
docker network connect ai open-webui
# then address vLLM as http://qwen36-35b:8000/v1
11.1 Open WebUI → your model
In Open WebUI: Admin Settings → Connections → OpenAI → Manage (wrench icon) → Add New Connection → Standard / Compatible tab:
- API URL:
http://<SPARK_LAN_IP>:8000/v1(not localhost — see above) - API Key: none (vLLM has no key unless you set
--api-key/VLLM_API_KEY; any placeholder works)
Pick the model (it appears under its served name, e.g. nvidia/Qwen3.6-35B-A3B-NVFP4) and chat.
Tip
A big model can take longer to load than Open WebUI’s 10 s model-list fetch timeout — if the connection saves but no models appear, recreate Open WebUI with
-e AIOHTTP_CLIENT_TIMEOUT_MODEL_LIST=30. Sanity-check reachability first from inside the container:docker exec -it open-webui curl -s http://<SPARK_LAN_IP>:8000/v1/models— if it lists the model, the UI will connect with the same URL.
11.2 Hermes Agent → your model → Open WebUI
Hermes Agent (Nous Research) is a separate install, not just a connection — it adds terminal, file ops, web search, memory, and skills on top of your model:
- Install Hermes Agent (quickstart); confirm with
hermes --version. - Point its brain at vLLM — in Hermes’s config (
~/.hermes/.env), set its model-provider base URL to your Spark endpoint (http://<SPARK_LAN_IP>:8000/v1), modelnvidia/Qwen3.6-35B-A3B-NVFP4, key none. (Exact variable names are in the Hermes quickstart; it accepts any OpenAI-compatible endpoint.) - Enable Hermes’s own API server so Open WebUI can reach it — add to
~/.hermes/.env:API_SERVER_ENABLED=true API_SERVER_KEY=your-secret-key API_SERVER_HOST=0.0.0.0 # default 127.0.0.1 is localhost-only; 0.0.0.0 lets the open-webui container reach it - Run the gateway (keep it up — tmux/screen or a systemd service):
hermes gateway→ listens on:8642. - Add Hermes as a second connection in Open WebUI (same Connections → OpenAI panel): URL
http://<SPARK_LAN_IP>:8642/v1, API Key = yourAPI_SERVER_KEY. Thehermes-agentmodel appears in the dropdown; selecting it routes through the agent, with inline tool indicators as it works.
Note
For the agent to use tools reliably, the underlying model needs solid function-calling — on the vLLM side add
--enable-auto-tool-choice --tool-call-parser qwen3_xml(Qwen3.6 has strong tool use; this is the same pair baked into the §7.1.1 recipe). If the Hermes connection tests OK but no model loads, it’s almost always a missing/v1suffix; verify withcurl http://<SPARK_LAN_IP>:8000/v1/models.
12. Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
permission denied ... docker.sock | Not in docker group | sudo usermod -aG docker $USER && newgrp docker |
| OOM with free RAM showing | Page cache holds unified memory | sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' before launch |
| OOM during load/serve | Util too high / context too long | Lower --gpu-memory-utilization, reduce --max-model-len, or pick a smaller/more-quantized model |
| 401 / gated download fails | Missing token / license | export HF_TOKEN=..., accept the license on the HF page |
unrecognized arguments: serve <model> | Added vllm serve with a vllm/vllm-openai image (entrypoint is already vllm serve) | Drop vllm serve; pass the model handle + flags directly. Keep vllm serve only for nvcr.io/nvidia/vllm:*. See §6 |
HTTP 400: "auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser | Agent/client sent tool_choice:"auto" but the server was launched without tool-calling flags (model loads, /v1/models is 200, only chat calls 400) | Add --enable-auto-tool-choice + a matching --tool-call-parser (e.g. qwen3_xml for Qwen3.6), then docker rm -f <name> and relaunch. See §7.1.1 |
model type ... not recognized | Container too old for the arch | Newer NGC tag or vllm-openai:nightly; check vllm --version |
unknown quantization / weights mis-load | Wrong quant flag | ModelOpt → --quantization modelopt; compressed-tensors → omit it |
| Stable image won’t run on GB10 | No sm_121 support | Use cu130-nightly / NGC image |
| Output repetition loops | --kv-cache-dtype fp8 on a model that dislikes it | Remove it (model-specific) |
| ~9× slower MoE | --enable-chunked-prefill on SSM+MoE | Remove it |
| First request slow, rest fine | JIT warmup | Pre-warm (§9) |
| rmi won’t delete image | Tag omitted | docker stop <name> && docker rm <name> then docker rmi <repo>:<tag> |
13. Scaling to multiple Sparks (brief)
A single Spark handles models up to ~100 GB of weights. For larger models or higher TP, link Sparks over the 200 Gb/s QSFP (ConnectX-7) ports and run a Ray cluster with --tensor-parallel-size = number of GPUs:
- 2 Sparks: direct QSFP cable. 3+: through a switch.
- Bind NCCL to the QSFP interface (
NCCL_SOCKET_IFNAME=enP2p1s0f1np1); an Ethernet fallback costs 10–20× throughput. - Mount the same
~/vllmon every node and stage weights on each. - Easiest: the
mark-ramsey-ri/vllm-dgx-sparkscripts, or NVIDIA’sspark_cluster_setup.sh+ the official multi-Spark playbook.
14. Speculative decoding on Spark: MTP & DFlash (step-by-step)
A small, cheap drafter proposes the next few tokens; the big model verifies them in one pass — accepted tokens are free. It’s a latency win that shines at low concurrency, exactly Spark’s profile, and can roughly 2–3× interactive decode speed with no quality loss. Three approaches matter on Spark, easiest first:
- n-gram (prompt-lookup) — zero setup, any model, any image. Free speed when your output repeats the input (code edits, RAG, JSON, agent loops); nothing on novel prose. Start here — it costs nothing to add.
- MTP — one flag on a model that ships MTP heads (Qwen3.6, gemma4-coder). No download; broad, modest speedup.
- DFlash — a trained block-diffusion drafter (needs vLLM ≥ 0.21 + an sm121 build). The biggest win (up to ~6×) on code/reasoning, but per-target setup.
Rule of thumb: try n-gram first; reach for MTP if the model has it; invest in DFlash for maximum single-stream speed on one specific model.
14.1 Do they need special models, Docker, or configs?
| n-gram | MTP | DFlash | |
|---|---|---|---|
| Needs a drafter / model support? | No — any model | built-in MTP heads, or a paired draft (gemma4-coder) | Yes — a drafter trained for your target |
| Needs a special image? | No — even stock NGC 26.04 | No — standard Spark images | Usually — vLLM ≥ 0.21.0 + sm_121 build (nightly or prebuilt) |
| Config | '{"method":"ngram","num_speculative_tokens":4,"prompt_lookup_min":2,"prompt_lookup_max":5}' | '{"method":"mtp","num_speculative_tokens":1}' | '{"method":"dflash","model":"<drafter>","num_speculative_tokens":N}' |
| KV cache | any (fp8 ok) | per recipe | BF16 only (no fp8) |
| Best for | output that echoes input — code edits, RAG, JSON, agent loops | general decode on MTP-capable models | code/reasoning, whole-block accepts |
| Speedup | 1.5–2× on repetitive output; ~0 on novel prose | ~1.6–1.8× | up to ~6×; ~2.2–2.7× on Spark |
Note
Both consume KV cache for speculative tokens, so they trade peak throughput for latency. Keep them on for interactive use; for bursty/concurrent serving add
--speculative-disable-by-batch-size 32.
Which models in this guide can use which method:
| Model (this guide) | MTP | DFlash | How |
|---|---|---|---|
| Qwen3.6-35B-A3B-NVFP4 (§7.1.1) | ✅ built-in | ✅ z-lab/Qwen3.6-35B-A3B-DFlash | MTP in its recipe; DFlash (n=15) ~50 tok/s via a DFlash-capable sm121 build — see the §7.1.1 TIP |
| Qwen3.6-27B-NVFP4 (§7.1.2) | ✅ built-in (validate) | ✅ | MTP flag, or z-lab Qwen3.6-27B DFlash via the AEON Ultimate image |
| gemma-4-12B-coder (§7.2.3) | ✅ paired draft | — | bundled /model/assistant + kv fp8 |
| Gemma-4-31B-IT-NVFP4 (§7.2.1) | — | ✅ | z-lab/gemma-4-31B-it-DFlash (gemma4-dflash image) |
| Gemma-4-26B-A4B-NVFP4 (§7.2.2) | — | ✅ | z-lab/gemma-4-26B-A4B-it-DFlash (gemma4-dflash image) |
| Nemotron-3-Nano (§7.3.1) | — | — | card ships no speculator; run dense |
(Empty = no published speculator for that exact checkpoint today; “validate” = architecture supports it but confirm acceptance on your build.)
n-gram works on every model above — it needs no model support, just append its flag (§14.2); it’s simply less effective the less your output echoes your input.
14.2 n-gram (prompt-lookup) — zero everything
The cheapest speculative decoding there is: no drafter, no special image, no model support. vLLM scans the prompt + text-so-far for a repeat of the last few tokens and proposes whatever followed them last time. When your output reuses spans of the input — editing a file (most lines unchanged), quoting retrieved context (RAG), reformatting, structured/JSON output, or agent loops that echo schemas — acceptance is high and the speed-up is free, even on the stock NGC 26.04 image with any checkpoint.
# append to ANY §7 recipe — nothing else changes
--speculative-config '{"method":"ngram","num_speculative_tokens":4,"prompt_lookup_min":2,"prompt_lookup_max":5}'
num_speculative_tokens = tokens proposed per step (3–5 typical). prompt_lookup_min/max = shortest/longest history n-gram to match (smaller min = more, riskier matches). It works with fp8 KV cache and needs no ≥ 0.21 build — it’s core vLLM. The catch is the flip side of its strength: on novel prose there’s nothing to look up, so acceptance ≈ 0 — turn it off there, or use MTP/DFlash.
Tip
A close cousin,
"method":"suffix"(suffix decoding), keeps a small suffix-tree cache instead of a fixed-length lookup and often accepts more on repetitive output —--speculative-config '{"method":"suffix","num_speculative_tokens":8}'. Also drafter-free and auto-resolved; try it if n-gram’s acceptance is only lukewarm on your workload.
14.3 MTP — zero-download (≈5 min)
Just a flag on an MTP-capable model (the §7.1.1 Qwen3.6 and §7.2.3 gemma4-coder recipes already use it):
Note
Two flavors of MTP. Most MTP models (Qwen3.6, DeepSeek-style) have the prediction heads baked into the checkpoint — nothing extra to download, just the flag. A few ship a small paired draft model instead: gemma4-coder bundles a 0.4 B draft in
assistant/, so you point"model"at it (/model/assistant) — see §7.2.3. Both use"method":"mtp"; only the second needs the"model"field.
docker run -d --name mtp --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 -e CUTE_DSL_ARCH=sm_121a \
vllm/vllm-openai:nightly \
nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 --trust-remote-code \
--quantization modelopt --kv-cache-dtype fp8 --moe-backend marlin \
--gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4 \
--reasoning-parser qwen3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'
num_speculative_tokens is usually 1 (some recipes use 3). For lowest latency at low concurrency, the vLLM recipe suggests disabling prefix caching. MTP reduces throughput under load — pair with --speculative-disable-by-batch-size 32.
14.4 DFlash on Spark — plug-and-play container (easiest)
The ghcr.io/aeon-7/vllm-dflash image is a prebuilt sm_121 vLLM with DFlash baked in; it serves a 27B dense Qwen3.5 with a block-diffusion drafter. Real Spark decode is content-dependent: ~60 tok/s on code/reasoning, ~30 on free prose (vs ~10 stock-eager); random/adversarial text shows no speedup.
# 1) download the target into ~/vllm
pip install -U "huggingface_hub[cli]"; export HF_TOKEN=hf_xxx
hf download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
--local-dir ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4
# 2) make a PERSISTENT API key — save it; you need the SAME value for every request
export VLLM_API_KEY=$(openssl rand -hex 32); echo "VLLM_API_KEY=$VLLM_API_KEY" >> ~/.bashrc
docker run -d --name vllm-dflash --runtime nvidia --network host --ipc host \
--restart unless-stopped --ulimit memlock=-1:-1 \
-v ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4:/models/target:ro \
-v ~/vllm/drafter-cache:/models/drafter-cache \
-e MODEL_PATH=/models/target \
-e SERVED_MODEL_NAME=dflash-qwen3.5-27b \
-e DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash \
-e DFLASH_NUM_SPEC_TOKENS=15 \
-e MAX_MODEL_LEN=65536 -e MAX_NUM_SEQS=16 \
-e GPU_MEMORY_UTILIZATION=0.85 -e MAX_NUM_BATCHED_TOKENS=32768 \
-e VLLM_API_KEY="$VLLM_API_KEY" -e HF_TOKEN="$HF_TOKEN" \
ghcr.io/aeon-7/vllm-dflash:latest
docker logs -f vllm-dflash # ~5–7 min cold boot
# 3) test (note the Bearer token)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer $VLLM_API_KEY" \
-d '{"model":"dflash-qwen3.5-27b","messages":[{"role":"user","content":"Write a haiku about GPUs."}],"max_tokens":200}'
Key env vars: DFLASH_DRAFTER (HF id of the drafter; empty = plain vLLM), DFLASH_NUM_SPEC_TOKENS (15 single-stream), KV_CACHE_DTYPE (stays BF16 with DFlash). AEON’s tuned Spark defaults are MAX_NUM_SEQS=16 / MAX_NUM_BATCHED_TOKENS=32768 for the 128 GB pool — lower MAX_NUM_SEQS if you want pure single-user latency. Persist VLLM_API_KEY (don’t inline a one-shot $(openssl …) you can’t reproduce).
Tip
AEON also ships a newer unified image,
ghcr.io/aeon-7/aeon-vllm-ultimate:latest, with a Qwen3.6-27B target (AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash) atnum_speculative_tokens: 12— currently their best-performing GB10 package (~37.6 tok/s single-stream vs ~10.5 stock-eager, with the z-lab sliding-window Qwen3.6-27B drafter). On Spark, leave the drafter’s attention backend at default — don’t set attention_backend inside the spec config.
14.5 DFlash the general way (other models)
DFlash uses vLLM’s speculators format; pass the drafter in --speculative-config. Available drafters today:
| Target (verifier) | DFlash drafter | Source |
|---|---|---|
Qwen/Qwen3.6-35B-A3B | z-lab/Qwen3.6-35B-A3B-DFlash (~0.5B) — pairs with §7.1.1 | Z Lab |
Qwen/Qwen3.6-27B | z-lab/Qwen3.6-27B-DFlash — pairs with §7.1.2; needs PR #40898 SWA build | Z Lab |
Qwen/Qwen3.5-27B | z-lab/Qwen3.5-27B-DFlash | Z Lab |
Qwen/Qwen3.5-9B | z-lab/Qwen3.5-9B-DFlash | Z Lab |
Qwen/Qwen3.5-4B · Qwen/Qwen3-8B | z-lab/Qwen3.5-4B-DFlash · z-lab/Qwen3-8B-DFlash-b16 — small/fast | Z Lab |
google/gemma-4-31B-it | z-lab/gemma-4-31B-it-DFlash (or RedHatAI/…speculator.dflash) | Z Lab / RedHatAI |
google/gemma-4-26B-A4B-it | z-lab/gemma-4-26B-A4B-it-DFlash (needs gemma4-dflash image) | Z Lab |
poolside/Laguna-XS.2 | poolside/Laguna-XS.2-speculator.dflash | poolside |
docker run --rm vllm/vllm-openai:nightly vllm --version # confirm >= 0.21.0
# Laguna XS.2 (coding; up to 7 tokens/step, ~70% acceptance on code)
docker run -d --name laguna --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -e VLLM_USE_DEEP_GEMM=0 -v ~/vllm:/models \
vllm/vllm-openai:nightly \
poolside/Laguna-XS.2 --trust-remote-code \
--enable-auto-tool-choice --tool-call-parser poolside_v1 --reasoning-parser poolside_v1 \
--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
Note
DFlash needs vLLM ≥ 0.21.0 (PR #41880); the stock NGC
26.04(0.19.0) won’t do it — use nightly or a prebuilt image.VLLM_USE_DEEP_GEMM=0is required for all DFlash (DeepGEMM is incompatible with the DFlash draft path), not just Laguna. On sm121, set the target--attention-backend triton_attn— z-lab’s own Spark builds use Triton, and flash_attn/FlashInfer crash on GB10; leave the drafter’s backend at default. Gemma-4 DFlash needs Z Lab’s build (ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130). The canonical drafter repo isgithub.com/z-lab/dflash; no drafter for your model? Train one — the speculators project (v0.5.0+) ships a DFlash training tutorial. The 4B/8B drafters (Qwen3.5-4B,Qwen3-8B, plus a Llama-3.1-8B drafter in the same collection) target small, fast models — where DFlash pushes single-stream into the hundreds of tok/s, useful for high-concurrency serving or draft-heavy agent loops; set num_speculative_tokens to the block size (8 for concurrency, 16 for longer accepts at concurrency 1).
14.6 Verify it’s actually accelerating (don’t fly blind)
Speculative decoding only helps if the drafter’s tokens are accepted. Confirm it, then tune:
- Check acceptance. vLLM logs a spec-decode summary (draft acceptance rate and mean accepted length per step) and exposes it on
/metrics:Healthy MTP runs accept most proposed tokens (mean accepted length well above 1); DFlash on code can accept whole blocks. If acceptance is near zero, the method isn’t helping — the drafter/target are mismatched or the content is low-predictability (random/adversarial text).curl -s http://localhost:8000/metrics | grep -i spec_decode - Tune num_speculative_tokens (K). Raise K for DFlash (it verifies a whole block in one pass — 7–15 is normal); keep K low for MTP (1–3) since each extra token costs a verify pass. Watch decode tok/s as you change it — past the acceptance sweet spot, throughput drops.
- A/B against a dense run. Time the same prompt with the spec flag removed. If tok/s isn’t clearly higher and output is identical, drop speculation for that workload.
- Concurrency check. Both methods tax throughput under load; if you serve bursts, set
--speculative-disable-by-batch-size 32so vLLM auto-disables speculation when batches grow.
14.7 The gotchas that bite people
Caution
- DFlash floor: vLLM ≥ 0.21.0 (AEON-7/Z-Lab image or nightly). MTP is fine on 0.19.0+.
VLLM_USE_DEEP_GEMM=0for DFlash — DeepGEMM is incompatible with the DFlash draft path (silent failure otherwise).- On sm121, target
--attention-backend triton_attn— flash_attn/FlashInfer crash on GB10; leave the drafter’s backend at default.- DFlash KV cache stays BF16 — never combine with
--kv-cache-dtype fp8. (The MTP gemma4-coder draft is the opposite — it needs fp8 KV.) Relatedly, FP8-KV checkpoints emit scrambled output on non-Hopper GPUs (Spark included) — e.g. Laguna-XS.2-FP8/INT4; use the BF16 checkpoint or disable FP8 KV.- One drafter per target — a DFlash drafter is trained for a specific model.
- Gemma 4 paired-draft MTP needs current vLLM — if startup logs show SpeculativeConfig(method=‘draft_model’, …) for a gemma4-coder
assistant/checkpoint, your vLLM build lacks Gemma-4 MTP support for that path; upgrade rather than forcing it through generic draft-model decoding.- Higher K is cheap with DFlash (whole block in one pass), not with MTP — keep MTP num_speculative_tokens low (1–3); DFlash runs 7–15.
- Both hurt throughput under load — add
--speculative-disable-by-batch-size 32.- Acceptance is content-dependent — big on code/structured output, smaller on prose, none on random text.
15. Sources
- NVIDIA DGX Spark vLLM playbook — https://build.nvidia.com/spark/vllm/instructions
- NVIDIA
dgx-spark-playbooks— https://github.com/NVIDIA/dgx-spark-playbooks - vLLM team blog “vLLM on the DGX Spark” (Jun 2026) — https://vllm.ai/blog/2026-06-01-vllm-dgx-spark
- vLLM speculative decoding / MTP / DFlash docs — https://docs.vllm.ai/en/latest/features/speculative_decoding/ · https://docs.vllm.ai/projects/speculators/en/latest/user_guide/algorithms/dflash/
- DFlash paper (Z Lab, arXiv 2602.06036) — https://arxiv.org/abs/2602.06036 · Z Lab DFlash drafters — https://github.com/z-lab/dflash ·
AEON-7/vllm-dflash— https://github.com/AEON-7/vllm-dflash - Model cards:
nvidia/Qwen3.6-35B-A3B-NVFP4,unsloth/Qwen3.6-27B-NVFP4,nvidia/Gemma-4-31B-IT-NVFP4,bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4(ships gemma4_patched.py),nvidia/Gemma-4-26B-A4B-NVFP4,sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4,nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4,nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4,Qwen/Qwen3-Coder-Next-FP8(all on https://huggingface.co) - NVIDIA Spark benchmarking playbook (
vllm bench/trtllm-bench) — https://github.com/NVIDIA/dgx-spark-playbooks (connect-two-sparks/performance_benchmarking_guide.md) - NVIDIA Spark speculative-decoding playbook — https://build.nvidia.com/spark/speculative-decoding
- Single-Spark model picks & extra cards:
RedHatAI/Qwen3.5-122B-A10B-NVFP4, z-lab DFlash drafters (all on https://huggingface.co) - NGC vLLM container tags — https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm
Compiled June 2026. Performance charts are measured single-Spark numbers: the §2 decode chart and the §10 Qwen3.6 chart are this guide’s own bench-script runs on one DGX Spark; the §2 DFlash chart is AEON-7’s published Spark measurement. Container tags and model handles move fast — pin a known-good image digest for anything you depend on.