[!IMPORTANT] Where models live. Every command in this guide stores model weights in
~/vllmon the host and mounts it into the container at/models(viaHF_HOME=/models). Download once, mount everywhere. A couple of community models that bundle a speculative-decode draft use a per-model local path under~/vllm/<name>instead — those are flagged inline.
A practical, end-to-end guide for serving LLMs with vLLM on a DGX Spark (GB10 Grace Blackwell), assembled from NVIDIA’s official playbook, the vLLM team’s deep-dive, model cards, and battle-tested community setups.
[!WARNING] Flag choices on Spark are model- and image-specific, not hardware-wide defaults. The recipes below are starting points that worked for their authors against a specific container tag. Validate against the exact image you run, and pin a known-good tag/digest for anything you depend on. Copying a flag from one model’s recipe to another can silently regress throughput or output quality.
1. The landscape: current playbooks & walkthroughs
| Source | Best for | Link |
|---|---|---|
| NVIDIA official playbook | Canonical install steps; single / stacked / switched / troubleshooting tabs | https://build.nvidia.com/spark/vllm/instructions |
NVIDIA dgx-spark-playbooks | Source-of-truth repo; benchmarking + cluster bootstrap | https://github.com/NVIDIA/dgx-spark-playbooks |
| DeepWiki: vLLM on Spark | Clean model support matrix + serving params | https://deepwiki.com/NVIDIA/dgx-spark-playbooks/4.2-vllm |
| vLLM team blog (Jun 2026) — authoritative config deep-dive | Why each flag matters, unified-memory behavior, JIT pre-warm | https://vllm.ai/blog/2026-06-01-vllm-dgx-spark |
mark-ramsey-ri/vllm-dgx-spark | 1-to-N Spark orchestration, 41 model presets | https://github.com/mark-ramsey-ri/vllm-dgx-spark |
AEON-7/vllm-dflash | Plug-and-play DFlash container for Spark | https://github.com/AEON-7/vllm-dflash |
| vLLM Recipes index | Per-model launch commands, kept current | https://recipes.vllm.ai/ |
| NGC vLLM container tags | Latest nvcr.io/nvidia/vllm build | https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm |
2. Hardware facts that drive every config
DGX Spark is a GB10 Grace Blackwell SoC with 128 GB unified memory shared by CPU and GPU, on a consumer-Blackwell sm_121 GPU, ~273 GB/s bandwidth, and an ARM64 (aarch64) host. Four consequences shape everything below:
- One memory pool for everything. Weights, KV cache, the OS, page cache, and the runtime all draw from the same 128 GB.
--gpu-memory-utilizationis a fraction of that shared pool — leave headroom or you OOM with “free” RAM still showing. - Memory-bandwidth bound for decode. Spark shines at small-batch, single-user interactive serving. Keep concurrency low (
--max-num-seqs≈ 1–4). - NVFP4 MoE is the sweet spot. Quantized MoE with ~3–13 B active params (even at 25–120 B total) is the best fit: small active set = fast decode, and 4-bit weights free unified memory for KV cache. Prefer NVFP4/FP8/MXFP4 over BF16.
- Use sm_121-validated builds. The upstream stable vLLM image does not support GB10. Use the CUDA-13 nightly track (
vllm/vllm-openai:cu130-nightly/:nightly) or NVIDIA’s NGC container (nvcr.io/nvidia/vllm:26.02-py3+).
3. Storing all models in ~/vllm
vLLM fetches weights through the Hugging Face stack, which caches under the directory named by HF_HOME. Point that at a host directory and mount it, and every model lives in ~/vllm, shared across containers.
mkdir -p ~/vllm
Pre-stage on the host (recommended) — files end up owned by you, the download happens once, and the first vllm serve doesn’t stall on a multi-GB pull:
pip install -U "huggingface_hub[cli]" # one-time
export HF_HOME=~/vllm
export HF_TOKEN=hf_xxx # for gated models; accept the license on the HF page first
hf download nvidia/Qwen3.6-35B-A3B-NVFP4
The mount, in every docker run:
-e HF_HOME=/models \
-v ~/vllm:/models \
Inside the container HF now caches at /models (= ~/vllm on host); pre-staged weights at ~/vllm/hub/... appear at /models/hub/... automatically.
[!TIP] Force fully-local loading (air-gapped, no network checks) with
-e HF_HUB_OFFLINE=1once weights are staged. If you let the container (root) download instead of pre-staging, files in~/vllmare root-owned — add--user $(id -u):$(id -g)if that matters.
4. Is volume mounting necessary?
Yes — treat it as mandatory. A container filesystem is ephemeral: a freshly downloaded 27–120 B model is destroyed when the container is removed, so without a mount you re-pull many GB and re-pay the 10–15 min weight-load on every run. The vLLM team’s guidance is “download once, mount everywhere.”
The one other volume worth adding is the torch.compile cache (the slow part of cold start), which lives separately from your models at /root/.cache/vllm:
-v ~/vllm-compile-cache:/root/.cache/vllm \
[!NOTE] The compile cache only pays off when you recreate a container (image/flag change) — a long-running
--restart unless-stoppedcontainer compiles once anyway. It’s keyed to GPU arch + vLLM version + model + flags, so clear it (rm -rf ~/vllm-compile-cache/*) if a stale entry causes a startup hiccup after an upgrade.--ipc=host/--shm-sizeis shared memory (RAM), not a-vvolume.
5. One-time host setup
nvidia-smi # GPU + driver visible
docker ps # else: sudo usermod -aG docker $USER && newgrp docker
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
mkdir -p ~/vllm
export HF_TOKEN=hf_xxx # gated models
# NGC pulls 401? -> docker login nvcr.io (user "$oauthtoken", password = NGC API key)
[!WARNING] Unified-memory OOM valve. Because CPU and GPU share DRAM, the Linux page cache can hold memory CUDA can’t reclaim — an “OOM” well under 128 GB. If a big model fails to load after heavy file activity, flush caches first:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
6. Baseline single-Spark recipe (template)
Minimal, model-agnostic shape — swap the model handle and per-model flags from §7.
export HF_TOKEN=hf_xxx # only for gated models
docker run -d --name vllm --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-e HF_HOME=/models \
-v ~/vllm:/models \
nvcr.io/nvidia/vllm:26.04-py3 \
vllm serve <HF_MODEL_HANDLE> \
--host 0.0.0.0 --port 8000 \
--max-model-len 65536 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 4
Smoke-test (first request triggers JIT warmup — see §9):
curl -sS http://localhost:8000/v1/models | jq -r '.data[0].id'
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model":"<HF_MODEL_HANDLE>","messages":[{"role":"user","content":"12*17"}],"max_tokens":500}'
# expect "204"
[!NOTE] Two container tracks — pick per recipe. NVIDIA’s NGC image
nvcr.io/nvidia/vllm:26.04-py3(vLLM 0.19.0) is the default for manynvidia/...checkpoints. Several recipes below use the upstreamvllm/vllm-openai:nightly/:cu130-nightlytrack (needed for the newest architectures and for DFlash). A given model may need a newer container than you have —vllm --versioninside the image tells you.
[!IMPORTANT] Quantization flag rule. For NVIDIA ModelOpt NVFP4 checkpoints (the
nvidia/...repos), pass--quantization modelopt. For compressed-tensors NVFP4 (Unsloth, llm-compressor community builds), vLLM auto-detects — do not pass--quantization. Getting this wrong is a common first-run failure.
7. Per-model configurations
Grouped by family: Qwen → Gemma → NVIDIA → Mistral & other. Every model below ships a real benchmark from its card (charted inline). All commands assume ~/vllm is your model store.
Quick comparison
| Model | Publisher | Arch | Total / Active | Quant | Image | Key flags |
|---|---|---|---|---|---|---|
| Qwen3.6-35B-A3B-NVFP4 | NVIDIA (Qwen) | MoE+hybrid attn, multimodal | 35B / 3B | NVFP4 (modelopt) | vllm-openai:nightly | env vars, MTP-3, --quantization modelopt |
| Qwen3.6-27B-NVFP4 | Unsloth (Qwen) | dense + vision | 27B | NVFP4 (c-t) | vllm-openai:nightly | --dtype bfloat16, trust-remote-code |
| Qwen3.5-35B-A3B | Qwen (community) | MoE (SSM+MoE) | 35B / 3B | BF16 | cu130-nightly | qwen3_coder; no chunked-prefill |
| Gemma-4-31B-IT-NVFP4 | NVIDIA (Gemma) | dense, multimodal | 30.7B | NVFP4 (modelopt) | vllm-openai:gemma4-cu130 | --quantization modelopt, TP=1 |
| Gemma-4-26B-A4B-NVFP4 | NVIDIA (Gemma) | MoE, multimodal | 25.2B / 3.8B | NVFP4 (modelopt) | vllm-openai:gemma4-cu130 | TP=1 only, gemma4 parsers |
| gemma-4-12B-coder MTP-NVFP4 | community (Gemma) | dense coder + MTP draft | 12B | NVFP4 (c-t) | vllm-openai:nightly | bundled MTP, kv-cache fp8, thinking-on |
| DiffusionGemma-26B-A4B-NVFP4 | NVIDIA (Gemma) | diffusion MoE | 25.2B / 3.8B | NVFP4 (modelopt) | vllm-openai:gemma | V2 runner, TRITON_ATTN |
| Nemotron-3-Nano-30B-A3B-NVFP4 | NVIDIA | hybrid Mamba-2 + MoE | 30B / 3.5B | NVFP4 + fp8 KV | nvcr.io vllm:25.12.post1 | nano_v3 parser, Spark-tested |
| Nemotron-3-Super-120B-A12B-NVFP4 | NVIDIA | hybrid Mamba-Tx MoE | 120B / 12B | NVFP4 | cu130-nightly | MTP, reasoning nemotron_v3 |
| Mistral-Small-4-119B-NVFP4 | Mistral AI | MoE, multimodal | 119B / 6.5B | NVFP4 (c-t) | cu130-nightly | MLA, mistral parsers, SYSTEM_PROMPT |
| gpt-oss-120b | OpenAI | MoE | 120B | MXFP4 | nvcr.io vllm:26.04 | expert-parallel |
| Llama-3.3-70B-Instruct-NVFP4 | NVIDIA (Meta) | dense | 70B | NVFP4 | nvcr.io vllm:26.04 | gated, bandwidth-bound |
| Phi-4-multimodal-NVFP4 | NVIDIA (Microsoft) | dense, multimodal | — | NVFP4 | nvcr.io vllm:26.04 | trust-remote-code |
(c-t = compressed-tensors auto-detect; env vars = the four sm_121a exports shown in 7.1.1.)
7.1 Qwen family
7.1.1 Qwen3.6-35B-A3B-NVFP4 — flagship MoE agent (official Spark recipe)
MoE with hybrid attention, 35B total / 3B active, multimodal (text/image/video), 262K context, Apache 2.0, ~19 GB NVFP4. This is NVIDIA’s recommended Spark configuration verbatim — note the four required env vars and --quantization modelopt.
export HF_TOKEN=hf_xxx
docker run -d --name qwen36-35b --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e VLLM_FP8_MOE_BACKEND=flashinfer_cutlass \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e CUTE_DSL_ARCH=sm_121a \
vllm/vllm-openai:nightly \
vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 \
--tensor-parallel-size 1 --trust-remote-code --dtype auto \
--quantization modelopt --kv-cache-dtype fp8 \
--attention-backend flashinfer --moe-backend marlin \
--gpu-memory-utilization 0.85 --max-model-len 65536 \
--max-num-seqs 4 --max-num-batched-tokens 8192 \
--enable-chunked-prefill --async-scheduling --enable-prefix-caching \
--speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
--reasoning-parser qwen3
[!CAUTION] Earlier community recipes floated
--gpu-memory-utilization 0.4and--max-model-len 262144for this model. NVIDIA’s official card uses 0.85 / 65536 plus the four env vars above — use the official values. For agent/tool use add--enable-auto-tool-choice --tool-call-parser qwen3_coder.
7.1.2 Qwen3.6-27B-NVFP4 — flagship-level coding in a 27B dense model (Unsloth)
Dense 27B with a vision encoder (text/image/video), hybrid Gated-DeltaNet + Gated-Attention layers, MTP-trained, 262K context (extensible to ~1M via YaRN), Apache 2.0, ~18 GB NVFP4 (compressed-tensors → no --quantization flag).
docker run -d --name qwen36-27b --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models \
vllm/vllm-openai:nightly \
vllm serve unsloth/Qwen3.6-27B-NVFP4 \
--trust-remote-code --dtype bfloat16 \
--gpu-memory-utilization 0.85 --max-model-len 65536 \
--max-num-seqs 4 --reasoning-parser qwen3
[!NOTE] The card’s default
--max-model-lenis a conservative 4096 (“increase only after checking memory”). 65536 is comfortable on Spark; keep ≥128K only if you have headroom, since the model leans on long context for thinking. For coding agents add--tool-call-parser qwen3_coder. MTP is supported by the architecture — you can try--speculative-config '{"method":"mtp","num_speculative_tokens":3}'and validate acceptance on your build. Thinking is on by default; passchat_template_kwargs={"enable_thinking":false}(or{"preserve_thinking":true}for agents) per request.
[!TIP] Beyond 262K (YaRN, up to ~1M). This checkpoint is native to 262,144 tokens; for longer contexts enable static YaRN via an
--hf-overridesblock and the long-len escape hatch (mind that static YaRN can slightly dent short-context quality — only enable it when you actually need the length):-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ ... vllm serve unsloth/Qwen3.6-27B-NVFP4 --trust-remote-code --dtype bfloat16 \ --hf-overrides '{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":4.0,"original_max_position_embeddings":262144}}}' \ --max-model-len 1010000
7.1.3 Qwen3.5-35B-A3B — coding-agent backend (community)
Popular for pointing local coding agents at Spark. Upstream nightly, BF16.
docker run -d --name qwen35 --ipc=host --restart unless-stopped \
--gpus all --shm-size 64gb -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models \
vllm/vllm-openai:cu130-nightly \
Qwen/Qwen3.5-35B-A3B \
--served-model-name qwen3.5-35b --host 0.0.0.0 --port 8000 \
--dtype bfloat16 --gpu-memory-utilization 0.9 --max-model-len 262144 \
--enable-prefix-caching --enable-auto-tool-choice \
--tool-call-parser qwen3_coder --reasoning-parser qwen3
[!WARNING] GB10 gotchas (community-verified): do not add
--enable-chunked-prefill(≈9× throughput regression on SSM+MoE), and do not add--kv-cache-dtype fp8(output-repetition loops on GB10) for this model. This is the opposite of the Qwen3.6-NVFP4 recipe — never copy flags across models. There’s also a DFlash-accelerated dense sibling (Qwen3.5-27B) covered in §13.
7.2 Gemma family
7.2.1 Gemma-4-31B-IT-NVFP4 — dense multimodal (NVIDIA)
Dense 30.7B, multimodal (text/image/video), 256K context, 140+ languages, hybrid local/global attention with p-RoPE, NVIDIA Open Model License (+ Gemma terms), ~21 GB NVFP4.
export HF_TOKEN=hf_xxx
docker run -d --name gemma4-31b --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
vllm/vllm-openai:gemma4-cu130 \
vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
--quantization modelopt --tensor-parallel-size 1 \
--gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4
[!NOTE] The card uses
--tensor-parallel-size 8on a server — on Spark use TP=1. Gated: accept the Gemma license + setHF_TOKEN. For tools/reasoning add--enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4.
7.2.2 Gemma-4-26B-A4B-NVFP4 — MoE multimodal (NVIDIA)
MoE 25.2B total / 3.8B active (8 of 128 experts +1 shared), multimodal, 256K context, ~14 GB NVFP4 — a small, fast, high-quality Spark pick.
docker run -d --name gemma4-26b --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
vllm/vllm-openai:gemma4-cu130 \
vllm serve nvidia/Gemma-4-26B-A4B-NVFP4 \
--quantization modelopt --tensor-parallel-size 1 --moe-backend marlin \
--trust-remote-code \
--enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
--gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4
[!WARNING] Per the card, this checkpoint currently works with TP=1 only in vLLM (expert-parallel is supported; tensor-parallel is not yet). MoE backend must be VLLM_CUTLASS or Marlin (FlashInfer-TRTLLM is pending a vLLM PR). Needs
--trust-remote-code.
7.2.3 gemma-4-12B-coder (MTP-NVFP4) — Python specialist with a bundled MTP draft (community)
A weight-only NVFP4 (W4A16) build of a Gemma-4-12B coding fine-tune: 8.25 GB model + a 0.85 GB bundled MTP draft for ~1.6× single-stream. Auto-detects NVFP4 (no --quantization). Because the draft lives in assistant/, download to a local path and mount it.
# download (~9 GB total) into ~/vllm
hf download sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 \
--local-dir ~/vllm/gemma4-coder
# easiest: one GPU, just chat
docker run -d --name gemma4-coder --ipc=host --shm-size 16gb --restart unless-stopped \
--gpus all -p 8000:8000 \
-v ~/vllm/gemma4-coder:/model:ro \
vllm/vllm-openai:nightly \
--model /model --served-model-name gemma4-coder \
--max-model-len 16384 --gpu-memory-utilization 0.92 --trust-remote-code
Add the bundled MTP draft for ~1.6× interactive speed (lossless):
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","model":"/model/assistant","num_speculative_tokens":3}'
[!IMPORTANT] This model was trained to think first — enable it per request or quality drops:
extra_body={"chat_template_kwargs":{"enable_thinking":true}}. Needs a recent nightly (registersGemma4UnifiedForConditionalGeneration). With MTP you must use--kv-cache-dtype fp8(NVFP4 KV breaks the draft).
[!CAUTION] The build is de-refused / not safety-aligned — add your own guardrails. It’s a superb algorithm/debug assistant but can write look-ahead bias into pandas/numpy time-series & back-test code (its reasoning sometimes states the right rule while the code does the opposite). Gate quant/accounting code; don’t ship it unreviewed.
7.2.4 DiffusionGemma-26B-A4B-NVFP4 — discrete-diffusion text gen (NVIDIA)
A diffusion LLM on the Gemma-4 26B-A4B MoE backbone (25.2B / 3.8B active) that emits parallel 256-token blocks, exceeding 1,100 tok/s at low batch (H100 FP8). Multimodal, 256K context, ~14 GB NVFP4. Uses the dedicated diffusion image.
docker run -d --name diffgemma --ipc=host --shm-size=16g --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
-e VLLM_USE_V2_MODEL_RUNNER=1 \
vllm/vllm-openai:gemma \
vllm serve nvidia/diffusiongemma-26B-A4B-it-NVFP4 \
--trust-remote-code --max-num-seqs 4 \
--attention-backend TRITON_ATTN \
--enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
--override-generation-config '{"max_new_tokens": null}' \
--default-chat-template-kwargs '{"enable_thinking":true}'
[!WARNING] The
vllm/vllm-openai:gemmaimage and these flags are tentative until the supporting vLLM image is publicly released (per the card). Check the vLLM releases and thevllm/vllm-openai:gemma4Docker Hub tags before relying on it.
7.3 NVIDIA models (proprietary)
Models NVIDIA itself designed and trained (not quantizations of someone else’s weights). Both are hybrid Mamba-2 + Transformer MoE reasoning models with native function calling: the Nano is the natural single-Spark starting point, the Super is the heavyweight.
7.3.1 NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 — best single-Spark Nemotron (Spark-tested)
A hybrid Mamba-2 + MoE (52 layers: 23 Mamba-2, 23 MoE with 128 experts +1 shared / 6 active, 6 GQA attention), 30B total / 3.5B active, text-only, 1M context (256K default), unified reasoning + non-reasoning, NVIDIA Nemotron Open Model License, ~18 GB on disk. NVFP4 weights with FP8 KV cache; attention and the Mamba layers feeding it stay BF16, and quantization-aware distillation (QAD) recovers accuracy. NVIDIA lists DGX Spark in this model’s tested hardware and ships a Spark/Jetson-specific container.
export HF_TOKEN=hf_xxx
# one-time: fetch the custom reasoning parser into ~/vllm (mounted at /models)
wget -P ~/vllm \
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py
docker run -d --name nemotron-nano --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
-e VLLM_USE_FLASHINFER_MOE_FP4=1 \
-e VLLM_FLASHINFER_MOE_BACKEND=throughput \
nvcr.io/nvidia/vllm:25.12.post1-py3 \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--served-model-name nemotron-3-nano --port 8000 \
--tensor-parallel-size 1 --trust-remote-code \
--max-model-len 262144 --max-num-seqs 8 \
--kv-cache-dtype fp8 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--reasoning-parser-plugin /models/nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3
[!IMPORTANT] On Spark (or Jetson Thor) use NVIDIA’s
nvcr.io/nvidia/vllm:25.12.post1-py3container for this model, and fetch thenano_v3_reasoning_parser.pyplugin (downloaded into~/vllmabove, referenced at/models/...).--kv-cache-dtype fp8is part of the official recipe here — unlike Qwen3.5, this model is built for it.--max-num-seqs 8is NVIDIA’s Spark-tested value; drop to 4 for more KV headroom at long context.
[!NOTE] Reasoning is on by default — pass
chat_template_kwargs={"enable_thinking":false}per request to turn it off. The model also supports areasoning_budget(cap internal reasoning tokens to hit latency targets). Sampling:temperature=1.0, top_p=1.0for reasoning;0.6 / 0.95for tool calling. For up to 1M context add-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1and raise--max-model-len. With only 3.5B active params and ~18 GB resident, decode is fast and KV headroom is large — an excellent default Spark model.
7.3.2 Nemotron-3-Super-120B-A12B-NVFP4 — flagship local MoE
NVIDIA’s own hybrid Mamba-Transformer LatentMoE, 120B total / 12B active, native NVFP4 pretraining, MTP, 1M context. The vLLM team’s reference Spark deployment (~23 tok/s decode; ~10–15 min first load).
docker run -d --name nemotron --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
vllm/vllm-openai:cu130-nightly \
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--served-model-name nemotron-3-super --trust-remote-code \
--max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
--reasoning-parser nemotron_v3 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder
[!NOTE] Two NVIDIA-published NVFP4 checkpoints of other vendors’ models live in §7.4 because they follow those families: Llama-3.3-70B-NVFP4 (Meta) and Phi-4-multimodal-NVFP4 (Microsoft). The
nvidia/...Qwen and Gemma NVFP4 checkpoints likewise live in their family groups above. This section is for models NVIDIA itself designed.
7.4 Mistral & other open models
7.4.1 Mistral-Small-4-119B-2603-NVFP4 — unified instruct + reasoning + coding (Mistral AI)
A granular MoE (128 experts, 4 active; 119B total / 6.5B active) fusing Instruct, Reasoning (Magistral), and Devstral skills, multimodal, 256K context, Apache 2.0, ~60 GB NVFP4 (fits one Spark). Compressed-tensors → no --quantization; uses MLA attention and Mistral parsers.
docker run -d --name mistral-small4 --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models \
vllm/vllm-openai:cu130-nightly \
vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
--tensor-parallel-size 1 --max-model-len 65536 \
--attention-backend TRITON_MLA \
--tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral \
--max-num-batched-tokens 16384 --max-num-seqs 4 \
--gpu-memory-utilization 0.85
[!IMPORTANT] Needs
mistral_common >= 1.11.0(bundled in recent vLLM;uv pip install -U vllmpulls it). For correct behavior, load the repo’sSYSTEM_PROMPT.txtand setreasoning_effortper request ("none"for fast replies,"high"for hard tasks; temp 0.7 with reasoning). The card’s example uses TP=2 +--max-num-seqs 128for a multi-GPU server — on a single Spark use TP=1,--max-num-seqs 4,--max-model-len 65536.
7.4.2 gpt-oss-120b — open-weights MXFP4 MoE (OpenAI)
~65 GB native MXFP4 — fits one Spark with KV-cache room. Ungated. The 20B sibling (openai/gpt-oss-20b) is a great first smoke-test.
docker run -d --name gptoss --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models \
nvcr.io/nvidia/vllm:26.04-py3 \
vllm serve openai/gpt-oss-120b \
--host 0.0.0.0 --port 8000 \
--max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
--enable-expert-parallel
[!NOTE] gpt-oss models use OpenAI’s Harmony response format and expose a
reasoning_effortcontrol ("low"/"medium"/"high") passed in the request body — there’s no separate--reasoning-parserflag. For tool calling add--enable-auto-tool-choice --tool-call-parser openai. Use a recent vLLM/NGC image (older builds predate Harmony parsing).
7.4.3 Llama-3.3-70B-Instruct-NVFP4 — gated dense (Meta, NVIDIA quant)
docker run -d --name llama70 --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
nvcr.io/nvidia/vllm:26.04-py3 \
vllm serve nvidia/Llama-3.3-70B-Instruct-NVFP4 \
--quantization modelopt --max-model-len 131072 \
--gpu-memory-utilization 0.85 --max-num-seqs 4
Dense 70B → memory-bandwidth-limited decode (slower than a similar-size MoE), but high single-user quality. Needs an accepted Llama license + HF_TOKEN.
7.4.4 Phi-4-multimodal-instruct-NVFP4 — omnimodal text + image + audio (Microsoft, NVIDIA quant)
Microsoft’s 5.6B omnimodal Phi-4 — text, image, and speech/audio — 128K context, Phi4MMForCausalLM with a custom processor. NVIDIA NVFP4 ModelOpt quant; small and cheap on Spark.
[!CAUTION] Known container gotcha: Phi-4-mm’s custom processor imports
scipy, which is not installed in the NGC vLLM images — serving fails withImportError: ... scipy. Install it beforevllm serve(or bake it into a derived image). Tracked in NVIDIA/dgx-spark-playbooks issue #69.
docker run -d --name phi4mm --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_HOME=/models -v ~/vllm:/models \
nvcr.io/nvidia/vllm:26.04-py3 \
bash -lc "pip install --no-cache-dir scipy && \
vllm serve nvidia/Phi-4-multimodal-instruct-NVFP4 \
--quantization modelopt --trust-remote-code \
--max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4"
[!NOTE] Needs
--trust-remote-code(customPhi4MMprocessor). Send images via OpenAIimage_urlblocks and audio via the audio input field (anysoundfile-readable format). The NVFP4 checkpoint was originally published for TensorRT-LLM but runs under vLLM with the two requirements above.
8. Flag reference & tuning
| Flag | On Spark | Guidance |
|---|---|---|
--gpu-memory-utilization | Fraction of the shared 128 GB | Start from the model recipe (0.4–0.92 seen). Leave headroom; lower if OOM. |
--max-num-seqs | Concurrent sequences | Keep low (1–4); above ~4 the bandwidth tax outweighs batching. |
--max-model-len | Prompt + completion cap | 65536 is a sane Spark default; raise toward model max only with headroom. |
| prefix caching | KV reuse across shared prefixes | On by default in V1; --enable-prefix-caching is redundant but harmless. |
--quantization modelopt | ModelOpt NVFP4 only | Pass for nvidia/... ModelOpt checkpoints; omit for compressed-tensors (auto-detected). |
--reasoning-parser / --tool-call-parser + --enable-auto-tool-choice | Structured reasoning/tools | Follow the model recipe (qwen3 / gemma4 / mistral / nemotron_v3). |
--kv-cache-dtype fp8 | Shrinks KV cache | Model-specific: the Qwen3.6-NVFP4 & gemma4-coder-MTP recipes use it; Qwen3.5 and DFlash do not. |
--speculative-config '{"method":"mtp"...}' | Built-in speculative decode | Latency lever for MTP models (Qwen3.x, Nemotron, gemma4-coder). See §13. |
--moe-backend / --attention-backend | Kernel pins | Leave auto unless a tested recipe pins one (marlin / cutlass / flashinfer / TRITON_*). |
--enable-expert-parallel | MoE routing | Enable for MoE (gpt-oss, Nemotron, Mistral, Gemma-4-26B). |
| CUDA graphs | Per-step overhead | Keep enabled. |
--load-format fastsafetensors | Faster weight load | Evaluate if the 10–15 min load matters. |
[!TIP] Stability ↔ throughput slider. A plain run leaves KV-cache dtype unset, spec-decode off, CUDA graphs on, backends auto. A tuned run layers in FP8 KV cache, async scheduling, MTP/DFlash, and pinned FP4 backends — each validated against the exact model + prompt shape + batch pattern + vLLM release. Want ~2–3× faster interactive decode? See §13 (MTP & DFlash).
9. Pre-warm the JIT
The first request after boot triggers Inductor + FlashInfer codegen (~25–60 s). Fire a tiny warmup on the same path as real traffic, then short prompts return in <0.5 s:
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model":"<served-model-name>","messages":[{"role":"user","content":"ping"}],"max_tokens":3}'
This is separate from weight load (10–15 min for a 120B) — address that with fastsafetensors/InstantTensor if needed.
10. Verify & monitor
curl -sS http://localhost:8000/health
curl -sS http://localhost:8000/v1/models | jq
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model":"<served-model-name>","messages":[{"role":"user","content":"Explain quantum computing briefly."}],"max_tokens":200,"stream":true}'
Prometheus telemetry is at /metrics (no extra service). Watch KV-cache utilization (vllm:kv_cache_usage_perc) and TTFT / inter-token-latency histograms. Healthy single-user behavior: prefix-cached later turns don’t spike, decode rate stays steady, KV usage stays well below the context limit.
[!TIP] Confirm the fast paths are live. Check the startup log to verify NVFP4 GEMM kernels actually engaged (you want a line like
Using NvFp4LinearBackend.VLLM_CUTLASS for NVFP4 GEMM) — if it silently fell back, you lose the FP4 speed/memory win. For speculative decoding, confirm acceptance viacurl -s localhost:8000/metrics | grep -i spec_decode(see §13.5).
11. Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
permission denied ... docker.sock | Not in docker group | sudo usermod -aG docker $USER && newgrp docker |
| OOM with free RAM showing | Page cache holds unified memory | sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' before launch |
| OOM during load/serve | Util too high / context too long | Lower --gpu-memory-utilization, reduce --max-model-len, or pick a smaller/more-quantized model |
| 401 / gated download fails | Missing token / license | export HF_TOKEN=..., accept the license on the HF page |
model type ... not recognized | Container too old for the arch | Newer NGC tag or vllm-openai:nightly; check vllm --version |
unknown quantization / weights mis-load | Wrong quant flag | ModelOpt → --quantization modelopt; compressed-tensors → omit it |
| Stable image won’t run on GB10 | No sm_121 support | Use cu130-nightly / NGC image |
| Output repetition loops | --kv-cache-dtype fp8 on a model that dislikes it | Remove it (model-specific) |
| ~9× slower MoE | --enable-chunked-prefill on SSM+MoE | Remove it |
| First request slow, rest fine | JIT warmup | Pre-warm (§9) |
rmi won’t delete image | Tag omitted | docker stop <name> && docker rm <name> then docker rmi <repo>:<tag> |
12. Scaling to multiple Sparks (brief)
A single Spark handles models up to ~100 GB (gpt-oss-120b MXFP4, Mistral-Small-4 NVFP4, Llama-70B NVFP4). For larger models or higher TP, link Sparks over the 200 Gb/s QSFP (ConnectX-7) ports and run a Ray cluster with --tensor-parallel-size = number of GPUs:
- 2 Sparks: direct QSFP cable. 3+: through a switch.
- Bind NCCL to the QSFP interface (
NCCL_SOCKET_IFNAME=enP2p1s0f1np1); an Ethernet fallback costs 10–20× throughput. - Mount the same
~/vllmon every node and stage weights on each. - Easiest: the
mark-ramsey-ri/vllm-dgx-sparkscripts, or NVIDIA’sspark_cluster_setup.sh+ the official multi-Spark playbook.
13. Speculative decoding on Spark: MTP & DFlash (step-by-step)
A small, cheap drafter proposes the next few tokens; the big model verifies them in one pass — accepted tokens are free. It’s a latency win that shines at low concurrency, exactly Spark’s profile, and can roughly 2–3× interactive decode speed with no quality loss. Two methods matter: MTP (built into the model) and DFlash (a separate diffusion drafter).
13.1 Do they need special models, Docker, or configs?
| MTP | DFlash | |
|---|---|---|
| Special model? | Target has built-in MTP modules (no download) or ships a small paired draft model (e.g. gemma4-coder’s bundled assistant/). | Yes — a matching DFlash drafter checkpoint trained for your target. |
| Special Docker? | No — standard Spark images. | Usually — vLLM ≥ 0.21.0 + sm_121 build. NGC 26.04 (vLLM 0.19.0) is too old; use a prebuilt DFlash image or nightly. |
| Special config? | --speculative-config '{"method":"mtp","num_speculative_tokens":1}' | --speculative-config '{"method":"dflash","model":"<drafter>","num_speculative_tokens":N}' + KV cache BF16 (no fp8). |
| Models | DeepSeek V3/R1/V4, Qwen3.5/3.6, GLM-5.x, Gemma 4 (+gemma4-coder), Nemotron-3-Super, Mistral | Gemma-4-31B, Laguna XS.2, Qwen3.5-27B (or train your own) |
| Speedup | ~1.6–1.8× | up to ~6× / ~2.5× over EAGLE-3; ~2.2–2.7× on Spark |
[!NOTE] Both consume KV cache for speculative tokens, so they trade peak throughput for latency. Keep them on for interactive use; for bursty/concurrent serving add
--speculative-disable-by-batch-size 32.
Which models in this guide can use which method:
| Model (this guide) | MTP | DFlash | How |
|---|---|---|---|
| Qwen3.6-35B-A3B-NVFP4 (§7.1.1) | ✅ built-in | — | already in its recipe (num_speculative_tokens:3) |
| Qwen3.6-27B-NVFP4 (§7.1.2) | ✅ built-in (validate) | — | add --speculative-config '{"method":"mtp",...}' |
| Qwen3.5-35B-A3B (§7.1.3) | ✅ built-in | via 27B sibling | dense Qwen3.5-27B has a DFlash drafter |
| gemma-4-12B-coder (§7.2.3) | ✅ paired draft | — | bundled /model/assistant + kv fp8 |
| Gemma-4-31B-IT-NVFP4 (§7.2.1) | — | ✅ | RedHatAI/...speculator.dflash (Z-Lab image) |
| Nemotron-3-Super (§7.3.2) | ✅ built-in | — | {"method":"mtp","num_speculative_tokens":1} |
| Nemotron-3-Nano (§7.3.1) | — | — | card ships no speculator; run dense |
| Mistral-Small-4 (§7.4.1) | ✅ built-in (validate) | — | add the mtp flag and check acceptance |
(Empty = no published speculator for that exact checkpoint today; “validate” = architecture supports it but confirm acceptance on your build.)
13.2 MTP — zero-download (≈5 min)
Just a flag on an MTP-capable model (the §7.1.1 Qwen3.6 and §7.2.3 gemma4-coder recipes already use it):
[!NOTE] Two flavors of MTP. Most MTP models (Qwen3.6, Nemotron-Super, DeepSeek-style) have the prediction heads baked into the checkpoint — nothing extra to download, just the flag. A few ship a small paired draft model instead: gemma4-coder bundles a 0.4 B draft in
assistant/, so you point"model"at it (/model/assistant) — see §7.2.3. Both use"method":"mtp"; only the second needs the"model"field.
docker run -d --name mtp --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
nvcr.io/nvidia/vllm:26.04-py3 \
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--host 0.0.0.0 --port 8000 --trust-remote-code \
--max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
--reasoning-parser nemotron_v3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
num_speculative_tokens is usually 1 (some recipes use 3). For lowest latency at low concurrency, the vLLM recipe suggests disabling prefix caching. MTP reduces throughput under load — pair with --speculative-disable-by-batch-size 32.
13.3 DFlash on Spark — plug-and-play container (easiest)
The ghcr.io/aeon-7/vllm-dflash image is a prebuilt sm_121 vLLM with DFlash baked in; it serves a 27B dense Qwen3.5 with a 2B block-diffusion drafter, taking decode from ~12 → ~33 tok/s.
# 1) download the target into ~/vllm
pip install -U "huggingface_hub[cli]"; export HF_TOKEN=hf_xxx
hf download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
--local-dir ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4
# 2) make a persistent API key, then launch (drafter auto-downloads)
export VLLM_API_KEY=$(openssl rand -hex 32); echo "API key: $VLLM_API_KEY"
docker run -d --name vllm-dflash --runtime nvidia --network host --ipc host \
--restart unless-stopped \
-v ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4:/models/target \
-v ~/vllm/drafter-cache:/models/drafter-cache \
-e MODEL_PATH=/models/target \
-e SERVED_MODEL_NAME=dflash-qwen3.5-27b \
-e DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash \
-e DFLASH_NUM_SPEC_TOKENS=15 \
-e MAX_MODEL_LEN=65536 -e MAX_NUM_SEQS=4 \
-e GPU_MEMORY_UTILIZATION=0.85 -e MAX_NUM_BATCHED_TOKENS=65536 \
-e VLLM_API_KEY="$VLLM_API_KEY" -e HF_TOKEN="$HF_TOKEN" \
ghcr.io/aeon-7/vllm-dflash:latest
docker logs -f vllm-dflash # ~5 min
# 3) test (note the Bearer token)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer $VLLM_API_KEY" \
-d '{"model":"dflash-qwen3.5-27b","messages":[{"role":"user","content":"Write a haiku about GPUs."}],"max_tokens":200}'
Key env vars: DFLASH_DRAFTER (HF id of the drafter; empty = plain vLLM), DFLASH_NUM_SPEC_TOKENS (15 best single-stream, 5 for high concurrency), KV_CACHE_DTYPE (stays BF16 with DFlash). Spark presets: default 65536/15/4 ≈ 33 tok/s; high-concurrency 32768/5/8 ≈ 85–92 total.
13.4 DFlash the general way (other models)
DFlash uses vLLM’s speculators format; pass the drafter in --speculative-config. Available drafters today:
| Target (verifier) | DFlash drafter | Source |
|---|---|---|
google/gemma-4-31B-it | RedHatAI/gemma-4-31B-it-speculator.dflash | RedHatAI / vLLM |
poolside/Laguna-XS.2 | poolside/Laguna-XS.2-speculator.dflash | poolside |
Qwen/Qwen3.5-27B | z-lab/Qwen3.5-27B-DFlash | Z Lab |
docker run --rm vllm/vllm-openai:nightly vllm --version # confirm >= 0.21.0
# Laguna XS.2 (coding; up to 7 tokens/step, ~70% acceptance on code)
docker run -d --name laguna --ipc=host --restart unless-stopped \
--gpus all -p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -e VLLM_USE_DEEP_GEMM=0 -v ~/vllm:/models \
vllm/vllm-openai:nightly \
vllm serve poolside/Laguna-XS.2 --trust-remote-code \
--enable-auto-tool-choice --tool-call-parser poolside_v1 --reasoning-parser poolside_v1 \
--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
[!NOTE] DFlash needs vLLM ≥ 0.21.0; the stock NGC
26.04(0.19.0) won’t do it. Gemma-4 DFlash currently needs Z Lab’s build (ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130). No drafter for your model? Train one — the drafter is a small Qwen3-style stack and thespeculatorsproject ships a “Train DFlash” tutorial.
13.5 Verify it’s actually accelerating (don’t fly blind)
Speculative decoding only helps if the drafter’s tokens are accepted. Confirm it, then tune:
- Check acceptance. vLLM logs a spec-decode summary (draft acceptance rate and mean accepted length per step) and exposes it on
/metrics:Healthy MTP runs accept most proposed tokens (mean accepted length well above 1); DFlash on code can accept whole blocks. If acceptance is near zero, the method isn’t helping — the drafter/target are mismatched or the content is low-predictability (random/adversarial text).curl -s http://localhost:8000/metrics | grep -i spec_decode - Tune
num_speculative_tokens(K). Raise K for DFlash (it verifies a whole block in one pass — 7–15 is normal); keep K low for MTP (1–3) since each extra token costs a verify pass. Watch decode tok/s as you change it — past the acceptance sweet spot, throughput drops. - A/B against a dense run. Time the same prompt with the spec flag removed. If tok/s isn’t clearly higher and output is identical, drop speculation for that workload.
- Concurrency check. Both methods tax throughput under load; if you serve bursts, set
--speculative-disable-by-batch-size 32so vLLM auto-disables speculation when batches grow.
13.6 The gotchas that bite people
[!CAUTION]
- DFlash floor: vLLM ≥ 0.21.0 (AEON-7/Z-Lab image or nightly). MTP is fine on 0.19.0+.
- DFlash KV cache stays BF16 — never combine with
--kv-cache-dtype fp8. (Note: the MTP gemma4-coder draft is the opposite — it needs fp8 KV.)- One drafter per target — a DFlash drafter is trained for a specific model.
- Higher K is cheap with DFlash (whole block in one pass), not with MTP — keep MTP
num_speculative_tokenslow.- Both hurt throughput under load — add
--speculative-disable-by-batch-size 32.- Acceptance is content-dependent — big on code/structured output, smaller on prose, none on random text.
14. Sources
- NVIDIA DGX Spark vLLM playbook — https://build.nvidia.com/spark/vllm/instructions
- NVIDIA
dgx-spark-playbooks— https://github.com/NVIDIA/dgx-spark-playbooks - vLLM team blog “vLLM on the DGX Spark” (Jun 2026) — https://vllm.ai/blog/2026-06-01-vllm-dgx-spark
- vLLM speculative decoding / MTP / DFlash docs — https://docs.vllm.ai/en/latest/features/speculative_decoding/ · https://docs.vllm.ai/projects/speculators/en/latest/user_guide/algorithms/dflash/
- DFlash paper (Z Lab, arXiv 2602.06036) — https://arxiv.org/abs/2602.06036 ·
AEON-7/vllm-dflash— https://github.com/AEON-7/vllm-dflash - Model cards:
nvidia/Qwen3.6-35B-A3B-NVFP4,unsloth/Qwen3.6-27B-NVFP4,nvidia/Gemma-4-31B-IT-NVFP4,nvidia/Gemma-4-26B-A4B-NVFP4,sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4,nvidia/diffusiongemma-26B-A4B-it-NVFP4,mistralai/Mistral-Small-4-119B-2603-NVFP4,nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4,nvidia/Phi-4-multimodal-instruct-NVFP4(all on https://huggingface.co) - Phi-4-mm scipy container issue — https://github.com/NVIDIA/dgx-spark-playbooks/issues/69
- NGC vLLM container tags — https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm
Compiled June 2026. Benchmark numbers are reproduced from each model’s Hugging Face card (NVFP4-vs-baseline tables, or vendor-reported gains). Container tags and model handles move fast — pin a known-good image digest for anything you depend on.