# Dgx Vllm

*June 20, 2026*
 — by Flaviu Vlaicu

> Some description



> [!IMPORTANT]
> **Where models live.** Every command in this guide stores model weights in **`~/vllm`** on the host and mounts it into the container at `/models` (via `HF_HOME=/models`). Download once, mount everywhere. A couple of community models that bundle a speculative-decode draft use a per-model local path under `~/vllm/<name>` instead — those are flagged inline.

A practical, end-to-end guide for serving LLMs with vLLM on a **DGX Spark (GB10 Grace Blackwell)**, assembled from NVIDIA's official playbook, the vLLM team's deep-dive, model cards, and battle-tested community setups.

> [!WARNING]
> Flag choices on Spark are **model- and image-specific**, not hardware-wide defaults. The recipes below are starting points that worked for their authors against a specific container tag. Validate against the exact image you run, and pin a known-good tag/digest for anything you depend on. Copying a flag from one model's recipe to another can silently regress throughput or output quality.

---

## 1. The landscape: current playbooks & walkthroughs

| Source | Best for | Link |
|---|---|---|
| **NVIDIA official playbook** | Canonical install steps; single / stacked / switched / troubleshooting tabs | https://build.nvidia.com/spark/vllm/instructions |
| **NVIDIA `dgx-spark-playbooks`** | Source-of-truth repo; benchmarking + cluster bootstrap | https://github.com/NVIDIA/dgx-spark-playbooks |
| **DeepWiki: vLLM on Spark** | Clean model support matrix + serving params | https://deepwiki.com/NVIDIA/dgx-spark-playbooks/4.2-vllm |
| **vLLM team blog (Jun 2026)** — *authoritative config deep-dive* | Why each flag matters, unified-memory behavior, JIT pre-warm | https://vllm.ai/blog/2026-06-01-vllm-dgx-spark |
| **`mark-ramsey-ri/vllm-dgx-spark`** | 1-to-N Spark orchestration, 41 model presets | https://github.com/mark-ramsey-ri/vllm-dgx-spark |
| **`AEON-7/vllm-dflash`** | Plug-and-play DFlash container for Spark | https://github.com/AEON-7/vllm-dflash |
| **vLLM Recipes index** | Per-model launch commands, kept current | https://recipes.vllm.ai/ |
| **NGC vLLM container tags** | Latest `nvcr.io/nvidia/vllm` build | https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm |

---

## 2. Hardware facts that drive every config

DGX Spark is a GB10 Grace Blackwell SoC with **128 GB unified memory shared by CPU and GPU**, on a consumer-Blackwell `sm_121` GPU, ~273 GB/s bandwidth, and an **ARM64 (aarch64)** host. Four consequences shape everything below:

- **One memory pool for everything.** Weights, KV cache, the OS, page cache, and the runtime all draw from the same 128 GB. `--gpu-memory-utilization` is a fraction of that shared pool — leave headroom or you OOM with "free" RAM still showing.
- **Memory-bandwidth bound for decode.** Spark shines at **small-batch, single-user interactive serving**. Keep concurrency low (`--max-num-seqs` ≈ 1–4).
- **NVFP4 MoE is the sweet spot.** Quantized MoE with ~3–13 B *active* params (even at 25–120 B total) is the best fit: small active set = fast decode, and 4-bit weights free unified memory for KV cache. Prefer NVFP4/FP8/MXFP4 over BF16.
- **Use sm_121-validated builds.** The upstream **stable** vLLM image does **not** support GB10. Use the CUDA-13 nightly track (`vllm/vllm-openai:cu130-nightly` / `:nightly`) or NVIDIA's NGC container (`nvcr.io/nvidia/vllm:26.02-py3`+).

---

## 3. Storing all models in `~/vllm`

vLLM fetches weights through the Hugging Face stack, which caches under the directory named by `HF_HOME`. Point that at a host directory and mount it, and **every model lives in `~/vllm`**, shared across containers.

```bash
mkdir -p ~/vllm
```

**Pre-stage on the host (recommended)** — files end up owned by you, the download happens once, and the first `vllm serve` doesn't stall on a multi-GB pull:

```bash
pip install -U "huggingface_hub[cli]"        # one-time
export HF_HOME=~/vllm
export HF_TOKEN=hf_xxx                          # for gated models; accept the license on the HF page first
hf download nvidia/Qwen3.6-35B-A3B-NVFP4
```

**The mount, in every `docker run`:**

```bash
  -e HF_HOME=/models \
  -v ~/vllm:/models \
```

Inside the container HF now caches at `/models` (= `~/vllm` on host); pre-staged weights at `~/vllm/hub/...` appear at `/models/hub/...` automatically.

> [!TIP]
> Force fully-local loading (air-gapped, no network checks) with `-e HF_HUB_OFFLINE=1` once weights are staged. If you let the *container* (root) download instead of pre-staging, files in `~/vllm` are root-owned — add `--user $(id -u):$(id -g)` if that matters.

---

## 4. Is volume mounting necessary?

**Yes — treat it as mandatory.** A container filesystem is ephemeral: a freshly downloaded 27–120 B model is destroyed when the container is removed, so without a mount you re-pull many GB **and** re-pay the 10–15 min weight-load on every run. The vLLM team's guidance is "download once, mount everywhere."

The **one** other volume worth adding is the **torch.compile cache** (the slow part of cold start), which lives separately from your models at `/root/.cache/vllm`:

```bash
  -v ~/vllm-compile-cache:/root/.cache/vllm \
```

> [!NOTE]
> The compile cache only pays off when you **recreate** a container (image/flag change) — a long-running `--restart unless-stopped` container compiles once anyway. It's keyed to GPU arch + vLLM version + model + flags, so clear it (`rm -rf ~/vllm-compile-cache/*`) if a stale entry causes a startup hiccup after an upgrade. `--ipc=host` / `--shm-size` is shared memory (RAM), **not** a `-v` volume.

---

## 5. One-time host setup

```bash
nvidia-smi                                   # GPU + driver visible
docker ps                                    # else: sudo usermod -aG docker $USER && newgrp docker
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
mkdir -p ~/vllm
export HF_TOKEN=hf_xxx                        # gated models
# NGC pulls 401? -> docker login nvcr.io  (user "$oauthtoken", password = NGC API key)
```

> [!WARNING]
> **Unified-memory OOM valve.** Because CPU and GPU share DRAM, the Linux page cache can hold memory CUDA can't reclaim — an "OOM" well under 128 GB. If a big model fails to load after heavy file activity, flush caches first: `sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'`

---

## 6. Baseline single-Spark recipe (template)

Minimal, model-agnostic shape — swap the model handle and per-model flags from §7.

```bash
export HF_TOKEN=hf_xxx   # only for gated models

docker run -d --name vllm --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HOME=/models \
  -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve <HF_MODEL_HANDLE> \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 4
```

Smoke-test (first request triggers JIT warmup — see §9):

```bash
curl -sS http://localhost:8000/v1/models | jq -r '.data[0].id'
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"<HF_MODEL_HANDLE>","messages":[{"role":"user","content":"12*17"}],"max_tokens":500}'
# expect "204"
```

> [!NOTE]
> **Two container tracks — pick per recipe.** NVIDIA's NGC image `nvcr.io/nvidia/vllm:26.04-py3` (vLLM 0.19.0) is the default for many `nvidia/...` checkpoints. Several recipes below use the upstream **`vllm/vllm-openai:nightly`** / `:cu130-nightly` track (needed for the newest architectures and for DFlash). A given model may need a *newer* container than you have — `vllm --version` inside the image tells you.

> [!IMPORTANT]
> **Quantization flag rule.** For **NVIDIA ModelOpt** NVFP4 checkpoints (the `nvidia/...` repos), pass `--quantization modelopt`. For **compressed-tensors** NVFP4 (Unsloth, llm-compressor community builds), vLLM **auto-detects** — do *not* pass `--quantization`. Getting this wrong is a common first-run failure.

---

## 7. Per-model configurations

Grouped by family: **Qwen → Gemma → NVIDIA → Mistral & other**. Every model below ships a real benchmark from its card (charted inline). All commands assume `~/vllm` is your model store.

### Quick comparison

| Model | Publisher | Arch | Total / Active | Quant | Image | Key flags |
|---|---|---|---|---|---|---|
| Qwen3.6-35B-A3B-NVFP4 | NVIDIA (Qwen) | MoE+hybrid attn, multimodal | 35B / 3B | NVFP4 (modelopt) | `vllm-openai:nightly` | env vars, MTP-3, `--quantization modelopt` |
| Qwen3.6-27B-NVFP4 | Unsloth (Qwen) | dense + vision | 27B | NVFP4 (c-t) | `vllm-openai:nightly` | `--dtype bfloat16`, trust-remote-code |
| Qwen3.5-35B-A3B | Qwen (community) | MoE (SSM+MoE) | 35B / 3B | BF16 | `cu130-nightly` | qwen3_coder; **no chunked-prefill** |
| Gemma-4-31B-IT-NVFP4 | NVIDIA (Gemma) | dense, multimodal | 30.7B | NVFP4 (modelopt) | `vllm-openai:gemma4-cu130` | `--quantization modelopt`, TP=1 |
| Gemma-4-26B-A4B-NVFP4 | NVIDIA (Gemma) | MoE, multimodal | 25.2B / 3.8B | NVFP4 (modelopt) | `vllm-openai:gemma4-cu130` | **TP=1 only**, gemma4 parsers |
| gemma-4-12B-coder MTP-NVFP4 | community (Gemma) | dense coder + MTP draft | 12B | NVFP4 (c-t) | `vllm-openai:nightly` | bundled MTP, `kv-cache fp8`, thinking-on |
| DiffusionGemma-26B-A4B-NVFP4 | NVIDIA (Gemma) | diffusion MoE | 25.2B / 3.8B | NVFP4 (modelopt) | `vllm-openai:gemma` | V2 runner, TRITON_ATTN |
| Nemotron-3-Nano-30B-A3B-NVFP4 | NVIDIA | hybrid Mamba-2 + MoE | 30B / 3.5B | NVFP4 + fp8 KV | `nvcr.io vllm:25.12.post1` | nano_v3 parser, **Spark-tested** |
| Nemotron-3-Super-120B-A12B-NVFP4 | NVIDIA | hybrid Mamba-Tx MoE | 120B / 12B | NVFP4 | `cu130-nightly` | MTP, reasoning nemotron_v3 |
| Mistral-Small-4-119B-NVFP4 | Mistral AI | MoE, multimodal | 119B / 6.5B | NVFP4 (c-t) | `cu130-nightly` | MLA, mistral parsers, SYSTEM_PROMPT |
| gpt-oss-120b | OpenAI | MoE | 120B | MXFP4 | `nvcr.io vllm:26.04` | expert-parallel |
| Llama-3.3-70B-Instruct-NVFP4 | NVIDIA (Meta) | dense | 70B | NVFP4 | `nvcr.io vllm:26.04` | gated, bandwidth-bound |
| Phi-4-multimodal-NVFP4 | NVIDIA (Microsoft) | dense, multimodal | — | NVFP4 | `nvcr.io vllm:26.04` | trust-remote-code |

*(c-t = compressed-tensors auto-detect; env vars = the four `sm_121a` exports shown in 7.1.1.)*

---

### 7.1 Qwen family

#### 7.1.1 Qwen3.6-35B-A3B-NVFP4 — flagship MoE agent (official Spark recipe)

MoE with hybrid attention, **35B total / 3B active**, multimodal (text/image/video), 262K context, Apache 2.0, ~19 GB NVFP4. This is NVIDIA's recommended Spark configuration verbatim — note the **four required env vars** and `--quantization modelopt`.

```bash
export HF_TOKEN=hf_xxx
docker run -d --name qwen36-35b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_FP8_MOE_BACKEND=flashinfer_cutlass \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e CUTE_DSL_ARCH=sm_121a \
  vllm/vllm-openai:nightly \
  vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 \
    --tensor-parallel-size 1 --trust-remote-code --dtype auto \
    --quantization modelopt --kv-cache-dtype fp8 \
    --attention-backend flashinfer --moe-backend marlin \
    --gpu-memory-utilization 0.85 --max-model-len 65536 \
    --max-num-seqs 4 --max-num-batched-tokens 8192 \
    --enable-chunked-prefill --async-scheduling --enable-prefix-caching \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
    --reasoning-parser qwen3
```

> [!CAUTION]
> Earlier community recipes floated `--gpu-memory-utilization 0.4` and `--max-model-len 262144` for this model. NVIDIA's **official** card uses **0.85 / 65536** *plus the four env vars above* — use the official values. For agent/tool use add `--enable-auto-tool-choice --tool-call-parser qwen3_coder`.

<svg viewBox="0 0 760 448" xmlns="http://www.w3.org/2000/svg" font-family="ui-sans-serif,system-ui,-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif" role="img" aria-label="Qwen3.6-35B-A3B-NVFP4 — accuracy retained vs BF16" style="max-width:100%;height:auto"><rect x="0" y="0" width="760" height="448" rx="12" fill="#0b0f14"/><text x="22" y="32" fill="#e5e7eb" font-size="17" font-weight="700">Qwen3.6-35B-A3B-NVFP4 — accuracy retained vs BF16</text><text x="22" y="52" fill="#9ca3af" font-size="12">NVIDIA card, vLLM on GB300. NVFP4 holds full-precision quality across 8 benchmarks.</text><rect x="22" y="59" width="13" height="13" rx="3" fill="#94a3b8"/><text x="41" y="70" fill="#cbd5e1" font-size="12.5">BF16 baseline</text><rect x="156" y="59" width="13" height="13" rx="3" fill="#76b900"/><text x="175" y="70" fill="#cbd5e1" font-size="12.5">NVFP4</text><line x1="309.2" y1="76" x2="309.2" y2="428" stroke="#1f2933" stroke-width="1"/><text x="309.2" y="72" fill="#475569" font-size="10" text-anchor="middle">25</text><line x1="433.5" y1="76" x2="433.5" y2="428" stroke="#1f2933" stroke-width="1"/><text x="433.5" y="72" fill="#475569" font-size="10" text-anchor="middle">50</text><line x1="557.8" y1="76" x2="557.8" y2="428" stroke="#1f2933" stroke-width="1"/><text x="557.8" y="72" fill="#475569" font-size="10" text-anchor="middle">75</text><line x1="682.0" y1="76" x2="682.0" y2="428" stroke="#1f2933" stroke-width="1"/><text x="682.0" y="72" fill="#475569" font-size="10" text-anchor="middle">100</text><text x="173" y="101.5" fill="#cbd5e1" font-size="12" text-anchor="end">MMLU Pro</text><rect x="185" y="82" width="425.4" height="14" rx="3" fill="#94a3b8"/><text x="616.4" y="93.5" fill="#e5e7eb" font-size="11" font-weight="600">85.6%</text><rect x="185" y="99" width="422.4" height="14" rx="3" fill="#76b900"/><text x="613.5" y="110.5" fill="#e5e7eb" font-size="11" font-weight="600">85.0%</text><text x="173" y="146.5" fill="#cbd5e1" font-size="12" text-anchor="end">GPQA Diamond</text><rect x="185" y="127" width="422.0" height="14" rx="3" fill="#94a3b8"/><text x="613.0" y="138.5" fill="#e5e7eb" font-size="11" font-weight="600">84.9%</text><rect x="185" y="144" width="421.5" height="14" rx="3" fill="#76b900"/><text x="612.5" y="155.5" fill="#e5e7eb" font-size="11" font-weight="600">84.8%</text><text x="173" y="191.5" fill="#cbd5e1" font-size="12" text-anchor="end">τ²-Bench Tel.</text><rect x="185" y="172" width="474.6" height="14" rx="3" fill="#94a3b8"/><text x="665.6" y="183.5" fill="#e5e7eb" font-size="11" font-weight="600">95.5%</text><rect x="185" y="189" width="470.7" height="14" rx="3" fill="#76b900"/><text x="661.7" y="200.5" fill="#e5e7eb" font-size="11" font-weight="600">94.7%</text><text x="173" y="236.5" fill="#cbd5e1" font-size="12" text-anchor="end">SciCode</text><rect x="185" y="217" width="202.8" height="14" rx="3" fill="#94a3b8"/><text x="393.8" y="228.5" fill="#e5e7eb" font-size="11" font-weight="600">40.8%</text><rect x="185" y="234" width="201.8" height="14" rx="3" fill="#76b900"/><text x="392.8" y="245.5" fill="#e5e7eb" font-size="11" font-weight="600">40.6%</text><text x="173" y="281.5" fill="#cbd5e1" font-size="12" text-anchor="end">AIME 2025</text><rect x="185" y="262" width="443.3" height="14" rx="3" fill="#94a3b8"/><text x="634.3" y="273.5" fill="#e5e7eb" font-size="11" font-weight="600">89.2%</text><rect x="185" y="279" width="441.3" height="14" rx="3" fill="#76b900"/><text x="632.3" y="290.5" fill="#e5e7eb" font-size="11" font-weight="600">88.8%</text><text x="173" y="326.5" fill="#cbd5e1" font-size="12" text-anchor="end">AA-LCR</text><rect x="185" y="307" width="308.1" height="14" rx="3" fill="#94a3b8"/><text x="499.1" y="318.5" fill="#e5e7eb" font-size="11" font-weight="600">62.0%</text><rect x="185" y="324" width="308.1" height="14" rx="3" fill="#76b900"/><text x="499.1" y="335.5" fill="#e5e7eb" font-size="11" font-weight="600">62.0%</text><text x="173" y="371.5" fill="#cbd5e1" font-size="12" text-anchor="end">IFBench</text><rect x="185" y="352" width="309.6" height="14" rx="3" fill="#94a3b8"/><text x="500.6" y="363.5" fill="#e5e7eb" font-size="11" font-weight="600">62.3%</text><rect x="185" y="369" width="312.1" height="14" rx="3" fill="#76b900"/><text x="503.1" y="380.5" fill="#e5e7eb" font-size="11" font-weight="600">62.8%</text><text x="173" y="416.5" fill="#cbd5e1" font-size="12" text-anchor="end">MMMU Pro</text><rect x="185" y="397" width="368.3" height="14" rx="3" fill="#94a3b8"/><text x="559.3" y="408.5" fill="#e5e7eb" font-size="11" font-weight="600">74.1%</text><rect x="185" y="414" width="370.3" height="14" rx="3" fill="#76b900"/><text x="561.3" y="425.5" fill="#e5e7eb" font-size="11" font-weight="600">74.5%</text></svg>

#### 7.1.2 Qwen3.6-27B-NVFP4 — flagship-level coding in a 27B dense model (Unsloth)

Dense **27B** with a vision encoder (text/image/video), hybrid Gated-DeltaNet + Gated-Attention layers, **MTP-trained**, 262K context (extensible to ~1M via YaRN), Apache 2.0, ~18 GB NVFP4 (compressed-tensors → **no `--quantization` flag**).

```bash
docker run -d --name qwen36-27b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:nightly \
  vllm serve unsloth/Qwen3.6-27B-NVFP4 \
    --trust-remote-code --dtype bfloat16 \
    --gpu-memory-utilization 0.85 --max-model-len 65536 \
    --max-num-seqs 4 --reasoning-parser qwen3
```

> [!NOTE]
> The card's default `--max-model-len` is a conservative **4096** ("increase only after checking memory"). 65536 is comfortable on Spark; keep ≥128K only if you have headroom, since the model leans on long context for thinking. For coding agents add `--tool-call-parser qwen3_coder`. MTP is supported by the architecture — you can try `--speculative-config '{"method":"mtp","num_speculative_tokens":3}'` and validate acceptance on your build. Thinking is **on by default**; pass `chat_template_kwargs={"enable_thinking":false}` (or `{"preserve_thinking":true}` for agents) per request.

<svg viewBox="0 0 760 336" xmlns="http://www.w3.org/2000/svg" font-family="ui-sans-serif,system-ui,-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif" role="img" aria-label="Qwen3.6-27B — flagship-level coding in a 27B dense model" style="max-width:100%;height:auto"><rect x="0" y="0" width="760" height="336" rx="12" fill="#0b0f14"/><text x="22" y="32" fill="#e5e7eb" font-size="17" font-weight="700">Qwen3.6-27B — flagship-level coding in a 27B dense model</text><text x="22" y="52" fill="#9ca3af" font-size="12">Qwen card BF16 reference benchmarks (NVFP4 retains ~99%). Higher is better.</text><rect x="22" y="59" width="13" height="13" rx="3" fill="#76b900"/><text x="41" y="70" fill="#cbd5e1" font-size="12.5">Qwen3.6-27B</text><rect x="142" y="59" width="13" height="13" rx="3" fill="#38bdf8"/><text x="161" y="70" fill="#cbd5e1" font-size="12.5">Qwen3.5-27B</text><rect x="262" y="59" width="13" height="13" rx="3" fill="#f59e0b"/><text x="281" y="70" fill="#cbd5e1" font-size="12.5">Gemma-4-31B</text><line x1="309.2" y1="76" x2="309.2" y2="316" stroke="#1f2933" stroke-width="1"/><text x="309.2" y="72" fill="#475569" font-size="10" text-anchor="middle">25</text><line x1="433.5" y1="76" x2="433.5" y2="316" stroke="#1f2933" stroke-width="1"/><text x="433.5" y="72" fill="#475569" font-size="10" text-anchor="middle">50</text><line x1="557.8" y1="76" x2="557.8" y2="316" stroke="#1f2933" stroke-width="1"/><text x="557.8" y="72" fill="#475569" font-size="10" text-anchor="middle">75</text><line x1="682.0" y1="76" x2="682.0" y2="316" stroke="#1f2933" stroke-width="1"/><text x="682.0" y="72" fill="#475569" font-size="10" text-anchor="middle">100</text><text x="173" y="110.0" fill="#cbd5e1" font-size="12" text-anchor="end">SWE-bench Verified</text><rect x="185" y="82" width="383.7" height="14" rx="3" fill="#76b900"/><text x="574.7" y="93.5" fill="#e5e7eb" font-size="11" font-weight="600">77.2%</text><rect x="185" y="99" width="372.8" height="14" rx="3" fill="#38bdf8"/><text x="563.8" y="110.5" fill="#e5e7eb" font-size="11" font-weight="600">75.0%</text><rect x="185" y="116" width="258.4" height="14" rx="3" fill="#f59e0b"/><text x="449.4" y="127.5" fill="#e5e7eb" font-size="11" font-weight="600">52.0%</text><text x="173" y="172.0" fill="#cbd5e1" font-size="12" text-anchor="end">SWE-bench Pro</text><rect x="185" y="144" width="265.9" height="14" rx="3" fill="#76b900"/><text x="456.9" y="155.5" fill="#e5e7eb" font-size="11" font-weight="600">53.5%</text><rect x="185" y="161" width="254.5" height="14" rx="3" fill="#38bdf8"/><text x="445.5" y="172.5" fill="#e5e7eb" font-size="11" font-weight="600">51.2%</text><rect x="185" y="178" width="177.4" height="14" rx="3" fill="#f59e0b"/><text x="368.4" y="189.5" fill="#e5e7eb" font-size="11" font-weight="600">35.7%</text><text x="173" y="234.0" fill="#cbd5e1" font-size="12" text-anchor="end">LiveCodeBench v6</text><rect x="185" y="206" width="417.0" height="14" rx="3" fill="#76b900"/><text x="608.0" y="217.5" fill="#e5e7eb" font-size="11" font-weight="600">83.9%</text><rect x="185" y="223" width="401.1" height="14" rx="3" fill="#38bdf8"/><text x="592.1" y="234.5" fill="#e5e7eb" font-size="11" font-weight="600">80.7%</text><rect x="185" y="240" width="397.6" height="14" rx="3" fill="#f59e0b"/><text x="588.6" y="251.5" fill="#e5e7eb" font-size="11" font-weight="600">80.0%</text><text x="173" y="296.0" fill="#cbd5e1" font-size="12" text-anchor="end">AIME 2026</text><rect x="185" y="268" width="467.7" height="14" rx="3" fill="#76b900"/><text x="658.7" y="279.5" fill="#e5e7eb" font-size="11" font-weight="600">94.1%</text><rect x="185" y="285" width="460.2" height="14" rx="3" fill="#38bdf8"/><text x="651.2" y="296.5" fill="#e5e7eb" font-size="11" font-weight="600">92.6%</text><rect x="185" y="302" width="443.3" height="14" rx="3" fill="#f59e0b"/><text x="634.3" y="313.5" fill="#e5e7eb" font-size="11" font-weight="600">89.2%</text></svg>

> [!TIP]
> **Beyond 262K (YaRN, up to ~1M).** This checkpoint is native to 262,144 tokens; for longer contexts enable static YaRN via an `--hf-overrides` block and the long-len escape hatch (mind that static YaRN can slightly dent short-context quality — only enable it when you actually need the length):
> ```bash
>   -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
>   ... vllm serve unsloth/Qwen3.6-27B-NVFP4 --trust-remote-code --dtype bfloat16 \
>     --hf-overrides '{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":4.0,"original_max_position_embeddings":262144}}}' \
>     --max-model-len 1010000
> ```

#### 7.1.3 Qwen3.5-35B-A3B — coding-agent backend (community)

Popular for pointing local coding agents at Spark. Upstream nightly, BF16.

```bash
docker run -d --name qwen35 --ipc=host --restart unless-stopped \
  --gpus all --shm-size 64gb -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:cu130-nightly \
  Qwen/Qwen3.5-35B-A3B \
    --served-model-name qwen3.5-35b --host 0.0.0.0 --port 8000 \
    --dtype bfloat16 --gpu-memory-utilization 0.9 --max-model-len 262144 \
    --enable-prefix-caching --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder --reasoning-parser qwen3
```

> [!WARNING]
> **GB10 gotchas (community-verified):** do **not** add `--enable-chunked-prefill` (≈9× throughput regression on SSM+MoE), and do **not** add `--kv-cache-dtype fp8` (output-repetition loops on GB10) for *this* model. This is the opposite of the Qwen3.6-NVFP4 recipe — never copy flags across models. There's also a DFlash-accelerated dense sibling (Qwen3.5-27B) covered in §13.

---

### 7.2 Gemma family

#### 7.2.1 Gemma-4-31B-IT-NVFP4 — dense multimodal (NVIDIA)

Dense **30.7B**, multimodal (text/image/video), 256K context, 140+ languages, hybrid local/global attention with p-RoPE, NVIDIA Open Model License (+ Gemma terms), ~21 GB NVFP4.

```bash
export HF_TOKEN=hf_xxx
docker run -d --name gemma4-31b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  vllm/vllm-openai:gemma4-cu130 \
  vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
    --quantization modelopt --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4
```

> [!NOTE]
> The card uses `--tensor-parallel-size 8` on a server — on Spark use **TP=1**. Gated: accept the Gemma license + set `HF_TOKEN`. For tools/reasoning add `--enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4`.

<svg viewBox="0 0 760 358" xmlns="http://www.w3.org/2000/svg" font-family="ui-sans-serif,system-ui,-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif" role="img" aria-label="Gemma-4-31B-IT-NVFP4 — accuracy retained vs full precision" style="max-width:100%;height:auto"><rect x="0" y="0" width="760" height="358" rx="12" fill="#0b0f14"/><text x="22" y="32" fill="#e5e7eb" font-size="17" font-weight="700">Gemma-4-31B-IT-NVFP4 — accuracy retained vs full precision</text><text x="22" y="52" fill="#9ca3af" font-size="12">NVIDIA card, vLLM on H100. NVFP4 within &lt;0.4 pts of baseline.</text><rect x="22" y="59" width="13" height="13" rx="3" fill="#94a3b8"/><text x="41" y="70" fill="#cbd5e1" font-size="12.5">Baseline</text><rect x="120" y="59" width="13" height="13" rx="3" fill="#76b900"/><text x="139" y="70" fill="#cbd5e1" font-size="12.5">NVFP4</text><line x1="309.2" y1="76" x2="309.2" y2="338" stroke="#1f2933" stroke-width="1"/><text x="309.2" y="72" fill="#475569" font-size="10" text-anchor="middle">25</text><line x1="433.5" y1="76" x2="433.5" y2="338" stroke="#1f2933" stroke-width="1"/><text x="433.5" y="72" fill="#475569" font-size="10" text-anchor="middle">50</text><line x1="557.8" y1="76" x2="557.8" y2="338" stroke="#1f2933" stroke-width="1"/><text x="557.8" y="72" fill="#475569" font-size="10" text-anchor="middle">75</text><line x1="682.0" y1="76" x2="682.0" y2="338" stroke="#1f2933" stroke-width="1"/><text x="682.0" y="72" fill="#475569" font-size="10" text-anchor="middle">100</text><text x="173" y="101.5" fill="#cbd5e1" font-size="12" text-anchor="end">GPQA Diamond</text><rect x="185" y="82" width="376.3" height="14" rx="3" fill="#94a3b8"/><text x="567.3" y="93.5" fill="#e5e7eb" font-size="11" font-weight="600">75.71%</text><rect x="185" y="99" width="375.0" height="14" rx="3" fill="#76b900"/><text x="566.0" y="110.5" fill="#e5e7eb" font-size="11" font-weight="600">75.46%</text><text x="173" y="146.5" fill="#cbd5e1" font-size="12" text-anchor="end">AIME 2025</text><rect x="185" y="127" width="329.3" height="14" rx="3" fill="#94a3b8"/><text x="520.3" y="138.5" fill="#e5e7eb" font-size="11" font-weight="600">66.25%</text><rect x="185" y="144" width="327.7" height="14" rx="3" fill="#76b900"/><text x="518.7" y="155.5" fill="#e5e7eb" font-size="11" font-weight="600">65.94%</text><text x="173" y="191.5" fill="#cbd5e1" font-size="12" text-anchor="end">MMLU Pro</text><rect x="185" y="172" width="423.7" height="14" rx="3" fill="#94a3b8"/><text x="614.7" y="183.5" fill="#e5e7eb" font-size="11" font-weight="600">85.25%</text><rect x="185" y="189" width="422.2" height="14" rx="3" fill="#76b900"/><text x="613.2" y="200.5" fill="#e5e7eb" font-size="11" font-weight="600">84.94%</text><text x="173" y="236.5" fill="#cbd5e1" font-size="12" text-anchor="end">LiveCodeBench</text><rect x="185" y="217" width="352.4" height="14" rx="3" fill="#94a3b8"/><text x="543.4" y="228.5" fill="#e5e7eb" font-size="11" font-weight="600">70.9%</text><rect x="185" y="234" width="351.0" height="14" rx="3" fill="#76b900"/><text x="542.0" y="245.5" fill="#e5e7eb" font-size="11" font-weight="600">70.63%</text><text x="173" y="281.5" fill="#cbd5e1" font-size="12" text-anchor="end">SciCode</text><rect x="185" y="262" width="167.0" height="14" rx="3" fill="#94a3b8"/><text x="358.0" y="273.5" fill="#e5e7eb" font-size="11" font-weight="600">33.61%</text><rect x="185" y="279" width="164.9" height="14" rx="3" fill="#76b900"/><text x="355.9" y="290.5" fill="#e5e7eb" font-size="11" font-weight="600">33.18%</text><text x="173" y="326.5" fill="#cbd5e1" font-size="12" text-anchor="end">Term-Bench Hard</text><rect x="185" y="307" width="134.6" height="14" rx="3" fill="#94a3b8"/><text x="325.6" y="318.5" fill="#e5e7eb" font-size="11" font-weight="600">27.08%</text><rect x="185" y="324" width="134.6" height="14" rx="3" fill="#76b900"/><text x="325.6" y="335.5" fill="#e5e7eb" font-size="11" font-weight="600">27.08%</text></svg>

#### 7.2.2 Gemma-4-26B-A4B-NVFP4 — MoE multimodal (NVIDIA)

MoE **25.2B total / 3.8B active** (8 of 128 experts +1 shared), multimodal, 256K context, ~14 GB NVFP4 — a small, fast, high-quality Spark pick.

```bash
docker run -d --name gemma4-26b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  vllm/vllm-openai:gemma4-cu130 \
  vllm serve nvidia/Gemma-4-26B-A4B-NVFP4 \
    --quantization modelopt --tensor-parallel-size 1 --moe-backend marlin \
    --trust-remote-code \
    --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
    --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4
```

> [!WARNING]
> Per the card, this checkpoint currently works with **TP=1 only** in vLLM (expert-parallel is supported; tensor-parallel is not yet). MoE backend must be **VLLM_CUTLASS or Marlin** (FlashInfer-TRTLLM is pending a vLLM PR). Needs `--trust-remote-code`.

<svg viewBox="0 0 760 358" xmlns="http://www.w3.org/2000/svg" font-family="ui-sans-serif,system-ui,-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif" role="img" aria-label="Gemma-4-26B-A4B-NVFP4 — accuracy retained vs full precision" style="max-width:100%;height:auto"><rect x="0" y="0" width="760" height="358" rx="12" fill="#0b0f14"/><text x="22" y="32" fill="#e5e7eb" font-size="17" font-weight="700">Gemma-4-26B-A4B-NVFP4 — accuracy retained vs full precision</text><text x="22" y="52" fill="#9ca3af" font-size="12">NVIDIA card, vLLM on B200. 25.2B total / 3.8B active MoE.</text><rect x="22" y="59" width="13" height="13" rx="3" fill="#94a3b8"/><text x="41" y="70" fill="#cbd5e1" font-size="12.5">Baseline</text><rect x="120" y="59" width="13" height="13" rx="3" fill="#76b900"/><text x="139" y="70" fill="#cbd5e1" font-size="12.5">NVFP4</text><line x1="309.2" y1="76" x2="309.2" y2="338" stroke="#1f2933" stroke-width="1"/><text x="309.2" y="72" fill="#475569" font-size="10" text-anchor="middle">25</text><line x1="433.5" y1="76" x2="433.5" y2="338" stroke="#1f2933" stroke-width="1"/><text x="433.5" y="72" fill="#475569" font-size="10" text-anchor="middle">50</text><line x1="557.8" y1="76" x2="557.8" y2="338" stroke="#1f2933" stroke-width="1"/><text x="557.8" y="72" fill="#475569" font-size="10" text-anchor="middle">75</text><line x1="682.0" y1="76" x2="682.0" y2="338" stroke="#1f2933" stroke-width="1"/><text x="682.0" y="72" fill="#475569" font-size="10" text-anchor="middle">100</text><text x="173" y="101.5" fill="#cbd5e1" font-size="12" text-anchor="end">GPQA Diamond</text><rect x="185" y="82" width="399.1" height="14" rx="3" fill="#94a3b8"/><text x="590.1" y="93.5" fill="#e5e7eb" font-size="11" font-weight="600">80.3%</text><rect x="185" y="99" width="397.1" height="14" rx="3" fill="#76b900"/><text x="588.1" y="110.5" fill="#e5e7eb" font-size="11" font-weight="600">79.9%</text><text x="173" y="146.5" fill="#cbd5e1" font-size="12" text-anchor="end">AIME 2025</text><rect x="185" y="127" width="442.1" height="14" rx="3" fill="#94a3b8"/><text x="633.1" y="138.5" fill="#e5e7eb" font-size="11" font-weight="600">88.95%</text><rect x="185" y="144" width="447.3" height="14" rx="3" fill="#76b900"/><text x="638.3" y="155.5" fill="#e5e7eb" font-size="11" font-weight="600">90.0%</text><text x="173" y="191.5" fill="#cbd5e1" font-size="12" text-anchor="end">MMLU Pro</text><rect x="185" y="172" width="422.4" height="14" rx="3" fill="#94a3b8"/><text x="613.5" y="183.5" fill="#e5e7eb" font-size="11" font-weight="600">85.0%</text><rect x="185" y="189" width="421.5" height="14" rx="3" fill="#76b900"/><text x="612.5" y="200.5" fill="#e5e7eb" font-size="11" font-weight="600">84.8%</text><text x="173" y="236.5" fill="#cbd5e1" font-size="12" text-anchor="end">LiveCodeBench</text><rect x="185" y="217" width="400.1" height="14" rx="3" fill="#94a3b8"/><text x="591.1" y="228.5" fill="#e5e7eb" font-size="11" font-weight="600">80.5%</text><rect x="185" y="234" width="396.6" height="14" rx="3" fill="#76b900"/><text x="587.6" y="245.5" fill="#e5e7eb" font-size="11" font-weight="600">79.8%</text><text x="173" y="281.5" fill="#cbd5e1" font-size="12" text-anchor="end">IFBench</text><rect x="185" y="262" width="386.5" height="14" rx="3" fill="#94a3b8"/><text x="577.5" y="273.5" fill="#e5e7eb" font-size="11" font-weight="600">77.77%</text><rect x="185" y="279" width="388.2" height="14" rx="3" fill="#76b900"/><text x="579.2" y="290.5" fill="#e5e7eb" font-size="11" font-weight="600">78.1%</text><text x="173" y="326.5" fill="#cbd5e1" font-size="12" text-anchor="end">IFEval</text><rect x="185" y="307" width="480.1" height="14" rx="3" fill="#94a3b8"/><text x="671.1" y="318.5" fill="#e5e7eb" font-size="11" font-weight="600">96.6%</text><rect x="185" y="324" width="479.1" height="14" rx="3" fill="#76b900"/><text x="670.1" y="335.5" fill="#e5e7eb" font-size="11" font-weight="600">96.4%</text></svg>

#### 7.2.3 gemma-4-12B-coder (MTP-NVFP4) — Python specialist with a bundled MTP draft (community)

A weight-only NVFP4 (W4A16) build of a Gemma-4-12B coding fine-tune: **8.25 GB** model + a **0.85 GB bundled MTP draft** for ~1.6× single-stream. Auto-detects NVFP4 (**no `--quantization`**). Because the draft lives in `assistant/`, download to a local path and mount it.

```bash
# download (~9 GB total) into ~/vllm
hf download sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 \
  --local-dir ~/vllm/gemma4-coder

# easiest: one GPU, just chat
docker run -d --name gemma4-coder --ipc=host --shm-size 16gb --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -v ~/vllm/gemma4-coder:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-coder \
  --max-model-len 16384 --gpu-memory-utilization 0.92 --trust-remote-code
```

Add the bundled **MTP draft** for ~1.6× interactive speed (lossless):

```bash
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"/model/assistant","num_speculative_tokens":3}'
```

> [!IMPORTANT]
> This model was **trained to think first** — enable it **per request** or quality drops: `extra_body={"chat_template_kwargs":{"enable_thinking":true}}`. Needs a recent **nightly** (registers `Gemma4UnifiedForConditionalGeneration`). With MTP you **must** use `--kv-cache-dtype fp8` (NVFP4 KV breaks the draft).

> [!CAUTION]
> The build is **de-refused / not safety-aligned** — add your own guardrails. It's a superb algorithm/debug assistant but can write **look-ahead bias** into pandas/numpy time-series & back-test code (its reasoning sometimes states the right rule while the code does the opposite). Gate quant/accounting code; don't ship it unreviewed.

<svg viewBox="0 0 760 144" xmlns="http://www.w3.org/2000/svg" font-family="ui-sans-serif,system-ui,-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif" role="img" aria-label="gemma-4-12B-coder (MTP-NVFP4) — independent eval, greedy pass@1" style="max-width:100%;height:auto"><rect x="0" y="0" width="760" height="144" rx="12" fill="#0b0f14"/><text x="22" y="32" fill="#e5e7eb" font-size="17" font-weight="700">gemma-4-12B-coder (MTP-NVFP4) — independent eval, greedy pass@1</text><text x="22" y="52" fill="#9ca3af" font-size="12">8.25 GB Python/algorithm specialist. NVFP4 build = Q8 source parity (96%=96% on HumanEval[:50]).</text><rect x="22" y="59" width="13" height="13" rx="3" fill="#76b900"/><text x="41" y="70" fill="#cbd5e1" font-size="12.5">pass@1</text><line x1="309.2" y1="76" x2="309.2" y2="124" stroke="#1f2933" stroke-width="1"/><text x="309.2" y="72" fill="#475569" font-size="10" text-anchor="middle">25</text><line x1="433.5" y1="76" x2="433.5" y2="124" stroke="#1f2933" stroke-width="1"/><text x="433.5" y="72" fill="#475569" font-size="10" text-anchor="middle">50</text><line x1="557.8" y1="76" x2="557.8" y2="124" stroke="#1f2933" stroke-width="1"/><text x="557.8" y="72" fill="#475569" font-size="10" text-anchor="middle">75</text><line x1="682.0" y1="76" x2="682.0" y2="124" stroke="#1f2933" stroke-width="1"/><text x="682.0" y="72" fill="#475569" font-size="10" text-anchor="middle">100</text><text x="173" y="93.0" fill="#cbd5e1" font-size="12" text-anchor="end">HumanEval</text><rect x="185" y="82" width="448.3" height="14" rx="3" fill="#76b900"/><text x="639.3" y="93.5" fill="#e5e7eb" font-size="11" font-weight="600">90.2%</text><text x="173" y="121.0" fill="#cbd5e1" font-size="12" text-anchor="end">MBPP</text><rect x="185" y="110" width="425.9" height="14" rx="3" fill="#76b900"/><text x="616.9" y="121.5" fill="#e5e7eb" font-size="11" font-weight="600">85.7%</text></svg>

#### 7.2.4 DiffusionGemma-26B-A4B-NVFP4 — discrete-diffusion text gen (NVIDIA)

A **diffusion** LLM on the Gemma-4 26B-A4B MoE backbone (25.2B / 3.8B active) that emits **parallel 256-token blocks**, exceeding **1,100 tok/s at low batch** (H100 FP8). Multimodal, 256K context, ~14 GB NVFP4. Uses the dedicated diffusion image.

```bash
docker run -d --name diffgemma --ipc=host --shm-size=16g --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_USE_V2_MODEL_RUNNER=1 \
  vllm/vllm-openai:gemma \
  vllm serve nvidia/diffusiongemma-26B-A4B-it-NVFP4 \
    --trust-remote-code --max-num-seqs 4 \
    --attention-backend TRITON_ATTN \
    --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
    --override-generation-config '{"max_new_tokens": null}' \
    --default-chat-template-kwargs '{"enable_thinking":true}'
```

> [!WARNING]
> The `vllm/vllm-openai:gemma` image and these flags are **tentative until the supporting vLLM image is publicly released** (per the card). Check the [vLLM releases](https://github.com/vllm-project/vllm/releases) and the `vllm/vllm-openai:gemma4` Docker Hub tags before relying on it.

<svg viewBox="0 0 760 403" xmlns="http://www.w3.org/2000/svg" font-family="ui-sans-serif,system-ui,-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif" role="img" aria-label="DiffusionGemma-26B-A4B-NVFP4 — accuracy retained (thinking on)" style="max-width:100%;height:auto"><rect x="0" y="0" width="760" height="403" rx="12" fill="#0b0f14"/><text x="22" y="32" fill="#e5e7eb" font-size="17" font-weight="700">DiffusionGemma-26B-A4B-NVFP4 — accuracy retained (thinking on)</text><text x="22" y="52" fill="#9ca3af" font-size="12">NVIDIA card, vLLM on B100. Diffusion sampling; &gt;1,100 tok/s at low batch (H100 FP8).</text><rect x="22" y="59" width="13" height="13" rx="3" fill="#94a3b8"/><text x="41" y="70" fill="#cbd5e1" font-size="12.5">Baseline</text><rect x="120" y="59" width="13" height="13" rx="3" fill="#76b900"/><text x="139" y="70" fill="#cbd5e1" font-size="12.5">NVFP4</text><line x1="309.2" y1="76" x2="309.2" y2="383" stroke="#1f2933" stroke-width="1"/><text x="309.2" y="72" fill="#475569" font-size="10" text-anchor="middle">25</text><line x1="433.5" y1="76" x2="433.5" y2="383" stroke="#1f2933" stroke-width="1"/><text x="433.5" y="72" fill="#475569" font-size="10" text-anchor="middle">50</text><line x1="557.8" y1="76" x2="557.8" y2="383" stroke="#1f2933" stroke-width="1"/><text x="557.8" y="72" fill="#475569" font-size="10" text-anchor="middle">75</text><line x1="682.0" y1="76" x2="682.0" y2="383" stroke="#1f2933" stroke-width="1"/><text x="682.0" y="72" fill="#475569" font-size="10" text-anchor="middle">100</text><text x="173" y="101.5" fill="#cbd5e1" font-size="12" text-anchor="end">GPQA Diamond</text><rect x="185" y="82" width="344.9" height="14" rx="3" fill="#94a3b8"/><text x="535.9" y="93.5" fill="#e5e7eb" font-size="11" font-weight="600">69.4%</text><rect x="185" y="99" width="340.9" height="14" rx="3" fill="#76b900"/><text x="531.9" y="110.5" fill="#e5e7eb" font-size="11" font-weight="600">68.6%</text><text x="173" y="146.5" fill="#cbd5e1" font-size="12" text-anchor="end">AIME 2025</text><rect x="185" y="127" width="339.6" height="14" rx="3" fill="#94a3b8"/><text x="530.6" y="138.5" fill="#e5e7eb" font-size="11" font-weight="600">68.33%</text><rect x="185" y="144" width="334.6" height="14" rx="3" fill="#76b900"/><text x="525.6" y="155.5" fill="#e5e7eb" font-size="11" font-weight="600">67.33%</text><text x="173" y="191.5" fill="#cbd5e1" font-size="12" text-anchor="end">GSM8K</text><rect x="185" y="172" width="469.9" height="14" rx="3" fill="#94a3b8"/><text x="660.9" y="183.5" fill="#e5e7eb" font-size="11" font-weight="600">94.54%</text><rect x="185" y="189" width="467.2" height="14" rx="3" fill="#76b900"/><text x="658.2" y="200.5" fill="#e5e7eb" font-size="11" font-weight="600">94.01%</text><text x="173" y="236.5" fill="#cbd5e1" font-size="12" text-anchor="end">IFEval</text><rect x="185" y="217" width="467.2" height="14" rx="3" fill="#94a3b8"/><text x="658.2" y="228.5" fill="#e5e7eb" font-size="11" font-weight="600">94.01%</text><rect x="185" y="234" width="470.0" height="14" rx="3" fill="#76b900"/><text x="661.0" y="245.5" fill="#e5e7eb" font-size="11" font-weight="600">94.56%</text><text x="173" y="281.5" fill="#cbd5e1" font-size="12" text-anchor="end">HumanEval</text><rect x="185" y="262" width="467.6" height="14" rx="3" fill="#94a3b8"/><text x="658.6" y="273.5" fill="#e5e7eb" font-size="11" font-weight="600">94.09%</text><rect x="185" y="279" width="472.1" height="14" rx="3" fill="#76b900"/><text x="663.1" y="290.5" fill="#e5e7eb" font-size="11" font-weight="600">95.0%</text><text x="173" y="326.5" fill="#cbd5e1" font-size="12" text-anchor="end">MMMLU 0-shot</text><rect x="185" y="307" width="439.8" height="14" rx="3" fill="#94a3b8"/><text x="630.8" y="318.5" fill="#e5e7eb" font-size="11" font-weight="600">88.5%</text><rect x="185" y="324" width="438.0" height="14" rx="3" fill="#76b900"/><text x="629.0" y="335.5" fill="#e5e7eb" font-size="11" font-weight="600">88.13%</text><text x="173" y="371.5" fill="#cbd5e1" font-size="12" text-anchor="end">MMLU Pro</text><rect x="185" y="352" width="402.6" height="14" rx="3" fill="#94a3b8"/><text x="593.6" y="363.5" fill="#e5e7eb" font-size="11" font-weight="600">81.0%</text><rect x="185" y="369" width="401.1" height="14" rx="3" fill="#76b900"/><text x="592.1" y="380.5" fill="#e5e7eb" font-size="11" font-weight="600">80.7%</text></svg>

---

### 7.3 NVIDIA models (proprietary)

Models NVIDIA itself designed and trained (not quantizations of someone else's weights). Both are **hybrid Mamba-2 + Transformer MoE** reasoning models with native function calling: the **Nano** is the natural single-Spark starting point, the **Super** is the heavyweight.

#### 7.3.1 NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 — best single-Spark Nemotron (Spark-tested)

A hybrid **Mamba-2 + MoE** (52 layers: 23 Mamba-2, 23 MoE with 128 experts +1 shared / 6 active, 6 GQA attention), **30B total / 3.5B active**, text-only, **1M context** (256K default), unified reasoning + non-reasoning, NVIDIA Nemotron Open Model License, ~18 GB on disk. NVFP4 weights **with FP8 KV cache**; attention and the Mamba layers feeding it stay BF16, and quantization-aware distillation (QAD) recovers accuracy. NVIDIA lists **DGX Spark** in this model's tested hardware and ships a Spark/Jetson-specific container.

```bash
export HF_TOKEN=hf_xxx
# one-time: fetch the custom reasoning parser into ~/vllm (mounted at /models)
wget -P ~/vllm \
  https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py

docker run -d --name nemotron-nano --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
    --served-model-name nemotron-3-nano --port 8000 \
    --tensor-parallel-size 1 --trust-remote-code \
    --max-model-len 262144 --max-num-seqs 8 \
    --kv-cache-dtype fp8 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder \
    --reasoning-parser-plugin /models/nano_v3_reasoning_parser.py \
    --reasoning-parser nano_v3
```

> [!IMPORTANT]
> On Spark (or Jetson Thor) use NVIDIA's **`nvcr.io/nvidia/vllm:25.12.post1-py3`** container for this model, and fetch the **`nano_v3_reasoning_parser.py`** plugin (downloaded into `~/vllm` above, referenced at `/models/...`). `--kv-cache-dtype fp8` is part of the **official** recipe here — unlike Qwen3.5, this model is built for it. `--max-num-seqs 8` is NVIDIA's Spark-tested value; drop to 4 for more KV headroom at long context.

> [!NOTE]
> Reasoning is **on by default** — pass `chat_template_kwargs={"enable_thinking":false}` per request to turn it off. The model also supports a **`reasoning_budget`** (cap internal reasoning tokens to hit latency targets). Sampling: `temperature=1.0, top_p=1.0` for reasoning; `0.6 / 0.95` for tool calling. For up to 1M context add `-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1` and raise `--max-model-len`. With only **3.5B active params** and ~18 GB resident, decode is fast and KV headroom is large — an excellent default Spark model.

<svg viewBox="0 0 760 358" xmlns="http://www.w3.org/2000/svg" font-family="ui-sans-serif,system-ui,-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif" role="img" aria-label="Nemotron-3-Nano-30B-A3B — accuracy retained vs BF16" style="max-width:100%;height:auto"><rect x="0" y="0" width="760" height="358" rx="12" fill="#0b0f14"/><text x="22" y="32" fill="#e5e7eb" font-size="17" font-weight="700">Nemotron-3-Nano-30B-A3B — accuracy retained vs BF16</text><text x="22" y="52" fill="#9ca3af" font-size="12">NVIDIA card (Nemo Evaluator). NVFP4 + FP8 KV after PTQ + quant-aware distillation; FP8 sits between.</text><rect x="22" y="59" width="13" height="13" rx="3" fill="#94a3b8"/><text x="41" y="70" fill="#cbd5e1" font-size="12.5">BF16 baseline</text><rect x="156" y="59" width="13" height="13" rx="3" fill="#76b900"/><text x="175" y="70" fill="#cbd5e1" font-size="12.5">NVFP4</text><line x1="309.2" y1="76" x2="309.2" y2="338" stroke="#1f2933" stroke-width="1"/><text x="309.2" y="72" fill="#475569" font-size="10" text-anchor="middle">25</text><line x1="433.5" y1="76" x2="433.5" y2="338" stroke="#1f2933" stroke-width="1"/><text x="433.5" y="72" fill="#475569" font-size="10" text-anchor="middle">50</text><line x1="557.8" y1="76" x2="557.8" y2="338" stroke="#1f2933" stroke-width="1"/><text x="557.8" y="72" fill="#475569" font-size="10" text-anchor="middle">75</text><line x1="682.0" y1="76" x2="682.0" y2="338" stroke="#1f2933" stroke-width="1"/><text x="682.0" y="72" fill="#475569" font-size="10" text-anchor="middle">100</text><text x="173" y="101.5" fill="#cbd5e1" font-size="12" text-anchor="end">MMLU-Pro</text><rect x="185" y="82" width="389.2" height="14" rx="3" fill="#94a3b8"/><text x="580.2" y="93.5" fill="#e5e7eb" font-size="11" font-weight="600">78.3%</text><rect x="185" y="99" width="384.7" height="14" rx="3" fill="#76b900"/><text x="575.7" y="110.5" fill="#e5e7eb" font-size="11" font-weight="600">77.4%</text><text x="173" y="146.5" fill="#cbd5e1" font-size="12" text-anchor="end">AIME 2025</text><rect x="185" y="127" width="442.8" height="14" rx="3" fill="#94a3b8"/><text x="633.8" y="138.5" fill="#e5e7eb" font-size="11" font-weight="600">89.1%</text><rect x="185" y="144" width="430.9" height="14" rx="3" fill="#76b900"/><text x="621.9" y="155.5" fill="#e5e7eb" font-size="11" font-weight="600">86.7%</text><text x="173" y="191.5" fill="#cbd5e1" font-size="12" text-anchor="end">GPQA</text><rect x="185" y="172" width="362.8" height="14" rx="3" fill="#94a3b8"/><text x="553.8" y="183.5" fill="#e5e7eb" font-size="11" font-weight="600">73.0%</text><rect x="185" y="189" width="357.3" height="14" rx="3" fill="#76b900"/><text x="548.3" y="200.5" fill="#e5e7eb" font-size="11" font-weight="600">71.9%</text><text x="173" y="236.5" fill="#cbd5e1" font-size="12" text-anchor="end">LiveCodeBench</text><rect x="185" y="217" width="339.5" height="14" rx="3" fill="#94a3b8"/><text x="530.5" y="228.5" fill="#e5e7eb" font-size="11" font-weight="600">68.3%</text><rect x="185" y="234" width="325.0" height="14" rx="3" fill="#76b900"/><text x="516.0" y="245.5" fill="#e5e7eb" font-size="11" font-weight="600">65.4%</text><text x="173" y="281.5" fill="#cbd5e1" font-size="12" text-anchor="end">TauBench V2 avg</text><rect x="185" y="262" width="243.5" height="14" rx="3" fill="#94a3b8"/><text x="434.5" y="273.5" fill="#e5e7eb" font-size="11" font-weight="600">49.0%</text><rect x="185" y="279" width="226.6" height="14" rx="3" fill="#76b900"/><text x="417.6" y="290.5" fill="#e5e7eb" font-size="11" font-weight="600">45.6%</text><text x="173" y="326.5" fill="#cbd5e1" font-size="12" text-anchor="end">IFBench</text><rect x="185" y="307" width="355.4" height="14" rx="3" fill="#94a3b8"/><text x="546.4" y="318.5" fill="#e5e7eb" font-size="11" font-weight="600">71.5%</text><rect x="185" y="324" width="351.4" height="14" rx="3" fill="#76b900"/><text x="542.4" y="335.5" fill="#e5e7eb" font-size="11" font-weight="600">70.7%</text></svg>

#### 7.3.2 Nemotron-3-Super-120B-A12B-NVFP4 — flagship local MoE

NVIDIA's own hybrid **Mamba-Transformer LatentMoE**, **120B total / 12B active**, native NVFP4 pretraining, MTP, 1M context. The vLLM team's reference Spark deployment (~23 tok/s decode; ~10–15 min first load).

```bash
docker run -d --name nemotron --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:cu130-nightly \
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --served-model-name nemotron-3-super --trust-remote-code \
    --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
    --reasoning-parser nemotron_v3 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder
```

> [!NOTE]
> Two NVIDIA-**published** NVFP4 checkpoints of *other vendors'* models live in §7.4 because they follow those families: **Llama-3.3-70B-NVFP4** (Meta) and **Phi-4-multimodal-NVFP4** (Microsoft). The `nvidia/...` Qwen and Gemma NVFP4 checkpoints likewise live in their family groups above. This section is for models NVIDIA itself designed.

---

### 7.4 Mistral & other open models

#### 7.4.1 Mistral-Small-4-119B-2603-NVFP4 — unified instruct + reasoning + coding (Mistral AI)

A granular MoE (**128 experts, 4 active; 119B total / 6.5B active**) fusing Instruct, Reasoning (Magistral), and Devstral skills, multimodal, 256K context, Apache 2.0, ~60 GB NVFP4 (fits one Spark). Compressed-tensors → **no `--quantization`**; uses **MLA** attention and **Mistral** parsers.

```bash
docker run -d --name mistral-small4 --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:cu130-nightly \
  vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
    --tensor-parallel-size 1 --max-model-len 65536 \
    --attention-backend TRITON_MLA \
    --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral \
    --max-num-batched-tokens 16384 --max-num-seqs 4 \
    --gpu-memory-utilization 0.85
```

> [!IMPORTANT]
> Needs **`mistral_common >= 1.11.0`** (bundled in recent vLLM; `uv pip install -U vllm` pulls it). For correct behavior, load the repo's **`SYSTEM_PROMPT.txt`** and set **`reasoning_effort`** per request (`"none"` for fast replies, `"high"` for hard tasks; temp 0.7 with reasoning). The card's example uses TP=2 + `--max-num-seqs 128` for a multi-GPU server — on a single Spark use **TP=1**, `--max-num-seqs 4`, `--max-model-len 65536`.

<svg viewBox="0 0 760 176" xmlns="http://www.w3.org/2000/svg" font-family="ui-sans-serif,system-ui,-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif" role="img" aria-label="Mistral Small 4 119B — gains vs Mistral Small 3" style="max-width:100%;height:auto"><rect x="0" y="0" width="760" height="176" rx="12" fill="#0b0f14"/><text x="22" y="32" fill="#e5e7eb" font-size="17" font-weight="700">Mistral Small 4 119B — gains vs Mistral Small 3</text><text x="22" y="52" fill="#9ca3af" font-size="12">Vendor-reported (Mistral AI card). Unified instruct + reasoning + coding MoE, 119B / 6.5B active.</text><rect x="22.0" y="72" width="350.0" height="86" rx="10" fill="#11161d" stroke="#1f2933"/><text x="197.0" y="118" fill="#76b900" font-size="34" font-weight="800" text-anchor="middle">−40%</text><text x="197.0" y="142" fill="#cbd5e1" font-size="12" text-anchor="middle">end-to-end latency — latency-optimized</text><rect x="388.0" y="72" width="350.0" height="86" rx="10" fill="#11161d" stroke="#1f2933"/><text x="563.0" y="118" fill="#38bdf8" font-size="34" font-weight="800" text-anchor="middle">3×</text><text x="563.0" y="142" fill="#cbd5e1" font-size="12" text-anchor="middle">requests/sec — throughput-optimized</text></svg>

#### 7.4.2 gpt-oss-120b — open-weights MXFP4 MoE (OpenAI)

~65 GB native MXFP4 — fits one Spark with KV-cache room. Ungated. The 20B sibling (`openai/gpt-oss-20b`) is a great first smoke-test.

```bash
docker run -d --name gptoss --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve openai/gpt-oss-120b \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
    --enable-expert-parallel
```

> [!NOTE]
> gpt-oss models use OpenAI's **Harmony** response format and expose a **`reasoning_effort`** control (`"low"`/`"medium"`/`"high"`) passed in the request body — there's no separate `--reasoning-parser` flag. For tool calling add `--enable-auto-tool-choice --tool-call-parser openai`. Use a recent vLLM/NGC image (older builds predate Harmony parsing).

#### 7.4.3 Llama-3.3-70B-Instruct-NVFP4 — gated dense (Meta, NVIDIA quant)

```bash
docker run -d --name llama70 --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve nvidia/Llama-3.3-70B-Instruct-NVFP4 \
    --quantization modelopt --max-model-len 131072 \
    --gpu-memory-utilization 0.85 --max-num-seqs 4
```

Dense 70B → memory-bandwidth-limited decode (slower than a similar-size MoE), but high single-user quality. Needs an accepted Llama license + `HF_TOKEN`.

#### 7.4.4 Phi-4-multimodal-instruct-NVFP4 — omnimodal text + image + audio (Microsoft, NVIDIA quant)

Microsoft's 5.6B **omnimodal** Phi-4 — text, image, **and speech/audio** — 128K context, `Phi4MMForCausalLM` with a custom processor. NVIDIA NVFP4 ModelOpt quant; small and cheap on Spark.

> [!CAUTION]
> **Known container gotcha:** Phi-4-mm's custom processor imports **`scipy`**, which is **not installed** in the NGC vLLM images — serving fails with `ImportError: ... scipy`. Install it before `vllm serve` (or bake it into a derived image). Tracked in NVIDIA/dgx-spark-playbooks issue #69.

```bash
docker run -d --name phi4mm --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  bash -lc "pip install --no-cache-dir scipy && \
    vllm serve nvidia/Phi-4-multimodal-instruct-NVFP4 \
      --quantization modelopt --trust-remote-code \
      --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4"
```

> [!NOTE]
> Needs **`--trust-remote-code`** (custom `Phi4MM` processor). Send images via OpenAI `image_url` blocks and audio via the audio input field (any `soundfile`-readable format). The NVFP4 checkpoint was originally published for TensorRT-LLM but runs under vLLM with the two requirements above.

---

## 8. Flag reference & tuning

| Flag | On Spark | Guidance |
|---|---|---|
| `--gpu-memory-utilization` | Fraction of the **shared 128 GB** | Start from the model recipe (0.4–0.92 seen). Leave headroom; lower if OOM. |
| `--max-num-seqs` | Concurrent sequences | Keep **low (1–4)**; above ~4 the bandwidth tax outweighs batching. |
| `--max-model-len` | Prompt + completion cap | 65536 is a sane Spark default; raise toward model max only with headroom. |
| prefix caching | KV reuse across shared prefixes | **On by default in V1**; `--enable-prefix-caching` is redundant but harmless. |
| `--quantization modelopt` | ModelOpt NVFP4 only | Pass for `nvidia/...` ModelOpt checkpoints; **omit** for compressed-tensors (auto-detected). |
| `--reasoning-parser` / `--tool-call-parser` + `--enable-auto-tool-choice` | Structured reasoning/tools | **Follow the model recipe** (qwen3 / gemma4 / mistral / nemotron_v3). |
| `--kv-cache-dtype fp8` | Shrinks KV cache | Model-specific: the Qwen3.6-NVFP4 & gemma4-coder-MTP recipes use it; Qwen3.5 and DFlash do **not**. |
| `--speculative-config '{"method":"mtp"...}'` | Built-in speculative decode | Latency lever for MTP models (Qwen3.x, Nemotron, gemma4-coder). See §13. |
| `--moe-backend` / `--attention-backend` | Kernel pins | Leave auto unless a tested recipe pins one (marlin / cutlass / flashinfer / TRITON_*). |
| `--enable-expert-parallel` | MoE routing | Enable for MoE (gpt-oss, Nemotron, Mistral, Gemma-4-26B). |
| CUDA graphs | Per-step overhead | Keep enabled. |
| `--load-format fastsafetensors` | Faster weight load | Evaluate if the 10–15 min load matters. |

> [!TIP]
> **Stability ↔ throughput slider.** A plain run leaves KV-cache dtype unset, spec-decode off, CUDA graphs on, backends auto. A tuned run layers in FP8 KV cache, async scheduling, MTP/DFlash, and pinned FP4 backends — each validated against the exact model + prompt shape + batch pattern + vLLM release. **Want ~2–3× faster interactive decode? See §13 (MTP & DFlash).**

---

## 9. Pre-warm the JIT

The **first** request after boot triggers Inductor + FlashInfer codegen (~25–60 s). Fire a tiny warmup on the **same path** as real traffic, then short prompts return in <0.5 s:

```bash
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"<served-model-name>","messages":[{"role":"user","content":"ping"}],"max_tokens":3}'
```

This is separate from **weight load** (10–15 min for a 120B) — address that with `fastsafetensors`/InstantTensor if needed.

---

## 10. Verify & monitor

```bash
curl -sS http://localhost:8000/health
curl -sS http://localhost:8000/v1/models | jq
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"<served-model-name>","messages":[{"role":"user","content":"Explain quantum computing briefly."}],"max_tokens":200,"stream":true}'
```

Prometheus telemetry is at `/metrics` (no extra service). Watch KV-cache utilization (`vllm:kv_cache_usage_perc`) and TTFT / inter-token-latency histograms. Healthy single-user behavior: prefix-cached later turns don't spike, decode rate stays steady, KV usage stays well below the context limit.

> [!TIP]
> **Confirm the fast paths are live.** Check the startup log to verify NVFP4 GEMM kernels actually engaged (you want a line like `Using NvFp4LinearBackend.VLLM_CUTLASS for NVFP4 GEMM`) — if it silently fell back, you lose the FP4 speed/memory win. For speculative decoding, confirm acceptance via `curl -s localhost:8000/metrics | grep -i spec_decode` (see §13.5).

---

## 11. Troubleshooting

| Symptom | Cause | Fix |
|---|---|---|
| `permission denied ... docker.sock` | Not in `docker` group | `sudo usermod -aG docker $USER && newgrp docker` |
| OOM with free RAM showing | Page cache holds unified memory | `sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'` before launch |
| OOM during load/serve | Util too high / context too long | Lower `--gpu-memory-utilization`, reduce `--max-model-len`, or pick a smaller/more-quantized model |
| 401 / gated download fails | Missing token / license | `export HF_TOKEN=...`, accept the license on the HF page |
| `model type ... not recognized` | Container too old for the arch | Newer NGC tag or `vllm-openai:nightly`; check `vllm --version` |
| `unknown quantization` / weights mis-load | Wrong quant flag | ModelOpt → `--quantization modelopt`; compressed-tensors → omit it |
| Stable image won't run on GB10 | No `sm_121` support | Use `cu130-nightly` / NGC image |
| Output repetition loops | `--kv-cache-dtype fp8` on a model that dislikes it | Remove it (model-specific) |
| ~9× slower MoE | `--enable-chunked-prefill` on SSM+MoE | Remove it |
| First request slow, rest fine | JIT warmup | Pre-warm (§9) |
| `rmi` won't delete image | Tag omitted | `docker stop <name> && docker rm <name>` then `docker rmi <repo>:<tag>` |

---

## 12. Scaling to multiple Sparks (brief)

A single Spark handles models up to ~100 GB (gpt-oss-120b MXFP4, Mistral-Small-4 NVFP4, Llama-70B NVFP4). For larger models or higher TP, link Sparks over the 200 Gb/s QSFP (ConnectX-7) ports and run a Ray cluster with `--tensor-parallel-size` = number of GPUs:

- **2 Sparks:** direct QSFP cable. **3+:** through a switch.
- Bind NCCL to the QSFP interface (`NCCL_SOCKET_IFNAME=enP2p1s0f1np1`); an Ethernet fallback costs 10–20× throughput.
- **Mount the same `~/vllm` on every node** and stage weights on each.
- Easiest: the `mark-ramsey-ri/vllm-dgx-spark` scripts, or NVIDIA's `spark_cluster_setup.sh` + the official multi-Spark playbook.

---

## 13. Speculative decoding on Spark: MTP & DFlash (step-by-step)

A small, cheap drafter proposes the next few tokens; the big model verifies them in one pass — accepted tokens are free. It's a **latency** win that shines at **low concurrency**, exactly Spark's profile, and can roughly **2–3×** interactive decode speed with no quality loss. Two methods matter: **MTP** (built into the model) and **DFlash** (a separate diffusion drafter).

### 13.1 Do they need special models, Docker, or configs?

| | **MTP** | **DFlash** |
|---|---|---|
| **Special model?** | Target has built-in MTP modules (no download) **or** ships a small paired draft model (e.g. gemma4-coder's bundled `assistant/`). | **Yes** — a matching DFlash **drafter** checkpoint trained for your target. |
| **Special Docker?** | **No** — standard Spark images. | **Usually** — vLLM **≥ 0.21.0** + `sm_121` build. NGC `26.04` (vLLM 0.19.0) is too old; use a prebuilt DFlash image or nightly. |
| **Special config?** | `--speculative-config '{"method":"mtp","num_speculative_tokens":1}'` | `--speculative-config '{"method":"dflash","model":"<drafter>","num_speculative_tokens":N}'` + KV cache **BF16** (no fp8). |
| **Models** | DeepSeek V3/R1/V4, Qwen3.5/3.6, GLM-5.x, Gemma 4 (+gemma4-coder), Nemotron-3-Super, Mistral | Gemma-4-31B, Laguna XS.2, Qwen3.5-27B (or train your own) |
| **Speedup** | ~1.6–1.8× | up to ~6× / ~2.5× over EAGLE-3; ~2.2–2.7× on Spark |

> [!NOTE]
> Both consume KV cache for speculative tokens, so they **trade peak throughput for latency**. Keep them on for interactive use; for bursty/concurrent serving add `--speculative-disable-by-batch-size 32`.

**Which models in this guide can use which method:**

| Model (this guide) | MTP | DFlash | How |
|---|---|---|---|
| Qwen3.6-35B-A3B-NVFP4 (§7.1.1) | ✅ built-in | — | already in its recipe (`num_speculative_tokens:3`) |
| Qwen3.6-27B-NVFP4 (§7.1.2) | ✅ built-in (validate) | — | add `--speculative-config '{"method":"mtp",...}'` |
| Qwen3.5-35B-A3B (§7.1.3) | ✅ built-in | via 27B sibling | dense Qwen3.5-27B has a DFlash drafter |
| gemma-4-12B-coder (§7.2.3) | ✅ paired draft | — | bundled `/model/assistant` + `kv fp8` |
| Gemma-4-31B-IT-NVFP4 (§7.2.1) | — | ✅ | `RedHatAI/...speculator.dflash` (Z-Lab image) |
| Nemotron-3-Super (§7.3.2) | ✅ built-in | — | `{"method":"mtp","num_speculative_tokens":1}` |
| Nemotron-3-Nano (§7.3.1) | — | — | card ships no speculator; run dense |
| Mistral-Small-4 (§7.4.1) | ✅ built-in (validate) | — | add the mtp flag and check acceptance |

(Empty = no published speculator for that exact checkpoint today; "validate" = architecture supports it but confirm acceptance on your build.)


### 13.2 MTP — zero-download (≈5 min)

Just a flag on an MTP-capable model (the §7.1.1 Qwen3.6 and §7.2.3 gemma4-coder recipes already use it):

> [!NOTE]
> **Two flavors of MTP.** Most MTP models (Qwen3.6, Nemotron-Super, DeepSeek-style) have the prediction heads **baked into the checkpoint** — nothing extra to download, just the flag. A few ship a **small paired draft model** instead: gemma4-coder bundles a 0.4 B draft in `assistant/`, so you point `"model"` at it (`/model/assistant`) — see §7.2.3. Both use `"method":"mtp"`; only the second needs the `"model"` field.

```bash
docker run -d --name mtp --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --host 0.0.0.0 --port 8000 --trust-remote-code \
    --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
    --reasoning-parser nemotron_v3 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'
```

`num_speculative_tokens` is usually **1** (some recipes use 3). For lowest latency at low concurrency, the vLLM recipe suggests disabling prefix caching. MTP **reduces throughput under load** — pair with `--speculative-disable-by-batch-size 32`.

### 13.3 DFlash on Spark — plug-and-play container (easiest)

The `ghcr.io/aeon-7/vllm-dflash` image is a prebuilt `sm_121` vLLM with DFlash baked in; it serves a 27B dense Qwen3.5 with a 2B block-diffusion drafter, taking decode from **~12 → ~33 tok/s**.

```bash
# 1) download the target into ~/vllm
pip install -U "huggingface_hub[cli]"; export HF_TOKEN=hf_xxx
hf download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
  --local-dir ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4

# 2) make a persistent API key, then launch (drafter auto-downloads)
export VLLM_API_KEY=$(openssl rand -hex 32); echo "API key: $VLLM_API_KEY"
docker run -d --name vllm-dflash --runtime nvidia --network host --ipc host \
  --restart unless-stopped \
  -v ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4:/models/target \
  -v ~/vllm/drafter-cache:/models/drafter-cache \
  -e MODEL_PATH=/models/target \
  -e SERVED_MODEL_NAME=dflash-qwen3.5-27b \
  -e DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash \
  -e DFLASH_NUM_SPEC_TOKENS=15 \
  -e MAX_MODEL_LEN=65536 -e MAX_NUM_SEQS=4 \
  -e GPU_MEMORY_UTILIZATION=0.85 -e MAX_NUM_BATCHED_TOKENS=65536 \
  -e VLLM_API_KEY="$VLLM_API_KEY" -e HF_TOKEN="$HF_TOKEN" \
  ghcr.io/aeon-7/vllm-dflash:latest
docker logs -f vllm-dflash    # ~5 min

# 3) test (note the Bearer token)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{"model":"dflash-qwen3.5-27b","messages":[{"role":"user","content":"Write a haiku about GPUs."}],"max_tokens":200}'
```

**Key env vars:** `DFLASH_DRAFTER` (HF id of the drafter; empty = plain vLLM), `DFLASH_NUM_SPEC_TOKENS` (15 best single-stream, 5 for high concurrency), `KV_CACHE_DTYPE` (**stays BF16 with DFlash**). Spark presets: default 65536/15/4 ≈ 33 tok/s; high-concurrency 32768/5/8 ≈ 85–92 total.

### 13.4 DFlash the general way (other models)

DFlash uses vLLM's **speculators** format; pass the drafter in `--speculative-config`. Available drafters today:

| Target (verifier) | DFlash drafter | Source |
|---|---|---|
| `google/gemma-4-31B-it` | `RedHatAI/gemma-4-31B-it-speculator.dflash` | RedHatAI / vLLM |
| `poolside/Laguna-XS.2` | `poolside/Laguna-XS.2-speculator.dflash` | poolside |
| `Qwen/Qwen3.5-27B` | `z-lab/Qwen3.5-27B-DFlash` | Z Lab |

```bash
docker run --rm vllm/vllm-openai:nightly vllm --version   # confirm >= 0.21.0

# Laguna XS.2 (coding; up to 7 tokens/step, ~70% acceptance on code)
docker run -d --name laguna --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -e VLLM_USE_DEEP_GEMM=0 -v ~/vllm:/models \
  vllm/vllm-openai:nightly \
  vllm serve poolside/Laguna-XS.2 --trust-remote-code \
    --enable-auto-tool-choice --tool-call-parser poolside_v1 --reasoning-parser poolside_v1 \
    --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
```

> [!NOTE]
> DFlash needs vLLM **≥ 0.21.0**; the stock NGC `26.04` (0.19.0) won't do it. Gemma-4 DFlash currently needs Z Lab's build (`ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130`). No drafter for your model? Train one — the drafter is a small Qwen3-style stack and the `speculators` project ships a "Train DFlash" tutorial.

### 13.5 Verify it's actually accelerating (don't fly blind)

Speculative decoding only helps if the drafter's tokens are **accepted**. Confirm it, then tune:

- **Check acceptance.** vLLM logs a spec-decode summary (draft acceptance rate and **mean accepted length** per step) and exposes it on `/metrics`:
  ```bash
  curl -s http://localhost:8000/metrics | grep -i spec_decode
  ```
  Healthy MTP runs accept most proposed tokens (mean accepted length well above 1); DFlash on code can accept whole blocks. If acceptance is near zero, the method isn't helping — the drafter/target are mismatched or the content is low-predictability (random/adversarial text).
- **Tune `num_speculative_tokens` (K).** Raise K for **DFlash** (it verifies a whole block in one pass — 7–15 is normal); keep K **low for MTP** (1–3) since each extra token costs a verify pass. Watch decode tok/s as you change it — past the acceptance sweet spot, throughput drops.
- **A/B against a dense run.** Time the same prompt with the spec flag removed. If tok/s isn't clearly higher *and* output is identical, drop speculation for that workload.
- **Concurrency check.** Both methods tax throughput under load; if you serve bursts, set `--speculative-disable-by-batch-size 32` so vLLM auto-disables speculation when batches grow.

### 13.6 The gotchas that bite people

> [!CAUTION]
> - **DFlash floor:** vLLM ≥ 0.21.0 (AEON-7/Z-Lab image or nightly). MTP is fine on 0.19.0+.
> - **DFlash KV cache stays BF16** — never combine with `--kv-cache-dtype fp8`. (Note: the *MTP* gemma4-coder draft is the opposite — it *needs* fp8 KV.)
> - **One drafter per target** — a DFlash drafter is trained for a specific model.
> - **Higher K is cheap with DFlash** (whole block in one pass), not with MTP — keep MTP `num_speculative_tokens` low.
> - **Both hurt throughput under load** — add `--speculative-disable-by-batch-size 32`.
> - **Acceptance is content-dependent** — big on code/structured output, smaller on prose, none on random text.

---

## 14. Sources

- NVIDIA DGX Spark vLLM playbook — https://build.nvidia.com/spark/vllm/instructions
- NVIDIA `dgx-spark-playbooks` — https://github.com/NVIDIA/dgx-spark-playbooks
- vLLM team blog "vLLM on the DGX Spark" (Jun 2026) — https://vllm.ai/blog/2026-06-01-vllm-dgx-spark
- vLLM speculative decoding / MTP / DFlash docs — https://docs.vllm.ai/en/latest/features/speculative_decoding/ · https://docs.vllm.ai/projects/speculators/en/latest/user_guide/algorithms/dflash/
- DFlash paper (Z Lab, arXiv 2602.06036) — https://arxiv.org/abs/2602.06036 · `AEON-7/vllm-dflash` — https://github.com/AEON-7/vllm-dflash
- Model cards: `nvidia/Qwen3.6-35B-A3B-NVFP4`, `unsloth/Qwen3.6-27B-NVFP4`, `nvidia/Gemma-4-31B-IT-NVFP4`, `nvidia/Gemma-4-26B-A4B-NVFP4`, `sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4`, `nvidia/diffusiongemma-26B-A4B-it-NVFP4`, `mistralai/Mistral-Small-4-119B-2603-NVFP4`, `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4`, `nvidia/Phi-4-multimodal-instruct-NVFP4` (all on https://huggingface.co)
- Phi-4-mm scipy container issue — https://github.com/NVIDIA/dgx-spark-playbooks/issues/69
- NGC vLLM container tags — https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm

*Compiled June 2026. Benchmark numbers are reproduced from each model's Hugging Face card (NVFP4-vs-baseline tables, or vendor-reported gains). Container tags and model handles move fast — pin a known-good image digest for anything you depend on.*


---
*Source: [https://vlaicu.io/posts/dgx-vllm/](https://vlaicu.io/posts/dgx-vllm/)*
