# Dgx Vllm *June 20, 2026* — by Flaviu Vlaicu > Some description > [!IMPORTANT] > **Where models live.** Every command in this guide stores model weights in **`~/vllm`** on the host and mounts it into the container at `/models` (via `HF_HOME=/models`). Download once, mount everywhere. A couple of community models that bundle a speculative-decode draft use a per-model local path under `~/vllm/` instead — those are flagged inline. A practical, end-to-end guide for serving LLMs with vLLM on a **DGX Spark (GB10 Grace Blackwell)**, assembled from NVIDIA's official playbook, the vLLM team's deep-dive, model cards, and battle-tested community setups. > [!WARNING] > Flag choices on Spark are **model- and image-specific**, not hardware-wide defaults. The recipes below are starting points that worked for their authors against a specific container tag. Validate against the exact image you run, and pin a known-good tag/digest for anything you depend on. Copying a flag from one model's recipe to another can silently regress throughput or output quality. --- ## 1. The landscape: current playbooks & walkthroughs | Source | Best for | Link | |---|---|---| | **NVIDIA official playbook** | Canonical install steps; single / stacked / switched / troubleshooting tabs | https://build.nvidia.com/spark/vllm/instructions | | **NVIDIA `dgx-spark-playbooks`** | Source-of-truth repo; benchmarking + cluster bootstrap | https://github.com/NVIDIA/dgx-spark-playbooks | | **DeepWiki: vLLM on Spark** | Clean model support matrix + serving params | https://deepwiki.com/NVIDIA/dgx-spark-playbooks/4.2-vllm | | **vLLM team blog (Jun 2026)** — *authoritative config deep-dive* | Why each flag matters, unified-memory behavior, JIT pre-warm | https://vllm.ai/blog/2026-06-01-vllm-dgx-spark | | **`mark-ramsey-ri/vllm-dgx-spark`** | 1-to-N Spark orchestration, 41 model presets | https://github.com/mark-ramsey-ri/vllm-dgx-spark | | **`AEON-7/vllm-dflash`** | Plug-and-play DFlash container for Spark | https://github.com/AEON-7/vllm-dflash | | **vLLM Recipes index** | Per-model launch commands, kept current | https://recipes.vllm.ai/ | | **NGC vLLM container tags** | Latest `nvcr.io/nvidia/vllm` build | https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm | --- ## 2. Hardware facts that drive every config DGX Spark is a GB10 Grace Blackwell SoC with **128 GB unified memory shared by CPU and GPU**, on a consumer-Blackwell `sm_121` GPU, ~273 GB/s bandwidth, and an **ARM64 (aarch64)** host. Four consequences shape everything below: - **One memory pool for everything.** Weights, KV cache, the OS, page cache, and the runtime all draw from the same 128 GB. `--gpu-memory-utilization` is a fraction of that shared pool — leave headroom or you OOM with "free" RAM still showing. - **Memory-bandwidth bound for decode.** Spark shines at **small-batch, single-user interactive serving**. Keep concurrency low (`--max-num-seqs` ≈ 1–4). - **NVFP4 MoE is the sweet spot.** Quantized MoE with ~3–13 B *active* params (even at 25–120 B total) is the best fit: small active set = fast decode, and 4-bit weights free unified memory for KV cache. Prefer NVFP4/FP8/MXFP4 over BF16. - **Use sm_121-validated builds.** The upstream **stable** vLLM image does **not** support GB10. Use the CUDA-13 nightly track (`vllm/vllm-openai:cu130-nightly` / `:nightly`) or NVIDIA's NGC container (`nvcr.io/nvidia/vllm:26.02-py3`+). --- ## 3. Storing all models in `~/vllm` vLLM fetches weights through the Hugging Face stack, which caches under the directory named by `HF_HOME`. Point that at a host directory and mount it, and **every model lives in `~/vllm`**, shared across containers. ```bash mkdir -p ~/vllm ``` **Pre-stage on the host (recommended)** — files end up owned by you, the download happens once, and the first `vllm serve` doesn't stall on a multi-GB pull: ```bash pip install -U "huggingface_hub[cli]" # one-time export HF_HOME=~/vllm export HF_TOKEN=hf_xxx # for gated models; accept the license on the HF page first hf download nvidia/Qwen3.6-35B-A3B-NVFP4 ``` **The mount, in every `docker run`:** ```bash -e HF_HOME=/models \ -v ~/vllm:/models \ ``` Inside the container HF now caches at `/models` (= `~/vllm` on host); pre-staged weights at `~/vllm/hub/...` appear at `/models/hub/...` automatically. > [!TIP] > Force fully-local loading (air-gapped, no network checks) with `-e HF_HUB_OFFLINE=1` once weights are staged. If you let the *container* (root) download instead of pre-staging, files in `~/vllm` are root-owned — add `--user $(id -u):$(id -g)` if that matters. --- ## 4. Is volume mounting necessary? **Yes — treat it as mandatory.** A container filesystem is ephemeral: a freshly downloaded 27–120 B model is destroyed when the container is removed, so without a mount you re-pull many GB **and** re-pay the 10–15 min weight-load on every run. The vLLM team's guidance is "download once, mount everywhere." The **one** other volume worth adding is the **torch.compile cache** (the slow part of cold start), which lives separately from your models at `/root/.cache/vllm`: ```bash -v ~/vllm-compile-cache:/root/.cache/vllm \ ``` > [!NOTE] > The compile cache only pays off when you **recreate** a container (image/flag change) — a long-running `--restart unless-stopped` container compiles once anyway. It's keyed to GPU arch + vLLM version + model + flags, so clear it (`rm -rf ~/vllm-compile-cache/*`) if a stale entry causes a startup hiccup after an upgrade. `--ipc=host` / `--shm-size` is shared memory (RAM), **not** a `-v` volume. --- ## 5. One-time host setup ```bash nvidia-smi # GPU + driver visible docker ps # else: sudo usermod -aG docker $USER && newgrp docker docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi mkdir -p ~/vllm export HF_TOKEN=hf_xxx # gated models # NGC pulls 401? -> docker login nvcr.io (user "$oauthtoken", password = NGC API key) ``` > [!WARNING] > **Unified-memory OOM valve.** Because CPU and GPU share DRAM, the Linux page cache can hold memory CUDA can't reclaim — an "OOM" well under 128 GB. If a big model fails to load after heavy file activity, flush caches first: `sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'` --- ## 6. Baseline single-Spark recipe (template) Minimal, model-agnostic shape — swap the model handle and per-model flags from §7. ```bash export HF_TOKEN=hf_xxx # only for gated models docker run -d --name vllm --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" \ -e HF_HOME=/models \ -v ~/vllm:/models \ nvcr.io/nvidia/vllm:26.04-py3 \ vllm serve \ --host 0.0.0.0 --port 8000 \ --max-model-len 65536 \ --gpu-memory-utilization 0.85 \ --max-num-seqs 4 ``` Smoke-test (first request triggers JIT warmup — see §9): ```bash curl -sS http://localhost:8000/v1/models | jq -r '.data[0].id' curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \ -d '{"model":"","messages":[{"role":"user","content":"12*17"}],"max_tokens":500}' # expect "204" ``` > [!NOTE] > **Two container tracks — pick per recipe.** NVIDIA's NGC image `nvcr.io/nvidia/vllm:26.04-py3` (vLLM 0.19.0) is the default for many `nvidia/...` checkpoints. Several recipes below use the upstream **`vllm/vllm-openai:nightly`** / `:cu130-nightly` track (needed for the newest architectures and for DFlash). A given model may need a *newer* container than you have — `vllm --version` inside the image tells you. > [!IMPORTANT] > **Quantization flag rule.** For **NVIDIA ModelOpt** NVFP4 checkpoints (the `nvidia/...` repos), pass `--quantization modelopt`. For **compressed-tensors** NVFP4 (Unsloth, llm-compressor community builds), vLLM **auto-detects** — do *not* pass `--quantization`. Getting this wrong is a common first-run failure. --- ## 7. Per-model configurations Grouped by family: **Qwen → Gemma → NVIDIA → Mistral & other**. Every model below ships a real benchmark from its card (charted inline). All commands assume `~/vllm` is your model store. ### Quick comparison | Model | Publisher | Arch | Total / Active | Quant | Image | Key flags | |---|---|---|---|---|---|---| | Qwen3.6-35B-A3B-NVFP4 | NVIDIA (Qwen) | MoE+hybrid attn, multimodal | 35B / 3B | NVFP4 (modelopt) | `vllm-openai:nightly` | env vars, MTP-3, `--quantization modelopt` | | Qwen3.6-27B-NVFP4 | Unsloth (Qwen) | dense + vision | 27B | NVFP4 (c-t) | `vllm-openai:nightly` | `--dtype bfloat16`, trust-remote-code | | Qwen3.5-35B-A3B | Qwen (community) | MoE (SSM+MoE) | 35B / 3B | BF16 | `cu130-nightly` | qwen3_coder; **no chunked-prefill** | | Gemma-4-31B-IT-NVFP4 | NVIDIA (Gemma) | dense, multimodal | 30.7B | NVFP4 (modelopt) | `vllm-openai:gemma4-cu130` | `--quantization modelopt`, TP=1 | | Gemma-4-26B-A4B-NVFP4 | NVIDIA (Gemma) | MoE, multimodal | 25.2B / 3.8B | NVFP4 (modelopt) | `vllm-openai:gemma4-cu130` | **TP=1 only**, gemma4 parsers | | gemma-4-12B-coder MTP-NVFP4 | community (Gemma) | dense coder + MTP draft | 12B | NVFP4 (c-t) | `vllm-openai:nightly` | bundled MTP, `kv-cache fp8`, thinking-on | | DiffusionGemma-26B-A4B-NVFP4 | NVIDIA (Gemma) | diffusion MoE | 25.2B / 3.8B | NVFP4 (modelopt) | `vllm-openai:gemma` | V2 runner, TRITON_ATTN | | Nemotron-3-Nano-30B-A3B-NVFP4 | NVIDIA | hybrid Mamba-2 + MoE | 30B / 3.5B | NVFP4 + fp8 KV | `nvcr.io vllm:25.12.post1` | nano_v3 parser, **Spark-tested** | | Nemotron-3-Super-120B-A12B-NVFP4 | NVIDIA | hybrid Mamba-Tx MoE | 120B / 12B | NVFP4 | `cu130-nightly` | MTP, reasoning nemotron_v3 | | Mistral-Small-4-119B-NVFP4 | Mistral AI | MoE, multimodal | 119B / 6.5B | NVFP4 (c-t) | `cu130-nightly` | MLA, mistral parsers, SYSTEM_PROMPT | | gpt-oss-120b | OpenAI | MoE | 120B | MXFP4 | `nvcr.io vllm:26.04` | expert-parallel | | Llama-3.3-70B-Instruct-NVFP4 | NVIDIA (Meta) | dense | 70B | NVFP4 | `nvcr.io vllm:26.04` | gated, bandwidth-bound | | Phi-4-multimodal-NVFP4 | NVIDIA (Microsoft) | dense, multimodal | — | NVFP4 | `nvcr.io vllm:26.04` | trust-remote-code | *(c-t = compressed-tensors auto-detect; env vars = the four `sm_121a` exports shown in 7.1.1.)* --- ### 7.1 Qwen family #### 7.1.1 Qwen3.6-35B-A3B-NVFP4 — flagship MoE agent (official Spark recipe) MoE with hybrid attention, **35B total / 3B active**, multimodal (text/image/video), 262K context, Apache 2.0, ~19 GB NVFP4. This is NVIDIA's recommended Spark configuration verbatim — note the **four required env vars** and `--quantization modelopt`. ```bash export HF_TOKEN=hf_xxx docker run -d --name qwen36-35b --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \ -e VLLM_USE_FLASHINFER_MOE_FP4=0 \ -e VLLM_FP8_MOE_BACKEND=flashinfer_cutlass \ -e FLASHINFER_DISABLE_VERSION_CHECK=1 \ -e CUTE_DSL_ARCH=sm_121a \ vllm/vllm-openai:nightly \ vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 \ --tensor-parallel-size 1 --trust-remote-code --dtype auto \ --quantization modelopt --kv-cache-dtype fp8 \ --attention-backend flashinfer --moe-backend marlin \ --gpu-memory-utilization 0.85 --max-model-len 65536 \ --max-num-seqs 4 --max-num-batched-tokens 8192 \ --enable-chunked-prefill --async-scheduling --enable-prefix-caching \ --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \ --reasoning-parser qwen3 ``` > [!CAUTION] > Earlier community recipes floated `--gpu-memory-utilization 0.4` and `--max-model-len 262144` for this model. NVIDIA's **official** card uses **0.85 / 65536** *plus the four env vars above* — use the official values. For agent/tool use add `--enable-auto-tool-choice --tool-call-parser qwen3_coder`. #### 7.1.2 Qwen3.6-27B-NVFP4 — flagship-level coding in a 27B dense model (Unsloth) Dense **27B** with a vision encoder (text/image/video), hybrid Gated-DeltaNet + Gated-Attention layers, **MTP-trained**, 262K context (extensible to ~1M via YaRN), Apache 2.0, ~18 GB NVFP4 (compressed-tensors → **no `--quantization` flag**). ```bash docker run -d --name qwen36-27b --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_HOME=/models -v ~/vllm:/models \ vllm/vllm-openai:nightly \ vllm serve unsloth/Qwen3.6-27B-NVFP4 \ --trust-remote-code --dtype bfloat16 \ --gpu-memory-utilization 0.85 --max-model-len 65536 \ --max-num-seqs 4 --reasoning-parser qwen3 ``` > [!NOTE] > The card's default `--max-model-len` is a conservative **4096** ("increase only after checking memory"). 65536 is comfortable on Spark; keep ≥128K only if you have headroom, since the model leans on long context for thinking. For coding agents add `--tool-call-parser qwen3_coder`. MTP is supported by the architecture — you can try `--speculative-config '{"method":"mtp","num_speculative_tokens":3}'` and validate acceptance on your build. Thinking is **on by default**; pass `chat_template_kwargs={"enable_thinking":false}` (or `{"preserve_thinking":true}` for agents) per request. > [!TIP] > **Beyond 262K (YaRN, up to ~1M).** This checkpoint is native to 262,144 tokens; for longer contexts enable static YaRN via an `--hf-overrides` block and the long-len escape hatch (mind that static YaRN can slightly dent short-context quality — only enable it when you actually need the length): > ```bash > -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ > ... vllm serve unsloth/Qwen3.6-27B-NVFP4 --trust-remote-code --dtype bfloat16 \ > --hf-overrides '{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":4.0,"original_max_position_embeddings":262144}}}' \ > --max-model-len 1010000 > ``` #### 7.1.3 Qwen3.5-35B-A3B — coding-agent backend (community) Popular for pointing local coding agents at Spark. Upstream nightly, BF16. ```bash docker run -d --name qwen35 --ipc=host --restart unless-stopped \ --gpus all --shm-size 64gb -p 8000:8000 \ -e HF_HOME=/models -v ~/vllm:/models \ vllm/vllm-openai:cu130-nightly \ Qwen/Qwen3.5-35B-A3B \ --served-model-name qwen3.5-35b --host 0.0.0.0 --port 8000 \ --dtype bfloat16 --gpu-memory-utilization 0.9 --max-model-len 262144 \ --enable-prefix-caching --enable-auto-tool-choice \ --tool-call-parser qwen3_coder --reasoning-parser qwen3 ``` > [!WARNING] > **GB10 gotchas (community-verified):** do **not** add `--enable-chunked-prefill` (≈9× throughput regression on SSM+MoE), and do **not** add `--kv-cache-dtype fp8` (output-repetition loops on GB10) for *this* model. This is the opposite of the Qwen3.6-NVFP4 recipe — never copy flags across models. There's also a DFlash-accelerated dense sibling (Qwen3.5-27B) covered in §13. --- ### 7.2 Gemma family #### 7.2.1 Gemma-4-31B-IT-NVFP4 — dense multimodal (NVIDIA) Dense **30.7B**, multimodal (text/image/video), 256K context, 140+ languages, hybrid local/global attention with p-RoPE, NVIDIA Open Model License (+ Gemma terms), ~21 GB NVFP4. ```bash export HF_TOKEN=hf_xxx docker run -d --name gemma4-31b --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \ vllm/vllm-openai:gemma4-cu130 \ vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \ --quantization modelopt --tensor-parallel-size 1 \ --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4 ``` > [!NOTE] > The card uses `--tensor-parallel-size 8` on a server — on Spark use **TP=1**. Gated: accept the Gemma license + set `HF_TOKEN`. For tools/reasoning add `--enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4`. #### 7.2.2 Gemma-4-26B-A4B-NVFP4 — MoE multimodal (NVIDIA) MoE **25.2B total / 3.8B active** (8 of 128 experts +1 shared), multimodal, 256K context, ~14 GB NVFP4 — a small, fast, high-quality Spark pick. ```bash docker run -d --name gemma4-26b --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \ vllm/vllm-openai:gemma4-cu130 \ vllm serve nvidia/Gemma-4-26B-A4B-NVFP4 \ --quantization modelopt --tensor-parallel-size 1 --moe-backend marlin \ --trust-remote-code \ --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \ --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4 ``` > [!WARNING] > Per the card, this checkpoint currently works with **TP=1 only** in vLLM (expert-parallel is supported; tensor-parallel is not yet). MoE backend must be **VLLM_CUTLASS or Marlin** (FlashInfer-TRTLLM is pending a vLLM PR). Needs `--trust-remote-code`. #### 7.2.3 gemma-4-12B-coder (MTP-NVFP4) — Python specialist with a bundled MTP draft (community) A weight-only NVFP4 (W4A16) build of a Gemma-4-12B coding fine-tune: **8.25 GB** model + a **0.85 GB bundled MTP draft** for ~1.6× single-stream. Auto-detects NVFP4 (**no `--quantization`**). Because the draft lives in `assistant/`, download to a local path and mount it. ```bash # download (~9 GB total) into ~/vllm hf download sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 \ --local-dir ~/vllm/gemma4-coder # easiest: one GPU, just chat docker run -d --name gemma4-coder --ipc=host --shm-size 16gb --restart unless-stopped \ --gpus all -p 8000:8000 \ -v ~/vllm/gemma4-coder:/model:ro \ vllm/vllm-openai:nightly \ --model /model --served-model-name gemma4-coder \ --max-model-len 16384 --gpu-memory-utilization 0.92 --trust-remote-code ``` Add the bundled **MTP draft** for ~1.6× interactive speed (lossless): ```bash --kv-cache-dtype fp8 \ --speculative-config '{"method":"mtp","model":"/model/assistant","num_speculative_tokens":3}' ``` > [!IMPORTANT] > This model was **trained to think first** — enable it **per request** or quality drops: `extra_body={"chat_template_kwargs":{"enable_thinking":true}}`. Needs a recent **nightly** (registers `Gemma4UnifiedForConditionalGeneration`). With MTP you **must** use `--kv-cache-dtype fp8` (NVFP4 KV breaks the draft). > [!CAUTION] > The build is **de-refused / not safety-aligned** — add your own guardrails. It's a superb algorithm/debug assistant but can write **look-ahead bias** into pandas/numpy time-series & back-test code (its reasoning sometimes states the right rule while the code does the opposite). Gate quant/accounting code; don't ship it unreviewed. #### 7.2.4 DiffusionGemma-26B-A4B-NVFP4 — discrete-diffusion text gen (NVIDIA) A **diffusion** LLM on the Gemma-4 26B-A4B MoE backbone (25.2B / 3.8B active) that emits **parallel 256-token blocks**, exceeding **1,100 tok/s at low batch** (H100 FP8). Multimodal, 256K context, ~14 GB NVFP4. Uses the dedicated diffusion image. ```bash docker run -d --name diffgemma --ipc=host --shm-size=16g --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \ -e VLLM_USE_V2_MODEL_RUNNER=1 \ vllm/vllm-openai:gemma \ vllm serve nvidia/diffusiongemma-26B-A4B-it-NVFP4 \ --trust-remote-code --max-num-seqs 4 \ --attention-backend TRITON_ATTN \ --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \ --override-generation-config '{"max_new_tokens": null}' \ --default-chat-template-kwargs '{"enable_thinking":true}' ``` > [!WARNING] > The `vllm/vllm-openai:gemma` image and these flags are **tentative until the supporting vLLM image is publicly released** (per the card). Check the [vLLM releases](https://github.com/vllm-project/vllm/releases) and the `vllm/vllm-openai:gemma4` Docker Hub tags before relying on it. --- ### 7.3 NVIDIA models (proprietary) Models NVIDIA itself designed and trained (not quantizations of someone else's weights). Both are **hybrid Mamba-2 + Transformer MoE** reasoning models with native function calling: the **Nano** is the natural single-Spark starting point, the **Super** is the heavyweight. #### 7.3.1 NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 — best single-Spark Nemotron (Spark-tested) A hybrid **Mamba-2 + MoE** (52 layers: 23 Mamba-2, 23 MoE with 128 experts +1 shared / 6 active, 6 GQA attention), **30B total / 3.5B active**, text-only, **1M context** (256K default), unified reasoning + non-reasoning, NVIDIA Nemotron Open Model License, ~18 GB on disk. NVFP4 weights **with FP8 KV cache**; attention and the Mamba layers feeding it stay BF16, and quantization-aware distillation (QAD) recovers accuracy. NVIDIA lists **DGX Spark** in this model's tested hardware and ships a Spark/Jetson-specific container. ```bash export HF_TOKEN=hf_xxx # one-time: fetch the custom reasoning parser into ~/vllm (mounted at /models) wget -P ~/vllm \ https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py docker run -d --name nemotron-nano --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \ -e VLLM_USE_FLASHINFER_MOE_FP4=1 \ -e VLLM_FLASHINFER_MOE_BACKEND=throughput \ nvcr.io/nvidia/vllm:25.12.post1-py3 \ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \ --served-model-name nemotron-3-nano --port 8000 \ --tensor-parallel-size 1 --trust-remote-code \ --max-model-len 262144 --max-num-seqs 8 \ --kv-cache-dtype fp8 \ --enable-auto-tool-choice --tool-call-parser qwen3_coder \ --reasoning-parser-plugin /models/nano_v3_reasoning_parser.py \ --reasoning-parser nano_v3 ``` > [!IMPORTANT] > On Spark (or Jetson Thor) use NVIDIA's **`nvcr.io/nvidia/vllm:25.12.post1-py3`** container for this model, and fetch the **`nano_v3_reasoning_parser.py`** plugin (downloaded into `~/vllm` above, referenced at `/models/...`). `--kv-cache-dtype fp8` is part of the **official** recipe here — unlike Qwen3.5, this model is built for it. `--max-num-seqs 8` is NVIDIA's Spark-tested value; drop to 4 for more KV headroom at long context. > [!NOTE] > Reasoning is **on by default** — pass `chat_template_kwargs={"enable_thinking":false}` per request to turn it off. The model also supports a **`reasoning_budget`** (cap internal reasoning tokens to hit latency targets). Sampling: `temperature=1.0, top_p=1.0` for reasoning; `0.6 / 0.95` for tool calling. For up to 1M context add `-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1` and raise `--max-model-len`. With only **3.5B active params** and ~18 GB resident, decode is fast and KV headroom is large — an excellent default Spark model. #### 7.3.2 Nemotron-3-Super-120B-A12B-NVFP4 — flagship local MoE NVIDIA's own hybrid **Mamba-Transformer LatentMoE**, **120B total / 12B active**, native NVFP4 pretraining, MTP, 1M context. The vLLM team's reference Spark deployment (~23 tok/s decode; ~10–15 min first load). ```bash docker run -d --name nemotron --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \ vllm/vllm-openai:cu130-nightly \ vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ --served-model-name nemotron-3-super --trust-remote-code \ --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \ --reasoning-parser nemotron_v3 \ --enable-auto-tool-choice --tool-call-parser qwen3_coder ``` > [!NOTE] > Two NVIDIA-**published** NVFP4 checkpoints of *other vendors'* models live in §7.4 because they follow those families: **Llama-3.3-70B-NVFP4** (Meta) and **Phi-4-multimodal-NVFP4** (Microsoft). The `nvidia/...` Qwen and Gemma NVFP4 checkpoints likewise live in their family groups above. This section is for models NVIDIA itself designed. --- ### 7.4 Mistral & other open models #### 7.4.1 Mistral-Small-4-119B-2603-NVFP4 — unified instruct + reasoning + coding (Mistral AI) A granular MoE (**128 experts, 4 active; 119B total / 6.5B active**) fusing Instruct, Reasoning (Magistral), and Devstral skills, multimodal, 256K context, Apache 2.0, ~60 GB NVFP4 (fits one Spark). Compressed-tensors → **no `--quantization`**; uses **MLA** attention and **Mistral** parsers. ```bash docker run -d --name mistral-small4 --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_HOME=/models -v ~/vllm:/models \ vllm/vllm-openai:cu130-nightly \ vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \ --tensor-parallel-size 1 --max-model-len 65536 \ --attention-backend TRITON_MLA \ --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral \ --max-num-batched-tokens 16384 --max-num-seqs 4 \ --gpu-memory-utilization 0.85 ``` > [!IMPORTANT] > Needs **`mistral_common >= 1.11.0`** (bundled in recent vLLM; `uv pip install -U vllm` pulls it). For correct behavior, load the repo's **`SYSTEM_PROMPT.txt`** and set **`reasoning_effort`** per request (`"none"` for fast replies, `"high"` for hard tasks; temp 0.7 with reasoning). The card's example uses TP=2 + `--max-num-seqs 128` for a multi-GPU server — on a single Spark use **TP=1**, `--max-num-seqs 4`, `--max-model-len 65536`. #### 7.4.2 gpt-oss-120b — open-weights MXFP4 MoE (OpenAI) ~65 GB native MXFP4 — fits one Spark with KV-cache room. Ungated. The 20B sibling (`openai/gpt-oss-20b`) is a great first smoke-test. ```bash docker run -d --name gptoss --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_HOME=/models -v ~/vllm:/models \ nvcr.io/nvidia/vllm:26.04-py3 \ vllm serve openai/gpt-oss-120b \ --host 0.0.0.0 --port 8000 \ --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \ --enable-expert-parallel ``` > [!NOTE] > gpt-oss models use OpenAI's **Harmony** response format and expose a **`reasoning_effort`** control (`"low"`/`"medium"`/`"high"`) passed in the request body — there's no separate `--reasoning-parser` flag. For tool calling add `--enable-auto-tool-choice --tool-call-parser openai`. Use a recent vLLM/NGC image (older builds predate Harmony parsing). #### 7.4.3 Llama-3.3-70B-Instruct-NVFP4 — gated dense (Meta, NVIDIA quant) ```bash docker run -d --name llama70 --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \ nvcr.io/nvidia/vllm:26.04-py3 \ vllm serve nvidia/Llama-3.3-70B-Instruct-NVFP4 \ --quantization modelopt --max-model-len 131072 \ --gpu-memory-utilization 0.85 --max-num-seqs 4 ``` Dense 70B → memory-bandwidth-limited decode (slower than a similar-size MoE), but high single-user quality. Needs an accepted Llama license + `HF_TOKEN`. #### 7.4.4 Phi-4-multimodal-instruct-NVFP4 — omnimodal text + image + audio (Microsoft, NVIDIA quant) Microsoft's 5.6B **omnimodal** Phi-4 — text, image, **and speech/audio** — 128K context, `Phi4MMForCausalLM` with a custom processor. NVIDIA NVFP4 ModelOpt quant; small and cheap on Spark. > [!CAUTION] > **Known container gotcha:** Phi-4-mm's custom processor imports **`scipy`**, which is **not installed** in the NGC vLLM images — serving fails with `ImportError: ... scipy`. Install it before `vllm serve` (or bake it into a derived image). Tracked in NVIDIA/dgx-spark-playbooks issue #69. ```bash docker run -d --name phi4mm --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_HOME=/models -v ~/vllm:/models \ nvcr.io/nvidia/vllm:26.04-py3 \ bash -lc "pip install --no-cache-dir scipy && \ vllm serve nvidia/Phi-4-multimodal-instruct-NVFP4 \ --quantization modelopt --trust-remote-code \ --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4" ``` > [!NOTE] > Needs **`--trust-remote-code`** (custom `Phi4MM` processor). Send images via OpenAI `image_url` blocks and audio via the audio input field (any `soundfile`-readable format). The NVFP4 checkpoint was originally published for TensorRT-LLM but runs under vLLM with the two requirements above. --- ## 8. Flag reference & tuning | Flag | On Spark | Guidance | |---|---|---| | `--gpu-memory-utilization` | Fraction of the **shared 128 GB** | Start from the model recipe (0.4–0.92 seen). Leave headroom; lower if OOM. | | `--max-num-seqs` | Concurrent sequences | Keep **low (1–4)**; above ~4 the bandwidth tax outweighs batching. | | `--max-model-len` | Prompt + completion cap | 65536 is a sane Spark default; raise toward model max only with headroom. | | prefix caching | KV reuse across shared prefixes | **On by default in V1**; `--enable-prefix-caching` is redundant but harmless. | | `--quantization modelopt` | ModelOpt NVFP4 only | Pass for `nvidia/...` ModelOpt checkpoints; **omit** for compressed-tensors (auto-detected). | | `--reasoning-parser` / `--tool-call-parser` + `--enable-auto-tool-choice` | Structured reasoning/tools | **Follow the model recipe** (qwen3 / gemma4 / mistral / nemotron_v3). | | `--kv-cache-dtype fp8` | Shrinks KV cache | Model-specific: the Qwen3.6-NVFP4 & gemma4-coder-MTP recipes use it; Qwen3.5 and DFlash do **not**. | | `--speculative-config '{"method":"mtp"...}'` | Built-in speculative decode | Latency lever for MTP models (Qwen3.x, Nemotron, gemma4-coder). See §13. | | `--moe-backend` / `--attention-backend` | Kernel pins | Leave auto unless a tested recipe pins one (marlin / cutlass / flashinfer / TRITON_*). | | `--enable-expert-parallel` | MoE routing | Enable for MoE (gpt-oss, Nemotron, Mistral, Gemma-4-26B). | | CUDA graphs | Per-step overhead | Keep enabled. | | `--load-format fastsafetensors` | Faster weight load | Evaluate if the 10–15 min load matters. | > [!TIP] > **Stability ↔ throughput slider.** A plain run leaves KV-cache dtype unset, spec-decode off, CUDA graphs on, backends auto. A tuned run layers in FP8 KV cache, async scheduling, MTP/DFlash, and pinned FP4 backends — each validated against the exact model + prompt shape + batch pattern + vLLM release. **Want ~2–3× faster interactive decode? See §13 (MTP & DFlash).** --- ## 9. Pre-warm the JIT The **first** request after boot triggers Inductor + FlashInfer codegen (~25–60 s). Fire a tiny warmup on the **same path** as real traffic, then short prompts return in <0.5 s: ```bash curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \ -d '{"model":"","messages":[{"role":"user","content":"ping"}],"max_tokens":3}' ``` This is separate from **weight load** (10–15 min for a 120B) — address that with `fastsafetensors`/InstantTensor if needed. --- ## 10. Verify & monitor ```bash curl -sS http://localhost:8000/health curl -sS http://localhost:8000/v1/models | jq curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \ -d '{"model":"","messages":[{"role":"user","content":"Explain quantum computing briefly."}],"max_tokens":200,"stream":true}' ``` Prometheus telemetry is at `/metrics` (no extra service). Watch KV-cache utilization (`vllm:kv_cache_usage_perc`) and TTFT / inter-token-latency histograms. Healthy single-user behavior: prefix-cached later turns don't spike, decode rate stays steady, KV usage stays well below the context limit. > [!TIP] > **Confirm the fast paths are live.** Check the startup log to verify NVFP4 GEMM kernels actually engaged (you want a line like `Using NvFp4LinearBackend.VLLM_CUTLASS for NVFP4 GEMM`) — if it silently fell back, you lose the FP4 speed/memory win. For speculative decoding, confirm acceptance via `curl -s localhost:8000/metrics | grep -i spec_decode` (see §13.5). --- ## 11. Troubleshooting | Symptom | Cause | Fix | |---|---|---| | `permission denied ... docker.sock` | Not in `docker` group | `sudo usermod -aG docker $USER && newgrp docker` | | OOM with free RAM showing | Page cache holds unified memory | `sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'` before launch | | OOM during load/serve | Util too high / context too long | Lower `--gpu-memory-utilization`, reduce `--max-model-len`, or pick a smaller/more-quantized model | | 401 / gated download fails | Missing token / license | `export HF_TOKEN=...`, accept the license on the HF page | | `model type ... not recognized` | Container too old for the arch | Newer NGC tag or `vllm-openai:nightly`; check `vllm --version` | | `unknown quantization` / weights mis-load | Wrong quant flag | ModelOpt → `--quantization modelopt`; compressed-tensors → omit it | | Stable image won't run on GB10 | No `sm_121` support | Use `cu130-nightly` / NGC image | | Output repetition loops | `--kv-cache-dtype fp8` on a model that dislikes it | Remove it (model-specific) | | ~9× slower MoE | `--enable-chunked-prefill` on SSM+MoE | Remove it | | First request slow, rest fine | JIT warmup | Pre-warm (§9) | | `rmi` won't delete image | Tag omitted | `docker stop && docker rm ` then `docker rmi :` | --- ## 12. Scaling to multiple Sparks (brief) A single Spark handles models up to ~100 GB (gpt-oss-120b MXFP4, Mistral-Small-4 NVFP4, Llama-70B NVFP4). For larger models or higher TP, link Sparks over the 200 Gb/s QSFP (ConnectX-7) ports and run a Ray cluster with `--tensor-parallel-size` = number of GPUs: - **2 Sparks:** direct QSFP cable. **3+:** through a switch. - Bind NCCL to the QSFP interface (`NCCL_SOCKET_IFNAME=enP2p1s0f1np1`); an Ethernet fallback costs 10–20× throughput. - **Mount the same `~/vllm` on every node** and stage weights on each. - Easiest: the `mark-ramsey-ri/vllm-dgx-spark` scripts, or NVIDIA's `spark_cluster_setup.sh` + the official multi-Spark playbook. --- ## 13. Speculative decoding on Spark: MTP & DFlash (step-by-step) A small, cheap drafter proposes the next few tokens; the big model verifies them in one pass — accepted tokens are free. It's a **latency** win that shines at **low concurrency**, exactly Spark's profile, and can roughly **2–3×** interactive decode speed with no quality loss. Two methods matter: **MTP** (built into the model) and **DFlash** (a separate diffusion drafter). ### 13.1 Do they need special models, Docker, or configs? | | **MTP** | **DFlash** | |---|---|---| | **Special model?** | Target has built-in MTP modules (no download) **or** ships a small paired draft model (e.g. gemma4-coder's bundled `assistant/`). | **Yes** — a matching DFlash **drafter** checkpoint trained for your target. | | **Special Docker?** | **No** — standard Spark images. | **Usually** — vLLM **≥ 0.21.0** + `sm_121` build. NGC `26.04` (vLLM 0.19.0) is too old; use a prebuilt DFlash image or nightly. | | **Special config?** | `--speculative-config '{"method":"mtp","num_speculative_tokens":1}'` | `--speculative-config '{"method":"dflash","model":"","num_speculative_tokens":N}'` + KV cache **BF16** (no fp8). | | **Models** | DeepSeek V3/R1/V4, Qwen3.5/3.6, GLM-5.x, Gemma 4 (+gemma4-coder), Nemotron-3-Super, Mistral | Gemma-4-31B, Laguna XS.2, Qwen3.5-27B (or train your own) | | **Speedup** | ~1.6–1.8× | up to ~6× / ~2.5× over EAGLE-3; ~2.2–2.7× on Spark | > [!NOTE] > Both consume KV cache for speculative tokens, so they **trade peak throughput for latency**. Keep them on for interactive use; for bursty/concurrent serving add `--speculative-disable-by-batch-size 32`. **Which models in this guide can use which method:** | Model (this guide) | MTP | DFlash | How | |---|---|---|---| | Qwen3.6-35B-A3B-NVFP4 (§7.1.1) | ✅ built-in | — | already in its recipe (`num_speculative_tokens:3`) | | Qwen3.6-27B-NVFP4 (§7.1.2) | ✅ built-in (validate) | — | add `--speculative-config '{"method":"mtp",...}'` | | Qwen3.5-35B-A3B (§7.1.3) | ✅ built-in | via 27B sibling | dense Qwen3.5-27B has a DFlash drafter | | gemma-4-12B-coder (§7.2.3) | ✅ paired draft | — | bundled `/model/assistant` + `kv fp8` | | Gemma-4-31B-IT-NVFP4 (§7.2.1) | — | ✅ | `RedHatAI/...speculator.dflash` (Z-Lab image) | | Nemotron-3-Super (§7.3.2) | ✅ built-in | — | `{"method":"mtp","num_speculative_tokens":1}` | | Nemotron-3-Nano (§7.3.1) | — | — | card ships no speculator; run dense | | Mistral-Small-4 (§7.4.1) | ✅ built-in (validate) | — | add the mtp flag and check acceptance | (Empty = no published speculator for that exact checkpoint today; "validate" = architecture supports it but confirm acceptance on your build.) ### 13.2 MTP — zero-download (≈5 min) Just a flag on an MTP-capable model (the §7.1.1 Qwen3.6 and §7.2.3 gemma4-coder recipes already use it): > [!NOTE] > **Two flavors of MTP.** Most MTP models (Qwen3.6, Nemotron-Super, DeepSeek-style) have the prediction heads **baked into the checkpoint** — nothing extra to download, just the flag. A few ship a **small paired draft model** instead: gemma4-coder bundles a 0.4 B draft in `assistant/`, so you point `"model"` at it (`/model/assistant`) — see §7.2.3. Both use `"method":"mtp"`; only the second needs the `"model"` field. ```bash docker run -d --name mtp --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \ nvcr.io/nvidia/vllm:26.04-py3 \ vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ --host 0.0.0.0 --port 8000 --trust-remote-code \ --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \ --reasoning-parser nemotron_v3 \ --speculative-config '{"method":"mtp","num_speculative_tokens":1}' ``` `num_speculative_tokens` is usually **1** (some recipes use 3). For lowest latency at low concurrency, the vLLM recipe suggests disabling prefix caching. MTP **reduces throughput under load** — pair with `--speculative-disable-by-batch-size 32`. ### 13.3 DFlash on Spark — plug-and-play container (easiest) The `ghcr.io/aeon-7/vllm-dflash` image is a prebuilt `sm_121` vLLM with DFlash baked in; it serves a 27B dense Qwen3.5 with a 2B block-diffusion drafter, taking decode from **~12 → ~33 tok/s**. ```bash # 1) download the target into ~/vllm pip install -U "huggingface_hub[cli]"; export HF_TOKEN=hf_xxx hf download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \ --local-dir ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4 # 2) make a persistent API key, then launch (drafter auto-downloads) export VLLM_API_KEY=$(openssl rand -hex 32); echo "API key: $VLLM_API_KEY" docker run -d --name vllm-dflash --runtime nvidia --network host --ipc host \ --restart unless-stopped \ -v ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4:/models/target \ -v ~/vllm/drafter-cache:/models/drafter-cache \ -e MODEL_PATH=/models/target \ -e SERVED_MODEL_NAME=dflash-qwen3.5-27b \ -e DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash \ -e DFLASH_NUM_SPEC_TOKENS=15 \ -e MAX_MODEL_LEN=65536 -e MAX_NUM_SEQS=4 \ -e GPU_MEMORY_UTILIZATION=0.85 -e MAX_NUM_BATCHED_TOKENS=65536 \ -e VLLM_API_KEY="$VLLM_API_KEY" -e HF_TOKEN="$HF_TOKEN" \ ghcr.io/aeon-7/vllm-dflash:latest docker logs -f vllm-dflash # ~5 min # 3) test (note the Bearer token) curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" -H "Authorization: Bearer $VLLM_API_KEY" \ -d '{"model":"dflash-qwen3.5-27b","messages":[{"role":"user","content":"Write a haiku about GPUs."}],"max_tokens":200}' ``` **Key env vars:** `DFLASH_DRAFTER` (HF id of the drafter; empty = plain vLLM), `DFLASH_NUM_SPEC_TOKENS` (15 best single-stream, 5 for high concurrency), `KV_CACHE_DTYPE` (**stays BF16 with DFlash**). Spark presets: default 65536/15/4 ≈ 33 tok/s; high-concurrency 32768/5/8 ≈ 85–92 total. ### 13.4 DFlash the general way (other models) DFlash uses vLLM's **speculators** format; pass the drafter in `--speculative-config`. Available drafters today: | Target (verifier) | DFlash drafter | Source | |---|---|---| | `google/gemma-4-31B-it` | `RedHatAI/gemma-4-31B-it-speculator.dflash` | RedHatAI / vLLM | | `poolside/Laguna-XS.2` | `poolside/Laguna-XS.2-speculator.dflash` | poolside | | `Qwen/Qwen3.5-27B` | `z-lab/Qwen3.5-27B-DFlash` | Z Lab | ```bash docker run --rm vllm/vllm-openai:nightly vllm --version # confirm >= 0.21.0 # Laguna XS.2 (coding; up to 7 tokens/step, ~70% acceptance on code) docker run -d --name laguna --ipc=host --restart unless-stopped \ --gpus all -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -e VLLM_USE_DEEP_GEMM=0 -v ~/vllm:/models \ vllm/vllm-openai:nightly \ vllm serve poolside/Laguna-XS.2 --trust-remote-code \ --enable-auto-tool-choice --tool-call-parser poolside_v1 --reasoning-parser poolside_v1 \ --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}' ``` > [!NOTE] > DFlash needs vLLM **≥ 0.21.0**; the stock NGC `26.04` (0.19.0) won't do it. Gemma-4 DFlash currently needs Z Lab's build (`ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130`). No drafter for your model? Train one — the drafter is a small Qwen3-style stack and the `speculators` project ships a "Train DFlash" tutorial. ### 13.5 Verify it's actually accelerating (don't fly blind) Speculative decoding only helps if the drafter's tokens are **accepted**. Confirm it, then tune: - **Check acceptance.** vLLM logs a spec-decode summary (draft acceptance rate and **mean accepted length** per step) and exposes it on `/metrics`: ```bash curl -s http://localhost:8000/metrics | grep -i spec_decode ``` Healthy MTP runs accept most proposed tokens (mean accepted length well above 1); DFlash on code can accept whole blocks. If acceptance is near zero, the method isn't helping — the drafter/target are mismatched or the content is low-predictability (random/adversarial text). - **Tune `num_speculative_tokens` (K).** Raise K for **DFlash** (it verifies a whole block in one pass — 7–15 is normal); keep K **low for MTP** (1–3) since each extra token costs a verify pass. Watch decode tok/s as you change it — past the acceptance sweet spot, throughput drops. - **A/B against a dense run.** Time the same prompt with the spec flag removed. If tok/s isn't clearly higher *and* output is identical, drop speculation for that workload. - **Concurrency check.** Both methods tax throughput under load; if you serve bursts, set `--speculative-disable-by-batch-size 32` so vLLM auto-disables speculation when batches grow. ### 13.6 The gotchas that bite people > [!CAUTION] > - **DFlash floor:** vLLM ≥ 0.21.0 (AEON-7/Z-Lab image or nightly). MTP is fine on 0.19.0+. > - **DFlash KV cache stays BF16** — never combine with `--kv-cache-dtype fp8`. (Note: the *MTP* gemma4-coder draft is the opposite — it *needs* fp8 KV.) > - **One drafter per target** — a DFlash drafter is trained for a specific model. > - **Higher K is cheap with DFlash** (whole block in one pass), not with MTP — keep MTP `num_speculative_tokens` low. > - **Both hurt throughput under load** — add `--speculative-disable-by-batch-size 32`. > - **Acceptance is content-dependent** — big on code/structured output, smaller on prose, none on random text. --- ## 14. Sources - NVIDIA DGX Spark vLLM playbook — https://build.nvidia.com/spark/vllm/instructions - NVIDIA `dgx-spark-playbooks` — https://github.com/NVIDIA/dgx-spark-playbooks - vLLM team blog "vLLM on the DGX Spark" (Jun 2026) — https://vllm.ai/blog/2026-06-01-vllm-dgx-spark - vLLM speculative decoding / MTP / DFlash docs — https://docs.vllm.ai/en/latest/features/speculative_decoding/ · https://docs.vllm.ai/projects/speculators/en/latest/user_guide/algorithms/dflash/ - DFlash paper (Z Lab, arXiv 2602.06036) — https://arxiv.org/abs/2602.06036 · `AEON-7/vllm-dflash` — https://github.com/AEON-7/vllm-dflash - Model cards: `nvidia/Qwen3.6-35B-A3B-NVFP4`, `unsloth/Qwen3.6-27B-NVFP4`, `nvidia/Gemma-4-31B-IT-NVFP4`, `nvidia/Gemma-4-26B-A4B-NVFP4`, `sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4`, `nvidia/diffusiongemma-26B-A4B-it-NVFP4`, `mistralai/Mistral-Small-4-119B-2603-NVFP4`, `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4`, `nvidia/Phi-4-multimodal-instruct-NVFP4` (all on https://huggingface.co) - Phi-4-mm scipy container issue — https://github.com/NVIDIA/dgx-spark-playbooks/issues/69 - NGC vLLM container tags — https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm *Compiled June 2026. Benchmark numbers are reproduced from each model's Hugging Face card (NVFP4-vs-baseline tables, or vendor-reported gains). Container tags and model handles move fast — pin a known-good image digest for anything you depend on.* --- *Source: [https://vlaicu.io/posts/dgx-vllm/](https://vlaicu.io/posts/dgx-vllm/)*