Dgx Vllm

[!IMPORTANT] Where models live. Every command in this guide stores model weights in ~/vllm on the host and mounts it into the container at /models (via HF_HOME=/models). Download once, mount everywhere. A couple of community models that bundle a speculative-decode draft use a per-model local path under ~/vllm/<name> instead — those are flagged inline.

A practical, end-to-end guide for serving LLMs with vLLM on a DGX Spark (GB10 Grace Blackwell), assembled from NVIDIA’s official playbook, the vLLM team’s deep-dive, model cards, and battle-tested community setups.

[!WARNING] Flag choices on Spark are model- and image-specific, not hardware-wide defaults. The recipes below are starting points that worked for their authors against a specific container tag. Validate against the exact image you run, and pin a known-good tag/digest for anything you depend on. Copying a flag from one model’s recipe to another can silently regress throughput or output quality.

1. The landscape: current playbooks & walkthroughs

Source	Best for	Link
NVIDIA official playbook	Canonical install steps; single / stacked / switched / troubleshooting tabs	https://build.nvidia.com/spark/vllm/instructions
NVIDIA `dgx-spark-playbooks`	Source-of-truth repo; benchmarking + cluster bootstrap	https://github.com/NVIDIA/dgx-spark-playbooks
DeepWiki: vLLM on Spark	Clean model support matrix + serving params	https://deepwiki.com/NVIDIA/dgx-spark-playbooks/4.2-vllm
vLLM team blog (Jun 2026) — authoritative config deep-dive	Why each flag matters, unified-memory behavior, JIT pre-warm	https://vllm.ai/blog/2026-06-01-vllm-dgx-spark
`mark-ramsey-ri/vllm-dgx-spark`	1-to-N Spark orchestration, 41 model presets	https://github.com/mark-ramsey-ri/vllm-dgx-spark
`AEON-7/vllm-dflash`	Plug-and-play DFlash container for Spark	https://github.com/AEON-7/vllm-dflash
vLLM Recipes index	Per-model launch commands, kept current	https://recipes.vllm.ai/
NGC vLLM container tags	Latest `nvcr.io/nvidia/vllm` build	https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm

2. Hardware facts that drive every config

DGX Spark is a GB10 Grace Blackwell SoC with 128 GB unified memory shared by CPU and GPU, on a consumer-Blackwell sm_121 GPU, ~273 GB/s bandwidth, and an ARM64 (aarch64) host. Four consequences shape everything below:

One memory pool for everything. Weights, KV cache, the OS, page cache, and the runtime all draw from the same 128 GB. --gpu-memory-utilization is a fraction of that shared pool — leave headroom or you OOM with “free” RAM still showing.
Memory-bandwidth bound for decode. Spark shines at small-batch, single-user interactive serving. Keep concurrency low (--max-num-seqs ≈ 1–4).
NVFP4 MoE is the sweet spot. Quantized MoE with ~3–13 B active params (even at 25–120 B total) is the best fit: small active set = fast decode, and 4-bit weights free unified memory for KV cache. Prefer NVFP4/FP8/MXFP4 over BF16.
Use sm_121-validated builds. The upstream stable vLLM image does not support GB10. Use the CUDA-13 nightly track (vllm/vllm-openai:cu130-nightly / :nightly) or NVIDIA’s NGC container (nvcr.io/nvidia/vllm:26.02-py3+).

3. Storing all models in `~/vllm`

vLLM fetches weights through the Hugging Face stack, which caches under the directory named by HF_HOME. Point that at a host directory and mount it, and every model lives in ~/vllm, shared across containers.

mkdir -p ~/vllm

Pre-stage on the host (recommended) — files end up owned by you, the download happens once, and the first vllm serve doesn’t stall on a multi-GB pull:

pip install -U "huggingface_hub[cli]"        # one-time
export HF_HOME=~/vllm
export HF_TOKEN=hf_xxx                          # for gated models; accept the license on the HF page first
hf download nvidia/Qwen3.6-35B-A3B-NVFP4

The mount, in every docker run:

  -e HF_HOME=/models \
  -v ~/vllm:/models \

Inside the container HF now caches at /models (= ~/vllm on host); pre-staged weights at ~/vllm/hub/... appear at /models/hub/... automatically.

[!TIP] Force fully-local loading (air-gapped, no network checks) with -e HF_HUB_OFFLINE=1 once weights are staged. If you let the container (root) download instead of pre-staging, files in ~/vllm are root-owned — add --user $(id -u):$(id -g) if that matters.

4. Is volume mounting necessary?

Yes — treat it as mandatory. A container filesystem is ephemeral: a freshly downloaded 27–120 B model is destroyed when the container is removed, so without a mount you re-pull many GB and re-pay the 10–15 min weight-load on every run. The vLLM team’s guidance is “download once, mount everywhere.”

The one other volume worth adding is the torch.compile cache (the slow part of cold start), which lives separately from your models at /root/.cache/vllm:

  -v ~/vllm-compile-cache:/root/.cache/vllm \

[!NOTE] The compile cache only pays off when you recreate a container (image/flag change) — a long-running --restart unless-stopped container compiles once anyway. It’s keyed to GPU arch + vLLM version + model + flags, so clear it (rm -rf ~/vllm-compile-cache/*) if a stale entry causes a startup hiccup after an upgrade. --ipc=host / --shm-size is shared memory (RAM), not a -v volume.

5. One-time host setup

nvidia-smi                                   # GPU + driver visible
docker ps                                    # else: sudo usermod -aG docker $USER && newgrp docker
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
mkdir -p ~/vllm
export HF_TOKEN=hf_xxx                        # gated models
# NGC pulls 401? -> docker login nvcr.io  (user "$oauthtoken", password = NGC API key)

[!WARNING] Unified-memory OOM valve. Because CPU and GPU share DRAM, the Linux page cache can hold memory CUDA can’t reclaim — an “OOM” well under 128 GB. If a big model fails to load after heavy file activity, flush caches first: sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

6. Baseline single-Spark recipe (template)

Minimal, model-agnostic shape — swap the model handle and per-model flags from §7.

export HF_TOKEN=hf_xxx   # only for gated models

docker run -d --name vllm --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HOME=/models \
  -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve <HF_MODEL_HANDLE> \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 4

Smoke-test (first request triggers JIT warmup — see §9):

curl -sS http://localhost:8000/v1/models | jq -r '.data[0].id'
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"<HF_MODEL_HANDLE>","messages":[{"role":"user","content":"12*17"}],"max_tokens":500}'
# expect "204"

[!NOTE] Two container tracks — pick per recipe. NVIDIA’s NGC image nvcr.io/nvidia/vllm:26.04-py3 (vLLM 0.19.0) is the default for many nvidia/... checkpoints. Several recipes below use the upstream vllm/vllm-openai:nightly / :cu130-nightly track (needed for the newest architectures and for DFlash). A given model may need a newer container than you have — vllm --version inside the image tells you.

[!IMPORTANT] Quantization flag rule. For NVIDIA ModelOpt NVFP4 checkpoints (the nvidia/... repos), pass --quantization modelopt. For compressed-tensors NVFP4 (Unsloth, llm-compressor community builds), vLLM auto-detects — do not pass --quantization. Getting this wrong is a common first-run failure.

7. Per-model configurations

Grouped by family: Qwen → Gemma → NVIDIA → Mistral & other. Every model below ships a real benchmark from its card (charted inline). All commands assume ~/vllm is your model store.

Quick comparison

Model	Publisher	Arch	Total / Active	Quant	Image	Key flags
Qwen3.6-35B-A3B-NVFP4	NVIDIA (Qwen)	MoE+hybrid attn, multimodal	35B / 3B	NVFP4 (modelopt)	`vllm-openai:nightly`	env vars, MTP-3, `--quantization modelopt`
Qwen3.6-27B-NVFP4	Unsloth (Qwen)	dense + vision	27B	NVFP4 (c-t)	`vllm-openai:nightly`	`--dtype bfloat16`, trust-remote-code
Qwen3.5-35B-A3B	Qwen (community)	MoE (SSM+MoE)	35B / 3B	BF16	`cu130-nightly`	qwen3_coder; no chunked-prefill
Gemma-4-31B-IT-NVFP4	NVIDIA (Gemma)	dense, multimodal	30.7B	NVFP4 (modelopt)	`vllm-openai:gemma4-cu130`	`--quantization modelopt`, TP=1
Gemma-4-26B-A4B-NVFP4	NVIDIA (Gemma)	MoE, multimodal	25.2B / 3.8B	NVFP4 (modelopt)	`vllm-openai:gemma4-cu130`	TP=1 only, gemma4 parsers
gemma-4-12B-coder MTP-NVFP4	community (Gemma)	dense coder + MTP draft	12B	NVFP4 (c-t)	`vllm-openai:nightly`	bundled MTP, `kv-cache fp8`, thinking-on
DiffusionGemma-26B-A4B-NVFP4	NVIDIA (Gemma)	diffusion MoE	25.2B / 3.8B	NVFP4 (modelopt)	`vllm-openai:gemma`	V2 runner, TRITON_ATTN
Nemotron-3-Nano-30B-A3B-NVFP4	NVIDIA	hybrid Mamba-2 + MoE	30B / 3.5B	NVFP4 + fp8 KV	`nvcr.io vllm:25.12.post1`	nano_v3 parser, Spark-tested
Nemotron-3-Super-120B-A12B-NVFP4	NVIDIA	hybrid Mamba-Tx MoE	120B / 12B	NVFP4	`cu130-nightly`	MTP, reasoning nemotron_v3
Mistral-Small-4-119B-NVFP4	Mistral AI	MoE, multimodal	119B / 6.5B	NVFP4 (c-t)	`cu130-nightly`	MLA, mistral parsers, SYSTEM_PROMPT
gpt-oss-120b	OpenAI	MoE	120B	MXFP4	`nvcr.io vllm:26.04`	expert-parallel
Llama-3.3-70B-Instruct-NVFP4	NVIDIA (Meta)	dense	70B	NVFP4	`nvcr.io vllm:26.04`	gated, bandwidth-bound
Phi-4-multimodal-NVFP4	NVIDIA (Microsoft)	dense, multimodal	—	NVFP4	`nvcr.io vllm:26.04`	trust-remote-code

(c-t = compressed-tensors auto-detect; env vars = the four sm_121a exports shown in 7.1.1.)

7.1 Qwen family

7.1.1 Qwen3.6-35B-A3B-NVFP4 — flagship MoE agent (official Spark recipe)

MoE with hybrid attention, 35B total / 3B active, multimodal (text/image/video), 262K context, Apache 2.0, ~19 GB NVFP4. This is NVIDIA’s recommended Spark configuration verbatim — note the four required env vars and --quantization modelopt.

export HF_TOKEN=hf_xxx
docker run -d --name qwen36-35b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_FP8_MOE_BACKEND=flashinfer_cutlass \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e CUTE_DSL_ARCH=sm_121a \
  vllm/vllm-openai:nightly \
  vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 \
    --tensor-parallel-size 1 --trust-remote-code --dtype auto \
    --quantization modelopt --kv-cache-dtype fp8 \
    --attention-backend flashinfer --moe-backend marlin \
    --gpu-memory-utilization 0.85 --max-model-len 65536 \
    --max-num-seqs 4 --max-num-batched-tokens 8192 \
    --enable-chunked-prefill --async-scheduling --enable-prefix-caching \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
    --reasoning-parser qwen3

[!CAUTION] Earlier community recipes floated --gpu-memory-utilization 0.4 and --max-model-len 262144 for this model. NVIDIA’s official card uses 0.85 / 65536 plus the four env vars above — use the official values. For agent/tool use add --enable-auto-tool-choice --tool-call-parser qwen3_coder.

7.1.2 Qwen3.6-27B-NVFP4 — flagship-level coding in a 27B dense model (Unsloth)

Dense 27B with a vision encoder (text/image/video), hybrid Gated-DeltaNet + Gated-Attention layers, MTP-trained, 262K context (extensible to ~1M via YaRN), Apache 2.0, ~18 GB NVFP4 (compressed-tensors → no --quantization flag).

docker run -d --name qwen36-27b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:nightly \
  vllm serve unsloth/Qwen3.6-27B-NVFP4 \
    --trust-remote-code --dtype bfloat16 \
    --gpu-memory-utilization 0.85 --max-model-len 65536 \
    --max-num-seqs 4 --reasoning-parser qwen3

[!NOTE] The card’s default --max-model-len is a conservative 4096 (“increase only after checking memory”). 65536 is comfortable on Spark; keep ≥128K only if you have headroom, since the model leans on long context for thinking. For coding agents add --tool-call-parser qwen3_coder. MTP is supported by the architecture — you can try --speculative-config '{"method":"mtp","num_speculative_tokens":3}' and validate acceptance on your build. Thinking is on by default; pass chat_template_kwargs={"enable_thinking":false} (or {"preserve_thinking":true} for agents) per request.

[!TIP] Beyond 262K (YaRN, up to ~1M). This checkpoint is native to 262,144 tokens; for longer contexts enable static YaRN via an --hf-overrides block and the long-len escape hatch (mind that static YaRN can slightly dent short-context quality — only enable it when you actually need the length):
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  ... vllm serve unsloth/Qwen3.6-27B-NVFP4 --trust-remote-code --dtype bfloat16 \
    --hf-overrides '{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":4.0,"original_max_position_embeddings":262144}}}' \
    --max-model-len 1010000

7.1.3 Qwen3.5-35B-A3B — coding-agent backend (community)

Popular for pointing local coding agents at Spark. Upstream nightly, BF16.

docker run -d --name qwen35 --ipc=host --restart unless-stopped \
  --gpus all --shm-size 64gb -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:cu130-nightly \
  Qwen/Qwen3.5-35B-A3B \
    --served-model-name qwen3.5-35b --host 0.0.0.0 --port 8000 \
    --dtype bfloat16 --gpu-memory-utilization 0.9 --max-model-len 262144 \
    --enable-prefix-caching --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder --reasoning-parser qwen3

[!WARNING] GB10 gotchas (community-verified): do not add --enable-chunked-prefill (≈9× throughput regression on SSM+MoE), and do not add --kv-cache-dtype fp8 (output-repetition loops on GB10) for this model. This is the opposite of the Qwen3.6-NVFP4 recipe — never copy flags across models. There’s also a DFlash-accelerated dense sibling (Qwen3.5-27B) covered in §13.

7.2 Gemma family

7.2.1 Gemma-4-31B-IT-NVFP4 — dense multimodal (NVIDIA)

Dense 30.7B, multimodal (text/image/video), 256K context, 140+ languages, hybrid local/global attention with p-RoPE, NVIDIA Open Model License (+ Gemma terms), ~21 GB NVFP4.

export HF_TOKEN=hf_xxx
docker run -d --name gemma4-31b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  vllm/vllm-openai:gemma4-cu130 \
  vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
    --quantization modelopt --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4

[!NOTE] The card uses --tensor-parallel-size 8 on a server — on Spark use TP=1. Gated: accept the Gemma license + set HF_TOKEN. For tools/reasoning add --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4.

7.2.2 Gemma-4-26B-A4B-NVFP4 — MoE multimodal (NVIDIA)

MoE 25.2B total / 3.8B active (8 of 128 experts +1 shared), multimodal, 256K context, ~14 GB NVFP4 — a small, fast, high-quality Spark pick.

docker run -d --name gemma4-26b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  vllm/vllm-openai:gemma4-cu130 \
  vllm serve nvidia/Gemma-4-26B-A4B-NVFP4 \
    --quantization modelopt --tensor-parallel-size 1 --moe-backend marlin \
    --trust-remote-code \
    --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
    --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4

[!WARNING] Per the card, this checkpoint currently works with TP=1 only in vLLM (expert-parallel is supported; tensor-parallel is not yet). MoE backend must be VLLM_CUTLASS or Marlin (FlashInfer-TRTLLM is pending a vLLM PR). Needs --trust-remote-code.

7.2.3 gemma-4-12B-coder (MTP-NVFP4) — Python specialist with a bundled MTP draft (community)

A weight-only NVFP4 (W4A16) build of a Gemma-4-12B coding fine-tune: 8.25 GB model + a 0.85 GB bundled MTP draft for ~1.6× single-stream. Auto-detects NVFP4 (no --quantization). Because the draft lives in assistant/, download to a local path and mount it.

# download (~9 GB total) into ~/vllm
hf download sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 \
  --local-dir ~/vllm/gemma4-coder

# easiest: one GPU, just chat
docker run -d --name gemma4-coder --ipc=host --shm-size 16gb --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -v ~/vllm/gemma4-coder:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-coder \
  --max-model-len 16384 --gpu-memory-utilization 0.92 --trust-remote-code

Add the bundled MTP draft for ~1.6× interactive speed (lossless):

  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"/model/assistant","num_speculative_tokens":3}'

[!IMPORTANT] This model was trained to think first — enable it per request or quality drops: extra_body={"chat_template_kwargs":{"enable_thinking":true}}. Needs a recent nightly (registers Gemma4UnifiedForConditionalGeneration). With MTP you must use --kv-cache-dtype fp8 (NVFP4 KV breaks the draft).

[!CAUTION] The build is de-refused / not safety-aligned — add your own guardrails. It’s a superb algorithm/debug assistant but can write look-ahead bias into pandas/numpy time-series & back-test code (its reasoning sometimes states the right rule while the code does the opposite). Gate quant/accounting code; don’t ship it unreviewed.

7.2.4 DiffusionGemma-26B-A4B-NVFP4 — discrete-diffusion text gen (NVIDIA)

A diffusion LLM on the Gemma-4 26B-A4B MoE backbone (25.2B / 3.8B active) that emits parallel 256-token blocks, exceeding 1,100 tok/s at low batch (H100 FP8). Multimodal, 256K context, ~14 GB NVFP4. Uses the dedicated diffusion image.

docker run -d --name diffgemma --ipc=host --shm-size=16g --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_USE_V2_MODEL_RUNNER=1 \
  vllm/vllm-openai:gemma \
  vllm serve nvidia/diffusiongemma-26B-A4B-it-NVFP4 \
    --trust-remote-code --max-num-seqs 4 \
    --attention-backend TRITON_ATTN \
    --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
    --override-generation-config '{"max_new_tokens": null}' \
    --default-chat-template-kwargs '{"enable_thinking":true}'

[!WARNING] The vllm/vllm-openai:gemma image and these flags are tentative until the supporting vLLM image is publicly released (per the card). Check the vLLM releases and the vllm/vllm-openai:gemma4 Docker Hub tags before relying on it.

7.3 NVIDIA models (proprietary)

Models NVIDIA itself designed and trained (not quantizations of someone else’s weights). Both are hybrid Mamba-2 + Transformer MoE reasoning models with native function calling: the Nano is the natural single-Spark starting point, the Super is the heavyweight.

7.3.1 NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 — best single-Spark Nemotron (Spark-tested)

A hybrid Mamba-2 + MoE (52 layers: 23 Mamba-2, 23 MoE with 128 experts +1 shared / 6 active, 6 GQA attention), 30B total / 3.5B active, text-only, 1M context (256K default), unified reasoning + non-reasoning, NVIDIA Nemotron Open Model License, ~18 GB on disk. NVFP4 weights with FP8 KV cache; attention and the Mamba layers feeding it stay BF16, and quantization-aware distillation (QAD) recovers accuracy. NVIDIA lists DGX Spark in this model’s tested hardware and ships a Spark/Jetson-specific container.

export HF_TOKEN=hf_xxx
# one-time: fetch the custom reasoning parser into ~/vllm (mounted at /models)
wget -P ~/vllm \
  https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py

docker run -d --name nemotron-nano --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
    --served-model-name nemotron-3-nano --port 8000 \
    --tensor-parallel-size 1 --trust-remote-code \
    --max-model-len 262144 --max-num-seqs 8 \
    --kv-cache-dtype fp8 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder \
    --reasoning-parser-plugin /models/nano_v3_reasoning_parser.py \
    --reasoning-parser nano_v3

[!IMPORTANT] On Spark (or Jetson Thor) use NVIDIA’s nvcr.io/nvidia/vllm:25.12.post1-py3 container for this model, and fetch the nano_v3_reasoning_parser.py plugin (downloaded into ~/vllm above, referenced at /models/...). --kv-cache-dtype fp8 is part of the official recipe here — unlike Qwen3.5, this model is built for it. --max-num-seqs 8 is NVIDIA’s Spark-tested value; drop to 4 for more KV headroom at long context.

[!NOTE] Reasoning is on by default — pass chat_template_kwargs={"enable_thinking":false} per request to turn it off. The model also supports a reasoning_budget (cap internal reasoning tokens to hit latency targets). Sampling: temperature=1.0, top_p=1.0 for reasoning; 0.6 / 0.95 for tool calling. For up to 1M context add -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and raise --max-model-len. With only 3.5B active params and ~18 GB resident, decode is fast and KV headroom is large — an excellent default Spark model.

7.3.2 Nemotron-3-Super-120B-A12B-NVFP4 — flagship local MoE

NVIDIA’s own hybrid Mamba-Transformer LatentMoE, 120B total / 12B active, native NVFP4 pretraining, MTP, 1M context. The vLLM team’s reference Spark deployment (~23 tok/s decode; ~10–15 min first load).

docker run -d --name nemotron --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:cu130-nightly \
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --served-model-name nemotron-3-super --trust-remote-code \
    --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
    --reasoning-parser nemotron_v3 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder

[!NOTE] Two NVIDIA-published NVFP4 checkpoints of other vendors’ models live in §7.4 because they follow those families: Llama-3.3-70B-NVFP4 (Meta) and Phi-4-multimodal-NVFP4 (Microsoft). The nvidia/... Qwen and Gemma NVFP4 checkpoints likewise live in their family groups above. This section is for models NVIDIA itself designed.

7.4 Mistral & other open models

7.4.1 Mistral-Small-4-119B-2603-NVFP4 — unified instruct + reasoning + coding (Mistral AI)

A granular MoE (128 experts, 4 active; 119B total / 6.5B active) fusing Instruct, Reasoning (Magistral), and Devstral skills, multimodal, 256K context, Apache 2.0, ~60 GB NVFP4 (fits one Spark). Compressed-tensors → no --quantization; uses MLA attention and Mistral parsers.

docker run -d --name mistral-small4 --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:cu130-nightly \
  vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
    --tensor-parallel-size 1 --max-model-len 65536 \
    --attention-backend TRITON_MLA \
    --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral \
    --max-num-batched-tokens 16384 --max-num-seqs 4 \
    --gpu-memory-utilization 0.85

[!IMPORTANT] Needs mistral_common >= 1.11.0 (bundled in recent vLLM; uv pip install -U vllm pulls it). For correct behavior, load the repo’s SYSTEM_PROMPT.txt and set reasoning_effort per request ("none" for fast replies, "high" for hard tasks; temp 0.7 with reasoning). The card’s example uses TP=2 + --max-num-seqs 128 for a multi-GPU server — on a single Spark use TP=1, --max-num-seqs 4, --max-model-len 65536.

7.4.2 gpt-oss-120b — open-weights MXFP4 MoE (OpenAI)

~65 GB native MXFP4 — fits one Spark with KV-cache room. Ungated. The 20B sibling (openai/gpt-oss-20b) is a great first smoke-test.

docker run -d --name gptoss --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve openai/gpt-oss-120b \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
    --enable-expert-parallel

[!NOTE] gpt-oss models use OpenAI’s Harmony response format and expose a reasoning_effort control ("low"/"medium"/"high") passed in the request body — there’s no separate --reasoning-parser flag. For tool calling add --enable-auto-tool-choice --tool-call-parser openai. Use a recent vLLM/NGC image (older builds predate Harmony parsing).

7.4.3 Llama-3.3-70B-Instruct-NVFP4 — gated dense (Meta, NVIDIA quant)

docker run -d --name llama70 --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve nvidia/Llama-3.3-70B-Instruct-NVFP4 \
    --quantization modelopt --max-model-len 131072 \
    --gpu-memory-utilization 0.85 --max-num-seqs 4

Dense 70B → memory-bandwidth-limited decode (slower than a similar-size MoE), but high single-user quality. Needs an accepted Llama license + HF_TOKEN.

7.4.4 Phi-4-multimodal-instruct-NVFP4 — omnimodal text + image + audio (Microsoft, NVIDIA quant)

Microsoft’s 5.6B omnimodal Phi-4 — text, image, and speech/audio — 128K context, Phi4MMForCausalLM with a custom processor. NVIDIA NVFP4 ModelOpt quant; small and cheap on Spark.

[!CAUTION] Known container gotcha: Phi-4-mm’s custom processor imports scipy, which is not installed in the NGC vLLM images — serving fails with ImportError: ... scipy. Install it before vllm serve (or bake it into a derived image). Tracked in NVIDIA/dgx-spark-playbooks issue #69.

docker run -d --name phi4mm --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  bash -lc "pip install --no-cache-dir scipy && \
    vllm serve nvidia/Phi-4-multimodal-instruct-NVFP4 \
      --quantization modelopt --trust-remote-code \
      --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4"

[!NOTE] Needs --trust-remote-code (custom Phi4MM processor). Send images via OpenAI image_url blocks and audio via the audio input field (any soundfile-readable format). The NVFP4 checkpoint was originally published for TensorRT-LLM but runs under vLLM with the two requirements above.

8. Flag reference & tuning

Flag	On Spark	Guidance
`--gpu-memory-utilization`	Fraction of the shared 128 GB	Start from the model recipe (0.4–0.92 seen). Leave headroom; lower if OOM.
`--max-num-seqs`	Concurrent sequences	Keep low (1–4); above ~4 the bandwidth tax outweighs batching.
`--max-model-len`	Prompt + completion cap	65536 is a sane Spark default; raise toward model max only with headroom.
prefix caching	KV reuse across shared prefixes	On by default in V1; `--enable-prefix-caching` is redundant but harmless.
`--quantization modelopt`	ModelOpt NVFP4 only	Pass for `nvidia/...` ModelOpt checkpoints; omit for compressed-tensors (auto-detected).
`--reasoning-parser` / `--tool-call-parser` + `--enable-auto-tool-choice`	Structured reasoning/tools	Follow the model recipe (qwen3 / gemma4 / mistral / nemotron_v3).
`--kv-cache-dtype fp8`	Shrinks KV cache	Model-specific: the Qwen3.6-NVFP4 & gemma4-coder-MTP recipes use it; Qwen3.5 and DFlash do not.
`--speculative-config '{"method":"mtp"...}'`	Built-in speculative decode	Latency lever for MTP models (Qwen3.x, Nemotron, gemma4-coder). See §13.
`--moe-backend` / `--attention-backend`	Kernel pins	Leave auto unless a tested recipe pins one (marlin / cutlass / flashinfer / TRITON_*).
`--enable-expert-parallel`	MoE routing	Enable for MoE (gpt-oss, Nemotron, Mistral, Gemma-4-26B).
CUDA graphs	Per-step overhead	Keep enabled.
`--load-format fastsafetensors`	Faster weight load	Evaluate if the 10–15 min load matters.

[!TIP] Stability ↔ throughput slider. A plain run leaves KV-cache dtype unset, spec-decode off, CUDA graphs on, backends auto. A tuned run layers in FP8 KV cache, async scheduling, MTP/DFlash, and pinned FP4 backends — each validated against the exact model + prompt shape + batch pattern + vLLM release. Want ~2–3× faster interactive decode? See §13 (MTP & DFlash).

9. Pre-warm the JIT

The first request after boot triggers Inductor + FlashInfer codegen (~25–60 s). Fire a tiny warmup on the same path as real traffic, then short prompts return in <0.5 s:

curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"<served-model-name>","messages":[{"role":"user","content":"ping"}],"max_tokens":3}'

This is separate from weight load (10–15 min for a 120B) — address that with fastsafetensors/InstantTensor if needed.

10. Verify & monitor

curl -sS http://localhost:8000/health
curl -sS http://localhost:8000/v1/models | jq
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"<served-model-name>","messages":[{"role":"user","content":"Explain quantum computing briefly."}],"max_tokens":200,"stream":true}'

Prometheus telemetry is at /metrics (no extra service). Watch KV-cache utilization (vllm:kv_cache_usage_perc) and TTFT / inter-token-latency histograms. Healthy single-user behavior: prefix-cached later turns don’t spike, decode rate stays steady, KV usage stays well below the context limit.

[!TIP] Confirm the fast paths are live. Check the startup log to verify NVFP4 GEMM kernels actually engaged (you want a line like Using NvFp4LinearBackend.VLLM_CUTLASS for NVFP4 GEMM) — if it silently fell back, you lose the FP4 speed/memory win. For speculative decoding, confirm acceptance via curl -s localhost:8000/metrics | grep -i spec_decode (see §13.5).

11. Troubleshooting

Symptom	Cause	Fix
`permission denied ... docker.sock`	Not in `docker` group	`sudo usermod -aG docker $USER && newgrp docker`
OOM with free RAM showing	Page cache holds unified memory	`sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'` before launch
OOM during load/serve	Util too high / context too long	Lower `--gpu-memory-utilization`, reduce `--max-model-len`, or pick a smaller/more-quantized model
401 / gated download fails	Missing token / license	`export HF_TOKEN=...`, accept the license on the HF page
`model type ... not recognized`	Container too old for the arch	Newer NGC tag or `vllm-openai:nightly`; check `vllm --version`
`unknown quantization` / weights mis-load	Wrong quant flag	ModelOpt → `--quantization modelopt`; compressed-tensors → omit it
Stable image won’t run on GB10	No `sm_121` support	Use `cu130-nightly` / NGC image
Output repetition loops	`--kv-cache-dtype fp8` on a model that dislikes it	Remove it (model-specific)
~9× slower MoE	`--enable-chunked-prefill` on SSM+MoE	Remove it
First request slow, rest fine	JIT warmup	Pre-warm (§9)
`rmi` won’t delete image	Tag omitted	`docker stop <name> && docker rm <name>` then `docker rmi <repo>:<tag>`

12. Scaling to multiple Sparks (brief)

A single Spark handles models up to ~100 GB (gpt-oss-120b MXFP4, Mistral-Small-4 NVFP4, Llama-70B NVFP4). For larger models or higher TP, link Sparks over the 200 Gb/s QSFP (ConnectX-7) ports and run a Ray cluster with --tensor-parallel-size = number of GPUs:

2 Sparks: direct QSFP cable. 3+: through a switch.
Bind NCCL to the QSFP interface (NCCL_SOCKET_IFNAME=enP2p1s0f1np1); an Ethernet fallback costs 10–20× throughput.
Mount the same ~/vllm on every node and stage weights on each.
Easiest: the mark-ramsey-ri/vllm-dgx-spark scripts, or NVIDIA’s spark_cluster_setup.sh + the official multi-Spark playbook.

13. Speculative decoding on Spark: MTP & DFlash (step-by-step)

A small, cheap drafter proposes the next few tokens; the big model verifies them in one pass — accepted tokens are free. It’s a latency win that shines at low concurrency, exactly Spark’s profile, and can roughly 2–3× interactive decode speed with no quality loss. Two methods matter: MTP (built into the model) and DFlash (a separate diffusion drafter).

13.1 Do they need special models, Docker, or configs?

	MTP	DFlash
Special model?	Target has built-in MTP modules (no download) or ships a small paired draft model (e.g. gemma4-coder’s bundled `assistant/`).	Yes — a matching DFlash drafter checkpoint trained for your target.
Special Docker?	No — standard Spark images.	Usually — vLLM ≥ 0.21.0 + `sm_121` build. NGC `26.04` (vLLM 0.19.0) is too old; use a prebuilt DFlash image or nightly.
Special config?	`--speculative-config '{"method":"mtp","num_speculative_tokens":1}'`	`--speculative-config '{"method":"dflash","model":"<drafter>","num_speculative_tokens":N}'` + KV cache BF16 (no fp8).
Models	DeepSeek V3/R1/V4, Qwen3.5/3.6, GLM-5.x, Gemma 4 (+gemma4-coder), Nemotron-3-Super, Mistral	Gemma-4-31B, Laguna XS.2, Qwen3.5-27B (or train your own)
Speedup	~1.6–1.8×	up to ~6× / ~2.5× over EAGLE-3; ~2.2–2.7× on Spark

[!NOTE] Both consume KV cache for speculative tokens, so they trade peak throughput for latency. Keep them on for interactive use; for bursty/concurrent serving add --speculative-disable-by-batch-size 32.

Which models in this guide can use which method:

Model (this guide)	MTP	DFlash	How
Qwen3.6-35B-A3B-NVFP4 (§7.1.1)	✅ built-in	—	already in its recipe (`num_speculative_tokens:3`)
Qwen3.6-27B-NVFP4 (§7.1.2)	✅ built-in (validate)	—	add `--speculative-config '{"method":"mtp",...}'`
Qwen3.5-35B-A3B (§7.1.3)	✅ built-in	via 27B sibling	dense Qwen3.5-27B has a DFlash drafter
gemma-4-12B-coder (§7.2.3)	✅ paired draft	—	bundled `/model/assistant` + `kv fp8`
Gemma-4-31B-IT-NVFP4 (§7.2.1)	—	✅	`RedHatAI/...speculator.dflash` (Z-Lab image)
Nemotron-3-Super (§7.3.2)	✅ built-in	—	`{"method":"mtp","num_speculative_tokens":1}`
Nemotron-3-Nano (§7.3.1)	—	—	card ships no speculator; run dense
Mistral-Small-4 (§7.4.1)	✅ built-in (validate)	—	add the mtp flag and check acceptance

(Empty = no published speculator for that exact checkpoint today; “validate” = architecture supports it but confirm acceptance on your build.)

13.2 MTP — zero-download (≈5 min)

Just a flag on an MTP-capable model (the §7.1.1 Qwen3.6 and §7.2.3 gemma4-coder recipes already use it):

[!NOTE] Two flavors of MTP. Most MTP models (Qwen3.6, Nemotron-Super, DeepSeek-style) have the prediction heads baked into the checkpoint — nothing extra to download, just the flag. A few ship a small paired draft model instead: gemma4-coder bundles a 0.4 B draft in assistant/, so you point "model" at it (/model/assistant) — see §7.2.3. Both use "method":"mtp"; only the second needs the "model" field.

docker run -d --name mtp --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --host 0.0.0.0 --port 8000 --trust-remote-code \
    --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
    --reasoning-parser nemotron_v3 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

num_speculative_tokens is usually 1 (some recipes use 3). For lowest latency at low concurrency, the vLLM recipe suggests disabling prefix caching. MTP reduces throughput under load — pair with --speculative-disable-by-batch-size 32.

13.3 DFlash on Spark — plug-and-play container (easiest)

The ghcr.io/aeon-7/vllm-dflash image is a prebuilt sm_121 vLLM with DFlash baked in; it serves a 27B dense Qwen3.5 with a 2B block-diffusion drafter, taking decode from ~12 → ~33 tok/s.

# 1) download the target into ~/vllm
pip install -U "huggingface_hub[cli]"; export HF_TOKEN=hf_xxx
hf download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
  --local-dir ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4

# 2) make a persistent API key, then launch (drafter auto-downloads)
export VLLM_API_KEY=$(openssl rand -hex 32); echo "API key: $VLLM_API_KEY"
docker run -d --name vllm-dflash --runtime nvidia --network host --ipc host \
  --restart unless-stopped \
  -v ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4:/models/target \
  -v ~/vllm/drafter-cache:/models/drafter-cache \
  -e MODEL_PATH=/models/target \
  -e SERVED_MODEL_NAME=dflash-qwen3.5-27b \
  -e DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash \
  -e DFLASH_NUM_SPEC_TOKENS=15 \
  -e MAX_MODEL_LEN=65536 -e MAX_NUM_SEQS=4 \
  -e GPU_MEMORY_UTILIZATION=0.85 -e MAX_NUM_BATCHED_TOKENS=65536 \
  -e VLLM_API_KEY="$VLLM_API_KEY" -e HF_TOKEN="$HF_TOKEN" \
  ghcr.io/aeon-7/vllm-dflash:latest
docker logs -f vllm-dflash    # ~5 min

# 3) test (note the Bearer token)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{"model":"dflash-qwen3.5-27b","messages":[{"role":"user","content":"Write a haiku about GPUs."}],"max_tokens":200}'

Key env vars: DFLASH_DRAFTER (HF id of the drafter; empty = plain vLLM), DFLASH_NUM_SPEC_TOKENS (15 best single-stream, 5 for high concurrency), KV_CACHE_DTYPE (stays BF16 with DFlash). Spark presets: default 65536/15/4 ≈ 33 tok/s; high-concurrency 32768/5/8 ≈ 85–92 total.

13.4 DFlash the general way (other models)

DFlash uses vLLM’s speculators format; pass the drafter in --speculative-config. Available drafters today:

Target (verifier)	DFlash drafter	Source
`google/gemma-4-31B-it`	`RedHatAI/gemma-4-31B-it-speculator.dflash`	RedHatAI / vLLM
`poolside/Laguna-XS.2`	`poolside/Laguna-XS.2-speculator.dflash`	poolside
`Qwen/Qwen3.5-27B`	`z-lab/Qwen3.5-27B-DFlash`	Z Lab

docker run --rm vllm/vllm-openai:nightly vllm --version   # confirm >= 0.21.0

# Laguna XS.2 (coding; up to 7 tokens/step, ~70% acceptance on code)
docker run -d --name laguna --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -e VLLM_USE_DEEP_GEMM=0 -v ~/vllm:/models \
  vllm/vllm-openai:nightly \
  vllm serve poolside/Laguna-XS.2 --trust-remote-code \
    --enable-auto-tool-choice --tool-call-parser poolside_v1 --reasoning-parser poolside_v1 \
    --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'

[!NOTE] DFlash needs vLLM ≥ 0.21.0; the stock NGC 26.04 (0.19.0) won’t do it. Gemma-4 DFlash currently needs Z Lab’s build (ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130). No drafter for your model? Train one — the drafter is a small Qwen3-style stack and the speculators project ships a “Train DFlash” tutorial.

Speculative decoding only helps if the drafter’s tokens are accepted. Confirm it, then tune:

Check acceptance. vLLM logs a spec-decode summary (draft acceptance rate and mean accepted length per step) and exposes it on /metrics:
```
curl -s http://localhost:8000/metrics | grep -i spec_decode
```
Healthy MTP runs accept most proposed tokens (mean accepted length well above 1); DFlash on code can accept whole blocks. If acceptance is near zero, the method isn’t helping — the drafter/target are mismatched or the content is low-predictability (random/adversarial text).
Tune num_speculative_tokens (K). Raise K for DFlash (it verifies a whole block in one pass — 7–15 is normal); keep K low for MTP (1–3) since each extra token costs a verify pass. Watch decode tok/s as you change it — past the acceptance sweet spot, throughput drops.
A/B against a dense run. Time the same prompt with the spec flag removed. If tok/s isn’t clearly higher and output is identical, drop speculation for that workload.
Concurrency check. Both methods tax throughput under load; if you serve bursts, set --speculative-disable-by-batch-size 32 so vLLM auto-disables speculation when batches grow.

13.6 The gotchas that bite people

[!CAUTION]
DFlash floor: vLLM ≥ 0.21.0 (AEON-7/Z-Lab image or nightly). MTP is fine on 0.19.0+.
DFlash KV cache stays BF16 — never combine with --kv-cache-dtype fp8. (Note: the MTP gemma4-coder draft is the opposite — it needs fp8 KV.)
One drafter per target — a DFlash drafter is trained for a specific model.
Higher K is cheap with DFlash (whole block in one pass), not with MTP — keep MTP num_speculative_tokens low.
Both hurt throughput under load — add --speculative-disable-by-batch-size 32.
Acceptance is content-dependent — big on code/structured output, smaller on prose, none on random text.

14. Sources

NVIDIA DGX Spark vLLM playbook — https://build.nvidia.com/spark/vllm/instructions
NVIDIA dgx-spark-playbooks — https://github.com/NVIDIA/dgx-spark-playbooks
vLLM team blog “vLLM on the DGX Spark” (Jun 2026) — https://vllm.ai/blog/2026-06-01-vllm-dgx-spark
vLLM speculative decoding / MTP / DFlash docs — https://docs.vllm.ai/en/latest/features/speculative_decoding/ · https://docs.vllm.ai/projects/speculators/en/latest/user_guide/algorithms/dflash/
DFlash paper (Z Lab, arXiv 2602.06036) — https://arxiv.org/abs/2602.06036 · AEON-7/vllm-dflash — https://github.com/AEON-7/vllm-dflash
Model cards: nvidia/Qwen3.6-35B-A3B-NVFP4, unsloth/Qwen3.6-27B-NVFP4, nvidia/Gemma-4-31B-IT-NVFP4, nvidia/Gemma-4-26B-A4B-NVFP4, sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4, nvidia/diffusiongemma-26B-A4B-it-NVFP4, mistralai/Mistral-Small-4-119B-2603-NVFP4, nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4, nvidia/Phi-4-multimodal-instruct-NVFP4 (all on https://huggingface.co)
Phi-4-mm scipy container issue — https://github.com/NVIDIA/dgx-spark-playbooks/issues/69
NGC vLLM container tags — https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm

Compiled June 2026. Benchmark numbers are reproduced from each model’s Hugging Face card (NVFP4-vs-baseline tables, or vendor-reported gains). Container tags and model handles move fast — pin a known-good image digest for anything you depend on.

1. The landscape: current playbooks & walkthroughs#

2. Hardware facts that drive every config#

3. Storing all models in ~/vllm#

4. Is volume mounting necessary?#

5. One-time host setup#

6. Baseline single-Spark recipe (template)#

7. Per-model configurations#

Quick comparison#

7.1 Qwen family#

7.1.1 Qwen3.6-35B-A3B-NVFP4 — flagship MoE agent (official Spark recipe)#

7.1.2 Qwen3.6-27B-NVFP4 — flagship-level coding in a 27B dense model (Unsloth)#

7.1.3 Qwen3.5-35B-A3B — coding-agent backend (community)#

7.2 Gemma family#

7.2.1 Gemma-4-31B-IT-NVFP4 — dense multimodal (NVIDIA)#

7.2.2 Gemma-4-26B-A4B-NVFP4 — MoE multimodal (NVIDIA)#

7.2.3 gemma-4-12B-coder (MTP-NVFP4) — Python specialist with a bundled MTP draft (community)#

7.2.4 DiffusionGemma-26B-A4B-NVFP4 — discrete-diffusion text gen (NVIDIA)#

7.3 NVIDIA models (proprietary)#

7.3.1 NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 — best single-Spark Nemotron (Spark-tested)#

7.3.2 Nemotron-3-Super-120B-A12B-NVFP4 — flagship local MoE#

7.4 Mistral & other open models#

7.4.1 Mistral-Small-4-119B-2603-NVFP4 — unified instruct + reasoning + coding (Mistral AI)#

7.4.2 gpt-oss-120b — open-weights MXFP4 MoE (OpenAI)#

7.4.3 Llama-3.3-70B-Instruct-NVFP4 — gated dense (Meta, NVIDIA quant)#

7.4.4 Phi-4-multimodal-instruct-NVFP4 — omnimodal text + image + audio (Microsoft, NVIDIA quant)#

8. Flag reference & tuning#

9. Pre-warm the JIT#

10. Verify & monitor#

11. Troubleshooting#

12. Scaling to multiple Sparks (brief)#

13. Speculative decoding on Spark: MTP & DFlash (step-by-step)#

13.1 Do they need special models, Docker, or configs?#

13.2 MTP — zero-download (≈5 min)#

13.3 DFlash on Spark — plug-and-play container (easiest)#

13.4 DFlash the general way (other models)#

13.5 Verify it’s actually accelerating (don’t fly blind)#

13.6 The gotchas that bite people#

14. Sources#