[!IMPORTANT] Where models live. Every command in this guide stores model weights in ~/vllm on the host and mounts it into the container at /models (via HF_HOME=/models). Download once, mount everywhere. A couple of community models that bundle a speculative-decode draft use a per-model local path under ~/vllm/<name> instead — those are flagged inline.

A practical, end-to-end guide for serving LLMs with vLLM on a DGX Spark (GB10 Grace Blackwell), assembled from NVIDIA’s official playbook, the vLLM team’s deep-dive, model cards, and battle-tested community setups.

[!WARNING] Flag choices on Spark are model- and image-specific, not hardware-wide defaults. The recipes below are starting points that worked for their authors against a specific container tag. Validate against the exact image you run, and pin a known-good tag/digest for anything you depend on. Copying a flag from one model’s recipe to another can silently regress throughput or output quality.


1. The landscape: current playbooks & walkthroughs

SourceBest forLink
NVIDIA official playbookCanonical install steps; single / stacked / switched / troubleshooting tabshttps://build.nvidia.com/spark/vllm/instructions
NVIDIA dgx-spark-playbooksSource-of-truth repo; benchmarking + cluster bootstraphttps://github.com/NVIDIA/dgx-spark-playbooks
DeepWiki: vLLM on SparkClean model support matrix + serving paramshttps://deepwiki.com/NVIDIA/dgx-spark-playbooks/4.2-vllm
vLLM team blog (Jun 2026)authoritative config deep-diveWhy each flag matters, unified-memory behavior, JIT pre-warmhttps://vllm.ai/blog/2026-06-01-vllm-dgx-spark
mark-ramsey-ri/vllm-dgx-spark1-to-N Spark orchestration, 41 model presetshttps://github.com/mark-ramsey-ri/vllm-dgx-spark
AEON-7/vllm-dflashPlug-and-play DFlash container for Sparkhttps://github.com/AEON-7/vllm-dflash
vLLM Recipes indexPer-model launch commands, kept currenthttps://recipes.vllm.ai/
NGC vLLM container tagsLatest nvcr.io/nvidia/vllm buildhttps://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm

2. Hardware facts that drive every config

DGX Spark is a GB10 Grace Blackwell SoC with 128 GB unified memory shared by CPU and GPU, on a consumer-Blackwell sm_121 GPU, ~273 GB/s bandwidth, and an ARM64 (aarch64) host. Four consequences shape everything below:

  • One memory pool for everything. Weights, KV cache, the OS, page cache, and the runtime all draw from the same 128 GB. --gpu-memory-utilization is a fraction of that shared pool — leave headroom or you OOM with “free” RAM still showing.
  • Memory-bandwidth bound for decode. Spark shines at small-batch, single-user interactive serving. Keep concurrency low (--max-num-seqs ≈ 1–4).
  • NVFP4 MoE is the sweet spot. Quantized MoE with ~3–13 B active params (even at 25–120 B total) is the best fit: small active set = fast decode, and 4-bit weights free unified memory for KV cache. Prefer NVFP4/FP8/MXFP4 over BF16.
  • Use sm_121-validated builds. The upstream stable vLLM image does not support GB10. Use the CUDA-13 nightly track (vllm/vllm-openai:cu130-nightly / :nightly) or NVIDIA’s NGC container (nvcr.io/nvidia/vllm:26.02-py3+).

3. Storing all models in ~/vllm

vLLM fetches weights through the Hugging Face stack, which caches under the directory named by HF_HOME. Point that at a host directory and mount it, and every model lives in ~/vllm, shared across containers.

mkdir -p ~/vllm

Pre-stage on the host (recommended) — files end up owned by you, the download happens once, and the first vllm serve doesn’t stall on a multi-GB pull:

pip install -U "huggingface_hub[cli]"        # one-time
export HF_HOME=~/vllm
export HF_TOKEN=hf_xxx                          # for gated models; accept the license on the HF page first
hf download nvidia/Qwen3.6-35B-A3B-NVFP4

The mount, in every docker run:

  -e HF_HOME=/models \
  -v ~/vllm:/models \

Inside the container HF now caches at /models (= ~/vllm on host); pre-staged weights at ~/vllm/hub/... appear at /models/hub/... automatically.

[!TIP] Force fully-local loading (air-gapped, no network checks) with -e HF_HUB_OFFLINE=1 once weights are staged. If you let the container (root) download instead of pre-staging, files in ~/vllm are root-owned — add --user $(id -u):$(id -g) if that matters.


4. Is volume mounting necessary?

Yes — treat it as mandatory. A container filesystem is ephemeral: a freshly downloaded 27–120 B model is destroyed when the container is removed, so without a mount you re-pull many GB and re-pay the 10–15 min weight-load on every run. The vLLM team’s guidance is “download once, mount everywhere.”

The one other volume worth adding is the torch.compile cache (the slow part of cold start), which lives separately from your models at /root/.cache/vllm:

  -v ~/vllm-compile-cache:/root/.cache/vllm \

[!NOTE] The compile cache only pays off when you recreate a container (image/flag change) — a long-running --restart unless-stopped container compiles once anyway. It’s keyed to GPU arch + vLLM version + model + flags, so clear it (rm -rf ~/vllm-compile-cache/*) if a stale entry causes a startup hiccup after an upgrade. --ipc=host / --shm-size is shared memory (RAM), not a -v volume.


5. One-time host setup

nvidia-smi                                   # GPU + driver visible
docker ps                                    # else: sudo usermod -aG docker $USER && newgrp docker
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
mkdir -p ~/vllm
export HF_TOKEN=hf_xxx                        # gated models
# NGC pulls 401? -> docker login nvcr.io  (user "$oauthtoken", password = NGC API key)

[!WARNING] Unified-memory OOM valve. Because CPU and GPU share DRAM, the Linux page cache can hold memory CUDA can’t reclaim — an “OOM” well under 128 GB. If a big model fails to load after heavy file activity, flush caches first: sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'


6. Baseline single-Spark recipe (template)

Minimal, model-agnostic shape — swap the model handle and per-model flags from §7.

export HF_TOKEN=hf_xxx   # only for gated models

docker run -d --name vllm --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HOME=/models \
  -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve <HF_MODEL_HANDLE> \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 4

Smoke-test (first request triggers JIT warmup — see §9):

curl -sS http://localhost:8000/v1/models | jq -r '.data[0].id'
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"<HF_MODEL_HANDLE>","messages":[{"role":"user","content":"12*17"}],"max_tokens":500}'
# expect "204"

[!NOTE] Two container tracks — pick per recipe. NVIDIA’s NGC image nvcr.io/nvidia/vllm:26.04-py3 (vLLM 0.19.0) is the default for many nvidia/... checkpoints. Several recipes below use the upstream vllm/vllm-openai:nightly / :cu130-nightly track (needed for the newest architectures and for DFlash). A given model may need a newer container than you have — vllm --version inside the image tells you.

[!IMPORTANT] Quantization flag rule. For NVIDIA ModelOpt NVFP4 checkpoints (the nvidia/... repos), pass --quantization modelopt. For compressed-tensors NVFP4 (Unsloth, llm-compressor community builds), vLLM auto-detects — do not pass --quantization. Getting this wrong is a common first-run failure.


7. Per-model configurations

Grouped by family: Qwen → Gemma → NVIDIA → Mistral & other. Every model below ships a real benchmark from its card (charted inline). All commands assume ~/vllm is your model store.

Quick comparison

ModelPublisherArchTotal / ActiveQuantImageKey flags
Qwen3.6-35B-A3B-NVFP4NVIDIA (Qwen)MoE+hybrid attn, multimodal35B / 3BNVFP4 (modelopt)vllm-openai:nightlyenv vars, MTP-3, --quantization modelopt
Qwen3.6-27B-NVFP4Unsloth (Qwen)dense + vision27BNVFP4 (c-t)vllm-openai:nightly--dtype bfloat16, trust-remote-code
Qwen3.5-35B-A3BQwen (community)MoE (SSM+MoE)35B / 3BBF16cu130-nightlyqwen3_coder; no chunked-prefill
Gemma-4-31B-IT-NVFP4NVIDIA (Gemma)dense, multimodal30.7BNVFP4 (modelopt)vllm-openai:gemma4-cu130--quantization modelopt, TP=1
Gemma-4-26B-A4B-NVFP4NVIDIA (Gemma)MoE, multimodal25.2B / 3.8BNVFP4 (modelopt)vllm-openai:gemma4-cu130TP=1 only, gemma4 parsers
gemma-4-12B-coder MTP-NVFP4community (Gemma)dense coder + MTP draft12BNVFP4 (c-t)vllm-openai:nightlybundled MTP, kv-cache fp8, thinking-on
DiffusionGemma-26B-A4B-NVFP4NVIDIA (Gemma)diffusion MoE25.2B / 3.8BNVFP4 (modelopt)vllm-openai:gemmaV2 runner, TRITON_ATTN
Nemotron-3-Nano-30B-A3B-NVFP4NVIDIAhybrid Mamba-2 + MoE30B / 3.5BNVFP4 + fp8 KVnvcr.io vllm:25.12.post1nano_v3 parser, Spark-tested
Nemotron-3-Super-120B-A12B-NVFP4NVIDIAhybrid Mamba-Tx MoE120B / 12BNVFP4cu130-nightlyMTP, reasoning nemotron_v3
Mistral-Small-4-119B-NVFP4Mistral AIMoE, multimodal119B / 6.5BNVFP4 (c-t)cu130-nightlyMLA, mistral parsers, SYSTEM_PROMPT
gpt-oss-120bOpenAIMoE120BMXFP4nvcr.io vllm:26.04expert-parallel
Llama-3.3-70B-Instruct-NVFP4NVIDIA (Meta)dense70BNVFP4nvcr.io vllm:26.04gated, bandwidth-bound
Phi-4-multimodal-NVFP4NVIDIA (Microsoft)dense, multimodalNVFP4nvcr.io vllm:26.04trust-remote-code

(c-t = compressed-tensors auto-detect; env vars = the four sm_121a exports shown in 7.1.1.)


7.1 Qwen family

7.1.1 Qwen3.6-35B-A3B-NVFP4 — flagship MoE agent (official Spark recipe)

MoE with hybrid attention, 35B total / 3B active, multimodal (text/image/video), 262K context, Apache 2.0, ~19 GB NVFP4. This is NVIDIA’s recommended Spark configuration verbatim — note the four required env vars and --quantization modelopt.

export HF_TOKEN=hf_xxx
docker run -d --name qwen36-35b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_FP8_MOE_BACKEND=flashinfer_cutlass \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e CUTE_DSL_ARCH=sm_121a \
  vllm/vllm-openai:nightly \
  vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 --port 8000 \
    --tensor-parallel-size 1 --trust-remote-code --dtype auto \
    --quantization modelopt --kv-cache-dtype fp8 \
    --attention-backend flashinfer --moe-backend marlin \
    --gpu-memory-utilization 0.85 --max-model-len 65536 \
    --max-num-seqs 4 --max-num-batched-tokens 8192 \
    --enable-chunked-prefill --async-scheduling --enable-prefix-caching \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
    --reasoning-parser qwen3

[!CAUTION] Earlier community recipes floated --gpu-memory-utilization 0.4 and --max-model-len 262144 for this model. NVIDIA’s official card uses 0.85 / 65536 plus the four env vars above — use the official values. For agent/tool use add --enable-auto-tool-choice --tool-call-parser qwen3_coder.

Qwen3.6-35B-A3B-NVFP4 — accuracy retained vs BF16NVIDIA card, vLLM on GB300. NVFP4 holds full-precision quality across 8 benchmarks.BF16 baselineNVFP4255075100MMLU Pro85.6%85.0%GPQA Diamond84.9%84.8%τ²-Bench Tel.95.5%94.7%SciCode40.8%40.6%AIME 202589.2%88.8%AA-LCR62.0%62.0%IFBench62.3%62.8%MMMU Pro74.1%74.5%

7.1.2 Qwen3.6-27B-NVFP4 — flagship-level coding in a 27B dense model (Unsloth)

Dense 27B with a vision encoder (text/image/video), hybrid Gated-DeltaNet + Gated-Attention layers, MTP-trained, 262K context (extensible to ~1M via YaRN), Apache 2.0, ~18 GB NVFP4 (compressed-tensors → no --quantization flag).

docker run -d --name qwen36-27b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:nightly \
  vllm serve unsloth/Qwen3.6-27B-NVFP4 \
    --trust-remote-code --dtype bfloat16 \
    --gpu-memory-utilization 0.85 --max-model-len 65536 \
    --max-num-seqs 4 --reasoning-parser qwen3

[!NOTE] The card’s default --max-model-len is a conservative 4096 (“increase only after checking memory”). 65536 is comfortable on Spark; keep ≥128K only if you have headroom, since the model leans on long context for thinking. For coding agents add --tool-call-parser qwen3_coder. MTP is supported by the architecture — you can try --speculative-config '{"method":"mtp","num_speculative_tokens":3}' and validate acceptance on your build. Thinking is on by default; pass chat_template_kwargs={"enable_thinking":false} (or {"preserve_thinking":true} for agents) per request.

Qwen3.6-27B — flagship-level coding in a 27B dense modelQwen card BF16 reference benchmarks (NVFP4 retains ~99%). Higher is better.Qwen3.6-27BQwen3.5-27BGemma-4-31B255075100SWE-bench Verified77.2%75.0%52.0%SWE-bench Pro53.5%51.2%35.7%LiveCodeBench v683.9%80.7%80.0%AIME 202694.1%92.6%89.2%

[!TIP] Beyond 262K (YaRN, up to ~1M). This checkpoint is native to 262,144 tokens; for longer contexts enable static YaRN via an --hf-overrides block and the long-len escape hatch (mind that static YaRN can slightly dent short-context quality — only enable it when you actually need the length):

  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  ... vllm serve unsloth/Qwen3.6-27B-NVFP4 --trust-remote-code --dtype bfloat16 \
    --hf-overrides '{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":4.0,"original_max_position_embeddings":262144}}}' \
    --max-model-len 1010000

7.1.3 Qwen3.5-35B-A3B — coding-agent backend (community)

Popular for pointing local coding agents at Spark. Upstream nightly, BF16.

docker run -d --name qwen35 --ipc=host --restart unless-stopped \
  --gpus all --shm-size 64gb -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:cu130-nightly \
  Qwen/Qwen3.5-35B-A3B \
    --served-model-name qwen3.5-35b --host 0.0.0.0 --port 8000 \
    --dtype bfloat16 --gpu-memory-utilization 0.9 --max-model-len 262144 \
    --enable-prefix-caching --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder --reasoning-parser qwen3

[!WARNING] GB10 gotchas (community-verified): do not add --enable-chunked-prefill (≈9× throughput regression on SSM+MoE), and do not add --kv-cache-dtype fp8 (output-repetition loops on GB10) for this model. This is the opposite of the Qwen3.6-NVFP4 recipe — never copy flags across models. There’s also a DFlash-accelerated dense sibling (Qwen3.5-27B) covered in §13.


7.2 Gemma family

7.2.1 Gemma-4-31B-IT-NVFP4 — dense multimodal (NVIDIA)

Dense 30.7B, multimodal (text/image/video), 256K context, 140+ languages, hybrid local/global attention with p-RoPE, NVIDIA Open Model License (+ Gemma terms), ~21 GB NVFP4.

export HF_TOKEN=hf_xxx
docker run -d --name gemma4-31b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  vllm/vllm-openai:gemma4-cu130 \
  vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
    --quantization modelopt --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4

[!NOTE] The card uses --tensor-parallel-size 8 on a server — on Spark use TP=1. Gated: accept the Gemma license + set HF_TOKEN. For tools/reasoning add --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4.

Gemma-4-31B-IT-NVFP4 — accuracy retained vs full precisionNVIDIA card, vLLM on H100. NVFP4 within <0.4 pts of baseline.BaselineNVFP4255075100GPQA Diamond75.71%75.46%AIME 202566.25%65.94%MMLU Pro85.25%84.94%LiveCodeBench70.9%70.63%SciCode33.61%33.18%Term-Bench Hard27.08%27.08%

7.2.2 Gemma-4-26B-A4B-NVFP4 — MoE multimodal (NVIDIA)

MoE 25.2B total / 3.8B active (8 of 128 experts +1 shared), multimodal, 256K context, ~14 GB NVFP4 — a small, fast, high-quality Spark pick.

docker run -d --name gemma4-26b --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  vllm/vllm-openai:gemma4-cu130 \
  vllm serve nvidia/Gemma-4-26B-A4B-NVFP4 \
    --quantization modelopt --tensor-parallel-size 1 --moe-backend marlin \
    --trust-remote-code \
    --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
    --gpu-memory-utilization 0.85 --max-model-len 65536 --max-num-seqs 4

[!WARNING] Per the card, this checkpoint currently works with TP=1 only in vLLM (expert-parallel is supported; tensor-parallel is not yet). MoE backend must be VLLM_CUTLASS or Marlin (FlashInfer-TRTLLM is pending a vLLM PR). Needs --trust-remote-code.

Gemma-4-26B-A4B-NVFP4 — accuracy retained vs full precisionNVIDIA card, vLLM on B200. 25.2B total / 3.8B active MoE.BaselineNVFP4255075100GPQA Diamond80.3%79.9%AIME 202588.95%90.0%MMLU Pro85.0%84.8%LiveCodeBench80.5%79.8%IFBench77.77%78.1%IFEval96.6%96.4%

7.2.3 gemma-4-12B-coder (MTP-NVFP4) — Python specialist with a bundled MTP draft (community)

A weight-only NVFP4 (W4A16) build of a Gemma-4-12B coding fine-tune: 8.25 GB model + a 0.85 GB bundled MTP draft for ~1.6× single-stream. Auto-detects NVFP4 (no --quantization). Because the draft lives in assistant/, download to a local path and mount it.

# download (~9 GB total) into ~/vllm
hf download sakamakismile/gemma-4-12B-coder-fable5-composer2.5-MTP-NVFP4 \
  --local-dir ~/vllm/gemma4-coder

# easiest: one GPU, just chat
docker run -d --name gemma4-coder --ipc=host --shm-size 16gb --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -v ~/vllm/gemma4-coder:/model:ro \
  vllm/vllm-openai:nightly \
  --model /model --served-model-name gemma4-coder \
  --max-model-len 16384 --gpu-memory-utilization 0.92 --trust-remote-code

Add the bundled MTP draft for ~1.6× interactive speed (lossless):

  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"/model/assistant","num_speculative_tokens":3}'

[!IMPORTANT] This model was trained to think first — enable it per request or quality drops: extra_body={"chat_template_kwargs":{"enable_thinking":true}}. Needs a recent nightly (registers Gemma4UnifiedForConditionalGeneration). With MTP you must use --kv-cache-dtype fp8 (NVFP4 KV breaks the draft).

[!CAUTION] The build is de-refused / not safety-aligned — add your own guardrails. It’s a superb algorithm/debug assistant but can write look-ahead bias into pandas/numpy time-series & back-test code (its reasoning sometimes states the right rule while the code does the opposite). Gate quant/accounting code; don’t ship it unreviewed.

gemma-4-12B-coder (MTP-NVFP4) — independent eval, greedy pass@18.25 GB Python/algorithm specialist. NVFP4 build = Q8 source parity (96%=96% on HumanEval[:50]).pass@1255075100HumanEval90.2%MBPP85.7%

7.2.4 DiffusionGemma-26B-A4B-NVFP4 — discrete-diffusion text gen (NVIDIA)

A diffusion LLM on the Gemma-4 26B-A4B MoE backbone (25.2B / 3.8B active) that emits parallel 256-token blocks, exceeding 1,100 tok/s at low batch (H100 FP8). Multimodal, 256K context, ~14 GB NVFP4. Uses the dedicated diffusion image.

docker run -d --name diffgemma --ipc=host --shm-size=16g --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_USE_V2_MODEL_RUNNER=1 \
  vllm/vllm-openai:gemma \
  vllm serve nvidia/diffusiongemma-26B-A4B-it-NVFP4 \
    --trust-remote-code --max-num-seqs 4 \
    --attention-backend TRITON_ATTN \
    --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
    --override-generation-config '{"max_new_tokens": null}' \
    --default-chat-template-kwargs '{"enable_thinking":true}'

[!WARNING] The vllm/vllm-openai:gemma image and these flags are tentative until the supporting vLLM image is publicly released (per the card). Check the vLLM releases and the vllm/vllm-openai:gemma4 Docker Hub tags before relying on it.

DiffusionGemma-26B-A4B-NVFP4 — accuracy retained (thinking on)NVIDIA card, vLLM on B100. Diffusion sampling; >1,100 tok/s at low batch (H100 FP8).BaselineNVFP4255075100GPQA Diamond69.4%68.6%AIME 202568.33%67.33%GSM8K94.54%94.01%IFEval94.01%94.56%HumanEval94.09%95.0%MMMLU 0-shot88.5%88.13%MMLU Pro81.0%80.7%


7.3 NVIDIA models (proprietary)

Models NVIDIA itself designed and trained (not quantizations of someone else’s weights). Both are hybrid Mamba-2 + Transformer MoE reasoning models with native function calling: the Nano is the natural single-Spark starting point, the Super is the heavyweight.

7.3.1 NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 — best single-Spark Nemotron (Spark-tested)

A hybrid Mamba-2 + MoE (52 layers: 23 Mamba-2, 23 MoE with 128 experts +1 shared / 6 active, 6 GQA attention), 30B total / 3.5B active, text-only, 1M context (256K default), unified reasoning + non-reasoning, NVIDIA Nemotron Open Model License, ~18 GB on disk. NVFP4 weights with FP8 KV cache; attention and the Mamba layers feeding it stay BF16, and quantization-aware distillation (QAD) recovers accuracy. NVIDIA lists DGX Spark in this model’s tested hardware and ships a Spark/Jetson-specific container.

export HF_TOKEN=hf_xxx
# one-time: fetch the custom reasoning parser into ~/vllm (mounted at /models)
wget -P ~/vllm \
  https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py

docker run -d --name nemotron-nano --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
    --served-model-name nemotron-3-nano --port 8000 \
    --tensor-parallel-size 1 --trust-remote-code \
    --max-model-len 262144 --max-num-seqs 8 \
    --kv-cache-dtype fp8 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder \
    --reasoning-parser-plugin /models/nano_v3_reasoning_parser.py \
    --reasoning-parser nano_v3

[!IMPORTANT] On Spark (or Jetson Thor) use NVIDIA’s nvcr.io/nvidia/vllm:25.12.post1-py3 container for this model, and fetch the nano_v3_reasoning_parser.py plugin (downloaded into ~/vllm above, referenced at /models/...). --kv-cache-dtype fp8 is part of the official recipe here — unlike Qwen3.5, this model is built for it. --max-num-seqs 8 is NVIDIA’s Spark-tested value; drop to 4 for more KV headroom at long context.

[!NOTE] Reasoning is on by default — pass chat_template_kwargs={"enable_thinking":false} per request to turn it off. The model also supports a reasoning_budget (cap internal reasoning tokens to hit latency targets). Sampling: temperature=1.0, top_p=1.0 for reasoning; 0.6 / 0.95 for tool calling. For up to 1M context add -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and raise --max-model-len. With only 3.5B active params and ~18 GB resident, decode is fast and KV headroom is large — an excellent default Spark model.

Nemotron-3-Nano-30B-A3B — accuracy retained vs BF16NVIDIA card (Nemo Evaluator). NVFP4 + FP8 KV after PTQ + quant-aware distillation; FP8 sits between.BF16 baselineNVFP4255075100MMLU-Pro78.3%77.4%AIME 202589.1%86.7%GPQA73.0%71.9%LiveCodeBench68.3%65.4%TauBench V2 avg49.0%45.6%IFBench71.5%70.7%

7.3.2 Nemotron-3-Super-120B-A12B-NVFP4 — flagship local MoE

NVIDIA’s own hybrid Mamba-Transformer LatentMoE, 120B total / 12B active, native NVFP4 pretraining, MTP, 1M context. The vLLM team’s reference Spark deployment (~23 tok/s decode; ~10–15 min first load).

docker run -d --name nemotron --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:cu130-nightly \
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --served-model-name nemotron-3-super --trust-remote-code \
    --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
    --reasoning-parser nemotron_v3 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder

[!NOTE] Two NVIDIA-published NVFP4 checkpoints of other vendors’ models live in §7.4 because they follow those families: Llama-3.3-70B-NVFP4 (Meta) and Phi-4-multimodal-NVFP4 (Microsoft). The nvidia/... Qwen and Gemma NVFP4 checkpoints likewise live in their family groups above. This section is for models NVIDIA itself designed.


7.4 Mistral & other open models

7.4.1 Mistral-Small-4-119B-2603-NVFP4 — unified instruct + reasoning + coding (Mistral AI)

A granular MoE (128 experts, 4 active; 119B total / 6.5B active) fusing Instruct, Reasoning (Magistral), and Devstral skills, multimodal, 256K context, Apache 2.0, ~60 GB NVFP4 (fits one Spark). Compressed-tensors → no --quantization; uses MLA attention and Mistral parsers.

docker run -d --name mistral-small4 --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  vllm/vllm-openai:cu130-nightly \
  vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
    --tensor-parallel-size 1 --max-model-len 65536 \
    --attention-backend TRITON_MLA \
    --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral \
    --max-num-batched-tokens 16384 --max-num-seqs 4 \
    --gpu-memory-utilization 0.85

[!IMPORTANT] Needs mistral_common >= 1.11.0 (bundled in recent vLLM; uv pip install -U vllm pulls it). For correct behavior, load the repo’s SYSTEM_PROMPT.txt and set reasoning_effort per request ("none" for fast replies, "high" for hard tasks; temp 0.7 with reasoning). The card’s example uses TP=2 + --max-num-seqs 128 for a multi-GPU server — on a single Spark use TP=1, --max-num-seqs 4, --max-model-len 65536.

Mistral Small 4 119B — gains vs Mistral Small 3Vendor-reported (Mistral AI card). Unified instruct + reasoning + coding MoE, 119B / 6.5B active.−40%end-to-end latency — latency-optimizedrequests/sec — throughput-optimized

7.4.2 gpt-oss-120b — open-weights MXFP4 MoE (OpenAI)

~65 GB native MXFP4 — fits one Spark with KV-cache room. Ungated. The 20B sibling (openai/gpt-oss-20b) is a great first smoke-test.

docker run -d --name gptoss --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve openai/gpt-oss-120b \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
    --enable-expert-parallel

[!NOTE] gpt-oss models use OpenAI’s Harmony response format and expose a reasoning_effort control ("low"/"medium"/"high") passed in the request body — there’s no separate --reasoning-parser flag. For tool calling add --enable-auto-tool-choice --tool-call-parser openai. Use a recent vLLM/NGC image (older builds predate Harmony parsing).

7.4.3 Llama-3.3-70B-Instruct-NVFP4 — gated dense (Meta, NVIDIA quant)

docker run -d --name llama70 --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve nvidia/Llama-3.3-70B-Instruct-NVFP4 \
    --quantization modelopt --max-model-len 131072 \
    --gpu-memory-utilization 0.85 --max-num-seqs 4

Dense 70B → memory-bandwidth-limited decode (slower than a similar-size MoE), but high single-user quality. Needs an accepted Llama license + HF_TOKEN.

7.4.4 Phi-4-multimodal-instruct-NVFP4 — omnimodal text + image + audio (Microsoft, NVIDIA quant)

Microsoft’s 5.6B omnimodal Phi-4 — text, image, and speech/audio — 128K context, Phi4MMForCausalLM with a custom processor. NVIDIA NVFP4 ModelOpt quant; small and cheap on Spark.

[!CAUTION] Known container gotcha: Phi-4-mm’s custom processor imports scipy, which is not installed in the NGC vLLM images — serving fails with ImportError: ... scipy. Install it before vllm serve (or bake it into a derived image). Tracked in NVIDIA/dgx-spark-playbooks issue #69.

docker run -d --name phi4mm --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  bash -lc "pip install --no-cache-dir scipy && \
    vllm serve nvidia/Phi-4-multimodal-instruct-NVFP4 \
      --quantization modelopt --trust-remote-code \
      --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4"

[!NOTE] Needs --trust-remote-code (custom Phi4MM processor). Send images via OpenAI image_url blocks and audio via the audio input field (any soundfile-readable format). The NVFP4 checkpoint was originally published for TensorRT-LLM but runs under vLLM with the two requirements above.


8. Flag reference & tuning

FlagOn SparkGuidance
--gpu-memory-utilizationFraction of the shared 128 GBStart from the model recipe (0.4–0.92 seen). Leave headroom; lower if OOM.
--max-num-seqsConcurrent sequencesKeep low (1–4); above ~4 the bandwidth tax outweighs batching.
--max-model-lenPrompt + completion cap65536 is a sane Spark default; raise toward model max only with headroom.
prefix cachingKV reuse across shared prefixesOn by default in V1; --enable-prefix-caching is redundant but harmless.
--quantization modeloptModelOpt NVFP4 onlyPass for nvidia/... ModelOpt checkpoints; omit for compressed-tensors (auto-detected).
--reasoning-parser / --tool-call-parser + --enable-auto-tool-choiceStructured reasoning/toolsFollow the model recipe (qwen3 / gemma4 / mistral / nemotron_v3).
--kv-cache-dtype fp8Shrinks KV cacheModel-specific: the Qwen3.6-NVFP4 & gemma4-coder-MTP recipes use it; Qwen3.5 and DFlash do not.
--speculative-config '{"method":"mtp"...}'Built-in speculative decodeLatency lever for MTP models (Qwen3.x, Nemotron, gemma4-coder). See §13.
--moe-backend / --attention-backendKernel pinsLeave auto unless a tested recipe pins one (marlin / cutlass / flashinfer / TRITON_*).
--enable-expert-parallelMoE routingEnable for MoE (gpt-oss, Nemotron, Mistral, Gemma-4-26B).
CUDA graphsPer-step overheadKeep enabled.
--load-format fastsafetensorsFaster weight loadEvaluate if the 10–15 min load matters.

[!TIP] Stability ↔ throughput slider. A plain run leaves KV-cache dtype unset, spec-decode off, CUDA graphs on, backends auto. A tuned run layers in FP8 KV cache, async scheduling, MTP/DFlash, and pinned FP4 backends — each validated against the exact model + prompt shape + batch pattern + vLLM release. Want ~2–3× faster interactive decode? See §13 (MTP & DFlash).


9. Pre-warm the JIT

The first request after boot triggers Inductor + FlashInfer codegen (~25–60 s). Fire a tiny warmup on the same path as real traffic, then short prompts return in <0.5 s:

curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"<served-model-name>","messages":[{"role":"user","content":"ping"}],"max_tokens":3}'

This is separate from weight load (10–15 min for a 120B) — address that with fastsafetensors/InstantTensor if needed.


10. Verify & monitor

curl -sS http://localhost:8000/health
curl -sS http://localhost:8000/v1/models | jq
curl -sS http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"<served-model-name>","messages":[{"role":"user","content":"Explain quantum computing briefly."}],"max_tokens":200,"stream":true}'

Prometheus telemetry is at /metrics (no extra service). Watch KV-cache utilization (vllm:kv_cache_usage_perc) and TTFT / inter-token-latency histograms. Healthy single-user behavior: prefix-cached later turns don’t spike, decode rate stays steady, KV usage stays well below the context limit.

[!TIP] Confirm the fast paths are live. Check the startup log to verify NVFP4 GEMM kernels actually engaged (you want a line like Using NvFp4LinearBackend.VLLM_CUTLASS for NVFP4 GEMM) — if it silently fell back, you lose the FP4 speed/memory win. For speculative decoding, confirm acceptance via curl -s localhost:8000/metrics | grep -i spec_decode (see §13.5).


11. Troubleshooting

SymptomCauseFix
permission denied ... docker.sockNot in docker groupsudo usermod -aG docker $USER && newgrp docker
OOM with free RAM showingPage cache holds unified memorysudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' before launch
OOM during load/serveUtil too high / context too longLower --gpu-memory-utilization, reduce --max-model-len, or pick a smaller/more-quantized model
401 / gated download failsMissing token / licenseexport HF_TOKEN=..., accept the license on the HF page
model type ... not recognizedContainer too old for the archNewer NGC tag or vllm-openai:nightly; check vllm --version
unknown quantization / weights mis-loadWrong quant flagModelOpt → --quantization modelopt; compressed-tensors → omit it
Stable image won’t run on GB10No sm_121 supportUse cu130-nightly / NGC image
Output repetition loops--kv-cache-dtype fp8 on a model that dislikes itRemove it (model-specific)
~9× slower MoE--enable-chunked-prefill on SSM+MoERemove it
First request slow, rest fineJIT warmupPre-warm (§9)
rmi won’t delete imageTag omitteddocker stop <name> && docker rm <name> then docker rmi <repo>:<tag>

12. Scaling to multiple Sparks (brief)

A single Spark handles models up to ~100 GB (gpt-oss-120b MXFP4, Mistral-Small-4 NVFP4, Llama-70B NVFP4). For larger models or higher TP, link Sparks over the 200 Gb/s QSFP (ConnectX-7) ports and run a Ray cluster with --tensor-parallel-size = number of GPUs:

  • 2 Sparks: direct QSFP cable. 3+: through a switch.
  • Bind NCCL to the QSFP interface (NCCL_SOCKET_IFNAME=enP2p1s0f1np1); an Ethernet fallback costs 10–20× throughput.
  • Mount the same ~/vllm on every node and stage weights on each.
  • Easiest: the mark-ramsey-ri/vllm-dgx-spark scripts, or NVIDIA’s spark_cluster_setup.sh + the official multi-Spark playbook.

13. Speculative decoding on Spark: MTP & DFlash (step-by-step)

A small, cheap drafter proposes the next few tokens; the big model verifies them in one pass — accepted tokens are free. It’s a latency win that shines at low concurrency, exactly Spark’s profile, and can roughly 2–3× interactive decode speed with no quality loss. Two methods matter: MTP (built into the model) and DFlash (a separate diffusion drafter).

13.1 Do they need special models, Docker, or configs?

MTPDFlash
Special model?Target has built-in MTP modules (no download) or ships a small paired draft model (e.g. gemma4-coder’s bundled assistant/).Yes — a matching DFlash drafter checkpoint trained for your target.
Special Docker?No — standard Spark images.Usually — vLLM ≥ 0.21.0 + sm_121 build. NGC 26.04 (vLLM 0.19.0) is too old; use a prebuilt DFlash image or nightly.
Special config?--speculative-config '{"method":"mtp","num_speculative_tokens":1}'--speculative-config '{"method":"dflash","model":"<drafter>","num_speculative_tokens":N}' + KV cache BF16 (no fp8).
ModelsDeepSeek V3/R1/V4, Qwen3.5/3.6, GLM-5.x, Gemma 4 (+gemma4-coder), Nemotron-3-Super, MistralGemma-4-31B, Laguna XS.2, Qwen3.5-27B (or train your own)
Speedup~1.6–1.8×up to ~6× / ~2.5× over EAGLE-3; ~2.2–2.7× on Spark

[!NOTE] Both consume KV cache for speculative tokens, so they trade peak throughput for latency. Keep them on for interactive use; for bursty/concurrent serving add --speculative-disable-by-batch-size 32.

Which models in this guide can use which method:

Model (this guide)MTPDFlashHow
Qwen3.6-35B-A3B-NVFP4 (§7.1.1)✅ built-inalready in its recipe (num_speculative_tokens:3)
Qwen3.6-27B-NVFP4 (§7.1.2)✅ built-in (validate)add --speculative-config '{"method":"mtp",...}'
Qwen3.5-35B-A3B (§7.1.3)✅ built-invia 27B siblingdense Qwen3.5-27B has a DFlash drafter
gemma-4-12B-coder (§7.2.3)✅ paired draftbundled /model/assistant + kv fp8
Gemma-4-31B-IT-NVFP4 (§7.2.1)RedHatAI/...speculator.dflash (Z-Lab image)
Nemotron-3-Super (§7.3.2)✅ built-in{"method":"mtp","num_speculative_tokens":1}
Nemotron-3-Nano (§7.3.1)card ships no speculator; run dense
Mistral-Small-4 (§7.4.1)✅ built-in (validate)add the mtp flag and check acceptance

(Empty = no published speculator for that exact checkpoint today; “validate” = architecture supports it but confirm acceptance on your build.)

13.2 MTP — zero-download (≈5 min)

Just a flag on an MTP-capable model (the §7.1.1 Qwen3.6 and §7.2.3 gemma4-coder recipes already use it):

[!NOTE] Two flavors of MTP. Most MTP models (Qwen3.6, Nemotron-Super, DeepSeek-style) have the prediction heads baked into the checkpoint — nothing extra to download, just the flag. A few ship a small paired draft model instead: gemma4-coder bundles a 0.4 B draft in assistant/, so you point "model" at it (/model/assistant) — see §7.2.3. Both use "method":"mtp"; only the second needs the "model" field.

docker run -d --name mtp --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -v ~/vllm:/models \
  nvcr.io/nvidia/vllm:26.04-py3 \
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --host 0.0.0.0 --port 8000 --trust-remote-code \
    --max-model-len 131072 --gpu-memory-utilization 0.85 --max-num-seqs 4 \
    --reasoning-parser nemotron_v3 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

num_speculative_tokens is usually 1 (some recipes use 3). For lowest latency at low concurrency, the vLLM recipe suggests disabling prefix caching. MTP reduces throughput under load — pair with --speculative-disable-by-batch-size 32.

13.3 DFlash on Spark — plug-and-play container (easiest)

The ghcr.io/aeon-7/vllm-dflash image is a prebuilt sm_121 vLLM with DFlash baked in; it serves a 27B dense Qwen3.5 with a 2B block-diffusion drafter, taking decode from ~12 → ~33 tok/s.

# 1) download the target into ~/vllm
pip install -U "huggingface_hub[cli]"; export HF_TOKEN=hf_xxx
hf download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
  --local-dir ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4

# 2) make a persistent API key, then launch (drafter auto-downloads)
export VLLM_API_KEY=$(openssl rand -hex 32); echo "API key: $VLLM_API_KEY"
docker run -d --name vllm-dflash --runtime nvidia --network host --ipc host \
  --restart unless-stopped \
  -v ~/vllm/DFlash-Qwen3.5-27B-Uncensored-NVFP4:/models/target \
  -v ~/vllm/drafter-cache:/models/drafter-cache \
  -e MODEL_PATH=/models/target \
  -e SERVED_MODEL_NAME=dflash-qwen3.5-27b \
  -e DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash \
  -e DFLASH_NUM_SPEC_TOKENS=15 \
  -e MAX_MODEL_LEN=65536 -e MAX_NUM_SEQS=4 \
  -e GPU_MEMORY_UTILIZATION=0.85 -e MAX_NUM_BATCHED_TOKENS=65536 \
  -e VLLM_API_KEY="$VLLM_API_KEY" -e HF_TOKEN="$HF_TOKEN" \
  ghcr.io/aeon-7/vllm-dflash:latest
docker logs -f vllm-dflash    # ~5 min

# 3) test (note the Bearer token)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{"model":"dflash-qwen3.5-27b","messages":[{"role":"user","content":"Write a haiku about GPUs."}],"max_tokens":200}'

Key env vars: DFLASH_DRAFTER (HF id of the drafter; empty = plain vLLM), DFLASH_NUM_SPEC_TOKENS (15 best single-stream, 5 for high concurrency), KV_CACHE_DTYPE (stays BF16 with DFlash). Spark presets: default 65536/15/4 ≈ 33 tok/s; high-concurrency 32768/5/8 ≈ 85–92 total.

13.4 DFlash the general way (other models)

DFlash uses vLLM’s speculators format; pass the drafter in --speculative-config. Available drafters today:

Target (verifier)DFlash drafterSource
google/gemma-4-31B-itRedHatAI/gemma-4-31B-it-speculator.dflashRedHatAI / vLLM
poolside/Laguna-XS.2poolside/Laguna-XS.2-speculator.dflashpoolside
Qwen/Qwen3.5-27Bz-lab/Qwen3.5-27B-DFlashZ Lab
docker run --rm vllm/vllm-openai:nightly vllm --version   # confirm >= 0.21.0

# Laguna XS.2 (coding; up to 7 tokens/step, ~70% acceptance on code)
docker run -d --name laguna --ipc=host --restart unless-stopped \
  --gpus all -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" -e HF_HOME=/models -e VLLM_USE_DEEP_GEMM=0 -v ~/vllm:/models \
  vllm/vllm-openai:nightly \
  vllm serve poolside/Laguna-XS.2 --trust-remote-code \
    --enable-auto-tool-choice --tool-call-parser poolside_v1 --reasoning-parser poolside_v1 \
    --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'

[!NOTE] DFlash needs vLLM ≥ 0.21.0; the stock NGC 26.04 (0.19.0) won’t do it. Gemma-4 DFlash currently needs Z Lab’s build (ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130). No drafter for your model? Train one — the drafter is a small Qwen3-style stack and the speculators project ships a “Train DFlash” tutorial.

13.5 Verify it’s actually accelerating (don’t fly blind)

Speculative decoding only helps if the drafter’s tokens are accepted. Confirm it, then tune:

  • Check acceptance. vLLM logs a spec-decode summary (draft acceptance rate and mean accepted length per step) and exposes it on /metrics:
    curl -s http://localhost:8000/metrics | grep -i spec_decode
    
    Healthy MTP runs accept most proposed tokens (mean accepted length well above 1); DFlash on code can accept whole blocks. If acceptance is near zero, the method isn’t helping — the drafter/target are mismatched or the content is low-predictability (random/adversarial text).
  • Tune num_speculative_tokens (K). Raise K for DFlash (it verifies a whole block in one pass — 7–15 is normal); keep K low for MTP (1–3) since each extra token costs a verify pass. Watch decode tok/s as you change it — past the acceptance sweet spot, throughput drops.
  • A/B against a dense run. Time the same prompt with the spec flag removed. If tok/s isn’t clearly higher and output is identical, drop speculation for that workload.
  • Concurrency check. Both methods tax throughput under load; if you serve bursts, set --speculative-disable-by-batch-size 32 so vLLM auto-disables speculation when batches grow.

13.6 The gotchas that bite people

[!CAUTION]

  • DFlash floor: vLLM ≥ 0.21.0 (AEON-7/Z-Lab image or nightly). MTP is fine on 0.19.0+.
  • DFlash KV cache stays BF16 — never combine with --kv-cache-dtype fp8. (Note: the MTP gemma4-coder draft is the opposite — it needs fp8 KV.)
  • One drafter per target — a DFlash drafter is trained for a specific model.
  • Higher K is cheap with DFlash (whole block in one pass), not with MTP — keep MTP num_speculative_tokens low.
  • Both hurt throughput under load — add --speculative-disable-by-batch-size 32.
  • Acceptance is content-dependent — big on code/structured output, smaller on prose, none on random text.

14. Sources

Compiled June 2026. Benchmark numbers are reproduced from each model’s Hugging Face card (NVFP4-vs-baseline tables, or vendor-reported gains). Container tags and model handles move fast — pin a known-good image digest for anything you depend on.