Complete Setup & Operations Guide

Everything needed to build, run, update, and operate local LLMs on an NVIDIA DGX Spark (GB10 / sm_121) with llama.cpp and the llm helper command.


1. How the pieces fit

The Spark (GB10). Blackwell GPU at compute capability 12.1 (sm_121), 128 GB unified LPDDR5x shared between CPU and GPU, ~273 GB/s memory bandwidth. Bandwidth is the bottleneck for token generation, so Mixture-of-Experts (MoE) models with few active parameters run far faster than dense models of the same total size. Prefer MoE.

llama.cpp is the inference engine. You compile it once for the GB10 into a binary called llama-server, which loads a model and exposes an OpenAI-compatible HTTP API on a port. Everything else (Hermes, Open WebUI, curl) just talks to that API.

llm is a small wrapper script around llama-server. It does not replace llama.cpp — it launches it with the right Spark-tuned flags baked in (so you don’t retype a dozen options each time) and adds conveniences: start/stop, list running servers, tail logs, quick test, measure tok/s, rebuild. It’s the control panel; llama.cpp is the engine.

GGUF is the only model format llama.cpp runs. Models come from Hugging Face as .gguf files, usually quantized (Q8_0, Q4_K_M, IQ4_XS, …).

The flow:

GGUF model  ->  llama-server (built for sm_121)  ->  OpenAI API on a port  ->  clients
                        ^                                                      (Hermes,
                  launched by `llm`                                            Open WebUI)

2. Pre-flight checks (before installing anything)

nvidia-smi --query-gpu=compute_cap --format=csv   # must be 12.1 (GB10 / sm_121)
nvcc --version                                     # must be CUDA 13.x
cmake --version                                    # 3.31+ recommended for the 121a build
git --version
  • compute_cap must be 12.1 -> you build for the GB10’s sm_121 (arch 121, or 121a for native FP4 — see 4.1). Building for 120 (RTX 50-series) causes no kernel image is available for execution on the device.
  • nvcc must be CUDA 13.x; older toolkits don’t know sm_121.

Known-good stack (NVIDIA DGX OS release notes, mid-2026): DGX OS 7.4.0, GPU driver 580.126.09, CUDA Toolkit 13.0.2, kernel 6.17. Anything at or above this is fine; if your nvcc predates CUDA 13, update the toolkit before building.


3. Install prerequisites (one time)

sudo apt update
sudo apt install -y git cmake build-essential libssl-dev libcurl4-openssl-dev jq
  • libssl-dev (+ libcurl) -> HTTPS so -hf can download models from Hugging Face.
  • jq -> used by llm test / llm speed to parse responses.

What hf is. hf is the Hugging Face CLI (the huggingface_hub command-line tool, formerly huggingface-cli). It authenticates to Hugging Face and downloads model repos/files to your local cache or a chosen folder. llama.cpp’s -hf flag has its own built-in downloader, so hf is only needed when you want to pre-download in the foreground (visible progress, resumable) or fetch gated/private models.

Install it globally with uv (cleanest — nothing to activate):

uv tool install "huggingface_hub[cli]"
hf version          # verify

Or, the NVIDIA-playbook way, into a venv:

python3 -m venv ~/llama-cpp-venv && source ~/llama-cpp-venv/bin/activate
pip install -U "huggingface_hub[cli]"
hf version

For gated/private repos, authenticate once: hf auth login (paste a token from huggingface.co/settings/tokens).


4. Build llama.cpp

cd ~
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121a \
  -DLLAMA_OPENSSL=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"

These compile-time flags are the only things baked into the binary:

flagwhat it does
GGML_CUDA=ONCUDA backend (required)
CMAKE_CUDA_ARCHITECTURES=121atarget the GB10; a = native FP4 (see 4.1)
LLAMA_OPENSSL=ONHTTPS for -hf downloads (needs libssl-dev)
GGML_CUDA_FA_ALL_QUANTS=ONflash-attention kernels for all quant/KV combos, so -fa on works with q8_0 KV
CMAKE_BUILD_TYPE=Releaseoptimized binary

Verify and put on PATH:

~/llama.cpp/build/bin/llama-server --version
ln -sf ~/llama.cpp/build/bin/llama-server ~/.local/bin/llama-server

Keep the source in ~/llama.cpp (not ~/.cache, which can be wiped).

Build flags vs runtime flags: the cmake flags above bake in capabilities. Performance/behavior flags (--no-mmap, -fa on, --batch-size, --threads, --ctx-size, --jinja, sampling) are applied at runtime on the llama-server command line — they’re built into the llm command, not the binary.

4.1 The 121 vs 121a build (the “a”)

Both target the same chip (compute capability 12.1). The difference is the instruction set the compiler may emit:

  • 121 — the standard, portable feature set. PTX can JIT forward onto future GPUs. Conservative, “always works.”
  • 121aa = architecture-specific features: the newest tensor-core MMA ops and native FP4 / microscaling matmul (MXFP4, NVFP4). The code is locked to sm_121 (won’t run on other GPUs), but on a dedicated Spark that’s irrelevant.

121a is faster only on models that use the hardware FP4 path — MXFP4- and NVFP4-quantized models:

  • GPT-OSS-120B / GPT-OSS-20B (natively MXFP4)
  • any *MXFP4* GGUF
  • NVFP4 GGUFs that load on stock

It makes no difference for standard quants (Q4_K, Q5_K, Q6_K, Q8_0, IQ4_XS) — they don’t touch FP4 instructions. The gain, where it exists, is mostly in prefill (compute-bound); decode is bandwidth-bound on the Spark regardless.

Recommendation: keep 121a. It’s a strict superset with no downside on a dedicated Spark, and gives free FP4 acceleration the day you run an MXFP4/NVFP4 model. A FP4-capable build shows BLACKWELL_NATIVE_FP4 = 1 in the server log. Use plain 121 only if you need a portable binary. Verify the binary’s arch:

cuobjdump ~/llama.cpp/build/bin/llama-server 2>/dev/null | grep -m1 'arch =' \
  || find ~/llama.cpp/build -name '*.so' -exec cuobjdump {} \; 2>/dev/null | grep -m1 'arch ='
# -> arch = sm_121a

If cmake rejects 121a, your CMake is too old: pip install --upgrade --break-system-packages cmake, or bypass with -DCMAKE_CUDA_ARCHITECTURES=OFF -DCMAKE_CUDA_FLAGS="-gencode arch=compute_121a,code=sm_121a".


5. Updating llama.cpp

New architectures/features land constantly (a vision projector or MTP arch your build rejects often starts working after a pull).

cd ~/llama.cpp
git pull
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121a \
  -DLLAMA_OPENSSL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"

This is an incremental rebuild — only changed files recompile (a minute or two). llm update does exactly this in one command.

  • Clean rebuild (rm -rf build first) only when you change a build flag (e.g. 121->121a) or a pull throws stale-cache errors.
  • Roll back if master breaks: git -C ~/llama.cpp checkout <tag-or-commit> then rebuild; git checkout master to return.
  • Fork (Step-3.7): cd ~/stepfun-llama.cpp && git pull && cmake --build build --config Release -j"$(nproc)".

6. The llm command

The full script is in Appendix A (so this guide is self-contained). Save it as ~/.local/bin/llm and make it executable — no .bashrc editing, it just has to be on your PATH:

mkdir -p ~/.local/bin
# paste the Appendix A script into the file, e.g.:  nano ~/.local/bin/llm
chmod +x ~/.local/bin/llm
echo "$PATH" | grep -q "$HOME/.local/bin" || { echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc; source ~/.bashrc; }
llm help

How it pairs with llama.cpp: llm run launches ~/llama.cpp/build/bin/llama-server with the Spark-tuned flags below already applied, redirects output to ~/llama-<port>.log, and backgrounds it. The other subcommands manage those processes.

Baked-in runtime flags (per launch): --n-gpu-layers 999 -fa on --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 2048 --ubatch-size 2048 --threads 20 --ctx-size 65536 --jinja.

commandwhat it does
llm run <model> [port] [flags]start a server (default port 8080)
llm stop <port | all>stop one server (waits for it to die), or all
llm psrunning servers (port, pid, model) + memory
llm lslist local models in ~/models
llm log <port>follow a server’s log
llm wait <port>block until the server is ready
llm test <port> [prompt]quick chat test
llm speed <port> [prompt]report prefill / decode tok/s
llm updategit pull + rebuild llama.cpp

Model argument resolution (the <model>), in order:

  1. is a file -> used as-is (path or shell-expanded glob) — e.g. ~/models/gemma-12b/*Q8_0.gguf
  2. has a : -> repo:quant, downloaded from Hugging Face — e.g. unsloth/gemma-4-12b-it-GGUF:Q8_0
  3. bare name -> looked up under ~/models — e.g. gemma-12b -> ~/models/gemma-12b/*.gguf (skips mmproj/MTP files, prefers the first shard; errors and lists them if the folder has more than one quant)
  4. has /, no local match -> treated as an HF repo id

Env overrides (set per launch): LLM_CTX=<n> (context size), LLM_BIN=<path> (different binary, e.g. the fork), LLM_ARCH=<121|121a> (used by llm update), LLM_MODELS=<dir> (model directory, default ~/models), LLM_WAIT_TIMEOUT=<s> (how long llm wait waits before giving up, default 600).

llm run also validates up front (port range, numeric LLM_CTX, that the binary exists) and, after launching, checks the process survived its first second — if it died immediately (bad path, port race, bad arg, OOM at init) it prints the last log lines instead of leaving you to discover an empty server later.


7. Running models

# the three input forms:
llm run gemma-12b 8080                              # bare name -> ~/models/gemma-12b/*.gguf
llm run unsloth/gemma-4-12b-it-GGUF:Q8_0 8080       # repo:quant -> download from HF
llm run ~/models/gemma-12b/*Q8_0.gguf 8080          # explicit path

# full lifecycle:
llm run gemma-12b 8080 --alias gemma --temp 1.0 --top-p 0.95 --top-k 64
llm wait 8080                                       # wait until ready
llm test 8080 "what is a DGX Spark?"                # functional check
llm speed 8080                                      # tok/s
llm ps                                              # see it (and others) + memory
llm stop 8080

# multiple models at once — one per port:
llm run gemma-12b 8080
llm run qwen3-coder 8081
llm ps

# context size per model (small models -> big ctx; big models -> modest):
LLM_CTX=131072 llm run gemma-12b 8080               # 128K
LLM_CTX=0      llm run gemma-12b 8080               # model's full trained window
LLM_CTX=32768  llm run step-3.7 8082                # keep modest for a ~100GB model

Anything after the port is passed straight to llama-server (sampling, --alias, --mmproj, -np, …).


7.1 A-to-Z worked examples (Hugging Face -> running)

Hugging Face model pages show a copy-paste command. Translate it to llm:

you see on HF / Ollamarun it here as
llama-server -hf <repo>:<quant>llm run <repo>:<quant> <port>
ollama run hf.co/<repo>:<quant>llm run <repo>:<quant> <port> (drop the hf.co/)

The -hf <repo>:<quant> on HF is exactly the colon form llm run accepts — just add a port.

Example A — official ggml-org Gemma 26B (Q4_K_M), the recommended path:

# HF shows:  llama-server -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M

hf download ggml-org/gemma-4-26B-A4B-it-GGUF --include "*Q4_K_M*" --local-dir ~/models/gemma-26b
llm run gemma-26b 8080 --alias gemma-26b --temp 1.0 --top-p 0.95 --top-k 64
llm wait 8080
llm test 8080 "Explain MoE models in two sentences."
llm speed 8080
# then point a client at  http://<spark-ip>:8080/v1

Example B — community fine-tune (Mellum2 coder, thinking), the quick path:

# HF shows:  llama-server -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q8_0

llm run yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q8_0 8081 --alias mellum2
llm wait 8081          # the -hf download is silent; this blocks until it's actually ready
llm test 8081 "Write a Python function that reverses a linked list."

(A 2.5B-active MoE — tiny and fast. Thinking model, so keep generation generous.)

Example C — translating an Ollama command:

# Ollama:  ollama run hf.co/yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF:Q8_0
# drop "hf.co/", keep repo:quant, add a port:
llm run yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF:Q8_0 8082 --alias gemma12-opus

For Gemma repos, prefer the hf download --include "*<quant>*" route — its glob skips the mmproj-* vision file, avoiding the unsupported gemma4uv projector that the bare -hf form may try to auto-load.


8. The model directory & downloading

Models live in ~/models (override with LLM_MODELS), one folder per model:

~/models/
  gemma-12b/    gemma-4-12b-it-Q8_0.gguf
  gemma-26b/    gemma-4-26B-A4B-it-Q8_0.gguf   mmproj-...gguf
  step-3.7/     Step-3.7-flash-IQ4_XS-00001-of-00003.gguf   ...

llm ls lists them; bare-name llm run resolves into them.

Picking a quant (the :Q8_0 / Q4_K_M / IQ4_XS part). GGUF files are quantized to save memory. Reading the names: the number is bits per weight (lower = smaller, faster, slightly lower quality); K = modern “K-quant”; M/S = medium/small variant; IQ = imatrix quant (better quality per bit); MXFP4 / NVFP4 = 4-bit microscaling formats that the 121a build accelerates; F16/BF16 = full half-precision (largest). Practical picks on the Spark: Q8_0 when it fits and you want max quality, Q4_K_M or IQ4_XS to fit bigger models with little quality loss. Rough size ≈ (params × bits) / 8 — e.g. a 26B at Q8_0 ≈ 27 GB, at Q4_K_M ≈ 16 GB.

Two download syntaxes — don’t mix them:

toolsyntax
llama.cpp / llm runrepo:quant with a colon -> llm run unsloth/...-GGUF:Q8_0 8080
hf download (HF CLI)repo + --include "*pattern*", or an exact repo filename.gguf (no colon)

For new/large models, download in the foreground first (visible progress, clear error if the name is wrong, resumable), then serve the local file:

hf download unsloth/gemma-4-26B-A4B-it-GGUF --include "*Q8_0*" --local-dir ~/models/gemma-26b
llm run gemma-26b 8080

The direct repo:quant form downloads in the background, so progress does not show in the log — the log stays quiet until the download finishes. Watch progress with du -sh ~/.cache/huggingface/hub/models--<org>--<repo> (or look for *.incomplete blobs).

Where downloaded models actually live (this trips people up — there are two locations):

  • llm run <repo>:<quant> / -hf caches into the Hugging Face cache: ~/.cache/huggingface/hub/models--<org>--<repo>/ — the bytes are in blobs/, with readable filenames symlinked under snapshots/<hash>/. The model runs straight from there. Bare-name llm run does not see these (they’re in the cache, not ~/models).
  • hf download … --local-dir ~/models/<name> writes real files exactly where you point it — the ~/models/<name> layout this guide uses, which is what bare-name llm run resolves against.
  • Relocate the cache with export HF_HOME=/big/disk (or HF_HUB_CACHE) if your home partition is small — 27 GB+ files add up fast.

Practical rule: use hf download --local-dir ~/models/<name> when you want a tidy, predictable path and bare-name runs (llm run <name>); the repo:quant shortcut is convenient for one-offs, but the bytes land in ~/.cache/huggingface.


9. Per-model sampling

llama.cpp’s defaults aren’t tuned per model — set these (append to llm run, or in your client):

modelsampling
Gemma 4--temp 1.0 --top-p 0.95 --top-k 64 (high temp helps its reasoning)
Qwen3.x--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 (check card: thinking vs not)
Nemotron / Step-3.7low temp / near-greedy — see model card + reasoning level

Always grab the exact values from the model card.

Thinking / reasoning models (e.g. Gemma 4 with thinking = 1, Mellum2-Thinking, Nemotron/Step-3.7 reasoning modes) put their chain-of-thought in a separate reasoning_content field before the answer in .content. Two consequences: keep max_tokens generous (a tight cap can cut off the answer after the thinking), and if a reply looks empty, check reasoning_content too. --jinja (already on) is what makes the model’s thinking template work.


10. Wiring into clients

All servers are OpenAI-compatible at http://<spark-ip>:<port>/v1.

  • Hermes: hermes model -> “Custom endpoint” -> http://<spark-ip>:8080/v1 -> blank key -> pick the model.
  • Open WebUI: Admin -> Connections -> OpenAI API -> http://<spark-ip>:8080/v1 -> any key -> refresh. (Add as OpenAI API, never Ollama.)

One model per port = one connection. For tool-calling agents, --jinja (already on) is required.


11. Always-on model (systemd)

For a daily-driver model that survives reboots. ~/.config/systemd/user/llama.service:

[Unit]
Description=llama.cpp server
After=network-online.target

[Service]
ExecStart=%h/llama.cpp/build/bin/llama-server --model %h/models/gemma-12b/gemma-4-12b-it-Q8_0.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 999 -fa on --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 2048 --ubatch-size 2048 --threads 20 --ctx-size 65536 --jinja
Restart=on-failure
RestartSec=5

[Install]
WantedBy=default.target
systemctl --user daemon-reload
systemctl --user enable --now llama
sudo loginctl enable-linger "$USER"        # start at boot without login

Use systemd for the always-on model and llm for experimenting on other ports.


12. Measuring tok/s

  • Web UI (http://<ip>:<port>) shows tok/s under each response.
  • llm speed <port> prints prefill and decode tok/s.
  • Server log (llm log <port>) prints tokens per second after each request — the eval time (generation) line is decode speed.
  • llama-bench for clean, repeatable numbers without the server: ~/llama.cpp/build/bin/llama-bench -m <model.gguf> -p 512 -n 128 -fa 1 -> pp=prefill, tg=decode.

Want more decode tok/s? The biggest lever is speculative decoding, which is now native upstream (landed in recent llama.cpp). Two ways:

  • MTP / co-trained drafter (best): use a model’s multi-token-prediction head — the *-MTP.gguf files in the unsloth Qwen3.6 / Gemma repos — via --spec-type draft-mtp with --model-draft <mtp.gguf>. Co-trained drafters reach roughly 45–85% token acceptance, which can come close to doubling decode speed. (This is the same gemma-4-12b-it-Q8_0-MTP.gguf file that wouldn’t load as a standalone model — it’s a draft head, and this is what it’s for.)
  • Plain draft model: pair a small same-family model via --model-draft <small.gguf>.

Run llm update first if your build predates MTP support, and check ~/llama.cpp/build/bin/llama-server --help for the exact draft flags (this area is moving fast). It’s additive to everything above; since decode is bandwidth-bound on the Spark, this helps more than any single flag. See NVIDIA’s Speculative Decoding playbook (§17).


13. HTTPS & security

  • The LLAMA_OPENSSL=ON you built is for downloading models (-hf), already in use.
  • The server API runs plain http:// (running without SSL in the log). On a trusted LAN that’s correct — every client speaks HTTP to :<port>/v1.
  • For access control, add an API key: llm run gemma-12b 8080 --api-key SECRET.
  • Only put the API behind HTTPS if you expose it beyond the LAN — via a reverse proxy (Caddy/nginx terminating TLS) plus an API key, not per-server certs.

14. What you can / can’t run on one Spark

can’t runwhy
> ~200-240B total (DeepSeek V4, Kimi K2.x, GLM-4.7/5.1 ~355B, Llama-405B)won’t fit in 128 GB even at 4-bit
non-GGUF only (safetensors/FP8/AWQ/GPTQ/EXL2)llama.cpp is GGUF-only -> use vLLM/SGLang/TRT-LLM
unsupported archs (NemotronH, Step-3.7)need a fork (StepFun, etc.)
NVFP4-native GGUFshistorically needed a fork, but upstream NVFP4 (+ MTP) support has landed — llm update may now run them
certain vision projectors (Gemma gemma4uv)run text-only until llama.cpp adds support
large dense models (70B+)fit but decode at single-digit tok/s — bandwidth-bound

Rule of thumb: GGUF + supported architecture + <= ~200B total + ideally MoE. Ceiling is the ~120B-class MoEs (GPT-OSS-120B, Qwen3.5-122B), with Qwen3-235B-A22B at the edge.


15. Gotchas & troubleshooting

symptomcause / fix
HTTPS is not supported … rebuild with -DLLAMA_OPENSSL=ONbinary built without SSL -> rebuild (sec 4)
no kernel image is availablebuilt for wrong arch -> use 121/121a
couldn't bind HTTP server socket, port: Nport in use -> llm ps, ss -ltnp | grep :N, use another port
log shows only nohup: ignoring input-hf downloading silently (check HF cache size), OR you’re reading the wrong ~/llama-<port>.log
empty/wrong log on a new porteach port has its own log -> cat ~/llama-<thatport>.log
unknown model architecture: 'gemma4-assistant'that’s the MTP draft head, not a standalone model — on recent llama.cpp (llm update) use it for speculative decoding (--spec-type draft-mtp, see §12), else just run the plain quant
failed to load CLIP model … mmprojGemma gemma4uv vision unsupported -> run text-only (single-file run avoids it)
* still in the path in an errorglob matched no file -> wrong dir/quant or not downloaded
nohup: unrecognized option '--temp'stale llmrun function -> update it or use the llm script
Repo id must use alphanumeric … from hfused repo:quant colon with hf download -> use --include instead
two copies of a model loadedstarted the same model on two ports -> llm stop the duplicate
long / multi-line prompt -> shell syntax error near '('don’t paste it as a CLI arg (each newline runs as a command); pipe it instead: cat prompt.txt | llm test <port>, or use the web UI
llm ps shows mem: instead of gpu:normal on the GB10 — nvidia-smi reports VRAM as [N/A] on unified memory, so llm ps falls back to system (unified) RAM
token loops / gibberishKV below q8_0, or wrong sampling -> keep q8_0, set the model’s recommended sampling

KV cache: keep at q8_0 or higher. Never use sliding-window attention (--swa-full) with agents — it drops the system prompt. Don’t use -ot ...exps=CPU / --n-cpu-moe (those are for VRAM-limited GPUs; the Spark wants full GPU offload).


16. Quick reference (cheatsheet)

Daily use — the llm command

llm run <name|repo:quant|file.gguf> [port] [flags]   start a server (default :8080)
llm ps                                               running servers (port pid model) + memory (unified on GB10)
llm ls                                               local models in ~/models
llm wait  <port>                                     block until ready
llm test  <port> "prompt"                            quick chat test
llm speed <port>                                     prefill / decode tok/s
llm log   <port>                                     follow the log
llm stop  <port | all>                               stop one / stop everything
llm update                                           git pull + rebuild llama.cpp

Model argument (how run resolves it)

~/models/x/*Q8_0.gguf     -> that exact file                  (a path / glob)
unsloth/...-GGUF:Q8_0     -> download from Hugging Face       (has a colon)
gemma-12b                 -> ~/models/gemma-12b/*.gguf        (bare name)

Per-launch overrides

LLM_CTX=131072 llm run ...                       bigger context (default 65536; 0 = full window)
LLM_MODELS=/data llm run ...                     different model directory
LLM_BIN=~/stepfun.../llama-server llm run ...    different binary (fork)
llm run ... 8080 --temp 1.0 --top-p 0.95 --top-k 64 --alias name   append any llama-server flag

Download (two syntaxes — don’t mix)

llm run <repo>:<quant> <port>                                          colon form (background dl)
hf download <repo> --include "*<quant>*" --local-dir ~/models/<name>   hf CLI (foreground, progress)

Translate HF / Ollama copy-paste: llama-server -hf R:Q and ollama run hf.co/R:Q -> llm run R:Q <port>

Build / update llama.cpp

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121a \
  -DLLAMA_OPENSSL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
# update:  llm update          clean rebuild:  rm -rf ~/llama.cpp/build && llm update

Sampling (append to llm run or set in your client)

Gemma 4   --temp 1.0 --top-p 0.95 --top-k 64
Qwen3.x   --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0

Clientshttp://<spark-ip>:<port>/v1 (Hermes: Custom endpoint; Open WebUI: OpenAI API, not Ollama)

Quick fixes

port in use            llm ps ; ss -ltnp | grep :PORT ; use another port
empty log              wrong ~/llama-<port>.log, or -hf still downloading (silent)
unknown architecture   llm update  (or the model needs a fork)
mmproj / gemma4uv fail run text-only; hf download --include "*<quant>*" skips the mmproj
'*' in error path      glob matched nothing (wrong dir / quant / not downloaded)
long/multi-line prompt pipe it: cat prompt.txt | llm test PORT  (don't paste as an arg)
gibberish / loops      keep KV at q8_0; set the model's recommended sampling

17. Further reading (NVIDIA DGX Spark playbooks)

Official playbooks that pair with this setup:

How this guide differs from NVIDIA’s example (and why). The official llama.cpp instructions are deliberately minimal: they build with -DLLAMA_CURL=OFF (no HTTPS, so they hf download separately) for arch 121, and run a bare server with only --n-gpu-layers 99 --ctx-size 8192 --threads 8. This guide keeps all of that working but adds, on purpose:

  • build: LLAMA_OPENSSL=ON (so -hf downloads directly), the 121a native-FP4 target, GGML_CUDA_FA_ALL_QUANTS=ON, and Release.
  • runtime: the Spark-tuned flags the minimal example omits — --no-mmap (effectively mandatory on unified memory), -fa on, q8_0 KV cache, --batch-size/ubatch-size 2048, --threads 20, and --jinja (for tool-calling agents).

None of these change behavior incompatibly; they’re performance/quality/convenience upgrades the official example leaves out for brevity. Keep them.


Appendix A — the llm script (full source)

Save this as ~/.local/bin/llm and chmod +x it (see section 6). This is the complete, hardened script — copy it verbatim and the guide is self-contained.

#!/usr/bin/env bash
# llm — manage llama.cpp servers on the DGX Spark (GB10 / sm_121)
# install:  cp llm ~/.local/bin/llm && chmod +x ~/.local/bin/llm
# (make sure ~/.local/bin is on your PATH)
#
# env overrides:  LLM_CTX=<n>  LLM_BIN=<path>  LLM_ARCH=<121|121a>  LLM_MODELS=<dir>  LLM_WAIT_TIMEOUT=<s>

BIN="${LLM_BIN:-$HOME/llama.cpp/build/bin/llama-server}"
CTX="${LLM_CTX:-65536}"
MODELS="${LLM_MODELS:-$HOME/models}"

die() { echo "llm: $*" >&2; exit 1; }

cmd="${1:-help}"; shift 2>/dev/null || true

case "$cmd" in

  run)
    m="${1:-}"; [[ -z "$m" ]] && die "usage: llm run <name | repo:quant | file.gguf> [port] [extra flags]"
    shift
    p=8080; [[ "${1:-}" =~ ^[0-9]+$ ]] && { p="$1"; shift; }

    [[ "$CTX" =~ ^[0-9]+$ ]] || die "LLM_CTX must be a number (got '$CTX')"
    [[ "$p" =~ ^[0-9]+$ ]]   || die "port must be a number (got '$p')"
    p=$((10#$p))                                   # normalise (strip leading zeros, force base 10)
    (( p >= 1 && p <= 65535 )) || die "port out of range: $p"
    [[ -x "$BIN" ]] || die "llama-server not found/executable at '$BIN' — build it (see playbook) or set LLM_BIN"

    # --- resolve the model argument (existing file wins, then repo:quant, then name lookup) ---
    if [[ -f "$m" ]]; then
      f="--model"                                          # an existing file (path or expanded glob)
    elif [[ "$m" == *:* ]]; then
      f="-hf"                                              # repo:quant -> download from HF
    else
      hit=""                                               # resolve a short name under $MODELS
      if [[ -d "$MODELS/$m" ]]; then
        cands=()
        for g in "$MODELS/$m"/*.gguf; do
          [[ -e "$g" ]] || continue
          [[ "$g" == *mmproj* || "$g" == *-MTP* ]] && continue                                                                  # skip vision / draft files
          [[ "$g" == *-[0-9][0-9][0-9][0-9][0-9]-of-[0-9][0-9][0-9][0-9][0-9].gguf && "$g" != *-00001-of-*.gguf ]] && continue  # keep only the first shard
          cands+=("$g")
        done
        if [[ ${#cands[@]} -gt 1 ]]; then
          echo "llm: multiple models in $MODELS/$m — pass the exact file you want:" >&2
          for c in "${cands[@]}"; do echo "  $MODELS/$m/$(basename "$c")" >&2; done
          exit 1
        fi
        [[ ${#cands[@]} -eq 1 ]] && hit="${cands[0]}"
      elif [[ -f "$MODELS/$m" ]]; then
        hit="$MODELS/$m"
      else
        for g in "$MODELS/$m"*.gguf; do [[ -e "$g" ]] && { hit="$g"; break; }; done
      fi
      if [[ -n "$hit" ]]; then
        m="$hit"; f="--model"
      elif [[ "$m" == */* ]]; then
        f="-hf"                                            # looks like an HF repo id, no local match
      else
        die "'$m' not found in $MODELS and not a repo:quant — try 'llm ls' or use repo:quant"
      fi
    fi

    if ss -ltnH 2>/dev/null | awk '{print $4}' | grep -qE ":$p$"; then
      die "port $p already in use — see 'llm ps' or pick another port"
    fi

    nohup "$BIN" $f "$m" \
      --host 0.0.0.0 --port "$p" \
      --n-gpu-layers 999 -fa on --no-mmap \
      --cache-type-k q8_0 --cache-type-v q8_0 \
      --batch-size 2048 --ubatch-size 2048 --threads 20 \
      --ctx-size "$CTX" --jinja \
      "$@" \
      >"$HOME/llama-$p.log" 2>&1 &
    pid=$!
    echo "started $(basename "$m") on :$p (pid $pid, ctx=$CTX)"

    # catch immediate failures (bad path, port race, arg error, OOM at init)
    sleep 1
    if ! kill -0 "$pid" 2>/dev/null; then
      echo "  ERROR: process exited immediately — last log lines:" >&2
      tail -n 6 "$HOME/llama-$p.log" 2>/dev/null | sed 's/^/    /' >&2
      exit 1
    fi
    echo "  log: llm log $p  |  wait: llm wait $p  |  test: llm test $p  |  ui: http://localhost:$p"
    if [[ "$f" == "-hf" ]]; then
      echo "  note: downloading from HF in the background — progress does NOT show in the log;"
      echo "        watch ~/.cache/huggingface, or pre-download with 'hf download' for a progress bar."
    fi
    ;;

  stop)
    p="${1:-8080}"
    if [[ "$p" == all ]]; then
      pkill -f "llama-server" && echo "stopped all servers" || echo "no servers running"
      exit 0
    fi
    [[ "$p" =~ ^[0-9]+$ ]] || die "stop: give a port number or 'all' (got '$p')"
    # anchor the port so 'stop 80' can't match :8080 and 'stop 8080' can't match :18080
    pat="llama-server.*--port $p"'( |$)'
    pkill -f "$pat" || { echo "nothing running on :$p"; exit 0; }
    for _ in {1..20}; do pgrep -f "$pat" >/dev/null || break; sleep 0.5; done
    if pgrep -f "$pat" >/dev/null; then
      pkill -9 -f "$pat"; echo "force-killed :$p (it ignored SIGTERM)"
    else
      echo "stopped :$p"
    fi
    ;;

  log)
    p="${1:-8080}"
    [[ -f "$HOME/llama-$p.log" ]] || die "no log at ~/llama-$p.log — has a server run on :$p?"
    exec tail -f "$HOME/llama-$p.log"
    ;;

  ps)
    found=0
    for pid in $(pgrep -f llama-server 2>/dev/null); do
      args=$(tr '\0' ' ' < "/proc/$pid/cmdline" 2>/dev/null) || continue
      port=$(grep -oP -- '--port \K\S+' <<<"$args" || true)
      model=$(grep -oP -- '--model \K\S+' <<<"$args" || true); model="${model##*/}"
      [[ -z "$model" ]] && model=$(grep -oP -- '-hf \K\S+' <<<"$args" || true)
      printf "  :%-6s pid %-8s %s\n" "${port:-?}" "$pid" "${model:-?}"
      found=1
    done
    [[ "$found" -eq 0 ]] && echo "  (no llama-servers running)"
    # GB10 has unified memory; nvidia-smi often reports VRAM as [N/A], so fall back to system RAM
    mem=""
    if command -v nvidia-smi >/dev/null 2>&1; then
      mem=$(nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits 2>/dev/null \
            | awk -F', ' 'NR==1 && $1 ~ /^[0-9]+$/ {printf "gpu: %s / %s MiB used", $1, $2}')
    fi
    [[ -z "$mem" ]] && command -v free >/dev/null 2>&1 && \
      mem=$(free -m | awk '/^Mem:/ {printf "mem: %s / %s MiB used (unified)", $3, $2}')
    [[ -n "$mem" ]] && echo "  $mem"
    ;;

  wait)
    p="${1:-8080}"; t=0; max="${LLM_WAIT_TIMEOUT:-600}"
    [[ "$max" =~ ^[0-9]+$ ]] || max=600
    command -v curl >/dev/null 2>&1 || die "curl not installed (sudo apt install -y curl)"
    until curl -sf "http://localhost:$p/health" >/dev/null 2>&1; do
      sleep 2; t=$((t+2))
      [[ $t -ge $max ]] && die ":$p not ready after ${max}s — check 'llm log $p' (still downloading? crashed?)"
    done
    echo ":$p ready"
    ;;

  test)
    p=8080; [[ "${1:-}" =~ ^[0-9]+$ ]] && { p="$1"; shift; }
    command -v jq   >/dev/null 2>&1 || die "jq not installed (sudo apt install -y jq)"
    command -v curl >/dev/null 2>&1 || die "curl not installed (sudo apt install -y curl)"
    # prompt precedence: args > piped stdin > default
    if   [[ $# -gt 0 ]]; then prompt="$*"
    elif [[ ! -t 0 ]];   then prompt="$(cat)"
    else                      prompt="Say hello in one short sentence."; fi
    payload=$(jq -nc --arg c "$prompt" '{messages:[{role:"user",content:$c}],max_tokens:256}')
    resp=$(curl -s "http://localhost:$p/v1/chat/completions" -H "Content-Type: application/json" -d "$payload")
    [[ -z "$resp" ]] && die "no response from :$p — running? ('llm ps')   ready? ('llm wait $p')"
    echo "$resp" | jq -r '.choices[0].message.content // .error.message // .' 2>/dev/null || echo "$resp"
    ;;

  speed)
    p=8080; [[ "${1:-}" =~ ^[0-9]+$ ]] && { p="$1"; shift; }
    command -v jq   >/dev/null 2>&1 || die "jq not installed (sudo apt install -y jq)"
    command -v curl >/dev/null 2>&1 || die "curl not installed (sudo apt install -y curl)"
    # prompt precedence: args > piped stdin > default
    if   [[ $# -gt 0 ]]; then prompt="$*"
    elif [[ ! -t 0 ]];   then prompt="$(cat)"
    else                      prompt="Write a detailed paragraph about GPUs."; fi
    payload=$(jq -nc --arg c "$prompt" '{messages:[{role:"user",content:$c}],max_tokens:200}')
    resp=$(curl -s "http://localhost:$p/v1/chat/completions" -H "Content-Type: application/json" -d "$payload")
    [[ -z "$resp" ]] && die "no response from :$p — running? ('llm ps')   ready? ('llm wait $p')"
    echo "$resp" | jq -r 'if .timings then "prefill: \(.timings.prompt_per_second) tok/s   |   decode: \(.timings.predicted_per_second) tok/s" else (.error.message // "no timings in response") end' 2>/dev/null || echo "$resp"
    ;;

  ls)
    [[ -d "$MODELS" ]] || die "no models directory at $MODELS (set LLM_MODELS or create it)"
    echo "models in $MODELS:"
    n=0
    for d in "$MODELS"/*/;     do [[ -d "$d" ]] && { echo "  $(basename "$d")/"; n=1; }; done
    for g in "$MODELS"/*.gguf; do [[ -e "$g" ]] && { echo "  $(basename "$g")"; n=1; }; done
    [[ "$n" -eq 0 ]] && echo "  (empty)"
    ;;

  update)
    arch="${LLM_ARCH:-121a}"
    [[ -d "$HOME/llama.cpp/.git" ]] || die "no git repo at ~/llama.cpp — clone it first (see playbook)"
    cd "$HOME/llama.cpp" || die "cannot cd to ~/llama.cpp"
    git pull && \
    cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="$arch" \
      -DLLAMA_OPENSSL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release && \
    cmake --build build --config Release -j"$(nproc)" && \
    echo "updated -> $(git -C "$HOME/llama.cpp" log -1 --oneline)"
    ;;

  help|--help|-h|*)
    cat <<'EOF'
llm — manage llama.cpp servers on the DGX Spark

  llm run <name | repo:quant | file.gguf> [port] [extra llama-server flags]
  llm stop <port | all>
  llm ps                       running servers (port, pid, model) + memory
  llm ls                       list local models in $LLM_MODELS
  llm log  <port>              follow a server's log
  llm wait <port>              block until the server is ready (timeout LLM_WAIT_TIMEOUT, def 600s)
  llm test <port> [prompt...]  quick chat test (needs jq; prompt via args OR piped stdin)
  llm speed <port> [prompt...] report prefill / decode tok/s (needs jq; args OR stdin)
  llm update                   git pull + rebuild llama.cpp

the model argument resolves in this order:
  is a file  -> used as-is                     e.g. ~/models/gemma-12b/*Q8_0.gguf
  has a ':'  -> repo:quant, download from HF   e.g. unsloth/gemma-4-12b-it-GGUF:Q8_0
  bare name  -> looked up in $LLM_MODELS        e.g. gemma-12b -> ~/models/gemma-12b/*.gguf
                (skips mmproj/MTP, prefers first shard; errors if >1 quant present)
  has a '/' and no local match -> HF repo id

env overrides:  LLM_CTX=<n>   LLM_BIN=<path>   LLM_ARCH=<121|121a>   LLM_MODELS=<dir>   LLM_WAIT_TIMEOUT=<s>

examples:
  llm run gemma-12b 8080                              # resolves ~/models/gemma-12b/*.gguf
  llm run unsloth/gemma-4-12b-it-GGUF:Q8_0 8080 --alias gemma12b
  LLM_CTX=131072 llm run ~/models/gemma-12b/*Q8_0.gguf 8081
  llm ps
  llm test 8080 "what is a DGX Spark?"
  cat prompt.txt | llm test 8080                     # long / multi-line prompt via stdin
EOF
    ;;
esac

Appendix B — shell-function alternative

If you prefer shell functions to the llm script in Appendix A, the same behavior can live as llmrun / llmstop / llmlog / llmps / llmwait / llmtest / llmupdate functions in ~/.llama-helpers.sh, sourced from ~/.bashrc with one line:

echo '[ -f ~/.llama-helpers.sh ] && . ~/.llama-helpers.sh' >> ~/.bashrc

Same flags and behavior. Pick one approach to avoid confusion — the llm script is recommended (works in any shell, no sourcing, and what the rest of this guide assumes). A stale copy of the function in ~/.bashrc was the cause of several earlier surprises, so if you go the function route, keep it in the single sourced file above and re-source after edits.