Complete Setup & Operations Guide
Everything needed to build, run, update, and operate local LLMs on an NVIDIA DGX
Spark (GB10 / sm_121) with llama.cpp and the llm helper command.
1. How the pieces fit
The Spark (GB10). Blackwell GPU at compute capability 12.1 (sm_121), 128 GB
unified LPDDR5x shared between CPU and GPU, ~273 GB/s memory bandwidth. Bandwidth is the
bottleneck for token generation, so Mixture-of-Experts (MoE) models with few active
parameters run far faster than dense models of the same total size. Prefer MoE.
llama.cpp is the inference engine. You compile it once for the GB10 into a binary
called llama-server, which loads a model and exposes an OpenAI-compatible HTTP API
on a port. Everything else (Hermes, Open WebUI, curl) just talks to that API.
llm is a small wrapper script around llama-server. It does not replace
llama.cpp — it launches it with the right Spark-tuned flags baked in (so you don’t retype
a dozen options each time) and adds conveniences: start/stop, list running servers, tail
logs, quick test, measure tok/s, rebuild. It’s the control panel; llama.cpp is the engine.
GGUF is the only model format llama.cpp runs. Models come from Hugging Face as .gguf
files, usually quantized (Q8_0, Q4_K_M, IQ4_XS, …).
The flow:
GGUF model -> llama-server (built for sm_121) -> OpenAI API on a port -> clients
^ (Hermes,
launched by `llm` Open WebUI)
2. Pre-flight checks (before installing anything)
nvidia-smi --query-gpu=compute_cap --format=csv # must be 12.1 (GB10 / sm_121)
nvcc --version # must be CUDA 13.x
cmake --version # 3.31+ recommended for the 121a build
git --version
compute_capmust be 12.1 -> you build for the GB10’ssm_121(arch121, or121afor native FP4 — see 4.1). Building for120(RTX 50-series) causesno kernel image is available for execution on the device.nvccmust be CUDA 13.x; older toolkits don’t knowsm_121.
Known-good stack (NVIDIA DGX OS release notes, mid-2026): DGX OS 7.4.0, GPU driver
580.126.09, CUDA Toolkit 13.0.2, kernel 6.17. Anything at or above this is fine; if your
nvcc predates CUDA 13, update the toolkit before building.
3. Install prerequisites (one time)
sudo apt update
sudo apt install -y git cmake build-essential libssl-dev libcurl4-openssl-dev jq
libssl-dev(+ libcurl) -> HTTPS so-hfcan download models from Hugging Face.jq-> used byllm test/llm speedto parse responses.
What hf is. hf is the Hugging Face CLI (the huggingface_hub command-line
tool, formerly huggingface-cli). It authenticates to Hugging Face and downloads model
repos/files to your local cache or a chosen folder. llama.cpp’s -hf flag has its own
built-in downloader, so hf is only needed when you want to pre-download in the foreground
(visible progress, resumable) or fetch gated/private models.
Install it globally with uv (cleanest — nothing to activate):
uv tool install "huggingface_hub[cli]"
hf version # verify
Or, the NVIDIA-playbook way, into a venv:
python3 -m venv ~/llama-cpp-venv && source ~/llama-cpp-venv/bin/activate
pip install -U "huggingface_hub[cli]"
hf version
For gated/private repos, authenticate once: hf auth login (paste a token from
huggingface.co/settings/tokens).
4. Build llama.cpp
cd ~
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=121a \
-DLLAMA_OPENSSL=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
These compile-time flags are the only things baked into the binary:
| flag | what it does |
|---|---|
GGML_CUDA=ON | CUDA backend (required) |
CMAKE_CUDA_ARCHITECTURES=121a | target the GB10; a = native FP4 (see 4.1) |
LLAMA_OPENSSL=ON | HTTPS for -hf downloads (needs libssl-dev) |
GGML_CUDA_FA_ALL_QUANTS=ON | flash-attention kernels for all quant/KV combos, so -fa on works with q8_0 KV |
CMAKE_BUILD_TYPE=Release | optimized binary |
Verify and put on PATH:
~/llama.cpp/build/bin/llama-server --version
ln -sf ~/llama.cpp/build/bin/llama-server ~/.local/bin/llama-server
Keep the source in
~/llama.cpp(not~/.cache, which can be wiped).Build flags vs runtime flags: the cmake flags above bake in capabilities. Performance/behavior flags (
--no-mmap,-fa on,--batch-size,--threads,--ctx-size,--jinja, sampling) are applied at runtime on thellama-servercommand line — they’re built into thellmcommand, not the binary.
4.1 The 121 vs 121a build (the “a”)
Both target the same chip (compute capability 12.1). The difference is the instruction set the compiler may emit:
121— the standard, portable feature set. PTX can JIT forward onto future GPUs. Conservative, “always works.”121a—a= architecture-specific features: the newest tensor-core MMA ops and native FP4 / microscaling matmul (MXFP4, NVFP4). The code is locked tosm_121(won’t run on other GPUs), but on a dedicated Spark that’s irrelevant.
121a is faster only on models that use the hardware FP4 path — MXFP4- and
NVFP4-quantized models:
- GPT-OSS-120B / GPT-OSS-20B (natively MXFP4)
- any
*MXFP4*GGUF - NVFP4 GGUFs that load on stock
It makes no difference for standard quants (Q4_K, Q5_K, Q6_K, Q8_0, IQ4_XS) — they don’t touch FP4 instructions. The gain, where it exists, is mostly in prefill (compute-bound); decode is bandwidth-bound on the Spark regardless.
Recommendation: keep 121a. It’s a strict superset with no downside on a dedicated
Spark, and gives free FP4 acceleration the day you run an MXFP4/NVFP4 model. A FP4-capable
build shows BLACKWELL_NATIVE_FP4 = 1 in the server log. Use plain 121 only if you need
a portable binary. Verify the binary’s arch:
cuobjdump ~/llama.cpp/build/bin/llama-server 2>/dev/null | grep -m1 'arch =' \
|| find ~/llama.cpp/build -name '*.so' -exec cuobjdump {} \; 2>/dev/null | grep -m1 'arch ='
# -> arch = sm_121a
If cmake rejects 121a, your CMake is too old: pip install --upgrade --break-system-packages cmake,
or bypass with -DCMAKE_CUDA_ARCHITECTURES=OFF -DCMAKE_CUDA_FLAGS="-gencode arch=compute_121a,code=sm_121a".
5. Updating llama.cpp
New architectures/features land constantly (a vision projector or MTP arch your build rejects often starts working after a pull).
cd ~/llama.cpp
git pull
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121a \
-DLLAMA_OPENSSL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
This is an incremental rebuild — only changed files recompile (a minute or two).
llm update does exactly this in one command.
- Clean rebuild (
rm -rf buildfirst) only when you change a build flag (e.g.121->121a) or a pull throws stale-cache errors. - Roll back if master breaks:
git -C ~/llama.cpp checkout <tag-or-commit>then rebuild;git checkout masterto return. - Fork (Step-3.7):
cd ~/stepfun-llama.cpp && git pull && cmake --build build --config Release -j"$(nproc)".
6. The llm command
The full script is in Appendix A (so this guide is self-contained). Save it as
~/.local/bin/llm and make it executable — no .bashrc editing, it just has to be on
your PATH:
mkdir -p ~/.local/bin
# paste the Appendix A script into the file, e.g.: nano ~/.local/bin/llm
chmod +x ~/.local/bin/llm
echo "$PATH" | grep -q "$HOME/.local/bin" || { echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc; source ~/.bashrc; }
llm help
How it pairs with llama.cpp: llm run launches ~/llama.cpp/build/bin/llama-server
with the Spark-tuned flags below already applied, redirects output to ~/llama-<port>.log,
and backgrounds it. The other subcommands manage those processes.
Baked-in runtime flags (per launch): --n-gpu-layers 999 -fa on --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 2048 --ubatch-size 2048 --threads 20 --ctx-size 65536 --jinja.
| command | what it does |
|---|---|
llm run <model> [port] [flags] | start a server (default port 8080) |
llm stop <port | all> | stop one server (waits for it to die), or all |
llm ps | running servers (port, pid, model) + memory |
llm ls | list local models in ~/models |
llm log <port> | follow a server’s log |
llm wait <port> | block until the server is ready |
llm test <port> [prompt] | quick chat test |
llm speed <port> [prompt] | report prefill / decode tok/s |
llm update | git pull + rebuild llama.cpp |
Model argument resolution (the <model>), in order:
- is a file -> used as-is (path or shell-expanded glob) — e.g.
~/models/gemma-12b/*Q8_0.gguf - has a
:->repo:quant, downloaded from Hugging Face — e.g.unsloth/gemma-4-12b-it-GGUF:Q8_0 - bare name -> looked up under
~/models— e.g.gemma-12b->~/models/gemma-12b/*.gguf(skipsmmproj/MTPfiles, prefers the first shard; errors and lists them if the folder has more than one quant) - has
/, no local match -> treated as an HF repo id
Env overrides (set per launch): LLM_CTX=<n> (context size), LLM_BIN=<path>
(different binary, e.g. the fork), LLM_ARCH=<121|121a> (used by llm update),
LLM_MODELS=<dir> (model directory, default ~/models), LLM_WAIT_TIMEOUT=<s>
(how long llm wait waits before giving up, default 600).
llm run also validates up front (port range, numeric LLM_CTX, that the binary
exists) and, after launching, checks the process survived its first second — if it
died immediately (bad path, port race, bad arg, OOM at init) it prints the last log lines
instead of leaving you to discover an empty server later.
7. Running models
# the three input forms:
llm run gemma-12b 8080 # bare name -> ~/models/gemma-12b/*.gguf
llm run unsloth/gemma-4-12b-it-GGUF:Q8_0 8080 # repo:quant -> download from HF
llm run ~/models/gemma-12b/*Q8_0.gguf 8080 # explicit path
# full lifecycle:
llm run gemma-12b 8080 --alias gemma --temp 1.0 --top-p 0.95 --top-k 64
llm wait 8080 # wait until ready
llm test 8080 "what is a DGX Spark?" # functional check
llm speed 8080 # tok/s
llm ps # see it (and others) + memory
llm stop 8080
# multiple models at once — one per port:
llm run gemma-12b 8080
llm run qwen3-coder 8081
llm ps
# context size per model (small models -> big ctx; big models -> modest):
LLM_CTX=131072 llm run gemma-12b 8080 # 128K
LLM_CTX=0 llm run gemma-12b 8080 # model's full trained window
LLM_CTX=32768 llm run step-3.7 8082 # keep modest for a ~100GB model
Anything after the port is passed straight to llama-server (sampling, --alias,
--mmproj, -np, …).
7.1 A-to-Z worked examples (Hugging Face -> running)
Hugging Face model pages show a copy-paste command. Translate it to llm:
| you see on HF / Ollama | run it here as |
|---|---|
llama-server -hf <repo>:<quant> | llm run <repo>:<quant> <port> |
ollama run hf.co/<repo>:<quant> | llm run <repo>:<quant> <port> (drop the hf.co/) |
The -hf <repo>:<quant> on HF is exactly the colon form llm run accepts — just add a port.
Example A — official ggml-org Gemma 26B (Q4_K_M), the recommended path:
# HF shows: llama-server -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M
hf download ggml-org/gemma-4-26B-A4B-it-GGUF --include "*Q4_K_M*" --local-dir ~/models/gemma-26b
llm run gemma-26b 8080 --alias gemma-26b --temp 1.0 --top-p 0.95 --top-k 64
llm wait 8080
llm test 8080 "Explain MoE models in two sentences."
llm speed 8080
# then point a client at http://<spark-ip>:8080/v1
Example B — community fine-tune (Mellum2 coder, thinking), the quick path:
# HF shows: llama-server -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q8_0
llm run yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q8_0 8081 --alias mellum2
llm wait 8081 # the -hf download is silent; this blocks until it's actually ready
llm test 8081 "Write a Python function that reverses a linked list."
(A 2.5B-active MoE — tiny and fast. Thinking model, so keep generation generous.)
Example C — translating an Ollama command:
# Ollama: ollama run hf.co/yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF:Q8_0
# drop "hf.co/", keep repo:quant, add a port:
llm run yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF:Q8_0 8082 --alias gemma12-opus
For Gemma repos, prefer the hf download --include "*<quant>*" route — its glob skips
the mmproj-* vision file, avoiding the unsupported gemma4uv projector that the bare
-hf form may try to auto-load.
8. The model directory & downloading
Models live in ~/models (override with LLM_MODELS), one folder per model:
~/models/
gemma-12b/ gemma-4-12b-it-Q8_0.gguf
gemma-26b/ gemma-4-26B-A4B-it-Q8_0.gguf mmproj-...gguf
step-3.7/ Step-3.7-flash-IQ4_XS-00001-of-00003.gguf ...
llm ls lists them; bare-name llm run resolves into them.
Picking a quant (the :Q8_0 / Q4_K_M / IQ4_XS part). GGUF files are quantized to
save memory. Reading the names: the number is bits per weight (lower = smaller, faster,
slightly lower quality); K = modern “K-quant”; M/S = medium/small variant; IQ =
imatrix quant (better quality per bit); MXFP4 / NVFP4 = 4-bit microscaling formats that
the 121a build accelerates; F16/BF16 = full half-precision (largest). Practical
picks on the Spark: Q8_0 when it fits and you want max quality, Q4_K_M or IQ4_XS
to fit bigger models with little quality loss. Rough size ≈ (params × bits) / 8 — e.g. a
26B at Q8_0 ≈ 27 GB, at Q4_K_M ≈ 16 GB.
Two download syntaxes — don’t mix them:
| tool | syntax |
|---|---|
llama.cpp / llm run | repo:quant with a colon -> llm run unsloth/...-GGUF:Q8_0 8080 |
hf download (HF CLI) | repo + --include "*pattern*", or an exact repo filename.gguf (no colon) |
For new/large models, download in the foreground first (visible progress, clear error if the name is wrong, resumable), then serve the local file:
hf download unsloth/gemma-4-26B-A4B-it-GGUF --include "*Q8_0*" --local-dir ~/models/gemma-26b
llm run gemma-26b 8080
The direct repo:quant form downloads in the background, so progress does not show
in the log — the log stays quiet until the download finishes. Watch progress with
du -sh ~/.cache/huggingface/hub/models--<org>--<repo> (or look for *.incomplete blobs).
Where downloaded models actually live (this trips people up — there are two locations):
llm run <repo>:<quant>/-hfcaches into the Hugging Face cache:~/.cache/huggingface/hub/models--<org>--<repo>/— the bytes are inblobs/, with readable filenames symlinked undersnapshots/<hash>/. The model runs straight from there. Bare-namellm rundoes not see these (they’re in the cache, not~/models).hf download … --local-dir ~/models/<name>writes real files exactly where you point it — the~/models/<name>layout this guide uses, which is what bare-namellm runresolves against.- Relocate the cache with
export HF_HOME=/big/disk(orHF_HUB_CACHE) if your home partition is small — 27 GB+ files add up fast.
Practical rule: use hf download --local-dir ~/models/<name> when you want a tidy,
predictable path and bare-name runs (llm run <name>); the repo:quant shortcut is
convenient for one-offs, but the bytes land in ~/.cache/huggingface.
9. Per-model sampling
llama.cpp’s defaults aren’t tuned per model — set these (append to llm run, or in your client):
| model | sampling |
|---|---|
| Gemma 4 | --temp 1.0 --top-p 0.95 --top-k 64 (high temp helps its reasoning) |
| Qwen3.x | --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 (check card: thinking vs not) |
| Nemotron / Step-3.7 | low temp / near-greedy — see model card + reasoning level |
Always grab the exact values from the model card.
Thinking / reasoning models (e.g. Gemma 4 with thinking = 1, Mellum2-Thinking,
Nemotron/Step-3.7 reasoning modes) put their chain-of-thought in a separate
reasoning_content field before the answer in .content. Two consequences: keep
max_tokens generous (a tight cap can cut off the answer after the thinking), and if a
reply looks empty, check reasoning_content too. --jinja (already on) is what makes the
model’s thinking template work.
10. Wiring into clients
All servers are OpenAI-compatible at http://<spark-ip>:<port>/v1.
- Hermes:
hermes model-> “Custom endpoint” ->http://<spark-ip>:8080/v1-> blank key -> pick the model. - Open WebUI: Admin -> Connections -> OpenAI API ->
http://<spark-ip>:8080/v1-> any key -> refresh. (Add as OpenAI API, never Ollama.)
One model per port = one connection. For tool-calling agents, --jinja (already on) is required.
11. Always-on model (systemd)
For a daily-driver model that survives reboots. ~/.config/systemd/user/llama.service:
[Unit]
Description=llama.cpp server
After=network-online.target
[Service]
ExecStart=%h/llama.cpp/build/bin/llama-server --model %h/models/gemma-12b/gemma-4-12b-it-Q8_0.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 999 -fa on --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 2048 --ubatch-size 2048 --threads 20 --ctx-size 65536 --jinja
Restart=on-failure
RestartSec=5
[Install]
WantedBy=default.target
systemctl --user daemon-reload
systemctl --user enable --now llama
sudo loginctl enable-linger "$USER" # start at boot without login
Use systemd for the always-on model and llm for experimenting on other ports.
12. Measuring tok/s
- Web UI (
http://<ip>:<port>) shows tok/s under each response. llm speed <port>prints prefill and decode tok/s.- Server log (
llm log <port>) printstokens per secondafter each request — theeval time(generation) line is decode speed. llama-benchfor clean, repeatable numbers without the server:~/llama.cpp/build/bin/llama-bench -m <model.gguf> -p 512 -n 128 -fa 1->pp=prefill,tg=decode.
Want more decode tok/s? The biggest lever is speculative decoding, which is now native upstream (landed in recent llama.cpp). Two ways:
- MTP / co-trained drafter (best): use a model’s multi-token-prediction head — the
*-MTP.gguffiles in the unsloth Qwen3.6 / Gemma repos — via--spec-type draft-mtpwith--model-draft <mtp.gguf>. Co-trained drafters reach roughly 45–85% token acceptance, which can come close to doubling decode speed. (This is the samegemma-4-12b-it-Q8_0-MTP.gguffile that wouldn’t load as a standalone model — it’s a draft head, and this is what it’s for.) - Plain draft model: pair a small same-family model via
--model-draft <small.gguf>.
Run llm update first if your build predates MTP support, and check
~/llama.cpp/build/bin/llama-server --help for the exact draft flags (this area is moving
fast). It’s additive to everything above; since decode is bandwidth-bound on the Spark,
this helps more than any single flag. See NVIDIA’s Speculative Decoding playbook (§17).
13. HTTPS & security
- The
LLAMA_OPENSSL=ONyou built is for downloading models (-hf), already in use. - The server API runs plain http:// (
running without SSLin the log). On a trusted LAN that’s correct — every client speaks HTTP to:<port>/v1. - For access control, add an API key:
llm run gemma-12b 8080 --api-key SECRET. - Only put the API behind HTTPS if you expose it beyond the LAN — via a reverse proxy (Caddy/nginx terminating TLS) plus an API key, not per-server certs.
14. What you can / can’t run on one Spark
| can’t run | why |
|---|---|
| > ~200-240B total (DeepSeek V4, Kimi K2.x, GLM-4.7/5.1 ~355B, Llama-405B) | won’t fit in 128 GB even at 4-bit |
| non-GGUF only (safetensors/FP8/AWQ/GPTQ/EXL2) | llama.cpp is GGUF-only -> use vLLM/SGLang/TRT-LLM |
| unsupported archs (NemotronH, Step-3.7) | need a fork (StepFun, etc.) |
| NVFP4-native GGUFs | historically needed a fork, but upstream NVFP4 (+ MTP) support has landed — llm update may now run them |
certain vision projectors (Gemma gemma4uv) | run text-only until llama.cpp adds support |
| large dense models (70B+) | fit but decode at single-digit tok/s — bandwidth-bound |
Rule of thumb: GGUF + supported architecture + <= ~200B total + ideally MoE. Ceiling is the ~120B-class MoEs (GPT-OSS-120B, Qwen3.5-122B), with Qwen3-235B-A22B at the edge.
15. Gotchas & troubleshooting
| symptom | cause / fix |
|---|---|
HTTPS is not supported … rebuild with -DLLAMA_OPENSSL=ON | binary built without SSL -> rebuild (sec 4) |
no kernel image is available | built for wrong arch -> use 121/121a |
couldn't bind HTTP server socket, port: N | port in use -> llm ps, ss -ltnp | grep :N, use another port |
log shows only nohup: ignoring input | -hf downloading silently (check HF cache size), OR you’re reading the wrong ~/llama-<port>.log |
| empty/wrong log on a new port | each port has its own log -> cat ~/llama-<thatport>.log |
unknown model architecture: 'gemma4-assistant' | that’s the MTP draft head, not a standalone model — on recent llama.cpp (llm update) use it for speculative decoding (--spec-type draft-mtp, see §12), else just run the plain quant |
failed to load CLIP model … mmproj | Gemma gemma4uv vision unsupported -> run text-only (single-file run avoids it) |
* still in the path in an error | glob matched no file -> wrong dir/quant or not downloaded |
nohup: unrecognized option '--temp' | stale llmrun function -> update it or use the llm script |
Repo id must use alphanumeric … from hf | used repo:quant colon with hf download -> use --include instead |
| two copies of a model loaded | started the same model on two ports -> llm stop the duplicate |
long / multi-line prompt -> shell syntax error near '(' | don’t paste it as a CLI arg (each newline runs as a command); pipe it instead: cat prompt.txt | llm test <port>, or use the web UI |
llm ps shows mem: instead of gpu: | normal on the GB10 — nvidia-smi reports VRAM as [N/A] on unified memory, so llm ps falls back to system (unified) RAM |
| token loops / gibberish | KV below q8_0, or wrong sampling -> keep q8_0, set the model’s recommended sampling |
KV cache: keep at q8_0 or higher. Never use sliding-window attention (--swa-full) with
agents — it drops the system prompt. Don’t use -ot ...exps=CPU / --n-cpu-moe (those are
for VRAM-limited GPUs; the Spark wants full GPU offload).
16. Quick reference (cheatsheet)
Daily use — the llm command
llm run <name|repo:quant|file.gguf> [port] [flags] start a server (default :8080)
llm ps running servers (port pid model) + memory (unified on GB10)
llm ls local models in ~/models
llm wait <port> block until ready
llm test <port> "prompt" quick chat test
llm speed <port> prefill / decode tok/s
llm log <port> follow the log
llm stop <port | all> stop one / stop everything
llm update git pull + rebuild llama.cpp
Model argument (how run resolves it)
~/models/x/*Q8_0.gguf -> that exact file (a path / glob)
unsloth/...-GGUF:Q8_0 -> download from Hugging Face (has a colon)
gemma-12b -> ~/models/gemma-12b/*.gguf (bare name)
Per-launch overrides
LLM_CTX=131072 llm run ... bigger context (default 65536; 0 = full window)
LLM_MODELS=/data llm run ... different model directory
LLM_BIN=~/stepfun.../llama-server llm run ... different binary (fork)
llm run ... 8080 --temp 1.0 --top-p 0.95 --top-k 64 --alias name append any llama-server flag
Download (two syntaxes — don’t mix)
llm run <repo>:<quant> <port> colon form (background dl)
hf download <repo> --include "*<quant>*" --local-dir ~/models/<name> hf CLI (foreground, progress)
Translate HF / Ollama copy-paste: llama-server -hf R:Q and ollama run hf.co/R:Q -> llm run R:Q <port>
Build / update llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121a \
-DLLAMA_OPENSSL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
# update: llm update clean rebuild: rm -rf ~/llama.cpp/build && llm update
Sampling (append to llm run or set in your client)
Gemma 4 --temp 1.0 --top-p 0.95 --top-k 64
Qwen3.x --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0
Clients — http://<spark-ip>:<port>/v1 (Hermes: Custom endpoint; Open WebUI: OpenAI API, not Ollama)
Quick fixes
port in use llm ps ; ss -ltnp | grep :PORT ; use another port
empty log wrong ~/llama-<port>.log, or -hf still downloading (silent)
unknown architecture llm update (or the model needs a fork)
mmproj / gemma4uv fail run text-only; hf download --include "*<quant>*" skips the mmproj
'*' in error path glob matched nothing (wrong dir / quant / not downloaded)
long/multi-line prompt pipe it: cat prompt.txt | llm test PORT (don't paste as an arg)
gibberish / loops keep KV at q8_0; set the model's recommended sampling
17. Further reading (NVIDIA DGX Spark playbooks)
Official playbooks that pair with this setup:
- llama.cpp on Spark (build/serve basics): https://build.nvidia.com/spark/llama-cpp/instructions
- Overview + model matrix: https://build.nvidia.com/spark/llama-cpp/overview
- Troubleshooting: https://build.nvidia.com/spark/llama-cpp/troubleshooting
- Speculative Decoding (the biggest decode speedup): https://build.nvidia.com/spark/speculative-decoding
- Nemotron-3-Nano with llama.cpp: https://build.nvidia.com/spark/nemotron
- Run Hermes Agent with local models: https://build.nvidia.com/spark/hermes-agent
- DGX Spark playbooks repo / performance guide: https://github.com/NVIDIA/dgx-spark-playbooks
How this guide differs from NVIDIA’s example (and why). The official llama.cpp
instructions are deliberately minimal: they build with -DLLAMA_CURL=OFF (no HTTPS, so
they hf download separately) for arch 121, and run a bare server with only
--n-gpu-layers 99 --ctx-size 8192 --threads 8. This guide keeps all of that working but
adds, on purpose:
- build:
LLAMA_OPENSSL=ON(so-hfdownloads directly), the121anative-FP4 target,GGML_CUDA_FA_ALL_QUANTS=ON, andRelease. - runtime: the Spark-tuned flags the minimal example omits —
--no-mmap(effectively mandatory on unified memory),-fa on,q8_0KV cache,--batch-size/ubatch-size 2048,--threads 20, and--jinja(for tool-calling agents).
None of these change behavior incompatibly; they’re performance/quality/convenience upgrades the official example leaves out for brevity. Keep them.
Appendix A — the llm script (full source)
Save this as ~/.local/bin/llm and chmod +x it (see section 6). This is the complete, hardened script — copy it verbatim and the guide is self-contained.
#!/usr/bin/env bash
# llm — manage llama.cpp servers on the DGX Spark (GB10 / sm_121)
# install: cp llm ~/.local/bin/llm && chmod +x ~/.local/bin/llm
# (make sure ~/.local/bin is on your PATH)
#
# env overrides: LLM_CTX=<n> LLM_BIN=<path> LLM_ARCH=<121|121a> LLM_MODELS=<dir> LLM_WAIT_TIMEOUT=<s>
BIN="${LLM_BIN:-$HOME/llama.cpp/build/bin/llama-server}"
CTX="${LLM_CTX:-65536}"
MODELS="${LLM_MODELS:-$HOME/models}"
die() { echo "llm: $*" >&2; exit 1; }
cmd="${1:-help}"; shift 2>/dev/null || true
case "$cmd" in
run)
m="${1:-}"; [[ -z "$m" ]] && die "usage: llm run <name | repo:quant | file.gguf> [port] [extra flags]"
shift
p=8080; [[ "${1:-}" =~ ^[0-9]+$ ]] && { p="$1"; shift; }
[[ "$CTX" =~ ^[0-9]+$ ]] || die "LLM_CTX must be a number (got '$CTX')"
[[ "$p" =~ ^[0-9]+$ ]] || die "port must be a number (got '$p')"
p=$((10#$p)) # normalise (strip leading zeros, force base 10)
(( p >= 1 && p <= 65535 )) || die "port out of range: $p"
[[ -x "$BIN" ]] || die "llama-server not found/executable at '$BIN' — build it (see playbook) or set LLM_BIN"
# --- resolve the model argument (existing file wins, then repo:quant, then name lookup) ---
if [[ -f "$m" ]]; then
f="--model" # an existing file (path or expanded glob)
elif [[ "$m" == *:* ]]; then
f="-hf" # repo:quant -> download from HF
else
hit="" # resolve a short name under $MODELS
if [[ -d "$MODELS/$m" ]]; then
cands=()
for g in "$MODELS/$m"/*.gguf; do
[[ -e "$g" ]] || continue
[[ "$g" == *mmproj* || "$g" == *-MTP* ]] && continue # skip vision / draft files
[[ "$g" == *-[0-9][0-9][0-9][0-9][0-9]-of-[0-9][0-9][0-9][0-9][0-9].gguf && "$g" != *-00001-of-*.gguf ]] && continue # keep only the first shard
cands+=("$g")
done
if [[ ${#cands[@]} -gt 1 ]]; then
echo "llm: multiple models in $MODELS/$m — pass the exact file you want:" >&2
for c in "${cands[@]}"; do echo " $MODELS/$m/$(basename "$c")" >&2; done
exit 1
fi
[[ ${#cands[@]} -eq 1 ]] && hit="${cands[0]}"
elif [[ -f "$MODELS/$m" ]]; then
hit="$MODELS/$m"
else
for g in "$MODELS/$m"*.gguf; do [[ -e "$g" ]] && { hit="$g"; break; }; done
fi
if [[ -n "$hit" ]]; then
m="$hit"; f="--model"
elif [[ "$m" == */* ]]; then
f="-hf" # looks like an HF repo id, no local match
else
die "'$m' not found in $MODELS and not a repo:quant — try 'llm ls' or use repo:quant"
fi
fi
if ss -ltnH 2>/dev/null | awk '{print $4}' | grep -qE ":$p$"; then
die "port $p already in use — see 'llm ps' or pick another port"
fi
nohup "$BIN" $f "$m" \
--host 0.0.0.0 --port "$p" \
--n-gpu-layers 999 -fa on --no-mmap \
--cache-type-k q8_0 --cache-type-v q8_0 \
--batch-size 2048 --ubatch-size 2048 --threads 20 \
--ctx-size "$CTX" --jinja \
"$@" \
>"$HOME/llama-$p.log" 2>&1 &
pid=$!
echo "started $(basename "$m") on :$p (pid $pid, ctx=$CTX)"
# catch immediate failures (bad path, port race, arg error, OOM at init)
sleep 1
if ! kill -0 "$pid" 2>/dev/null; then
echo " ERROR: process exited immediately — last log lines:" >&2
tail -n 6 "$HOME/llama-$p.log" 2>/dev/null | sed 's/^/ /' >&2
exit 1
fi
echo " log: llm log $p | wait: llm wait $p | test: llm test $p | ui: http://localhost:$p"
if [[ "$f" == "-hf" ]]; then
echo " note: downloading from HF in the background — progress does NOT show in the log;"
echo " watch ~/.cache/huggingface, or pre-download with 'hf download' for a progress bar."
fi
;;
stop)
p="${1:-8080}"
if [[ "$p" == all ]]; then
pkill -f "llama-server" && echo "stopped all servers" || echo "no servers running"
exit 0
fi
[[ "$p" =~ ^[0-9]+$ ]] || die "stop: give a port number or 'all' (got '$p')"
# anchor the port so 'stop 80' can't match :8080 and 'stop 8080' can't match :18080
pat="llama-server.*--port $p"'( |$)'
pkill -f "$pat" || { echo "nothing running on :$p"; exit 0; }
for _ in {1..20}; do pgrep -f "$pat" >/dev/null || break; sleep 0.5; done
if pgrep -f "$pat" >/dev/null; then
pkill -9 -f "$pat"; echo "force-killed :$p (it ignored SIGTERM)"
else
echo "stopped :$p"
fi
;;
log)
p="${1:-8080}"
[[ -f "$HOME/llama-$p.log" ]] || die "no log at ~/llama-$p.log — has a server run on :$p?"
exec tail -f "$HOME/llama-$p.log"
;;
ps)
found=0
for pid in $(pgrep -f llama-server 2>/dev/null); do
args=$(tr '\0' ' ' < "/proc/$pid/cmdline" 2>/dev/null) || continue
port=$(grep -oP -- '--port \K\S+' <<<"$args" || true)
model=$(grep -oP -- '--model \K\S+' <<<"$args" || true); model="${model##*/}"
[[ -z "$model" ]] && model=$(grep -oP -- '-hf \K\S+' <<<"$args" || true)
printf " :%-6s pid %-8s %s\n" "${port:-?}" "$pid" "${model:-?}"
found=1
done
[[ "$found" -eq 0 ]] && echo " (no llama-servers running)"
# GB10 has unified memory; nvidia-smi often reports VRAM as [N/A], so fall back to system RAM
mem=""
if command -v nvidia-smi >/dev/null 2>&1; then
mem=$(nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits 2>/dev/null \
| awk -F', ' 'NR==1 && $1 ~ /^[0-9]+$/ {printf "gpu: %s / %s MiB used", $1, $2}')
fi
[[ -z "$mem" ]] && command -v free >/dev/null 2>&1 && \
mem=$(free -m | awk '/^Mem:/ {printf "mem: %s / %s MiB used (unified)", $3, $2}')
[[ -n "$mem" ]] && echo " $mem"
;;
wait)
p="${1:-8080}"; t=0; max="${LLM_WAIT_TIMEOUT:-600}"
[[ "$max" =~ ^[0-9]+$ ]] || max=600
command -v curl >/dev/null 2>&1 || die "curl not installed (sudo apt install -y curl)"
until curl -sf "http://localhost:$p/health" >/dev/null 2>&1; do
sleep 2; t=$((t+2))
[[ $t -ge $max ]] && die ":$p not ready after ${max}s — check 'llm log $p' (still downloading? crashed?)"
done
echo ":$p ready"
;;
test)
p=8080; [[ "${1:-}" =~ ^[0-9]+$ ]] && { p="$1"; shift; }
command -v jq >/dev/null 2>&1 || die "jq not installed (sudo apt install -y jq)"
command -v curl >/dev/null 2>&1 || die "curl not installed (sudo apt install -y curl)"
# prompt precedence: args > piped stdin > default
if [[ $# -gt 0 ]]; then prompt="$*"
elif [[ ! -t 0 ]]; then prompt="$(cat)"
else prompt="Say hello in one short sentence."; fi
payload=$(jq -nc --arg c "$prompt" '{messages:[{role:"user",content:$c}],max_tokens:256}')
resp=$(curl -s "http://localhost:$p/v1/chat/completions" -H "Content-Type: application/json" -d "$payload")
[[ -z "$resp" ]] && die "no response from :$p — running? ('llm ps') ready? ('llm wait $p')"
echo "$resp" | jq -r '.choices[0].message.content // .error.message // .' 2>/dev/null || echo "$resp"
;;
speed)
p=8080; [[ "${1:-}" =~ ^[0-9]+$ ]] && { p="$1"; shift; }
command -v jq >/dev/null 2>&1 || die "jq not installed (sudo apt install -y jq)"
command -v curl >/dev/null 2>&1 || die "curl not installed (sudo apt install -y curl)"
# prompt precedence: args > piped stdin > default
if [[ $# -gt 0 ]]; then prompt="$*"
elif [[ ! -t 0 ]]; then prompt="$(cat)"
else prompt="Write a detailed paragraph about GPUs."; fi
payload=$(jq -nc --arg c "$prompt" '{messages:[{role:"user",content:$c}],max_tokens:200}')
resp=$(curl -s "http://localhost:$p/v1/chat/completions" -H "Content-Type: application/json" -d "$payload")
[[ -z "$resp" ]] && die "no response from :$p — running? ('llm ps') ready? ('llm wait $p')"
echo "$resp" | jq -r 'if .timings then "prefill: \(.timings.prompt_per_second) tok/s | decode: \(.timings.predicted_per_second) tok/s" else (.error.message // "no timings in response") end' 2>/dev/null || echo "$resp"
;;
ls)
[[ -d "$MODELS" ]] || die "no models directory at $MODELS (set LLM_MODELS or create it)"
echo "models in $MODELS:"
n=0
for d in "$MODELS"/*/; do [[ -d "$d" ]] && { echo " $(basename "$d")/"; n=1; }; done
for g in "$MODELS"/*.gguf; do [[ -e "$g" ]] && { echo " $(basename "$g")"; n=1; }; done
[[ "$n" -eq 0 ]] && echo " (empty)"
;;
update)
arch="${LLM_ARCH:-121a}"
[[ -d "$HOME/llama.cpp/.git" ]] || die "no git repo at ~/llama.cpp — clone it first (see playbook)"
cd "$HOME/llama.cpp" || die "cannot cd to ~/llama.cpp"
git pull && \
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="$arch" \
-DLLAMA_OPENSSL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release && \
cmake --build build --config Release -j"$(nproc)" && \
echo "updated -> $(git -C "$HOME/llama.cpp" log -1 --oneline)"
;;
help|--help|-h|*)
cat <<'EOF'
llm — manage llama.cpp servers on the DGX Spark
llm run <name | repo:quant | file.gguf> [port] [extra llama-server flags]
llm stop <port | all>
llm ps running servers (port, pid, model) + memory
llm ls list local models in $LLM_MODELS
llm log <port> follow a server's log
llm wait <port> block until the server is ready (timeout LLM_WAIT_TIMEOUT, def 600s)
llm test <port> [prompt...] quick chat test (needs jq; prompt via args OR piped stdin)
llm speed <port> [prompt...] report prefill / decode tok/s (needs jq; args OR stdin)
llm update git pull + rebuild llama.cpp
the model argument resolves in this order:
is a file -> used as-is e.g. ~/models/gemma-12b/*Q8_0.gguf
has a ':' -> repo:quant, download from HF e.g. unsloth/gemma-4-12b-it-GGUF:Q8_0
bare name -> looked up in $LLM_MODELS e.g. gemma-12b -> ~/models/gemma-12b/*.gguf
(skips mmproj/MTP, prefers first shard; errors if >1 quant present)
has a '/' and no local match -> HF repo id
env overrides: LLM_CTX=<n> LLM_BIN=<path> LLM_ARCH=<121|121a> LLM_MODELS=<dir> LLM_WAIT_TIMEOUT=<s>
examples:
llm run gemma-12b 8080 # resolves ~/models/gemma-12b/*.gguf
llm run unsloth/gemma-4-12b-it-GGUF:Q8_0 8080 --alias gemma12b
LLM_CTX=131072 llm run ~/models/gemma-12b/*Q8_0.gguf 8081
llm ps
llm test 8080 "what is a DGX Spark?"
cat prompt.txt | llm test 8080 # long / multi-line prompt via stdin
EOF
;;
esac
Appendix B — shell-function alternative
If you prefer shell functions to the llm script in Appendix A, the same behavior can
live as llmrun / llmstop / llmlog / llmps / llmwait / llmtest / llmupdate
functions in ~/.llama-helpers.sh, sourced from ~/.bashrc with one line:
echo '[ -f ~/.llama-helpers.sh ] && . ~/.llama-helpers.sh' >> ~/.bashrc
Same flags and behavior. Pick one approach to avoid confusion — the llm script is
recommended (works in any shell, no sourcing, and what the rest of this guide assumes). A
stale copy of the function in ~/.bashrc was the cause of several earlier surprises, so if
you go the function route, keep it in the single sourced file above and re-source after edits.
