# DGX Spark + LlamaCPP Playbook

*June 17, 2026*
 — by Flaviu Vlaicu

> A complete, self-contained guide to running local LLMs on the NVIDIA DGX Spark (GB10 / sm_121) with llama.cpp — build with CUDA and native FP4, serve any GGUF model over an OpenAI-compatible API with a single `llm` command, and tune it all for the Spark's 128 GB unified memory.



# Complete Setup & Operations Guide

Everything needed to build, run, update, and operate local LLMs on an NVIDIA DGX
Spark (GB10 / sm_121) with llama.cpp and the `llm` helper command.

---

## 1. How the pieces fit

**The Spark (GB10).** Blackwell GPU at compute capability 12.1 (`sm_121`), 128 GB
unified LPDDR5x shared between CPU and GPU, ~273 GB/s memory bandwidth. Bandwidth is the
bottleneck for token generation, so **Mixture-of-Experts (MoE) models with few active
parameters run far faster than dense models of the same total size.** Prefer MoE.

**llama.cpp** is the inference engine. You compile it once for the GB10 into a binary
called `llama-server`, which loads a model and exposes an **OpenAI-compatible HTTP API**
on a port. Everything else (Hermes, Open WebUI, curl) just talks to that API.

**`llm`** is a small wrapper script around `llama-server`. It does *not* replace
llama.cpp — it launches it with the right Spark-tuned flags baked in (so you don't retype
a dozen options each time) and adds conveniences: start/stop, list running servers, tail
logs, quick test, measure tok/s, rebuild. It's the control panel; llama.cpp is the engine.

**GGUF** is the only model format llama.cpp runs. Models come from Hugging Face as `.gguf`
files, usually quantized (Q8_0, Q4_K_M, IQ4_XS, …).

The flow:

```
GGUF model  ->  llama-server (built for sm_121)  ->  OpenAI API on a port  ->  clients
                        ^                                                      (Hermes,
                  launched by `llm`                                            Open WebUI)
```

---

## 2. Pre-flight checks (before installing anything)

```bash
nvidia-smi --query-gpu=compute_cap --format=csv   # must be 12.1 (GB10 / sm_121)
nvcc --version                                     # must be CUDA 13.x
cmake --version                                    # 3.31+ recommended for the 121a build
git --version
```

- `compute_cap` must be **12.1** -> you build for the GB10's `sm_121` (arch `121`, or
  `121a` for native FP4 — see 4.1). Building for `120` (RTX 50-series) causes
  `no kernel image is available for execution on the device`.
- `nvcc` must be **CUDA 13.x**; older toolkits don't know `sm_121`.

Known-good stack (NVIDIA DGX OS release notes, mid-2026): **DGX OS 7.4.0, GPU driver
580.126.09, CUDA Toolkit 13.0.2, kernel 6.17**. Anything at or above this is fine; if your
`nvcc` predates CUDA 13, update the toolkit before building.

---

## 3. Install prerequisites (one time)

```bash
sudo apt update
sudo apt install -y git cmake build-essential libssl-dev libcurl4-openssl-dev jq
```

- `libssl-dev` (+ libcurl) -> HTTPS so `-hf` can download models from Hugging Face.
- `jq` -> used by `llm test` / `llm speed` to parse responses.

**What `hf` is.** `hf` is the **Hugging Face CLI** (the `huggingface_hub` command-line
tool, formerly `huggingface-cli`). It authenticates to Hugging Face and downloads model
repos/files to your local cache or a chosen folder. llama.cpp's `-hf` flag has its *own*
built-in downloader, so `hf` is only needed when you want to pre-download in the foreground
(visible progress, resumable) or fetch gated/private models.

Install it globally with `uv` (cleanest — nothing to activate):

```bash
uv tool install "huggingface_hub[cli]"
hf version          # verify
```

Or, the NVIDIA-playbook way, into a venv:

```bash
python3 -m venv ~/llama-cpp-venv && source ~/llama-cpp-venv/bin/activate
pip install -U "huggingface_hub[cli]"
hf version
```

For gated/private repos, authenticate once: `hf auth login` (paste a token from
huggingface.co/settings/tokens).

---

## 4. Build llama.cpp

```bash
cd ~
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121a \
  -DLLAMA_OPENSSL=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
```

These **compile-time** flags are the only things baked into the binary:

| flag | what it does |
|---|---|
| `GGML_CUDA=ON` | CUDA backend (required) |
| `CMAKE_CUDA_ARCHITECTURES=121a` | target the GB10; `a` = native FP4 (see 4.1) |
| `LLAMA_OPENSSL=ON` | HTTPS for `-hf` downloads (needs `libssl-dev`) |
| `GGML_CUDA_FA_ALL_QUANTS=ON` | flash-attention kernels for all quant/KV combos, so `-fa on` works with `q8_0` KV |
| `CMAKE_BUILD_TYPE=Release` | optimized binary |

Verify and put on PATH:

```bash
~/llama.cpp/build/bin/llama-server --version
ln -sf ~/llama.cpp/build/bin/llama-server ~/.local/bin/llama-server
```

> Keep the source in `~/llama.cpp` (not `~/.cache`, which can be wiped).
>
> **Build flags vs runtime flags:** the cmake flags above bake in *capabilities*.
> Performance/behavior flags (`--no-mmap`, `-fa on`, `--batch-size`, `--threads`,
> `--ctx-size`, `--jinja`, sampling) are applied at **runtime** on the `llama-server`
> command line — they're built into the `llm` command, not the binary.

### 4.1 The `121` vs `121a` build (the "a")

Both target the same chip (compute capability 12.1). The difference is the instruction
set the compiler may emit:

- **`121`** — the standard, portable feature set. PTX can JIT forward onto future GPUs.
  Conservative, "always works."
- **`121a`** — `a` = *architecture-specific* features: the newest tensor-core MMA ops and
  **native FP4 / microscaling matmul (MXFP4, NVFP4)**. The code is locked to `sm_121`
  (won't run on other GPUs), but on a dedicated Spark that's irrelevant.

**`121a` is faster only on models that use the hardware FP4 path** — MXFP4- and
NVFP4-quantized models:

- GPT-OSS-120B / GPT-OSS-20B (natively MXFP4)
- any `*MXFP4*` GGUF
- NVFP4 GGUFs that load on stock

**It makes no difference for standard quants** (Q4_K, Q5_K, Q6_K, Q8_0, IQ4_XS) — they
don't touch FP4 instructions. The gain, where it exists, is mostly in **prefill**
(compute-bound); **decode** is bandwidth-bound on the Spark regardless.

**Recommendation: keep `121a`.** It's a strict superset with no downside on a dedicated
Spark, and gives free FP4 acceleration the day you run an MXFP4/NVFP4 model. A FP4-capable
build shows `BLACKWELL_NATIVE_FP4 = 1` in the server log. Use plain `121` only if you need
a portable binary. Verify the binary's arch:

```bash
cuobjdump ~/llama.cpp/build/bin/llama-server 2>/dev/null | grep -m1 'arch =' \
  || find ~/llama.cpp/build -name '*.so' -exec cuobjdump {} \; 2>/dev/null | grep -m1 'arch ='
# -> arch = sm_121a
```

If cmake rejects `121a`, your CMake is too old: `pip install --upgrade --break-system-packages cmake`,
or bypass with `-DCMAKE_CUDA_ARCHITECTURES=OFF -DCMAKE_CUDA_FLAGS="-gencode arch=compute_121a,code=sm_121a"`.

---

## 5. Updating llama.cpp

New architectures/features land constantly (a vision projector or MTP arch your build
rejects often starts working after a pull).

```bash
cd ~/llama.cpp
git pull
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121a \
  -DLLAMA_OPENSSL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
```

This is an **incremental** rebuild — only changed files recompile (a minute or two).
`llm update` does exactly this in one command.

- **Clean rebuild** (`rm -rf build` first) only when you change a build flag
  (e.g. `121`->`121a`) or a pull throws stale-cache errors.
- **Roll back** if master breaks: `git -C ~/llama.cpp checkout <tag-or-commit>` then rebuild;
  `git checkout master` to return.
- **Fork** (Step-3.7): `cd ~/stepfun-llama.cpp && git pull && cmake --build build --config Release -j"$(nproc)"`.

---

## 6. The `llm` command

The full script is in **Appendix A** (so this guide is self-contained). Save it as
`~/.local/bin/llm` and make it executable — no `.bashrc` editing, it just has to be on
your PATH:

```bash
mkdir -p ~/.local/bin
# paste the Appendix A script into the file, e.g.:  nano ~/.local/bin/llm
chmod +x ~/.local/bin/llm
echo "$PATH" | grep -q "$HOME/.local/bin" || { echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc; source ~/.bashrc; }
llm help
```

**How it pairs with llama.cpp:** `llm run` launches `~/llama.cpp/build/bin/llama-server`
with the Spark-tuned flags below already applied, redirects output to `~/llama-<port>.log`,
and backgrounds it. The other subcommands manage those processes.

Baked-in runtime flags (per launch): `--n-gpu-layers 999 -fa on --no-mmap
--cache-type-k q8_0 --cache-type-v q8_0 --batch-size 2048 --ubatch-size 2048 --threads 20
--ctx-size 65536 --jinja`.

| command | what it does |
|---|---|
| `llm run <model> [port] [flags]` | start a server (default port 8080) |
| `llm stop <port \| all>` | stop one server (waits for it to die), or all |
| `llm ps` | running servers (port, pid, model) + memory |
| `llm ls` | list local models in `~/models` |
| `llm log <port>` | follow a server's log |
| `llm wait <port>` | block until the server is ready |
| `llm test <port> [prompt]` | quick chat test |
| `llm speed <port> [prompt]` | report prefill / decode tok/s |
| `llm update` | git pull + rebuild llama.cpp |

**Model argument resolution** (the `<model>`), in order:

1. **is a file** -> used as-is (path or shell-expanded glob) — e.g. `~/models/gemma-12b/*Q8_0.gguf`
2. **has a `:`** -> `repo:quant`, downloaded from Hugging Face — e.g. `unsloth/gemma-4-12b-it-GGUF:Q8_0`
3. **bare name** -> looked up under `~/models` — e.g. `gemma-12b` -> `~/models/gemma-12b/*.gguf`
   (skips `mmproj`/`MTP` files, prefers the first shard; **errors and lists them if the folder has more than one quant**)
4. **has `/`, no local match** -> treated as an HF repo id

**Env overrides** (set per launch): `LLM_CTX=<n>` (context size), `LLM_BIN=<path>`
(different binary, e.g. the fork), `LLM_ARCH=<121|121a>` (used by `llm update`),
`LLM_MODELS=<dir>` (model directory, default `~/models`), `LLM_WAIT_TIMEOUT=<s>`
(how long `llm wait` waits before giving up, default 600).

`llm run` also **validates up front** (port range, numeric `LLM_CTX`, that the binary
exists) and, after launching, **checks the process survived its first second** — if it
died immediately (bad path, port race, bad arg, OOM at init) it prints the last log lines
instead of leaving you to discover an empty server later.

---

## 7. Running models

```bash
# the three input forms:
llm run gemma-12b 8080                              # bare name -> ~/models/gemma-12b/*.gguf
llm run unsloth/gemma-4-12b-it-GGUF:Q8_0 8080       # repo:quant -> download from HF
llm run ~/models/gemma-12b/*Q8_0.gguf 8080          # explicit path

# full lifecycle:
llm run gemma-12b 8080 --alias gemma --temp 1.0 --top-p 0.95 --top-k 64
llm wait 8080                                       # wait until ready
llm test 8080 "what is a DGX Spark?"                # functional check
llm speed 8080                                      # tok/s
llm ps                                              # see it (and others) + memory
llm stop 8080

# multiple models at once — one per port:
llm run gemma-12b 8080
llm run qwen3-coder 8081
llm ps

# context size per model (small models -> big ctx; big models -> modest):
LLM_CTX=131072 llm run gemma-12b 8080               # 128K
LLM_CTX=0      llm run gemma-12b 8080               # model's full trained window
LLM_CTX=32768  llm run step-3.7 8082                # keep modest for a ~100GB model
```

Anything after the port is passed straight to `llama-server` (sampling, `--alias`,
`--mmproj`, `-np`, …).

---

## 7.1 A-to-Z worked examples (Hugging Face -> running)

Hugging Face model pages show a copy-paste command. Translate it to `llm`:

| you see on HF / Ollama | run it here as |
|---|---|
| `llama-server -hf <repo>:<quant>` | `llm run <repo>:<quant> <port>` |
| `ollama run hf.co/<repo>:<quant>` | `llm run <repo>:<quant> <port>`  (drop the `hf.co/`) |

The `-hf <repo>:<quant>` on HF is exactly the colon form `llm run` accepts — just add a port.

**Example A — official ggml-org Gemma 26B (Q4_K_M), the recommended path:**

```bash
# HF shows:  llama-server -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M

hf download ggml-org/gemma-4-26B-A4B-it-GGUF --include "*Q4_K_M*" --local-dir ~/models/gemma-26b
llm run gemma-26b 8080 --alias gemma-26b --temp 1.0 --top-p 0.95 --top-k 64
llm wait 8080
llm test 8080 "Explain MoE models in two sentences."
llm speed 8080
# then point a client at  http://<spark-ip>:8080/v1
```

**Example B — community fine-tune (Mellum2 coder, thinking), the quick path:**

```bash
# HF shows:  llama-server -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q8_0

llm run yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q8_0 8081 --alias mellum2
llm wait 8081          # the -hf download is silent; this blocks until it's actually ready
llm test 8081 "Write a Python function that reverses a linked list."
```
(A 2.5B-active MoE — tiny and fast. Thinking model, so keep generation generous.)

**Example C — translating an Ollama command:**

```bash
# Ollama:  ollama run hf.co/yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF:Q8_0
# drop "hf.co/", keep repo:quant, add a port:
llm run yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF:Q8_0 8082 --alias gemma12-opus
```

For **Gemma** repos, prefer the `hf download --include "*<quant>*"` route — its glob skips
the `mmproj-*` vision file, avoiding the unsupported `gemma4uv` projector that the bare
`-hf` form may try to auto-load.

---

## 8. The model directory & downloading

Models live in **`~/models`** (override with `LLM_MODELS`), one folder per model:

```
~/models/
  gemma-12b/    gemma-4-12b-it-Q8_0.gguf
  gemma-26b/    gemma-4-26B-A4B-it-Q8_0.gguf   mmproj-...gguf
  step-3.7/     Step-3.7-flash-IQ4_XS-00001-of-00003.gguf   ...
```

`llm ls` lists them; bare-name `llm run` resolves into them.

**Picking a quant (the `:Q8_0` / `Q4_K_M` / `IQ4_XS` part).** GGUF files are quantized to
save memory. Reading the names: the **number is bits per weight** (lower = smaller, faster,
slightly lower quality); **K** = modern "K-quant"; **M/S** = medium/small variant; **IQ** =
imatrix quant (better quality per bit); **MXFP4 / NVFP4** = 4-bit microscaling formats that
the `121a` build accelerates; **F16/BF16** = full half-precision (largest). Practical
picks on the Spark: **Q8_0** when it fits and you want max quality, **Q4_K_M** or **IQ4_XS**
to fit bigger models with little quality loss. Rough size ≈ (params × bits) / 8 — e.g. a
26B at Q8_0 ≈ 27 GB, at Q4_K_M ≈ 16 GB.

**Two download syntaxes — don't mix them:**

| tool | syntax |
|---|---|
| llama.cpp / `llm run` | `repo:quant` with a **colon** -> `llm run unsloth/...-GGUF:Q8_0 8080` |
| `hf download` (HF CLI) | `repo` + `--include "*pattern*"`, or an exact `repo filename.gguf` (no colon) |

**For new/large models, download in the foreground first** (visible progress, clear error
if the name is wrong, resumable), then serve the local file:

```bash
hf download unsloth/gemma-4-26B-A4B-it-GGUF --include "*Q8_0*" --local-dir ~/models/gemma-26b
llm run gemma-26b 8080
```

The direct `repo:quant` form downloads in the **background**, so progress does *not* show
in the log — the log stays quiet until the download finishes. Watch progress with
`du -sh ~/.cache/huggingface/hub/models--<org>--<repo>` (or look for `*.incomplete` blobs).

**Where downloaded models actually live** (this trips people up — there are two locations):

- **`llm run <repo>:<quant>` / `-hf`** caches into the **Hugging Face cache**:
  `~/.cache/huggingface/hub/models--<org>--<repo>/` — the bytes are in `blobs/`, with
  readable filenames symlinked under `snapshots/<hash>/`. The model runs straight from
  there. **Bare-name `llm run` does *not* see these** (they're in the cache, not `~/models`).
- **`hf download … --local-dir ~/models/<name>`** writes real files exactly where you point
  it — the `~/models/<name>` layout this guide uses, which is what bare-name `llm run`
  resolves against.
- Relocate the cache with `export HF_HOME=/big/disk` (or `HF_HUB_CACHE`) if your home
  partition is small — 27 GB+ files add up fast.

Practical rule: use `hf download --local-dir ~/models/<name>` when you want a tidy,
predictable path and bare-name runs (`llm run <name>`); the `repo:quant` shortcut is
convenient for one-offs, but the bytes land in `~/.cache/huggingface`.

---

## 9. Per-model sampling

llama.cpp's defaults aren't tuned per model — set these (append to `llm run`, or in your client):

| model | sampling |
|---|---|
| Gemma 4 | `--temp 1.0 --top-p 0.95 --top-k 64` (high temp helps its reasoning) |
| Qwen3.x | `--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0` (check card: thinking vs not) |
| Nemotron / Step-3.7 | low temp / near-greedy — see model card + reasoning level |

Always grab the exact values from the model card.

**Thinking / reasoning models** (e.g. Gemma 4 with `thinking = 1`, Mellum2-Thinking,
Nemotron/Step-3.7 reasoning modes) put their chain-of-thought in a separate
`reasoning_content` field before the answer in `.content`. Two consequences: keep
`max_tokens` generous (a tight cap can cut off the answer after the thinking), and if a
reply looks empty, check `reasoning_content` too. `--jinja` (already on) is what makes the
model's thinking template work.

---

## 10. Wiring into clients

All servers are OpenAI-compatible at `http://<spark-ip>:<port>/v1`.

- **Hermes:** `hermes model` -> "Custom endpoint" -> `http://<spark-ip>:8080/v1` -> blank key -> pick the model.
- **Open WebUI:** Admin -> Connections -> **OpenAI API** -> `http://<spark-ip>:8080/v1` -> any key -> refresh.
  (Add as OpenAI API, never Ollama.)

One model per port = one connection. For tool-calling agents, `--jinja` (already on) is required.

---

## 11. Always-on model (systemd)

For a daily-driver model that survives reboots. `~/.config/systemd/user/llama.service`:

```ini
[Unit]
Description=llama.cpp server
After=network-online.target

[Service]
ExecStart=%h/llama.cpp/build/bin/llama-server --model %h/models/gemma-12b/gemma-4-12b-it-Q8_0.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 999 -fa on --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 2048 --ubatch-size 2048 --threads 20 --ctx-size 65536 --jinja
Restart=on-failure
RestartSec=5

[Install]
WantedBy=default.target
```

```bash
systemctl --user daemon-reload
systemctl --user enable --now llama
sudo loginctl enable-linger "$USER"        # start at boot without login
```

Use systemd for the always-on model and `llm` for experimenting on other ports.

---

## 12. Measuring tok/s

- **Web UI** (`http://<ip>:<port>`) shows tok/s under each response.
- **`llm speed <port>`** prints prefill and decode tok/s.
- **Server log** (`llm log <port>`) prints `tokens per second` after each request — the
  `eval time` (generation) line is decode speed.
- **`llama-bench`** for clean, repeatable numbers without the server:
  `~/llama.cpp/build/bin/llama-bench -m <model.gguf> -p 512 -n 128 -fa 1` -> `pp`=prefill, `tg`=decode.

**Want more decode tok/s?** The biggest lever is **speculative decoding**, which is now
**native upstream** (landed in recent llama.cpp). Two ways:

- **MTP / co-trained drafter (best):** use a model's multi-token-prediction head — the
  `*-MTP.gguf` files in the unsloth Qwen3.6 / Gemma repos — via `--spec-type draft-mtp`
  with `--model-draft <mtp.gguf>`. Co-trained drafters reach roughly 45–85% token
  acceptance, which can come close to doubling decode speed. (This is the same
  `gemma-4-12b-it-Q8_0-MTP.gguf` file that wouldn't load *as a standalone model* — it's a
  draft head, and this is what it's for.)
- **Plain draft model:** pair a small same-family model via `--model-draft <small.gguf>`.

Run `llm update` first if your build predates MTP support, and check
`~/llama.cpp/build/bin/llama-server --help` for the exact draft flags (this area is moving
fast). It's additive to everything above; since decode is bandwidth-bound on the Spark,
this helps more than any single flag. See NVIDIA's Speculative Decoding playbook (§17).

---

## 13. HTTPS & security

- The `LLAMA_OPENSSL=ON` you built is for **downloading models** (`-hf`), already in use.
- The **server API** runs plain **http://** (`running without SSL` in the log). On a trusted
  LAN that's correct — every client speaks HTTP to `:<port>/v1`.
- For access control, add an API key: `llm run gemma-12b 8080 --api-key SECRET`.
- Only put the API behind **HTTPS** if you expose it beyond the LAN — via a reverse proxy
  (Caddy/nginx terminating TLS) plus an API key, not per-server certs.

---

## 14. What you can / can't run on one Spark

| can't run | why |
|---|---|
| > ~200-240B total (DeepSeek V4, Kimi K2.x, GLM-4.7/5.1 ~355B, Llama-405B) | won't fit in 128 GB even at 4-bit |
| non-GGUF only (safetensors/FP8/AWQ/GPTQ/EXL2) | llama.cpp is GGUF-only -> use vLLM/SGLang/TRT-LLM |
| unsupported archs (NemotronH, Step-3.7) | need a fork (StepFun, etc.) |
| NVFP4-native GGUFs | historically needed a fork, but upstream NVFP4 (+ MTP) support has landed — `llm update` may now run them |
| certain vision projectors (Gemma `gemma4uv`) | run text-only until llama.cpp adds support |
| large **dense** models (70B+) | fit but decode at single-digit tok/s — bandwidth-bound |

**Rule of thumb:** GGUF + supported architecture + <= ~200B total + ideally MoE.
Ceiling is the ~120B-class MoEs (GPT-OSS-120B, Qwen3.5-122B), with Qwen3-235B-A22B at the edge.

---

## 15. Gotchas & troubleshooting

| symptom | cause / fix |
|---|---|
| `HTTPS is not supported … rebuild with -DLLAMA_OPENSSL=ON` | binary built without SSL -> rebuild (sec 4) |
| `no kernel image is available` | built for wrong arch -> use `121`/`121a` |
| `couldn't bind HTTP server socket, port: N` | port in use -> `llm ps`, `ss -ltnp \| grep :N`, use another port |
| log shows only `nohup: ignoring input` | `-hf` downloading silently (check HF cache size), OR you're reading the wrong `~/llama-<port>.log` |
| empty/wrong log on a new port | each port has its own log -> `cat ~/llama-<thatport>.log` |
| `unknown model architecture: 'gemma4-assistant'` | that's the MTP draft head, not a standalone model — on recent llama.cpp (`llm update`) use it for speculative decoding (`--spec-type draft-mtp`, see §12), else just run the plain quant |
| `failed to load CLIP model … mmproj` | Gemma `gemma4uv` vision unsupported -> run text-only (single-file run avoids it) |
| `*` still in the path in an error | glob matched no file -> wrong dir/quant or not downloaded |
| `nohup: unrecognized option '--temp'` | stale `llmrun` function -> update it or use the `llm` script |
| `Repo id must use alphanumeric …` from `hf` | used `repo:quant` colon with `hf download` -> use `--include` instead |
| two copies of a model loaded | started the same model on two ports -> `llm stop` the duplicate |
| long / multi-line prompt -> shell `syntax error near '('` | don't paste it as a CLI arg (each newline runs as a command); pipe it instead: `cat prompt.txt \| llm test <port>`, or use the web UI |
| `llm ps` shows `mem:` instead of `gpu:` | normal on the GB10 — `nvidia-smi` reports VRAM as `[N/A]` on unified memory, so `llm ps` falls back to system (unified) RAM |
| token loops / gibberish | KV below `q8_0`, or wrong sampling -> keep `q8_0`, set the model's recommended sampling |

KV cache: keep at `q8_0` or higher. Never use sliding-window attention (`--swa-full`) with
agents — it drops the system prompt. Don't use `-ot ...exps=CPU` / `--n-cpu-moe` (those are
for VRAM-limited GPUs; the Spark wants full GPU offload).

---

## 16. Quick reference (cheatsheet)

**Daily use — the `llm` command**
```
llm run <name|repo:quant|file.gguf> [port] [flags]   start a server (default :8080)
llm ps                                               running servers (port pid model) + memory (unified on GB10)
llm ls                                               local models in ~/models
llm wait  <port>                                     block until ready
llm test  <port> "prompt"                            quick chat test
llm speed <port>                                     prefill / decode tok/s
llm log   <port>                                     follow the log
llm stop  <port | all>                               stop one / stop everything
llm update                                           git pull + rebuild llama.cpp
```

**Model argument (how `run` resolves it)**
```
~/models/x/*Q8_0.gguf     -> that exact file                  (a path / glob)
unsloth/...-GGUF:Q8_0     -> download from Hugging Face       (has a colon)
gemma-12b                 -> ~/models/gemma-12b/*.gguf        (bare name)
```

**Per-launch overrides**
```
LLM_CTX=131072 llm run ...                       bigger context (default 65536; 0 = full window)
LLM_MODELS=/data llm run ...                     different model directory
LLM_BIN=~/stepfun.../llama-server llm run ...    different binary (fork)
llm run ... 8080 --temp 1.0 --top-p 0.95 --top-k 64 --alias name   append any llama-server flag
```

**Download (two syntaxes — don't mix)**
```
llm run <repo>:<quant> <port>                                          colon form (background dl)
hf download <repo> --include "*<quant>*" --local-dir ~/models/<name>   hf CLI (foreground, progress)
```
Translate HF / Ollama copy-paste:  `llama-server -hf R:Q`  and  `ollama run hf.co/R:Q`  ->  `llm run R:Q <port>`

**Build / update llama.cpp**
```bash
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121a \
  -DLLAMA_OPENSSL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
# update:  llm update          clean rebuild:  rm -rf ~/llama.cpp/build && llm update
```

**Sampling** (append to `llm run` or set in your client)
```
Gemma 4   --temp 1.0 --top-p 0.95 --top-k 64
Qwen3.x   --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0
```

**Clients** — `http://<spark-ip>:<port>/v1`  (Hermes: Custom endpoint; Open WebUI: OpenAI API, not Ollama)

**Quick fixes**
```
port in use            llm ps ; ss -ltnp | grep :PORT ; use another port
empty log              wrong ~/llama-<port>.log, or -hf still downloading (silent)
unknown architecture   llm update  (or the model needs a fork)
mmproj / gemma4uv fail run text-only; hf download --include "*<quant>*" skips the mmproj
'*' in error path      glob matched nothing (wrong dir / quant / not downloaded)
long/multi-line prompt pipe it: cat prompt.txt | llm test PORT  (don't paste as an arg)
gibberish / loops      keep KV at q8_0; set the model's recommended sampling
```

---

## 17. Further reading (NVIDIA DGX Spark playbooks)

Official playbooks that pair with this setup:

- llama.cpp on Spark (build/serve basics): https://build.nvidia.com/spark/llama-cpp/instructions
- Overview + model matrix: https://build.nvidia.com/spark/llama-cpp/overview
- Troubleshooting: https://build.nvidia.com/spark/llama-cpp/troubleshooting
- Speculative Decoding (the biggest decode speedup): https://build.nvidia.com/spark/speculative-decoding
- Nemotron-3-Nano with llama.cpp: https://build.nvidia.com/spark/nemotron
- Run Hermes Agent with local models: https://build.nvidia.com/spark/hermes-agent
- DGX Spark playbooks repo / performance guide: https://github.com/NVIDIA/dgx-spark-playbooks

**How this guide differs from NVIDIA's example (and why).** The official llama.cpp
instructions are deliberately minimal: they build with `-DLLAMA_CURL=OFF` (no HTTPS, so
they `hf download` separately) for arch `121`, and run a bare server with only
`--n-gpu-layers 99 --ctx-size 8192 --threads 8`. This guide keeps all of that working but
adds, on purpose:

- build: `LLAMA_OPENSSL=ON` (so `-hf` downloads directly), the `121a` native-FP4 target,
  `GGML_CUDA_FA_ALL_QUANTS=ON`, and `Release`.
- runtime: the Spark-tuned flags the minimal example omits — `--no-mmap` (effectively
  mandatory on unified memory), `-fa on`, `q8_0` KV cache, `--batch-size/ubatch-size 2048`,
  `--threads 20`, and `--jinja` (for tool-calling agents).

None of these change behavior incompatibly; they're performance/quality/convenience
upgrades the official example leaves out for brevity. Keep them.

---

## Appendix A — the `llm` script (full source)

Save this as `~/.local/bin/llm` and `chmod +x` it (see section 6). This is the complete, hardened script — copy it verbatim and the guide is self-contained.

```bash
#!/usr/bin/env bash
# llm — manage llama.cpp servers on the DGX Spark (GB10 / sm_121)
# install:  cp llm ~/.local/bin/llm && chmod +x ~/.local/bin/llm
# (make sure ~/.local/bin is on your PATH)
#
# env overrides:  LLM_CTX=<n>  LLM_BIN=<path>  LLM_ARCH=<121|121a>  LLM_MODELS=<dir>  LLM_WAIT_TIMEOUT=<s>

BIN="${LLM_BIN:-$HOME/llama.cpp/build/bin/llama-server}"
CTX="${LLM_CTX:-65536}"
MODELS="${LLM_MODELS:-$HOME/models}"

die() { echo "llm: $*" >&2; exit 1; }

cmd="${1:-help}"; shift 2>/dev/null || true

case "$cmd" in

  run)
    m="${1:-}"; [[ -z "$m" ]] && die "usage: llm run <name | repo:quant | file.gguf> [port] [extra flags]"
    shift
    p=8080; [[ "${1:-}" =~ ^[0-9]+$ ]] && { p="$1"; shift; }

    [[ "$CTX" =~ ^[0-9]+$ ]] || die "LLM_CTX must be a number (got '$CTX')"
    [[ "$p" =~ ^[0-9]+$ ]]   || die "port must be a number (got '$p')"
    p=$((10#$p))                                   # normalise (strip leading zeros, force base 10)
    (( p >= 1 && p <= 65535 )) || die "port out of range: $p"
    [[ -x "$BIN" ]] || die "llama-server not found/executable at '$BIN' — build it (see playbook) or set LLM_BIN"

    # --- resolve the model argument (existing file wins, then repo:quant, then name lookup) ---
    if [[ -f "$m" ]]; then
      f="--model"                                          # an existing file (path or expanded glob)
    elif [[ "$m" == *:* ]]; then
      f="-hf"                                              # repo:quant -> download from HF
    else
      hit=""                                               # resolve a short name under $MODELS
      if [[ -d "$MODELS/$m" ]]; then
        cands=()
        for g in "$MODELS/$m"/*.gguf; do
          [[ -e "$g" ]] || continue
          [[ "$g" == *mmproj* || "$g" == *-MTP* ]] && continue                                                                  # skip vision / draft files
          [[ "$g" == *-[0-9][0-9][0-9][0-9][0-9]-of-[0-9][0-9][0-9][0-9][0-9].gguf && "$g" != *-00001-of-*.gguf ]] && continue  # keep only the first shard
          cands+=("$g")
        done
        if [[ ${#cands[@]} -gt 1 ]]; then
          echo "llm: multiple models in $MODELS/$m — pass the exact file you want:" >&2
          for c in "${cands[@]}"; do echo "  $MODELS/$m/$(basename "$c")" >&2; done
          exit 1
        fi
        [[ ${#cands[@]} -eq 1 ]] && hit="${cands[0]}"
      elif [[ -f "$MODELS/$m" ]]; then
        hit="$MODELS/$m"
      else
        for g in "$MODELS/$m"*.gguf; do [[ -e "$g" ]] && { hit="$g"; break; }; done
      fi
      if [[ -n "$hit" ]]; then
        m="$hit"; f="--model"
      elif [[ "$m" == */* ]]; then
        f="-hf"                                            # looks like an HF repo id, no local match
      else
        die "'$m' not found in $MODELS and not a repo:quant — try 'llm ls' or use repo:quant"
      fi
    fi

    if ss -ltnH 2>/dev/null | awk '{print $4}' | grep -qE ":$p$"; then
      die "port $p already in use — see 'llm ps' or pick another port"
    fi

    nohup "$BIN" $f "$m" \
      --host 0.0.0.0 --port "$p" \
      --n-gpu-layers 999 -fa on --no-mmap \
      --cache-type-k q8_0 --cache-type-v q8_0 \
      --batch-size 2048 --ubatch-size 2048 --threads 20 \
      --ctx-size "$CTX" --jinja \
      "$@" \
      >"$HOME/llama-$p.log" 2>&1 &
    pid=$!
    echo "started $(basename "$m") on :$p (pid $pid, ctx=$CTX)"

    # catch immediate failures (bad path, port race, arg error, OOM at init)
    sleep 1
    if ! kill -0 "$pid" 2>/dev/null; then
      echo "  ERROR: process exited immediately — last log lines:" >&2
      tail -n 6 "$HOME/llama-$p.log" 2>/dev/null | sed 's/^/    /' >&2
      exit 1
    fi
    echo "  log: llm log $p  |  wait: llm wait $p  |  test: llm test $p  |  ui: http://localhost:$p"
    if [[ "$f" == "-hf" ]]; then
      echo "  note: downloading from HF in the background — progress does NOT show in the log;"
      echo "        watch ~/.cache/huggingface, or pre-download with 'hf download' for a progress bar."
    fi
    ;;

  stop)
    p="${1:-8080}"
    if [[ "$p" == all ]]; then
      pkill -f "llama-server" && echo "stopped all servers" || echo "no servers running"
      exit 0
    fi
    [[ "$p" =~ ^[0-9]+$ ]] || die "stop: give a port number or 'all' (got '$p')"
    # anchor the port so 'stop 80' can't match :8080 and 'stop 8080' can't match :18080
    pat="llama-server.*--port $p"'( |$)'
    pkill -f "$pat" || { echo "nothing running on :$p"; exit 0; }
    for _ in {1..20}; do pgrep -f "$pat" >/dev/null || break; sleep 0.5; done
    if pgrep -f "$pat" >/dev/null; then
      pkill -9 -f "$pat"; echo "force-killed :$p (it ignored SIGTERM)"
    else
      echo "stopped :$p"
    fi
    ;;

  log)
    p="${1:-8080}"
    [[ -f "$HOME/llama-$p.log" ]] || die "no log at ~/llama-$p.log — has a server run on :$p?"
    exec tail -f "$HOME/llama-$p.log"
    ;;

  ps)
    found=0
    for pid in $(pgrep -f llama-server 2>/dev/null); do
      args=$(tr '\0' ' ' < "/proc/$pid/cmdline" 2>/dev/null) || continue
      port=$(grep -oP -- '--port \K\S+' <<<"$args" || true)
      model=$(grep -oP -- '--model \K\S+' <<<"$args" || true); model="${model##*/}"
      [[ -z "$model" ]] && model=$(grep -oP -- '-hf \K\S+' <<<"$args" || true)
      printf "  :%-6s pid %-8s %s\n" "${port:-?}" "$pid" "${model:-?}"
      found=1
    done
    [[ "$found" -eq 0 ]] && echo "  (no llama-servers running)"
    # GB10 has unified memory; nvidia-smi often reports VRAM as [N/A], so fall back to system RAM
    mem=""
    if command -v nvidia-smi >/dev/null 2>&1; then
      mem=$(nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits 2>/dev/null \
            | awk -F', ' 'NR==1 && $1 ~ /^[0-9]+$/ {printf "gpu: %s / %s MiB used", $1, $2}')
    fi
    [[ -z "$mem" ]] && command -v free >/dev/null 2>&1 && \
      mem=$(free -m | awk '/^Mem:/ {printf "mem: %s / %s MiB used (unified)", $3, $2}')
    [[ -n "$mem" ]] && echo "  $mem"
    ;;

  wait)
    p="${1:-8080}"; t=0; max="${LLM_WAIT_TIMEOUT:-600}"
    [[ "$max" =~ ^[0-9]+$ ]] || max=600
    command -v curl >/dev/null 2>&1 || die "curl not installed (sudo apt install -y curl)"
    until curl -sf "http://localhost:$p/health" >/dev/null 2>&1; do
      sleep 2; t=$((t+2))
      [[ $t -ge $max ]] && die ":$p not ready after ${max}s — check 'llm log $p' (still downloading? crashed?)"
    done
    echo ":$p ready"
    ;;

  test)
    p=8080; [[ "${1:-}" =~ ^[0-9]+$ ]] && { p="$1"; shift; }
    command -v jq   >/dev/null 2>&1 || die "jq not installed (sudo apt install -y jq)"
    command -v curl >/dev/null 2>&1 || die "curl not installed (sudo apt install -y curl)"
    # prompt precedence: args > piped stdin > default
    if   [[ $# -gt 0 ]]; then prompt="$*"
    elif [[ ! -t 0 ]];   then prompt="$(cat)"
    else                      prompt="Say hello in one short sentence."; fi
    payload=$(jq -nc --arg c "$prompt" '{messages:[{role:"user",content:$c}],max_tokens:256}')
    resp=$(curl -s "http://localhost:$p/v1/chat/completions" -H "Content-Type: application/json" -d "$payload")
    [[ -z "$resp" ]] && die "no response from :$p — running? ('llm ps')   ready? ('llm wait $p')"
    echo "$resp" | jq -r '.choices[0].message.content // .error.message // .' 2>/dev/null || echo "$resp"
    ;;

  speed)
    p=8080; [[ "${1:-}" =~ ^[0-9]+$ ]] && { p="$1"; shift; }
    command -v jq   >/dev/null 2>&1 || die "jq not installed (sudo apt install -y jq)"
    command -v curl >/dev/null 2>&1 || die "curl not installed (sudo apt install -y curl)"
    # prompt precedence: args > piped stdin > default
    if   [[ $# -gt 0 ]]; then prompt="$*"
    elif [[ ! -t 0 ]];   then prompt="$(cat)"
    else                      prompt="Write a detailed paragraph about GPUs."; fi
    payload=$(jq -nc --arg c "$prompt" '{messages:[{role:"user",content:$c}],max_tokens:200}')
    resp=$(curl -s "http://localhost:$p/v1/chat/completions" -H "Content-Type: application/json" -d "$payload")
    [[ -z "$resp" ]] && die "no response from :$p — running? ('llm ps')   ready? ('llm wait $p')"
    echo "$resp" | jq -r 'if .timings then "prefill: \(.timings.prompt_per_second) tok/s   |   decode: \(.timings.predicted_per_second) tok/s" else (.error.message // "no timings in response") end' 2>/dev/null || echo "$resp"
    ;;

  ls)
    [[ -d "$MODELS" ]] || die "no models directory at $MODELS (set LLM_MODELS or create it)"
    echo "models in $MODELS:"
    n=0
    for d in "$MODELS"/*/;     do [[ -d "$d" ]] && { echo "  $(basename "$d")/"; n=1; }; done
    for g in "$MODELS"/*.gguf; do [[ -e "$g" ]] && { echo "  $(basename "$g")"; n=1; }; done
    [[ "$n" -eq 0 ]] && echo "  (empty)"
    ;;

  update)
    arch="${LLM_ARCH:-121a}"
    [[ -d "$HOME/llama.cpp/.git" ]] || die "no git repo at ~/llama.cpp — clone it first (see playbook)"
    cd "$HOME/llama.cpp" || die "cannot cd to ~/llama.cpp"
    git pull && \
    cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="$arch" \
      -DLLAMA_OPENSSL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release && \
    cmake --build build --config Release -j"$(nproc)" && \
    echo "updated -> $(git -C "$HOME/llama.cpp" log -1 --oneline)"
    ;;

  help|--help|-h|*)
    cat <<'EOF'
llm — manage llama.cpp servers on the DGX Spark

  llm run <name | repo:quant | file.gguf> [port] [extra llama-server flags]
  llm stop <port | all>
  llm ps                       running servers (port, pid, model) + memory
  llm ls                       list local models in $LLM_MODELS
  llm log  <port>              follow a server's log
  llm wait <port>              block until the server is ready (timeout LLM_WAIT_TIMEOUT, def 600s)
  llm test <port> [prompt...]  quick chat test (needs jq; prompt via args OR piped stdin)
  llm speed <port> [prompt...] report prefill / decode tok/s (needs jq; args OR stdin)
  llm update                   git pull + rebuild llama.cpp

the model argument resolves in this order:
  is a file  -> used as-is                     e.g. ~/models/gemma-12b/*Q8_0.gguf
  has a ':'  -> repo:quant, download from HF   e.g. unsloth/gemma-4-12b-it-GGUF:Q8_0
  bare name  -> looked up in $LLM_MODELS        e.g. gemma-12b -> ~/models/gemma-12b/*.gguf
                (skips mmproj/MTP, prefers first shard; errors if >1 quant present)
  has a '/' and no local match -> HF repo id

env overrides:  LLM_CTX=<n>   LLM_BIN=<path>   LLM_ARCH=<121|121a>   LLM_MODELS=<dir>   LLM_WAIT_TIMEOUT=<s>

examples:
  llm run gemma-12b 8080                              # resolves ~/models/gemma-12b/*.gguf
  llm run unsloth/gemma-4-12b-it-GGUF:Q8_0 8080 --alias gemma12b
  LLM_CTX=131072 llm run ~/models/gemma-12b/*Q8_0.gguf 8081
  llm ps
  llm test 8080 "what is a DGX Spark?"
  cat prompt.txt | llm test 8080                     # long / multi-line prompt via stdin
EOF
    ;;
esac
```

---

## Appendix B — shell-function alternative

If you prefer shell functions to the `llm` script in Appendix A, the same behavior can
live as `llmrun` / `llmstop` / `llmlog` / `llmps` / `llmwait` / `llmtest` / `llmupdate`
functions in `~/.llama-helpers.sh`, sourced from `~/.bashrc` with one line:

```bash
echo '[ -f ~/.llama-helpers.sh ] && . ~/.llama-helpers.sh' >> ~/.bashrc
```

Same flags and behavior. **Pick one approach** to avoid confusion — the `llm` script is
recommended (works in any shell, no sourcing, and what the rest of this guide assumes). A
stale copy of the function in `~/.bashrc` was the cause of several earlier surprises, so if
you go the function route, keep it in the single sourced file above and re-source after edits.


---
*Source: [https://vlaicu.io/posts/dgx-llamacpp-playbook/](https://vlaicu.io/posts/dgx-llamacpp-playbook/)*
