Hermes Agent Operator's Manual

The Operator’s Manual for Hermes Agent Building an AI assistant that can act, remember, and improve Operator’s Manual · Edition 3.2 · Verified against official Nous Research documentation About This Manual This manual explains how to deploy and operate Hermes Agent as a persistent “operator” — an AI system that runs continuously, uses tools, remembers context across sessions, and improves over time — rather than as a single-session chatbot. It covers architecture, installation, the core mental model, day-to-day workflows, the operator loop, common failure modes, advanced configuration (including offline skill optimization with GEPA), and a distilled set of operational lessons. ...

May 24, 2026 Â·  47 min

DGX Spark + vLLM Playbook

A practical, end-to-end guide for serving LLMs with vLLM on a DGX Spark (GB10 Grace Blackwell), assembled from NVIDIA’s official playbook, the vLLM team’s deep-dive, model cards, and battle-tested community setups. ⚠️Warning Flag choices on Spark are model- and image-specific, not hardware-wide defaults. The recipes below are starting points that worked for their authors against a specific container tag. Validate against the exact image you run, and pin a known-good tag/digest for anything you depend on. Copying a flag from one model’s recipe to another can silently regress throughput or output quality. ...

June 20, 2026 Â·  65 min · 
TL;DR
  • Serving LLMs on a DGX Spark (GB10) is its own discipline — 128 GB of unified memory, an sm_121 GPU that breaks half the default FP4 kernels, and flags that help one model and wreck the next. This is the end-to-end playbook
  • Right engine, right job — vLLM for concurrent serving, with the honest boundary on when llama.cpp, SGLang, or TensorRT-LLM beats it.
  • The hardware drives everything — decode is bandwidth-bound (small –max-num-seqs), NVFP4 MoE with a few-billion active params is the sweet spot, and on sm_121 FP4 MoE must run Marlin or you get !!!!!.
  • Per-model recipes that actually boot — Qwen3.6, Qwen3-Coder-Next, Qwen3.5-122B, Gemma-4 (31B/26B/coder), and Nemotron-3-Nano, each with the right container tag, quant flag, and the parser trio agents need.
  • Spark surprises you under load — single-stream looks modest, but aggregate throughput jumps from tens to hundreds of tok/s at concurrency; benchmark where you’ll actually run.
  • 2–3Ă— single-stream for free — n-gram, MTP, and DFlash speculative decoding, easiest-first, with the sm_121 gotchas (Triton attention, BF16 KV, the images that make DFlash work).
  • Copy-paste ready — a measured prefill-vs-decode bench script (single-stream and aggregate), a full troubleshooting table, and the memory math behind “why is my RAM maxed?”

LLM Quantization

Quantization is the single most important technique for running large language models outside a datacenter. It is what turns a model that needs eight enterprise GPUs into one that runs on a gaming card, a laptop, or a Mac mini. But the moment you go to download a model, you are confronted with an intimidating wall of cryptic names — Q4_K_M, IQ3_XXS, UD-Q5_K_XL, GPTQ-Int4, AWQ, NF4, EXL3, NVFP4 — with little explanation of what they mean or which one you should pick. ...

June 18, 2026 Â·  27 min · 
TL;DR
  • Bits per weight is the master variable: it sets file size and is the main predictor of quality. Every method just spends a fixed bit budget well.
  • Quality has a knee around 4–5 bits — it collapses below and barely moves above. A good 4-bit quant is the default; 8-bit+ is usually wasted memory.
  • Which method you use (GGUF, GPTQ, AWQ, NF4, EXL3, FP8/FP4) is dictated by your runtime and hardware, not a universal best. Match the format to what your stack accelerates.
  • At a fixed memory budget, a bigger model quantized harder beats a smaller one quantized lightly. Go below 4-bit only with the methods built for it.
  • The KV cache is a separate knob, and for long contexts it can outweigh the weights. Quantize it too.
  • Perplexity hides the damage: quantization hits reasoning and code hardest. Judge a quant on tasks like your real workload, not one number.

DGX Spark + LlamaCPP Playbook

Complete Setup & Operations Guide Everything needed to build, run, update, and operate local LLMs on an NVIDIA DGX Spark (GB10 / sm_121) with llama.cpp and the llm helper command. 1. How the pieces fit The Spark (GB10). Blackwell GPU at compute capability 12.1 (sm_121), 128 GB unified LPDDR5x shared between CPU and GPU, ~273 GB/s memory bandwidth. Bandwidth is the bottleneck for token generation, so Mixture-of-Experts (MoE) models with few active parameters run far faster than dense models of the same total size. Prefer MoE. ...

June 17, 2026 Â·  28 min · 
TL;DR
  • Build llama.cpp for the GB10 (sm_121) with LLAMA_OPENSSL=ON and the 121a native-FP4 target.
  • Serve any GGUF model over an OpenAI-compatible API with one command: llm run <model> [port].
  • All the Spark tuning is baked in — --no-mmap, flash-attention, q8_0 KV cache, batch 2048, 20 threads.
  • 121a adds native FP4 (MXFP4/NVFP4) speedups; it’s neutral on standard quants like Q8_0 and Q4_K_M.
  • Prefer MoE models: the Spark is memory-bandwidth-bound, so low active-parameter models run fastest.
  • Manage everything with the llm helper: run, stop, ps, ls, wait, test, speed, log, update.
  • Wire Hermes or Open WebUI to http://:/v1; runnable = GGUF + supported arch + ≤ ~200B.
  • Includes the full llm script, a cheatsheet, and a troubleshooting table.

Claude Code Self Evolving

Most Claude Code setups are static. You write a CLAUDE.md, list your conventions, and hope Claude follows them. When it doesn’t, you correct it. Next session, it forgets. You correct it again. This guide builds something different: a system where every correction you make gets captured and logged, repeated corrections automatically become permanent rules, discovered patterns get verified before they’re trusted, and a periodic audit command decides what stays, what gets promoted, and what gets pruned. ...

April 1, 2026 Â·  33 min