AI | Flaviu Vlaicu 👾

Hermes Agent Operator's Manual

The Operator’s Manual for Hermes Agent Building an AI assistant that can act, remember, and improve Operator’s Manual · Edition 3.2 · Verified against official Nous Research documentation About This Manual This manual explains how to deploy and operate Hermes Agent as a persistent “operator” — an AI system that runs continuously, uses tools, remembers context across sessions, and improves over time — rather than as a single-session chatbot. It covers architecture, installation, the core mental model, day-to-day workflows, the operator loop, common failure modes, advanced configuration (including offline skill optimization with GEPA), and a distilled set of operational lessons. ...

DGX Spark + vLLM Playbook

A practical, end-to-end guide for serving LLMs with vLLM on a DGX Spark (GB10 Grace Blackwell), assembled from NVIDIA’s official playbook, the vLLM team’s deep-dive, model cards, and battle-tested community setups. ⚠️Warning Flag choices on Spark are model- and image-specific, not hardware-wide defaults. The recipes below are starting points that worked for their authors against a specific container tag. Validate against the exact image you run, and pin a known-good tag/digest for anything you depend on. Copying a flag from one model’s recipe to another can silently regress throughput or output quality. ...

LLM Quantization

Quantization is the single most important technique for running large language models outside a datacenter. It is what turns a model that needs eight enterprise GPUs into one that runs on a gaming card, a laptop, or a Mac mini. But the moment you go to download a model, you are confronted with an intimidating wall of cryptic names — Q4_K_M, IQ3_XXS, UD-Q5_K_XL, GPTQ-Int4, AWQ, NF4, EXL3, NVFP4 — with little explanation of what they mean or which one you should pick. ...

DGX Spark + LlamaCPP Playbook

Complete Setup & Operations Guide Everything needed to build, run, update, and operate local LLMs on an NVIDIA DGX Spark (GB10 / sm_121) with llama.cpp and the llm helper command. 1. How the pieces fit The Spark (GB10). Blackwell GPU at compute capability 12.1 (sm_121), 128 GB unified LPDDR5x shared between CPU and GPU, ~273 GB/s memory bandwidth. Bandwidth is the bottleneck for token generation, so Mixture-of-Experts (MoE) models with few active parameters run far faster than dense models of the same total size. Prefer MoE. ...

Claude Code Self Evolving

Most Claude Code setups are static. You write a CLAUDE.md, list your conventions, and hope Claude follows them. When it doesn’t, you correct it. Next session, it forgets. You correct it again. This guide builds something different: a system where every correction you make gets captured and logged, repeated corrections automatically become permanent rules, discovered patterns get verified before they’re trusted, and a periodic audit command decides what stays, what gets promoted, and what gets pruned. ...