Vault Synthesizer Eval Suite

Tools, agents, and models used on this project
TASK	AGENT / TOOL	MODEL / COST
eval harness	pytest + custom case loader	local / $0
LLM judge	Claude Sonnet 4.6	~$0.04 per case
case authoring	hand-curated reference cases	local / $0

What is this?

A 10-case binary pass/fail eval suite for the local Qwen3-14B vault synthesizer agent. The cases were derived from open-coding 17 days of production logs (2026-04-24 → 2026-05-10), not from imagined failure modes. The suite catches the failure class that production monitoring missed: silent regressions where the agent reports success while producing zero output.

Why this approach?

Pytest + YAML over Braintrust or Langfuse: at ten cases, platform infrastructure is pure overhead. Code-based and rubric graders run before any LLM-as-judge, per Hamel’s cost-economics rule. Binary pass/fail over Likert, per the Husain-Shankar canon: Likert scales destroy inter-rater reliability at this scale. The case library is grounded in real-log error analysis, not synthetic generation: the failures the suite guards against actually happened.

What would break?

Three failure modes. Synthesizer prompt drift: a structural change to the underlying synthesis prompt requires re-baselining the whole suite. Mock-input fixtures going stale as the vault evolves: they need a quarterly refresh. And the judge boundary: no active case uses an LLM judge at v1, so if a case is ever promoted to LLM-judge, the model ID has to be pinned explicitly in the case YAML and a --skip-llm-judge flag added for offline runs, or the suite stops being reproducible.

What did I learn?

That evals aren’t really about hallucinations. The failure modes I imagined (hallucinated phase numbers, relation-tag drift, temporal confusion) were the easy cases. The hard case was the one nobody drafts on purpose: the status field that says “ok” while the output is empty. Three layers of monitoring agreed everything was fine while the system underneath them rotted silently for nine days. Error analysis surfaces the failures imagination does not.

Vault Synthesizer Eval Suite

─ METHODS ─

─ EXPLANATION ─

What is this?

Why this approach?

What would break?

What did I learn?

─ WHAT THIS DOESN'T YET DO ─

─ METHODS ─

─ EXPLANATION ─

What is this?

Why this approach?

What would break?

What did I learn?

─ WHAT THIS DOESN'T YET DO ─

─ RELATED ─