LDR Grounding-Collapse Post-Mortem

Tools, agents, and models used on this project
TASK	AGENT / TOOL	MODEL / COST
forensic preservation	superseded LDR fixture (Qwen3-14B + SearXNG, Mac Mini)	$0 (local)
control experiment	same prompt re-run on Gemini Deep Research	research / $2.80
fix as policy	CLAUDE.md research-routing rule (compound to cloud, single-shape local)	portfolio time
regression eval	eval-case.yaml: fails on bad fixture, passes on grounded output	portfolio time

On May 5th one of my fleet agents produced a polished, scored comparison table of the MCP tooling ecosystem, with a ranking, a references section, and ten citations. About a third of it was invented: a tool called PureMCPClient ranked #3, a fabricated MCPCatalog (Central), Google’s ADK miscast as an MCP SDK at a Microsoft Azure URL, and the home of MCP declared to be github.com/microsoft/mcp (it’s modelcontextprotocol). It didn’t error or time out; it finished in 280 of its 900-second budget and reported nothing wrong. This is the post-mortem: the failure preserved verbatim, the diagnosis against the same prompt re-run correctly on Gemini DR, the routing rule that now lives in the fleet’s CLAUDE.md, and an eval that fails on the bad fixture and passes on the grounded one.

What is this?

A standalone forensic write-up of a silent agent failure. It keeps the bad output as a specimen, annotates every fabrication, and shows the control experiment that proved the failure was about the router, not the question: the identical prompt grounded cleanly on Gemini Deep Research the next day, naming real maintainers and flagging its own fragile claims before I asked. The audience is anyone evaluating whether I can catch a confidently-wrong agent and harden the system so it doesn’t recur.

Why this approach?

Three options after the failure: swap in a bigger local model (rejected: doesn’t address grounding-under-width, and kills the $0 economics); route all research to the cloud (rejected: most research is single-shape and grounds fine locally; this pays $2.80 to dodge a failure that only hits compound prompts); or build a routing boundary keyed on prompt shape (chosen). Compound research (three or more sub-questions, due-diligence matrices, “evaluate N things on M dimensions”) goes to a grounded cloud researcher; single-shape research stays local at $0. The boundary is encoded as policy so the system can’t drift back to “try local first, it’s free”, the default that was the bug.

What would break?

The eval is the load-bearing safeguard, and it has two soft spots. The structural assertion (every numbered citation resolving to exactly one URL) needs a runner that parses [n] markers; the token-level assertions (no invented entities, no fabricated URLs, correct provenance anchor, no leaked self-talk) run today, but the structural check is documented ahead of its runner. And the eval encodes this specimen; a new fabrication shape (a plausible-but-wrong maintainer name, say) would pass it. The general defense is the routing rule; the eval is the regression net for the one failure I’ve actually seen.

What did I learn?

The dangerous agent output isn’t the one that errors loudly; it’s the one that’s confidently, plausibly wrong and looks more rigorous than the truth. The grounded re-run’s real tell wasn’t that it got more cells right; it’s that it modeled its own uncertainty and flagged which claims to verify. The discipline that matters in agent products isn’t “use the best model”; it’s knowing what each tier can be trusted with, and building the boundary so the system can’t take the cheap path on work the cheap path can’t do. Catch the silent failure, name it, encode the fix as policy, write the test that proves it. That’s the loop.

LDR Grounding-Collapse Post-Mortem

─ METHODS ─

─ EXPLANATION ─

What is this?

Why this approach?

What would break?

What did I learn?

─ WHAT THIS DOESN'T YET DO ─