INFRA SHIPPED

Enterprise AP Invoice-Approval Agent Spec

─ METHODS ─

Tools, agents, and models used on this project
TASK AGENT / TOOL MODEL / COST
research grounding 5-angle deep research (AP benchmarks, fraud data, vendor certs, SOC 2/SR 11-7, OWASP LLM Top 10) research
spec design (escalation tree + trust boundary) 5-level decision tree, $5K-bounded blast radius portfolio time
evaluation suite 14 cases (happy/edge/adversarial/boundary/precision), runnable stub + runner, bite + xfail portfolio time
cost model reproducible calculator, live June-2026 per-token pricing, 3 scenarios portfolio time
build-vs-buy + governance 4-option scored memo + SOC 2 / SR 11-7 mapping portfolio time
4Q writeup EXPLANATION.md portfolio time

─ EXPLANATION ─

The single highest-leverage artifact in the Enterprise AI PM skill set: the spec a senior PM produces before a money-moving agent gets built. The scenario is deliberately ordinary: a 200-person SaaS company processing 5,000 supplier invoices a month, auto-approve the clean ~95%, escalate the risky ~5% to the right human in under 30 seconds. The repo is the full document set: a ~4,000-word PRD with a five-level escalation decision tree and a trust boundary that caps any autonomous action’s blast radius at $5,000; a 14-case eval suite that runs against a stub and is built to fail a naive approve-all agent; a reproducible cost model showing hybrid model-routing at ~$27/month against ~$29,150/month of manual labor; a build-vs-buy memo with a defended recommendation; and a governance mapping to SOC 2 (CC6.1 / CC7.2 / CC8.1) and SR 11-7 model-risk expectations. Grounded in a five-angle deep-research synthesis; the discipline behind it (route the cheap work to the cheap tier, bound what the agent does alone, test it with evals that fail when it’s wrong, log every decision) is the discipline I run on my own autonomous agent fleet.

─ WHAT THIS DOESN'T YET DO ─

  • The shipped stub detects adversarial content by keyword matching; the eval suite includes a precision case it deliberately fails. A real deployment needs a proper injection classifier with its own eval.
  • The thresholds (the $5K cap, dollar bands, match tolerances) are practitioner judgment calls; the phased shadow-mode rollout exists partly to calibrate them on real volume before money moves.