Enterprise AP Invoice-Approval Agent Spec

─ METHODS ─

Tools, agents, and models used on this project
TASK	AGENT / TOOL	MODEL / COST
research grounding	5-angle deep research (AP benchmarks, fraud data, vendor certs, SOC 2/SR 11-7, OWASP LLM Top 10)	research
spec design (escalation tree + trust boundary)	5-level decision tree, $5K-bounded blast radius	portfolio time
evaluation suite	14 cases (happy/edge/adversarial/boundary/precision), runnable stub + runner, bite + xfail	portfolio time
cost model	reproducible calculator, live June-2026 per-token pricing, 3 scenarios	portfolio time
build-vs-buy + governance	4-option scored memo + SOC 2 / SR 11-7 mapping	portfolio time
4Q writeup	EXPLANATION.md	portfolio time

TASK

AGENT / TOOL

MODEL / COST

research grounding

5-angle deep research (AP benchmarks, fraud data, vendor certs, SOC 2/SR 11-7, OWASP LLM Top 10)

research

spec design (escalation tree + trust boundary)

5-level decision tree, $5K-bounded blast radius

portfolio time

evaluation suite

14 cases (happy/edge/adversarial/boundary/precision), runnable stub + runner, bite + xfail

portfolio time

cost model

reproducible calculator, live June-2026 per-token pricing, 3 scenarios

portfolio time

build-vs-buy + governance

4-option scored memo + SOC 2 / SR 11-7 mapping

portfolio time

4Q writeup

EXPLANATION.md

portfolio time

─ EXPLANATION ─

The single highest-leverage artifact in the Enterprise AI PM skill set: the spec a senior PM produces before a money-moving agent gets built. The scenario is deliberately ordinary: a 200-person SaaS company processing 5,000 supplier invoices a month, auto-approve the clean ~95%, escalate the risky ~5% to the right human in under 30 seconds. The repo is the full document set: a ~4,000-word PRD with a five-level escalation decision tree and a trust boundary that caps any autonomous action’s blast radius at $5,000; a 14-case eval suite that runs against a stub and is built to fail a naive approve-all agent; a reproducible cost model showing hybrid model-routing at ~$27/month against ~$29,150/month of manual labor; a build-vs-buy memo with a defended recommendation; and a governance mapping to SOC 2 (CC6.1 / CC7.2 / CC8.1) and SR 11-7 model-risk expectations. Grounded in a five-angle deep-research synthesis; the discipline behind it (route the cheap work to the cheap tier, bound what the agent does alone, test it with evals that fail when it’s wrong, log every decision) is the discipline I run on my own autonomous agent fleet.

─ WHAT THIS DOESN'T YET DO ─

The shipped stub detects adversarial content by keyword matching; the eval suite includes a precision case it deliberately fails. A real deployment needs a proper injection classifier with its own eval.

The thresholds (the $5K cap, dollar bands, match tolerances) are practitioner judgment calls; the phased shadow-mode rollout exists partly to calibrate them on real volume before money moves.