Sean Winslow — Transactions

Discovery PRD — AI-Assisted Article Drafting Workflow

Tue, 02 Jun 2026 12:00:00 GMT

The skill this artifact exists to prove is cross-functional translation, which the 2026-05-18 DR-Max research flags as the single most-cited competency across Tier-1 AI PM job descriptions (90%). It's hard to claim and easy to show, so the PRD is built backward from the way these pitches actually die: an engineer explains RAG to a lawyer, the lawyer hears "the machine writes things and we hope they're true," and the meeting is over. The center of gravity is five distinctly-voiced stakeholders and the exact translation move made for each. Everything else (problem statement, six user stories, adoption-funnel metrics, a 90-day Klarna-citing rollout) exists to frame them. Stress-tested through the premium LLM Council; the convergent fix is folded in: keep the human translation, remove the deterministic overclaims (citation ≠ factuality, the SEO lead made genuinely SEO-specific, training-data scope stated honestly). ## What is this? A discovery-phase PRD for an AI article-drafting and editorial-review workflow at a generic ~50-person content organization. The load-bearing section voices the editor, content strategist, SEO lead, legal counsel, and executive sponsor, and translates embeddings, RAG, hallucination rates, and eval metrics into each one's vocabulary. The audience is a hiring manager checking whether I can translate across a skeptical org, not whether I can recite the technical terms. ## Why this approach? Three options: an abstract essay on stakeholder alignment (rejected: proves nothing), a generic PRD template (rejected: every PM has one), or a PRD whose weight sits on five personas a reader can tell apart blind, each paired with the specific concept I had to re-root in their language (chosen). The translations are the deliverable. The problem statement is framed as an outcome (cut first-draft cycle time from ~4 days to under 8 hours without sacrificing brand voice), never as "adopt AI." ## What would break? Three failure modes. **Persona collapse**: if the five voices blur together, the artifact fails its one job, so each quote and translation is tuned to be unmistakable. **Vanity metrics**: success is an adoption funnel (adoption rate, fallback-to-human rate, Time-to-Trust), not output volume or CTR; a metric that rewards shipping more drafts would re-introduce the exact failure the rollout guards against. **Rollout amnesia**: a plan that expands on the calendar instead of on metrics becomes Klarna, which walked back its AI support when complex cases degraded CSAT. The standing rule is that the expansion gate is always a metric, never a date. ## What did I learn? Translation isn't dumbing the concept down; it's re-rooting it in the listener's accountability. "Hallucination" becomes a ranking risk to the SEO lead, a liability with a chain of custody to the lawyer, and a weekly tripwire to the executive: same fact, three different promises. And half the skill is knowing what not to say, like never raising token economics with the editor. I ran this discovery informally for the better part of a decade across two non-AI orgs. Writing it with named accountability is the difference the title would have made.

MCP Security Audit

Mon, 01 Jun 2026 12:00:00 GMT

Three weeks after I published `@swins/intent-engineering-mcp`, I ran a security audit on my own server, and the finding wasn't the textbook MCP threat. The interesting failure was concrete: two of the three tools accepted an unconstrained `file_path` and read it straight off disk, so `audit_intent_spec({file_path: "/etc/passwd"})` (or a `.md` symlink pointing at `~/.ssh/id_rsa`) would hand file contents back to the model. v0.1.1 routes every disk read through one guard (realpath symlink resolution, a 1 MiB cap, an extension allowlist, optional root confinement) and logs every read. The threat model in `SECURITY.md` leads with that, and deliberately defers OAuth and sandboxing, with reasons. ## What is this? A self-audit of a published MCP server, shipped as code (`v0.1.1`) plus a `SECURITY.md` threat model. It names the real attack surface (arbitrary local-file read via a confused-deputy tool call), applies a single hardening guard across both file-reading tools, adds zero-dependency regression tests that fail on the unpatched code, and documents which standard defenses don't apply to a stdio pure-function server. Audience: anyone asking whether I can secure the things I publish, not just publish them. ## Why this approach? The roadmap checklist said "add input validation." But Zod was already validating every input at the boundary, so claiming I added it would be false. So the work was *tightening*, not adding: `.strict()` schemas plus a `loadFileSafely` guard for the one surface that was actually exposed. And I deferred OAuth 2.1/PKCE and sandboxed execution on purpose: this is a stdio, pure-function server with no network-auth surface and no exec path, so applying them would be cargo-culting. Scoping defenses to the surface is the judgment a security review is supposed to demonstrate; running every item on a generic checklist is the opposite of it. ## What would break? The guard is realpath-based, so the symlink-escape vector is closed at the resolved path, not the supplied string. But root confinement is opt-in, so on a shared machine without `INTENT_ENGINEERING_ALLOWED_ROOT` set, any readable `.md`/`.yaml` is still in scope (by design: zero-config install over lockdown). The audit log is local and plaintext, so it's evidence for the operator, not a tamper-proof control. And the MCP trust boundary still holds: a genuinely malicious *client* is out of scope. These defenses are against content flowing *through* a trusted one. ## What did I learn? Securing your own published artifact is a different muscle than building it. The credibility move was correcting the checklist against the real code, and correcting the research it came from: the source doc attributed the EchoLeak CVE to "the Anthropic MCP server," but CVE-2025-32711 is a Microsoft 365 Copilot vulnerability (Aim Labs). Catching a widely-repeated wrong attribution, and scoping defenses to the surface that's actually exposed, is the same instinct in both directions: read your own work, and your own sources, adversarially.

Code-Brain System Card

Sun, 31 May 2026 12:00:00 GMT

The credibility move that made the Enterprise Data Readiness Matrix land was applying a framework to a system I actually operate instead of reciting it. This does the same for regulatory accountability: it cards the real Code-Brain fleet (~12 live agents plus a published MCP server and a control-plane Judge Layer) against SR-11-7 and the EU AI Act. The load-bearing section isn't the tiering; it's the applicability determination up front that rules most of the regulation *out* (Code-Brain is minimal-risk, not high-risk, so Annex IV and Articles 13/72 don't apply), then models the discipline voluntarily. Stress-tested through the premium LLM Council; the convergent fixes (correct EU AI Act scope, vendor-model risk that can't be architected away, inherent-vs-residual tiering) are folded in. ## What is this? A governance accounting of my autonomous agent fleet, mapped to SR-11-7 (Fed model-risk management) and the EU AI Act. It tiers each live component by materiality, documents validation evidence and the human-override path, and names every place the system would not pass if it were regulated. The audience is a model-risk officer or regulated-SaaS hiring manager who wants to see whether I can scope a regulation, not just recite one. ## Why this approach? Three options: write an abstract explainer (rejected: proves no judgment); claim conformance (rejected: the high-risk obligations don't legally apply, so "partial compliance" is a category error); or apply the frameworks to a system I operate, lead with a scope determination that rules most of the regulation out, then model the discipline and name the gaps (chosen). Correctly scoping a law you don't have to follow signals more than performing compliance with one you've misread. ## What would break? Three failure modes. **Over-claiming**: any EU AI Act cell that reads "Partial / Substantially present" instead of "inapplicable; modeled voluntarily" has re-acquired the category error the scope section exists to prevent (the first draft had exactly this; the Council caught it). **Inventory drift**: the `status` column must match `agents-sdk/config.toml` enable flags, or the tiering lies the moment an agent is toggled. **The "no training data" erasure**: framing inherited vendor-model risk as "N/A by architecture" erases an SR-11-7 obligation rather than satisfying it. ## What did I learn? The hardest part of regulatory fluency isn't knowing what a regulation requires; it's knowing when it doesn't apply, and leading with that. A four-model adversarial review turned the artifact inside out: the impressive move is ruling the regimes out correctly, then modeling them anyway. That's the instinct a fintech or regulated-SaaS PM needs in week one.

LDR Grounding-Collapse Post-Mortem

Sun, 31 May 2026 12:00:00 GMT

On May 5th one of my fleet agents produced a polished, scored comparison table of the MCP tooling ecosystem, with a ranking, a references section, and ten citations. About a third of it was invented: a tool called `PureMCPClient` ranked #3, a fabricated `MCPCatalog (Central)`, Google's ADK miscast as an MCP SDK at a Microsoft Azure URL, and the home of MCP declared to be `github.com/microsoft/mcp` (it's `modelcontextprotocol`). It didn't error or time out; it finished in 280 of its 900-second budget and reported nothing wrong. This is the post-mortem: the failure preserved verbatim, the diagnosis against the same prompt re-run correctly on Gemini DR, the routing rule that now lives in the fleet's `CLAUDE.md`, and an eval that fails on the bad fixture and passes on the grounded one. ## What is this? A standalone forensic write-up of a silent agent failure. It keeps the bad output as a specimen, annotates every fabrication, and shows the control experiment that proved the failure was about the *router*, not the *question*: the identical prompt grounded cleanly on Gemini Deep Research the next day, naming real maintainers and flagging its own fragile claims before I asked. The audience is anyone evaluating whether I can catch a confidently-wrong agent and harden the system so it doesn't recur. ## Why this approach? Three options after the failure: swap in a bigger local model (rejected: doesn't address grounding-under-width, and kills the $0 economics); route all research to the cloud (rejected: most research is single-shape and grounds fine locally; this pays $2.80 to dodge a failure that only hits compound prompts); or build a routing boundary keyed on prompt shape (chosen). Compound research (three or more sub-questions, due-diligence matrices, "evaluate N things on M dimensions") goes to a grounded cloud researcher; single-shape research stays local at $0. The boundary is encoded as policy so the system can't drift back to "try local first, it's free", the default that *was* the bug. ## What would break? The eval is the load-bearing safeguard, and it has two soft spots. The structural assertion (every numbered citation resolving to exactly one URL) needs a runner that parses `[n]` markers; the token-level assertions (no invented entities, no fabricated URLs, correct provenance anchor, no leaked self-talk) run today, but the structural check is documented ahead of its runner. And the eval encodes *this* specimen; a new fabrication shape (a plausible-but-wrong maintainer name, say) would pass it. The general defense is the routing rule; the eval is the regression net for the one failure I've actually seen. ## What did I learn? The dangerous agent output isn't the one that errors loudly; it's the one that's confidently, plausibly wrong and *looks* more rigorous than the truth. The grounded re-run's real tell wasn't that it got more cells right; it's that it modeled its own uncertainty and flagged which claims to verify. The discipline that matters in agent products isn't "use the best model"; it's knowing what each tier can be trusted with, and building the boundary so the system can't take the cheap path on work the cheap path can't do. Catch the silent failure, name it, encode the fix as policy, write the test that proves it. That's the loop.

Vault as Agent Infrastructure: 5-Test Scorecard

Sun, 31 May 2026 12:00:00 GMT

## What is this? A five-test scoreboard for agent infrastructure. Nate Jones published five structural tests (persistent state, defined verbs, ownership, permissions, queryable audit history) and this scores four knowledge systems against them: Notion, default Obsidian, Linear, and my vault. The verdict is three passes and two honest losses, every cell backed by live telemetry from `vault/.vault-index.db` (632 typed edges, six SQL-enforced relations, 15,582 indexed chunks). ## Why this approach? Nate's framing is the recruiter vocabulary for agent-infrastructure work right now, so scoring against it beats inventing my own rubric: it's citing a standard instead of grading my own homework. It also makes the next build, `vault-knowledge-mcp`, self-justifying: the MCP ships as "the wrapper around the only public PM vault that passes the agent-infrastructure tests." The two losses to Linear are deliberate. Linear's RBAC and record-ownership model genuinely beat POSIX file permissions, and naming that is what calibrates the whole artifact's credibility. ## What would break? The numbers are a snapshot, so they go stale as the synthesizer runs nightly: they already moved from 478 edges to 632 in nine days. That's mitigated by making them regenerable on demand from the live schema rather than hand-maintained. The deeper risk is reading a scorecard I authored about my own system as objective truth; the Linear-wins-here callouts are the structural check against that. ## What did I learn? Most "agent infrastructure" claims fail the persistent-state test on day one: ask whether the structured state survives a crash with no recovery procedure, and most setups turn out to be tool integrations with session-scoped memory. The test discriminates. And the most useful losses to name are the ones whose fix already exists: the closer for the permissions gap (the Judge Layer, a control-plane interceptor with an append-only decision ledger) was already built, which turned that "failure" from a someday into a rollout.

Enterprise Data Readiness Matrix

Fri, 29 May 2026 12:00:00 GMT

The defining barrier to production AI in 2026 isn't model intelligence; it's data readiness. This rubric is the diagnostic an AI platform PM runs on a customer's data layer *before* a launch date goes on a slide: five dimensions (canonical entity IDs, lineage/provenance, freshness, governance/eligibility, dedup/embedding hygiene), each scored Green/Yellow/Red, with a floor rule (the weakest dimension sets the deployment posture). It ships with a worked example: the matrix applied end-to-end to a fictional Fortune 500 publisher, scored, with a dated 90-day Red→Yellow→Green remediation plan. Grounded in a 4-panel deep-research read of mid-2026 enterprise AI PM hiring criteria; the Green-state examples come from the five readiness problems I had to solve on my own agent fleet's knowledge base before it would produce citable output.

Vault Knowledge MCP: Design Lock

Thu, 21 May 2026 12:00:00 GMT

## What is this? A Model Context Protocol that exposes Sean's vault as a queryable agent surface. Design-locked May 21, 2026; ship-target ~June 4. The architecture argument lives at `/architecture/vault-scorecard/` (Phase 3c.2). This row tracks the *shipped artifact* (the MCP transport + tool surface), not the thesis. ## Why this approach? The vault is the only PM vault that passes the agent-infrastructure tests. The MCP wraps it. Recruiter from Glean lands on /architecture/vault-scorecard/ → reads the scoreboard → clicks through to the ledger row → sees the ship date. The two surfaces close the loop: architecture argues, ledger ships. ## What would break? If the MCP HTTP transport spec revs between design-lock and ship, the design re-opens. If concept_edges hits the 47-edge production limit (likely), the read-layer paginates or the vault sharding strategy lands first. ## What did I learn? Design-lock is the artifact, not the code. Locking the design on May 21 with a June 4 ship target gives 14 days of runway: enough for a real implementation cycle without scope drift. The pattern: lock at T-14, ship at T-0.

Knowledge Loop: Phase 6 (Producer + Consumer)

Wed, 20 May 2026 12:00:00 GMT

## What is this? The closing loop of the vault's knowledge system. A `SessionEnd` hook flushes the day's session content to the vault; at 02:30 the synthesizer turns one item into a concept; at 03:30 two critics (Codex CLI and Anti-Gravity CLI, running in parallel) argue about whether it holds up; and a `SessionStart` hook re-injects the surviving expansions as `additionalContext` the next time I open a session. Producer at night, consumer in the morning. The vault that captures the thinking serves it back. ## Why this approach? The earlier phases captured and structured knowledge but never closed the loop: concepts landed in the vault and stayed there until I went looking. Phase 6 makes the vault a participant in the next session instead of an archive I query. Splitting the critic across two independent CLIs (rather than re-prompting one model) is the deliberate move: real variance comes from different model families disagreeing, not from temperature on a single judge. The hooks are the seams: `SessionEnd` and `SessionStart` are the two places the loop touches the live workflow without me wiring anything by hand. ## What would break? Three failure modes. The `SessionStart` injection caps at 15K tokens, so a concept-heavy day truncates and the morning context loses the tail of the night's work. The 03:30 critic depends on both CLIs staying authenticated: a ChatGPT Plus session or a Google personal OAuth token expiring silently takes the whole critique step down without an error anyone notices until the morning concept arrives un-critiqued. And the loop assumes the synthesizer produced something worth injecting; on an empty night it re-injects nothing, which is correct but indistinguishable from a silent failure without the eval suite watching. ## What did I learn? A knowledge system that only captures is a graveyard with good lighting. The value isn't in storing the thinking; it's in the thinking showing up uninvited in tomorrow's context, already critiqued, so the next session starts a step ahead instead of from scratch. The loop is the artifact; the vault is just where it lives.

Phase D: Typed Reasoning Edges

Wed, 13 May 2026 12:00:00 GMT

## What is this? The fourth phase of the knowledge loop. Instead of just storing notes, the system now stores typed edges between concepts. If Concept A contradicts Concept B, the database knows it directly. Six relation types: `supports`, `contradicts`, `evolved_into`, `supersedes`, `depends_on`, `related_to`. ## Why this approach? Weekly lints were getting too expensive. Passing 100 pages of text to Claude and asking "are there contradictions?" cost $4 every Sunday. The redesign uses the LLM to structure the edges once during nightly synthesis, then the weekly lint runs `SELECT * FROM edges WHERE type='contradicts'`. Don't optimize prematurely, but do optimize when the pattern is proven. ## What would break? If the LLM hallucinates an edge type that isn't in the enum, the SQLite insert throws an error and the nightly job fails. This happened twice before strict schema enforcement (`CHECK type IN ('supports', 'contradicts', ...)`). The second failure mode is graph drift: an edge inserted in 2026-04 referencing concepts that were later renamed leaves dangling references. The schema doesn't enforce referential integrity across concept renames. ## What did I learn? LLMs are great at identifying relationships, but terrible at scanning large graphs repeatedly. Use the LLM to structure the data once, then use traditional code (SQL) to query it. The cost curve flipped from O(reads × LLM calls) to O(writes × LLM calls + reads × $0).

Intent Engineering MCP

Tue, 12 May 2026 12:00:00 GMT

## What is this? A Model Context Protocol server published to npm + the MCP registry on May 12, 2026. Two MCP tools registered (`tools/list` + `tools/call`); DNS-verified namespace. Installed by Claude Desktop with one line of config. ## Why this approach? Three options were on the table: ship as a standalone CLI tool, bake into a larger MCP, or ship as a registered server. The registered-server path won because DNS-verified registry adoption was the right scope-cut signal: recruiter could see the install count + registry presence at a glance. The case study at /work/intent-engineering-mcp documents what got cut to ship early. ## What would break? Three named failure modes: (1) Claude Desktop's MCP client revs faster than this server; (2) DNS verification expiring; (3) npm registry vetting failing on first publish. Mitigations: semver discipline on the server, calendar reminder for renewal, manual vetting check before each publish. ## What did I learn? It shipped 13 days early because the design was right. What got cut was right too: the interactive MCP-tool-call embed page is a future spec, not v1. The cut list is on the case-study page; the install count is on this ledger row's `` block (rendered via the case-study route, not duplicated here).

Substack Drafter (gate-b-drafts)

Wed, 06 May 2026 12:00:00 GMT

## What is this? A weekly drafting agent that runs every Thursday, reads the `writing-voice-modes` SKILL.md verbatim, and produces a Substack draft in one of five calibrated voice modes (sean, sedaris, kerouac, thompson, or vonnegut), rotating across them week to week. The output never publishes itself: drafts land in `gate-b-drafts/` and wait for my first editorial pass. The agent fills the blank page; the human still owns the send button. ## Why this approach? The hardest part of a writing habit isn't the writing; it's the cold start. A drafter that arrives every Thursday with a voice-calibrated first draft turns "what do I write about" into "is this draft worth finishing." Reading the SKILL.md verbatim instead of baking the voice rules into the prompt means the agent's voice stays in lockstep with the calibration I tune by hand: one source of truth, no drift. Routing local-first with a cloud fallback keeps the cost capped: the local model handles the easy modes, the cloud model picks up the ones it can't, and the run never exceeds the ten-cent ceiling. ## What would break? Two failure modes. The drafter ships default-disabled. Without `INSTALL_SUBSTACK_DRAFTER=1` the launchd schedule never installs, so a fresh fleet stays silent until I opt in, which is correct but easy to forget. And voice fidelity degrades on the local-fallback path: Qwen3-14B handles sean and sedaris cleanly but loses the cadence on Vonnegut and Kerouac, the two modes whose whole point is rhythm. When the fallback fires on those modes, the draft reads competent and wrong. ## What did I learn? A drafting agent is a gate, not a publisher. The value isn't in the prose it produces (most of it gets rewritten); it's in defeating the blank page on a schedule I don't have to maintain. The discipline the agent encodes is the same one it's drafting about: write it down, then decide what's worth keeping.

Vault Synthesizer Eval Suite

Thu, 23 Apr 2026 12:00:00 GMT

## What is this? A 10-case binary pass/fail eval suite for the local Qwen3-14B vault synthesizer agent. The cases were derived from open-coding 17 days of production logs (2026-04-24 → 2026-05-10), not from imagined failure modes. The suite catches the failure class that production monitoring missed: silent regressions where the agent reports success while producing zero output. ## Why this approach? Pytest + YAML over Braintrust or Langfuse: at ten cases, platform infrastructure is pure overhead. Code-based and rubric graders run before any LLM-as-judge, per Hamel's cost-economics rule. Binary pass/fail over Likert, per the Husain-Shankar canon: Likert scales destroy inter-rater reliability at this scale. The case library is grounded in real-log error analysis, not synthetic generation: the failures the suite guards against actually happened. ## What would break? Three failure modes. Synthesizer prompt drift: a structural change to the underlying synthesis prompt requires re-baselining the whole suite. Mock-input fixtures going stale as the vault evolves: they need a quarterly refresh. And the judge boundary: no active case uses an LLM judge at v1, so if a case is ever promoted to LLM-judge, the model ID has to be pinned explicitly in the case YAML and a `--skip-llm-judge` flag added for offline runs, or the suite stops being reproducible. ## What did I learn? That evals aren't really about hallucinations. The failure modes I imagined (hallucinated phase numbers, relation-tag drift, temporal confusion) were the easy cases. The hard case was the one nobody drafts on purpose: the status field that says "ok" while the output is empty. Three layers of monitoring agreed everything was fine while the system underneath them rotted silently for nine days. Error analysis surfaces the failures imagination does not.