BOSTON, MAY 31, 2026 8 MIN READ
ARCHITECTURE SHIPPED
Vault as Agent Infrastructure
Five structural tests. Four knowledge systems. Three passes, two losses to Linear, and one of the two closers is already on disk.
What is this?
A five-test scoreboard that scores knowledge systems on the axes that matter when an agent has to act on the vault, not just read it: Persistent State, Defined Verbs, Ownership, Permissions, and a Queryable Audit History. Notion, Obsidian (default), Linear, and Sean’s vault each get a row. Three passes, two failures, and the two failures are the architectural argument for what ships next. Every cell is backed by live telemetry from vault/.vault-index.db: 632 typed edges across ~110 nodes, six SQL-enforced relations, 15,582 indexed chunks, regenerable with scripts/generate_schema.py.
Why this approach?
Most “PM tools” discourse is about how nice the editing surface feels. That’s a content question. The agent question is structural: does the system have typed edges the agent can query without scanning, defined verbs it can call without a free-form LLM bridge, an audit history a reviewer can replay? The scorecard turns that structural question into a single visible result, and scoring against Nate Jones’s published tests means citing an external standard instead of grading my own homework. Comparison is the discipline: equal presentation across all four systems is the honesty signal. The reader reaches the conclusion from the symbols, not from a highlighted “Sean’s-vault” row.
What would break?
Three named failure modes: (1) the synthetic public vault fixture drifts from the production schema as concept_edges evolves, mitigated by the scripts/generate_schema.py step that re-emits the ER diagram and stats from the live SQLite file (the edge count already moved 478 → 632 in nine days, which is why the numbers are regenerated, not hand-typed); (2) the Notion/Obsidian/Linear rows go stale as those products ship features, mitigated by a quarterly rescore; (3) recruiters expecting a “Notion-killer” pitch bounce because the scoreboard is honest about the axes where Linear wins; that’s the intended cost. The page filters for the audience that values calibration over salesmanship.
What did I learn?
Equality of presentation is the credibility signal. The temptation was to highlight Sean’s-vault row, wash the passing cells in success green, or stack the columns in my order. Resisting that produced a scoreboard a Linear PM could read without rolling their eyes. The two honest-note callouts (Ownership and Permissions, both Linear-better) are the load-bearing part of earning that read. And the most useful loss to name turned out to be the one whose fix already exists: the closer for the permissions gap, the Judge Layer, was already built and tested, which reframed that “failure” from a someday into a rollout.
erDiagram
Concept ||--o{ ConceptEdge : "from_slug"
Concept ||--o{ ConceptEdge : "to_slug"
ConceptEdge {
int id PK
text from_slug FK
text to_slug FK
text relation "enum: supports | contradicts | evolved_into | supersedes | depends_on | related_to"
real confidence "0..1, NULL when classifier-undecided"
text valid_until "ISO 8601, NULL when current"
text classifier_version
text source_synth_run
text created_at "ISO 8601"
}
Concept {
text slug PK
text title
text body
}
Scoreboard
| STATE | VERBS | OWN | PERM | AUDIT | |
|---|---|---|---|---|---|
| Notion | |||||
| Obsidian (default) | |||||
| Linear | |||||
| Sean's vault |
exceeds · passes · partial · fails
─ METHODS ─
| TASK | AGENT / TOOL | MODEL / COST |
|---|---|---|
| persistent state | SQLite + git + JSONL | local / $0 |
| typed edges | Code Brain (concept_edges) | Sonnet 4.6 (HybridRouter) |
| er diagram gen | scripts/generate_schema.py | stdlib / $0 |
| fixture author | Sean + LLM seed | ~$0.10 |
— EXPLANATION —
What is this?
A five-test scoreboard that scores knowledge systems on the axes that matter when an agent has to act on the vault, not just read it: Persistent State, Defined Verbs, Ownership, Permissions, and a Queryable Audit History. Notion, Obsidian (default), Linear, and Sean’s vault each get a row. Three passes, two failures, and the two failures are the architectural argument for what ships next. Every cell is backed by live telemetry from vault/.vault-index.db: 632 typed edges across ~110 nodes, six SQL-enforced relations, 15,582 indexed chunks, regenerable with scripts/generate_schema.py.
Why this approach?
Most “PM tools” discourse is about how nice the editing surface feels. That’s a content question. The agent question is structural: does the system have typed edges the agent can query without scanning, defined verbs it can call without a free-form LLM bridge, an audit history a reviewer can replay? The scorecard turns that structural question into a single visible result, and scoring against Nate Jones’s published tests means citing an external standard instead of grading my own homework. Comparison is the discipline: equal presentation across all four systems is the honesty signal. The reader reaches the conclusion from the symbols, not from a highlighted “Sean’s-vault” row.
What would break?
Three named failure modes: (1) the synthetic public vault fixture drifts from the production schema as concept_edges evolves, mitigated by the scripts/generate_schema.py step that re-emits the ER diagram and stats from the live SQLite file (the edge count already moved 478 → 632 in nine days, which is why the numbers are regenerated, not hand-typed); (2) the Notion/Obsidian/Linear rows go stale as those products ship features, mitigated by a quarterly rescore; (3) recruiters expecting a “Notion-killer” pitch bounce because the scoreboard is honest about the axes where Linear wins; that’s the intended cost. The page filters for the audience that values calibration over salesmanship.
What did I learn?
Equality of presentation is the credibility signal. The temptation was to highlight Sean’s-vault row, wash the passing cells in success green, or stack the columns in my order. Resisting that produced a scoreboard a Linear PM could read without rolling their eyes. The two honest-note callouts (Ownership and Permissions, both Linear-better) are the load-bearing part of earning that read. And the most useful loss to name turned out to be the one whose fix already exists: the closer for the permissions gap, the Judge Layer, was already built and tested, which reframed that “failure” from a someday into a rollout.