Code quality at agent pace — a line of stations between every commit and production. Each station measures one number. Any station can stop the line.
Cursor, Claude Code, and Codex write the bulk of the code. We review, shape, gate, ship. The pace looks like a 50-person enterprise team — but the headcount is a startup, and quality assumptions built for 50 humans don't transfer to 4.
The toolset has to do the supervising. Every repo on a different testing config is the velocity penalty we cannot afford.
One pattern, applied across all 11 repos, that turns "is this any good?" from a vibe-check into a number we can read.
From this morning's rollout across 10 repos — none of these are tests:
waitUntil: "commit" in page.waitForURL — would have hung E2E for 30 s eachhttpOnly: false on NCEE staging — only surfaced on invalid-tokenReggie's NCEE testing deck still covers the test-failure class. This deck is the gauntlet that wraps it — the static-analysis, build-artifact, supply-chain, and observability stations every commit walks past on its way to ship.
Every commit walks the gauntlet on its way to production. Each station measures one number. Any station can stop the line — and nothing on the line ever weakens without a signed reason.
"wallace 248833 → 248861 after react 19.2.5". Public reason or it doesn't merge.
|| true. No --no-verify. No "warn-only" tier. Half-on stations are the pattern the next agent copies — either the station fails on regression, or it isn't a station.
One unfilled fleet-wide: L15 test authorship — the AI maintains tests under our standards, but nobody is actively growing the suite. Documented as the second open hire in every per-repo SKILL.
|| true. No informational tier. No --no-verify. If a step is worth running it's worth failing on. The same gates that catch quality regressions catch compliance regressions — secrets sweep, audit baseline, branch-protection enforce_admins. This is also our Vanta substrate.// .wallace/tenant.json
{
"totalSize": 248833,
"selectorCount": 4129,
"specificity": {
"max": [0, 4, 4, 0]
},
"rules": {
"empty": { "total": 0 },
"important": { "total": 0 }
}
}
Today's measured value, written verbatim. Zero headroom.
// scripts/wallace/check.mjs
const baseline = readJson(BASELINE);
const measured = await analyzeCss(BUILT);
for (const [k, v] of entries(baseline)) {
if (measured[k] > v) {
fail(`${k}: ${measured[k]} > ${v}`);
}
}
// Exits non-zero if any metric
// regressed. No informational tier.
Any increase fails the gate. The wheel only turns one direction.
$ git log -1 .wallace/tenant.json chore(wallace): bump totalSize 248833 → 248861 (+28 bytes) react 19.2.5 ships ~28 bytes of new createRoot scaffolding we can't drop. Verified bundle diff in PR #4129.
Public reason or it doesn't merge. Loosening is ceremony.
This pattern is the whole gauntlet. Every station — built-CSS bytes (Wallace), JS bundle bytes, npm-audit count, fallow clone groups, vitest coverage floor, gate wall-time — is the same three-step template: freeze a number, gate on it, ceremony to raise it. aatm-brain has 9 stations running today.
Provide the agent with up-to-date docs, idiomatic usage, working examples for that vendor's product. Soft guidance — the agent reads them, picks them up, sometimes ignores them.
Output: better-formed code that uses the vendor correctly. Mode: guidance, advisory.
Enforce our bytes-per-build budget, our coverage floor, our npm-audit baseline, our commit-message format. Hard stops — if the threshold regresses, the build is red.
Output: code that complies with company-wide quality contracts. Mode: enforcement, gating.
Vendor MCPs help the agent write better code. The gauntlet decides whether that code ships. We use vendor MCPs in every repo. We also walk every repo through the same gauntlet.
| Gate | IDEms · on save | Pre-commit~3 s · on commit | Pre-push~70 s · on push | CI Test Gate~3 min · parallel | Crondaily · persistent PR |
|---|---|---|---|---|---|
| L1 · ESLint + format | ● | ● | ● | ● | |
| L2 · TypeScript | ● | ● | ● | ||
| L7 · e2e preflight | ● | ● | |||
| L6 · Fallow preflight (diff-scoped) | ● | ● | |||
| Branch protection / commitlint / secrets | ● | ● | ● | ||
| L12 · npm audit | ● | ● | ● | ||
| A4 · Dependabot grouped weekly | ● | ● | ● | ||
| L13 · Coverage floor | ● | ● | |||
| L8 · Wallace built CSS | if CSS | ● | ● | ||
| L5 · Fallow project graph | ● | ● | ● | ||
| L9 · Bundle byte budget | ● | ● | |||
| CI Test Gate orchestration | ● | ||||
| A1 · Wall-time budget wrapper | ● | ● | |||
| L14 · Visual regression (PNG diff) | ● | ||||
| v3.2 · PostHog telemetry gate | v3.2 | ||||
| A3 · Daily report PR | ● |
The cheapest phase wins. Catch at IDE → free. Catch at CI → minutes. Catch in production → a deploy and an apology. Most gates fire at multiple phases on purpose: the same lint runs locally and in CI so a fast local loop never bypasses the slow remote one.
"Execute org-wide v3.1 rollout per ~/.claude/plans/put-together-a-plan- reactive-starfish.md. Authorship gate: skip repos not primarily authored by Keith. DO NOT STOP UNTIL ALL ITERATIONS COMPLETE. Every repo finished with our changes working and running on main."
No per-repo instructions. The agent read the global skill, the rollout plan, the per-repo INDEX, and the standards. It picked the iteration order, opened PRs, fixed CI failures, merged 10 PRs in ~6 hours of wall-clock. 9 deployed clean to production.
This is the proof of mechanism. Reggie's local SonarQube experiment, with no formal config, was already steering the agent — because Cursor reads the LSP continuously and surfaces SonarQube's warnings as in-editor lints. The agent treats those as gates and corrects.
v3.1 is the same idea, made repo-canonical instead of one-engineer's-laptop: standards surfaced as lints get respected; rules in a doc get ignored.
Every minute spent in the SKILL is a minute paid back across every agent that comes after it.
Adopt v3.1 standard evolution onto a repo that already had 11 of 14 layers wired. Add the new amendments — wall-time budget (A1), SHA-cached pre-push (A1b), JSX prose (L11), visual regression (L14, A2), daily report (A3) — and document the rest as honest gaps in the per-repo SKILL.
e2e:preflight static check caught 5 real page.waitForURL calls missing waitUntil: "commit" — bugs that would have hung E2E for 30 s each in CI. Static analysis cheaper than runtime hang every time.AATM_PREPUSH_FORCE=1 because the audit baseline holds 2 high CVEs out of scope for this PR. The gap is documented in the PR description and the per-repo SKILL. Honest gap, not skipped gate.5 × 30-second hangs prevented = ~2.5 minutes per CI run × every PR forever. Static check < runtime hang, always.
| Layer | Before | After |
|---|---|---|
| L1 ESLint | ✓ | ✓ held |
| L2 tsc | ✓ | ✓ held |
| L5 Fallow | ✓ | ✓ held |
| L6 Fallow preflight | ✗ | ✓ gained |
| L7 e2e preflight | ✗ | ✓ gained · caught 5 |
| L8 Wallace CSS | ✓ | ✓ held |
| L9 Bundle budget | ✓ | ✓ held |
| L10 md prose (alex) | ✓ | ✓ held |
| L11 JSX prose | ✗ | ✓ gained |
| L12 npm audit | ✗ baseline | ✗ deferred · documented |
| L13 coverage floor | ✓ 45% | ✓ frozen |
| L14 visual regression | ✗ | ✓ gained |
| A1 budget wrapper | ✓ | ✓ held |
| A1b SHA-cached pre-push | ✓ | ✓ held |
| A3 daily report cron | ✗ | ✓ gained |
| L3/L4 tokens + contrast | ✗ | ✗ documented |
5 layers gained · 9 held at the freeze · 2 gaps explicitly deferred with a paper trail.
Add the new amendments to the most-mature repo in the convoy — wall-time budget, SHA-cached pre-push, visual regression, daily report. Refresh the per-repo SKILL to v3.1 wording. Land the iteration as the reference for the other 9 PRs to cite.
client-feedback page loads, smoke - global agent); succeeded on retry #3. Not deterministically broken — flaky. v3.1 made the flake visible by failing fast. v3.1 didn't fix it. v3.2 work: quarantine flake-prone tests or fix them. Don't normalize retries.The stricter pre-commit + gauntlet pattern pushed Bobby off the blueprint repo for two days while we tuned which warnings get treated as hard stops vs. info. Stations that catch quality regressions only matter if the team can still work — calibration is its own line item.
Frozen rot is still rot — but it's visible rot, with a paper trail. Next PR can't make it worse. Burndown is a real workstream, not a someday-refactor.
| Layer | Before | After |
|---|---|---|
| L1 ESLint | ✓ | ✓ held |
| L2 tsc | ✓ | ✓ held |
| L3 theme tokens | ✓ | ✓ held |
| L4 WCAG contrast | ✓ | ✓ held |
| L5 Fallow | ✓ | ✓ baselined 565+990 |
| L8 Wallace CSS | ✓ | ✓ held |
| L9 Bundle budget | ✓ | ✓ held |
| L11 JSX prose | ✓ | ✓ caught "Looser" |
| L12 npm audit | ✓ 1 high | ✗ baseline · burndown |
| L13 coverage floor | ✓ 38% | ✓ frozen at floor |
| L14 visual regression | ✓ | ✓ held |
| A1 budget wrapper | ✗ | ✓ gained · 67.4 s/90 s |
| A1b SHA-cached pre-push | ✗ | ✓ gained |
| A3 daily report cron | ✗ | ✓ gained |
| CI Test Gate drill | flake-prone | flake-prone · v3.2 fix |
| Carpenter burndown | ~unknown | 1,555 · baselined |
3 layers gained · 11 held · 2 visible gaps with named workstreams.
Eight repos started with most layers unfilled. We didn't pretend to fill them. The iteration PR landed three things on every repo — and named the rest as gaps in the per-repo SKILL.
The signal isn't the layers we filled. It's the layers we named as gaps. A documented gap is a hire we can plan; a hidden gap is a fire we'll fight blind.
Production deploys went green for 9. bfd-platform's E2E flake unmasked an old retry-tolerant setup we'd never properly seen — that's a v3.2 fix, not a v3.1 regression.
Daily-report cron 401'd at PR creation across all 10 repos. Cause: Allow GitHub Actions to create and approve pull requests was off org-wide. Repo-level Actions perms can't override it. ~30 min to find. Now step 0 of any cron-PR workflow.
cfut_ vs cfat_Org's ~/.config/bfd/cloudflare.env held a cfut_ wrangler-OAuth token, not a cfat_ API token. Pages-create + DNS need API tokens; OAuth fails silently mid-flow. Token-type check is now the first line of every CF-touching script.
Token system + contrast gate wired only on bfd-platform. Eight UI-bearing repos ship hand-stitched canvas — every refactor risks brand drift and a11y regression. Single biggest hire on the v3.2 board. One repo per quarter cadence.
Goes to any repo, says "we're missing the standards here," and either writes the gate or files the work order. Senior, not junior. Subject-matter expert, not hands. Owns Vanta + compliance touchpoints alongside quality.
Reggie + Keith both reached for "QA + compliance" in the practice dev. This deck names that role.
The repetitive testing work that used to need a junior is now well-bounded enough for the agent to do reliably. "AI is going to do a 10× better job of writing and maintaining those tests than an intern will." — Reggie
That's the whole pattern. Every station in the gauntlet is one instance of it. CSS bytes, bundle bytes, audit count, dead-code findings, coverage floor, gate wall-time, lint warnings — pick the one drifting in your repo this week and freeze it.
One SME role unlocked. Same standards across every repo. The toolset doing the supervising. That's how four people ship like fifty without hating ourselves.
~/.claude/skills/code-quality-setup/SKILL.md — symlinked into Claude Code, Codex, and Cursor. Every agent reads it before generating code in the new repo.
Drop the per-repo SKILL template at .cursor/skills/code-quality/SKILL.md. Fill in which layers are filled, pending, or N/A. The honest gap doc is the deliverable.
In priority order: L1+L2 → A1 budget wrapper → A3 daily report → L6/L7 preflights → L5 fallow → L13 coverage → L8/L9 bytes → L12 audit → L10/L11 prose → L3/L4 tokens+contrast → L14 visual regression.
Update ~/.claude/skills/code-quality-setup/per-repo/INDEX.md — name, cron hour, filled vs. pending layers. The index is the org-wide adoption roster.
Open the iter PR. Each layer commit gets its own message. Gaps land documented, not pretended-shipped. PR description names every filled layer + every documented gap.
10 PRs merged · 9 production deploys clean · 7 real issues caught mid-flight · gaps documented, not hidden.