Why we kept saying this out loud

From last week's practice dev — what we kept saying

We need a metric of quality —
not an experiment on quality.

"If we spin up a lot of these all with different testing configurations, we're gonna hate ourselves."

— Keith · practice dev · 2026-04-27

"Yeah, no, we need to normalize on something for sure."

— Reggie · same conversation

"If we have a standardized way of doing this, we can go to any code repo and say this is up to quality, or it's not — and start to actually have a metric of quality."

— Keith · same conversation

The shape of the problem

4Humans on the team

11Active product repos

10×Lines/dev-day vs 2023

Cursor, Claude Code, and Codex write the bulk of the code. We review, shape, gate, ship. The pace looks like a 50-person enterprise team — but the headcount is a startup, and quality assumptions built for 50 humans don't transfer to 4.

The toolset has to do the supervising. Every repo on a different testing config is the velocity penalty we cannot afford.

This deck is the answer to that conversation.

One pattern, applied across all 11 repos, that turns "is this any good?" from a vibe-check into a number we can read.

The problem · scaling quality at agent pace02 / 15

Why now · what changed

Why now · the pyramid was right for craft pace

The testing pyramid was built for human pace.
Now code arrives at machine pace — the model breaks.

Pre-agent · craft pace

One developer wrote ~10 PRs a week.
The same humans wrote and reviewed code. Taste caught bad patterns.
The pyramid was a sketch on the wall. It described what a thoughtful test suite looked like — and humans self-enforced because the volume was small enough to.

Post-agent · machine pace

Cursor, Claude Code, and Codex write the bulk of every diff. ~50 PRs/dev/week is normal.
No human can read every line at machine pace. Taste doesn't scale.
The pyramid still describes a healthy test suite — but says nothing about whether this commit can ship.

"AI can ignore rules and it does on a regular basis. Lints are hard stops. Rules are guidelines."

— Reggie · practice dev · 2026-04-27

What the pyramid alone doesn't catch

From this morning's rollout across 10 repos — none of these are tests:

5 missing waitUntil: "commit" in page.waitForURL — would have hung E2E for 30 s each
14 high-severity npm advisories on bfd-front-door upstream
565 fallow clone groups + 990 health findings on bfd-platform — structural drift
"Looser" flagged by alex as a homophone of a slur
1,555 frozen work orders of CSS + structural decay
Cookies set httpOnly: false on NCEE staging — only surfaced on invalid-token

Unit tests
Many · fast
Integration
Some · medium
Functional
Few · slow
E2E
Few · slowest

The pyramid is right. It's just not enough.

Reggie's NCEE testing deck still covers the test-failure class. This deck is the gauntlet that wraps it — the static-analysis, build-artifact, supply-chain, and observability stations every commit walks past on its way to ship.

Why now · agent pace · the pyramid + the gauntlet03 / 15

Our approach

Black Flag standards · the gauntlet

The BFD Gauntlet.

Every commit walks the gauntlet on its way to production. Each station measures one number. Any station can stop the line — and nothing on the line ever weakens without a signed reason.

⎘

Commit

agent · human

Lint + format

0 warnings

✓ pass

TypeScript

strict mode

✓ pass

Fallow graph

≤ 565 clones

✓ pass

e2e preflight

no AST hangs

✓ pass

CSS bytes

≤ 248,833 b

✓ pass

⛔ Andon — line stopped

JS bundle

524,109 b
spec ≤ 512,884 b

✕ FAIL

L11

Prose (alex)

0 flags

✓ pass

L12

npm audit

≤ baseline

✓ pass

L13

Coverage

≥ 38 %

✓ pass

L14

Visual diff

PNG diff = 0

✓ pass

Wall-time

≤ 90 s local

✓ pass

Drift cron

daily PR

✓ pass

⛴

Production

main · deploy

↓ Tighten the line for free. If a metric improves — fewer bytes, fewer warnings, more coverage — edit the threshold to the new lower number. No PR ceremony, no discussion. Each station only ratchets tighter.

↑ Loosening is a signed ceremony. Raising a threshold needs a commit message naming the new value and the reason. "wallace 248833 → 248861 after react 19.2.5". Public reason or it doesn't merge.

⌀ No bypass. Ever. No || true. No --no-verify. No "warn-only" tier. Half-on stations are the pattern the next agent copies — either the station fails on regression, or it isn't a station.

The BFD Gauntlet · 12 stations · one stopped above04 / 15

The same set across every repo

Black Flag standards · 15 layers + 4 amendments

The layer set is identical across all 11 repos.
What differs is which layers are filled.

L1 · LEXICAL

ESLint + format

on save · pre-commit

style + rule drift

L2 · SEMANTIC

TypeScript

pre-commit · CI

contracts + types

L10 · PROSE

md prose (alex)

pre-commit · CI

ableist / homophone

L11 · PROSE

JSX prose (alex)

pre-push · CI

UI copy drift

L5 · STRUCTURE

Fallow project graph

pre-commit · CI

dead exports + clones

L6 · PRE-FLIGHT

Fallow preflight

pre-push · CI

scoped to diff

L7 · PRE-FLIGHT

e2e preflight

pre-push · CI

JS-AST scan for hangs

L14 · VISUAL

Visual regression

Playwright PNG diff

L8 · BYTES

Wallace (built CSS)

pre-push · CI

stylesheet bytes

L9 · BYTES

Bundle budget

pre-push · CI

JS hash-stripped

L13 · COVERAGE

Coverage floor

vitest threshold

A2 · BASELINE

Snapshot baseline

--update-snapshots

L12 · SUPPLY

npm audit

pre-push · CI · cron

live + baselined

A4 · SUPPLY

Dependabot

cron

grouped weekly

POLICY

Branch protection

always

enforce_admins · commitlint · secrets sweep

Test Gate orchestration

parallel jobs · needs gate

A1 · BUDGET

Wall-time wrapper

all phases

90 s local · 300 s CI

A1b · CACHE

SHA-cached pre-push

pre-push

skip if HEAD green

A3 · LOG

Daily report cron

cron · persistent PR

drift = the diff

L3/L4 · GAP

Tokens + WCAG contrast

unfilled in 8 of 10

v3.2: largest open hire

One unfilled fleet-wide: L15 test authorship — the AI maintains tests under our standards, but nobody is actively growing the suite. Documented as the second open hire in every per-repo SKILL.

Layer set · 15 layers + 4 amendments + 2 documented gaps05 / 15

Why the gauntlet works where guidelines don't

The doctrine · in Reggie's words first

"AI can ignore rules and it does on a regular basis. But if you have these gates set up — if this lint rule isn't followed, you can't push the code to GitHub — it reacts to those more reliably than it does to rules. Rules are guidelines. Lints are hard stops."

— Reggie · practice dev · 2026-04-27

Each station ratchets — tighter is free, loosening is ceremony.

MEASURECapture the current value of every metric. Don't aspire. Whatever the number is today is the freeze point.

FREEZECommit that value as the threshold. Zero headroom. Any increase fails the gate. The build won't go green until the metric returns to the frozen value or below.

RATCHET DOWNWhen a metric improves, edit the threshold to the new lower value. No ceremony, no discussion. The wheel only turns one direction.

RATCHET UPRaising a threshold requires a commit message that explicitly names the new value AND the reason. "Bumped wallace 248833 → 248861 after react 19.2.5" is the format. Public reasoning forces the call to be deliberate.

NEVERNo || true. No informational tier. No --no-verify. If a step is worth running it's worth failing on. The same gates that catch quality regressions catch compliance regressions — secrets sweep, audit baseline, branch-protection enforce_admins. This is also our Vanta substrate.

The doctrine · gates not guidelines06 / 15

Inside one station

Inside a station · the one pattern every gate is an instance of

Every station is just a JSON file the build refuses to regress.

1 · Freeze the value

// .wallace/tenant.json
{
  "totalSize":     248833,
  "selectorCount": 4129,
  "specificity": {
    "max": [0, 4, 4, 0]
  },
  "rules": {
    "empty":     { "total": 0 },
    "important": { "total": 0 }
  }
}

Today's measured value, written verbatim. Zero headroom.

2 · Gate on it

// scripts/wallace/check.mjs
const baseline = readJson(BASELINE);
const measured = await analyzeCss(BUILT);

for (const [k, v] of entries(baseline)) {
  if (measured[k] > v) {
    fail(`${k}: ${measured[k]} > ${v}`);
  }
}
// Exits non-zero if any metric
// regressed. No informational tier.

Any increase fails the gate. The wheel only turns one direction.

3 · Raising costs ceremony

$ git log -1 .wallace/tenant.json

chore(wallace): bump totalSize
  248833 → 248861 (+28 bytes)

react 19.2.5 ships ~28 bytes
of new createRoot scaffolding
we can't drop. Verified bundle
diff in PR #4129.

Public reason or it doesn't merge. Loosening is ceremony.

This pattern is the whole gauntlet. Every station — built-CSS bytes (Wallace), JS bundle bytes, npm-audit count, fallow clone groups, vitest coverage floor, gate wall-time — is the same three-step template: freeze a number, gate on it, ceremony to raise it. aatm-brain has 9 stations running today.

Inside a station · freeze · gate · ceremony07 / 15

vs vendor MCPs

Eli's 2024 question — answered

"Are we contriving this in a way that's so 2024? Sentry, Convex, Clerk all have approved Claude and cursor plugins already…"

— Eli · same conversation

Vendor MCPs teach the AI how to use a library.
The BFD Gauntlet enforces our standards. Two different jobs.

VENDOR MCPs · LIBRARY KNOWLEDGE

Sentry · Convex · Clerk · etc.

Provide the agent with up-to-date docs, idiomatic usage, working examples for that vendor's product. Soft guidance — the agent reads them, picks them up, sometimes ignores them.

"How do I set up Clerk middleware?"
"What's the right way to query Convex from a server action?"
"What Sentry tags should I attach to this error?"

Output: better-formed code that uses the vendor correctly. Mode: guidance, advisory.

THE GAUNTLET · COMPANY STANDARDS

Black Flag standards · gates that fail builds.

Enforce our bytes-per-build budget, our coverage floor, our npm-audit baseline, our commit-message format. Hard stops — if the threshold regresses, the build is red.

"This PR adds 28 KB to the CSS bundle — fails Wallace."
"This change drops coverage from 45 % to 44.7 % — fails L13."
"This commit raises the wallace threshold without a public reason — fails commit-msg."

Output: code that complies with company-wide quality contracts. Mode: enforcement, gating.

Both, not either. They live at different layers.

Vendor MCPs help the agent write better code. The gauntlet decides whether that code ships. We use vendor MCPs in every repo. We also walk every repo through the same gauntlet.

What this isn't · vendor MCPs vs Black Flag gates08 / 15

When each gate fires

Phase timing · catch the same problem at the cheapest possible phase

Each gate runs at the earliest cheap phase.
Defense in depth: same gate, multiple phases.

Gate	IDEms · on save	Pre-commit~3 s · on commit	Pre-push~70 s · on push	CI Test Gate~3 min · parallel	Crondaily · persistent PR
L1 · ESLint + format	●	●	●	●
L2 · TypeScript		●	●	●
L7 · e2e preflight			●	●
L6 · Fallow preflight (diff-scoped)		●	●
Branch protection / commitlint / secrets		●	●	●
L12 · npm audit			●	●	●
A4 · Dependabot grouped weekly			●	●	●
L13 · Coverage floor			●	●
L8 · Wallace built CSS		if CSS	●	●
L5 · Fallow project graph		●	●	●
L9 · Bundle byte budget			●	●
CI Test Gate orchestration				●
A1 · Wall-time budget wrapper			●	●
L14 · Visual regression (PNG diff)				●
v3.2 · PostHog telemetry gate					v3.2
A3 · Daily report PR					●

The cheapest phase wins. Catch at IDE → free. Catch at CI → minutes. Catch in production → a deploy and an apology. Most gates fire at multiple phases on purpose: the same lint runs locally and in CI so a fast local loop never bypasses the slow remote one.

Phase timing · same gate, multiple phases09 / 15

How it performed in real codebases

2026-04-28 · 10 ships · one morning · one prompt

The directive was thin because the standards are thick.
And SonarQube proved the model before we shipped.

What the agent ran on

"Execute org-wide v3.1 rollout per
~/.claude/plans/put-together-a-plan-
reactive-starfish.md. Authorship gate:
skip repos not primarily authored by Keith.
DO NOT STOP UNTIL ALL ITERATIONS COMPLETE.
Every repo finished with our changes
working and running on main."

No per-repo instructions. The agent read the global skill, the rollout plan, the per-repo INDEX, and the standards. It picked the iteration order, opened PRs, fixed CI failures, merged 10 PRs in ~6 hours of wall-clock. 9 deployed clean to production.

10/10PRs merged on main

9/10Production deploys clean

7Real issues caught mid-flight

What proved the model first: SonarQube

"I'm running SonarQube locally — barely set up, default rule set. The agents are seeing them as lints via the LSP and reacting to them, even though I haven't tuned anything."

— Reggie · practice dev · 2026-04-27

This is the proof of mechanism. Reggie's local SonarQube experiment, with no formal config, was already steering the agent — because Cursor reads the LSP continuously and surfaces SonarQube's warnings as in-editor lints. The agent treats those as gates and corrects.

v3.1 is the same idea, made repo-canonical instead of one-engineer's-laptop: standards surfaced as lints get respected; rules in a doc get ignored.

Thin prompt + thick standards ≫ thick prompt + thin standards.

Every minute spent in the SKILL is a minute paid back across every agent that comes after it.

The rollout · 10 PRs · the SonarQube proof10 / 15

Real codebase · #1 of 3 walkthroughs

aatm-brain · iter 2 · PR #158 · daily report 07:00 UTC

aatm-brain · most layers wired before;
now backed by amendments + a real catch.

What we asked the agent to do

Adopt v3.1 standard evolution onto a repo that already had 11 of 14 layers wired. Add the new amendments — wall-time budget (A1), SHA-cached pre-push (A1b), JSX prose (L11), visual regression (L14, A2), daily report (A3) — and document the rest as honest gaps in the per-repo SKILL.

What actually happened

The new e2e:preflight static check caught 5 real page.waitForURL calls missing waitUntil: "commit" — bugs that would have hung E2E for 30 s each in CI. Static analysis cheaper than runtime hang every time.
Pre-push had to be force-pushed with AATM_PREPUSH_FORCE=1 because the audit baseline holds 2 high CVEs out of scope for this PR. The gap is documented in the PR description and the per-repo SKILL. Honest gap, not skipped gate.
Coverage at 45 % was frozen as the new floor — gate against regression. Growing the suite is a separate workstream.
Theme tokens + WCAG contrast deferred — biggest standing investment for v3.2.

The L7 preflight paid for itself in one PR.

5 × 30-second hangs prevented = ~2.5 minutes per CI run × every PR forever. Static check < runtime hang, always.

aatm-brain · before → after by layer

Layer	Before	After
L1 ESLint	✓	✓ held
L2 tsc	✓	✓ held
L5 Fallow	✓	✓ held
L6 Fallow preflight	✗	✓ gained
L7 e2e preflight	✗	✓ gained · caught 5
L8 Wallace CSS	✓	✓ held
L9 Bundle budget	✓	✓ held
L10 md prose (alex)	✓	✓ held
L11 JSX prose	✗	✓ gained
L12 npm audit	✗ baseline	✗ deferred · documented
L13 coverage floor	✓ 45%	✓ frozen
L14 visual regression	✗	✓ gained
A1 budget wrapper	✓	✓ held
A1b SHA-cached pre-push	✓	✓ held
A3 daily report cron	✗	✓ gained
L3/L4 tokens + contrast	✗	✗ documented

5 layers gained · 9 held at the freeze · 2 gaps explicitly deferred with a paper trail.

aatm-brain · iter 2 · PR #15811 / 15

Real codebase · #2 of 3 walkthroughs

bfd-platform · iter 1 · PR #82 · the reference implementation

bfd-platform · we baselined 1,555 work orders
rather than pretend we'd cleaned them.

What we asked the agent to do

Add the new amendments to the most-mature repo in the convoy — wall-time budget, SHA-cached pre-push, visual regression, daily report. Refresh the per-repo SKILL to v3.1 wording. Land the iteration as the reference for the other 9 PRs to cite.

What actually happened

Baselined 565 fallow clone groups + 990 health findings — accumulated structural debt the new layer surfaced. Frozen as the v3.1 baseline so it can't get worse. Burndown is a separate workstream.
Test Gate flaked twice. CI Test Gate failed with 2 E2E suites timing out (client-feedback page loads, smoke - global agent); succeeded on retry #3. Not deterministically broken — flaky. v3.1 made the flake visible by failing fast. v3.1 didn't fix it. v3.2 work: quarantine flake-prone tests or fix them. Don't normalize retries.
Wall-time wrapper (A1) measured the full pipeline at 67.4 s under the 90 s ceiling. 22 s headroom.

The honest tradeoff

The stricter pre-commit + gauntlet pattern pushed Bobby off the blueprint repo for two days while we tuned which warnings get treated as hard stops vs. info. Stations that catch quality regressions only matter if the team can still work — calibration is its own line item.

The honest baseline beats the fake clean slate.

Frozen rot is still rot — but it's visible rot, with a paper trail. Next PR can't make it worse. Burndown is a real workstream, not a someday-refactor.

bfd-platform · before → after by layer

Layer	Before	After
L1 ESLint	✓	✓ held
L2 tsc	✓	✓ held
L3 theme tokens	✓	✓ held
L4 WCAG contrast	✓	✓ held
L5 Fallow	✓	✓ baselined 565+990
L8 Wallace CSS	✓	✓ held
L9 Bundle budget	✓	✓ held
L11 JSX prose	✓	✓ caught "Looser"
L12 npm audit	✓ 1 high	✗ baseline · burndown
L13 coverage floor	✓ 38%	✓ frozen at floor
L14 visual regression	✓	✓ held
A1 budget wrapper	✗	✓ gained · 67.4 s/90 s
A1b SHA-cached pre-push	✗	✓ gained
A3 daily report cron	✗	✓ gained
CI Test Gate drill	flake-prone	flake-prone · v3.2 fix
Carpenter burndown	~unknown	1,555 · baselined

3 layers gained · 11 held · 2 visible gaps with named workstreams.

bfd-platform · iter 1 · PR #82 · with honest tradeoff12 / 15

Real codebase · #3 of 3 + day-savers

ncee + 7 others · min-viable adoption · plus what we wouldn't have predicted

ncee, front-door, mcp, playbook, style-guide, cli, widget, muster.
Min-viable. Honest. Documented.

What "min-viable" actually shipped

Eight repos started with most layers unfilled. We didn't pretend to fill them. The iteration PR landed three things on every repo — and named the rest as gaps in the per-repo SKILL.

Wall-time budget wrapper (A1) — every repo now has a measured ceiling. Drift visible the moment it appears.
Daily report cron (A3) — every repo has a persistent-PR log. The day the org-perm flips, every cron starts firing.
Per-repo SKILL with explicit gaps — ncee documents 8 unfilled layers; bfd-front-door documents 14+ upstream Astro CVEs; bfd-cli documents node:test → vitest as deferred.

The signal isn't the layers we filled. It's the layers we named as gaps. A documented gap is a hire we can plan; a hidden gap is a fire we'll fight blind.

9 of 10 repos deployed clean. The 10th flaked twice and passed on retry.

Production deploys went green for 9. bfd-platform's E2E flake unmasked an old retry-tolerant setup we'd never properly seen — that's a v3.2 fix, not a v3.1 regression.

Day-savers we'll write down for the next agent

Org-level GitHub setting

The PR-create permission flip

Daily-report cron 401'd at PR creation across all 10 repos. Cause: Allow GitHub Actions to create and approve pull requests was off org-wide. Repo-level Actions perms can't override it. ~30 min to find. Now step 0 of any cron-PR workflow.

Cloudflare token type confusion

cfut_ vs cfat_

Org's ~/.config/bfd/cloudflare.env held a cfut_ wrangler-OAuth token, not a cfat_ API token. Pages-create + DNS need API tokens; OAuth fails silently mid-flow. Token-type check is now the first line of every CF-touching script.

Largest standing investment · v3.2

L3/L4 · tokens + WCAG contrast on 8 of 10 repos

Token system + contrast gate wired only on bfd-platform. Eight UI-bearing repos ship hand-stitched canvas — every refactor risks brand drift and a11y regression. Single biggest hire on the v3.2 board. One repo per quarter cadence.

The convoy + day-savers · 2026-04-2813 / 15

What's in it for you

Eli's question · answered

"At the turn of the summer, assuming we sign enough work — like, do we want a recent college grad to handle maintaining and building out a testing suite? Has AI changed the way we should be thinking about the next deal?"

— Eli · practice dev · 2026-04-27

Hire one technical-QA SME, not five juniors.
The role this opens up: write the standards. Let the agent maintain code under them.

THE HIRING ANSWER

One red-team / technical-QA SME.

Goes to any repo, says "we're missing the standards here," and either writes the gate or files the work order. Senior, not junior. Subject-matter expert, not hands. Owns Vanta + compliance touchpoints alongside quality.

Reggie + Keith both reached for "QA + compliance" in the practice dev. This deck names that role.

WHAT THE AGENT DOES INSTEAD

Maintains tests, fixes lint, writes coverage, tightens stations — under the SME's standards.

The repetitive testing work that used to need a junior is now well-bounded enough for the agent to do reliably. "AI is going to do a 10× better job of writing and maintaining those tests than an intern will." — Reggie

THE SINGLE TAKEAWAY

Pick one number your repo can measure. Freeze it. Make the build fail on regression.

That's the whole pattern. Every station in the gauntlet is one instance of it. CSS bytes, bundle bytes, audit count, dead-code findings, coverage floor, gate wall-time, lint warnings — pick the one drifting in your repo this week and freeze it.

The team's headline: a metric of quality, not an experiment on quality.

One SME role unlocked. Same standards across every repo. The toolset doing the supervising. That's how four people ship like fifty without hating ourselves.

What's in it for you · the hiring + adoption answer14 / 15

From one station to the full gauntlet

Adoption playbook · five steps for every new repo

From one station to the full BFD Gauntlet.
Same five steps that ran the rollout.

STEP 1

Read the standards

~/.claude/skills/code-quality-setup/SKILL.md — symlinked into Claude Code, Codex, and Cursor. Every agent reads it before generating code in the new repo.

STEP 2

Per-repo SKILL

Drop the per-repo SKILL template at .cursor/skills/code-quality/SKILL.md. Fill in which layers are filled, pending, or N/A. The honest gap doc is the deliverable.

STEP 3

Hire the layers

In priority order: L1+L2 → A1 budget wrapper → A3 daily report → L6/L7 preflights → L5 fallow → L13 coverage → L8/L9 bytes → L12 audit → L10/L11 prose → L3/L4 tokens+contrast → L14 visual regression.

STEP 4

Add to the index

Update ~/.claude/skills/code-quality-setup/per-repo/INDEX.md — name, cron hour, filled vs. pending layers. The index is the org-wide adoption roster.

STEP 5

Ship

Open the iter PR. Each layer commit gets its own message. Gaps land documented, not pretended-shipped. PR description names every filled layer + every documented gap.

A metric of quality, not an experiment on quality.
One SME, four people, eleven repos, one gauntlet.

10 PRs merged · 9 production deploys clean · 7 real issues caught mid-flight · gaps documented, not hidden.

Thanks.
Questions?

The BFD Gauntlet · adoption + close15 / 15

The BFDGauntlet.

We need a metric of quality —not an experiment on quality.

The shape of the problem

This deck is the answer to that conversation.

The testing pyramid was built for human pace.Now code arrives at machine pace — the model breaks.

Pre-agent · craft pace

Post-agent · machine pace

What the pyramid alone doesn't catch

The pyramid is right. It's just not enough.

The BFD Gauntlet.

The layer set is identical across all 11 repos.What differs is which layers are filled.

Each station ratchets — tighter is free, loosening is ceremony.

Every station is just a JSON file the build refuses to regress.

Vendor MCPs teach the AI how to use a library.The BFD Gauntlet enforces our standards. Two different jobs.

Sentry · Convex · Clerk · etc.

Black Flag standards · gates that fail builds.

Both, not either. They live at different layers.

Each gate runs at the earliest cheap phase.Defense in depth: same gate, multiple phases.

The directive was thin because the standards are thick.And SonarQube proved the model before we shipped.

What the agent ran on

What proved the model first: SonarQube

Thin prompt + thick standards ≫ thick prompt + thin standards.

aatm-brain · most layers wired before;now backed by amendments + a real catch.

What we asked the agent to do

What actually happened

The L7 preflight paid for itself in one PR.

bfd-platform · we baselined 1,555 work ordersrather than pretend we'd cleaned them.

What we asked the agent to do

What actually happened

The honest tradeoff

The honest baseline beats the fake clean slate.

ncee, front-door, mcp, playbook, style-guide, cli, widget, muster.Min-viable. Honest. Documented.

What "min-viable" actually shipped

9 of 10 repos deployed clean. The 10th flaked twice and passed on retry.

Day-savers we'll write down for the next agent

Hire one technical-QA SME, not five juniors.The role this opens up: write the standards. Let the agent maintain code under them.

One red-team / technical-QA SME.

Maintains tests, fixes lint, writes coverage, tightens stations — under the SME's standards.

Pick one number your repo can measure. Freeze it. Make the build fail on regression.

The team's headline: a metric of quality, not an experiment on quality.

From one station to the full BFD Gauntlet.Same five steps that ran the rollout.

Read the standards

Per-repo SKILL

Hire the layers

Add to the index

Ship

A metric of quality, not an experiment on quality.One SME, four people, eleven repos, one gauntlet.

Thanks.Questions?

The BFD
Gauntlet.

We need a metric of quality —
not an experiment on quality.

The testing pyramid was built for human pace.
Now code arrives at machine pace — the model breaks.

The layer set is identical across all 11 repos.
What differs is which layers are filled.

Vendor MCPs teach the AI how to use a library.
The BFD Gauntlet enforces our standards. Two different jobs.

Each gate runs at the earliest cheap phase.
Defense in depth: same gate, multiple phases.

The directive was thin because the standards are thick.
And SonarQube proved the model before we shipped.

aatm-brain · most layers wired before;
now backed by amendments + a real catch.

bfd-platform · we baselined 1,555 work orders
rather than pretend we'd cleaned them.

ncee, front-door, mcp, playbook, style-guide, cli, widget, muster.
Min-viable. Honest. Documented.

Hire one technical-QA SME, not five juniors.
The role this opens up: write the standards. Let the agent maintain code under them.

From one station to the full BFD Gauntlet.
Same five steps that ran the rollout.

A metric of quality, not an experiment on quality.
One SME, four people, eleven repos, one gauntlet.

Thanks.
Questions?