The Official Methodology

The Agentic Coding
Flywheel

A comprehensive guide to creating extraordinary software by orchestrating swarms of AI agents using exhaustive markdown plans, polished beads, and the Agent Flywheel stack. Based on the methodology of Jeffrey Emanuel.

1

The Complete Workflow

This is the end-to-end methodology for creating software with frontier AI models, exhaustive markdown planning, beads-based task management, and coordinated agent swarms. Every project follows the same arc, whether it is a small CLI tool or a complex web application. This guide is about moving the hardest thinking into representations that still fit into model context windows. That is the whole game.

Interactive Visualization

Why the flywheel compounds instead of spinning in place

Step through the loop. The same project gets faster and safer because every completed cycle upgrades the artifacts feeding the next one. This is the compounding return on planning.

Simulation Depth
System Output1.8xIntentPlanningTranslationExecutionHarden
Focused Context
Intent

Defines the 'Why'

Input

Fuzzy goals

Output

Mental model

System Metrics L3
Planning leverage70%
Swarm determinism60%
Reusable memory52%

It starts with you. You have an idea for a piece of software. Maybe a web app, maybe a CLI tool, maybe a complex system. Instead of opening an editor and starting to code, you do something that feels counterintuitive: you spend the vast majority of your time planning.

  1. 1You explain what you want to build to a frontier model like GPT Pro with Extended Reasoning. Your concept, your goals, the user workflows, why it matters. The model produces an initial markdown plan: a comprehensive design document for the entire system.
  2. 2You ask competing models to create their own plans. Claude Opus, Gemini with Deep Think, Grok Heavy. Each one independently designs the same project. They come up with surprisingly different approaches, each with unique strengths and blind spots.
  3. 3You synthesize the best ideas from all plans into one. GPT Pro analyzes the competing plans and produces a "best of all worlds" hybrid that blends the strongest ideas from every model into a single superior document.
  4. 4You iterate relentlessly. Round after round of refinement, each time in a fresh conversation, until the suggestions become incremental. Plans created this way routinely reach 3,000 to 6,000+ lines. They are not slop. They are the result of countless iterations and feedback from many frontier models.
  5. 5You convert the plan into beads. Beads are self-contained work units (like Jira or Linear tasks, but optimized for use by coding agents). Each bead carries its own context, reasoning, dependencies, and test obligations. A complex plan might produce 200-500 beads with a full dependency graph.
  6. 6You polish the beads obsessively. "Check your beads N times, implement once," where N is as many as you can stomach. Each polishing round finds things the previous round missed: duplicates, missing dependencies, incomplete context. You run this 4-6+ times until convergence.
  7. 7You launch a swarm of agents. Claude Code, Codex, and Gemini-CLI sessions running in parallel, all in the same codebase. They coordinate through Agent Mail, choose work intelligently using bv's graph-theory routing, and execute beads systematically.
  8. 8You tend the swarm, not the code. The human checks for stuck beads, rescues agents after context compaction, sends review prompts, and ensures flow quality. You are the clockwork deity. You designed the machine, set it running, and now you manage it.
  9. 9Agents review, test, and harden. Self-review with fresh eyes, cross-agent review, random code exploration, testing coverage, UI/UX polish. Rounds and rounds until reviews come back clean.

That is the whole movie. For the CASS Memory System, this process turned a 5,500-line markdown plan into 347 beads. Twenty-five agents produced 11,000 lines of working, tested code with 204 commits in about five hours. You can see the actual plan, the actual agent mail messages, and the actual beads for yourself.

5,500
Line Plan
Synthesized from 4 frontier models
347
Beads
With full dependency structure
11k
Lines of Code
Working, tested, production-ready
~5 hrs
To Ship
25 agents, 204 commits
Once you have the beads in good shape based on a great markdown plan, I almost view the project as a foregone conclusion at that point. The rest is basically mindless "machine tending" of your swarm of 5-15 agents.

The frontier models and coding agent harnesses really are that good already. They just need this extra level of tooling, prompting, and workflows to reach their full potential. The rest of this guide zooms into each stage.

Glossary

Term
Markdown plan
Plain-English Meaning
A huge design document where the whole project still fits in context
Why It Matters
Where architecture, workflows, tradeoffs, and intent get worked out
Term
Bead
Plain-English Meaning
A self-contained work unit in br with context, dependencies, and test obligations
Why It Matters
What agents actually execute
Term
Bead graph
Plain-English Meaning
The full dependency structure across all beads
Why It Matters
What lets bv compute the right next work
Term
Plan space
Plain-English Meaning
The reasoning mode where you are still shaping the whole system
Why It Matters
The cheapest place to buy correctness
Term
Bead space
Plain-English Meaning
The reasoning mode where you are shaping executable work packets
Why It Matters
Where planning becomes swarm-ready
Term
Code space
Plain-English Meaning
The implementation and verification layer inside the codebase
Why It Matters
Where local execution happens
Term
AGENTS.md
Plain-English Meaning
The operating manual every agent must reload after compaction
Why It Matters
Keeps the swarm from forgetting how to behave
Term
Skill
Plain-English Meaning
A reusable instruction bundle that teaches agents how to use a tool or execute a workflow
Why It Matters
How methods become repeatable instead of staying as tacit lore
Term
Compaction
Plain-English Meaning
Context compression inside a long-running agent session
Why It Matters
Why re-reading AGENTS.md is mandatory
Term
Fungible agents
Plain-English Meaning
Generalist agents that can replace one another
Why It Matters
Makes crashes and amnesia survivable
2

Why Planning Is 85% of the Work

You spend 85% of your time on planning. The first time you try it, it feels wrong. No code is being written. Every instinct tells you to just start building. That discomfort is the signal that you are doing it right.

The models are far smarter when reasoning about a plan that is very detailed and fleshed out but still trivially small enough to easily fit within their context window. This is really the key insight behind my obsessive focus on planning and why I spend 80%+ of my time on that part.

Interactive: The Context Horizon

Why reasoning in plan-space dominates reasoning in code-space as projects scale.

PrototypeProduction
Token Limit Horizon
Plan Space
Context Safe
Code Space
In context

A markdown plan, even a massive 6,000-line one, is still vastly smaller than the codebase it describes. When models reason about a plan instead of raw implementation, they can hold the whole system in their context window at once. Once you start turning that plan into code, the system rapidly becomes too large to understand holistically. You are doing global reasoning while global reasoning is still possible.

I think you get a better result faster by creating one big comprehensive, detailed, granular plan. That's the only way to get these models to use their big brains to understand the entire system all at the same time.

Planning tokens are far fewer and cheaper than implementation tokens. A big, complex markdown plan is shorter than a few substantive code files, let alone a whole project. That means you can afford many more refinement rounds in planning than in implementation. Each planning round evaluates system-wide consequences, not just local code edits. Each improvement to the plan gets amortized across every downstream bead and code change. Planning is the cheapest place to buy correctness, coherence, and ambition.

This workflow is what prevents it from generating slop. I spend 85% of my time and energy in the planning phases.

Without front-loaded planning, agents are effectively improvising architecture from a narrow local window into the codebase. That is exactly when you get placeholder abstractions, missing workflow details, contradictory assumptions, and compatibility shims that nobody actually wanted. With a detailed plan and polished beads, the models are no longer inventing the system from scratch while coding. They are executing a constrained, coherent design.

The Human Part

The plan creation is the most free form, creative, human part of the process. I just usually start writing in a messy stream of thought way to convey the basic concept and then collaboratively work the agent to flesh it out in an initial draft.

The human is not there to hand-author every line of the plan. The human is there to inject intent, judgment, taste, product sense, and strategic direction at the point where those qualities affect the entire downstream system. Once the plan is excellent, the rest becomes much more mechanical.

When prompting the model to create the initial markdown plan, you spend a lot of time explaining the goals and intent of the project and detailing the workflows: how you want the final software to work from the standpoint of the user's interactions. The more the model understands about what you're really trying to accomplish and the end goal and why, it can do a better job for you.

Debates belong in planning, not implementation. As many important disagreements as possible should happen before the swarm is burning expensive implementation tokens. Implementation can still surface surprises, but the posture of the workflow is to front-load decisions into plan space.

You Don't Need to Know Everything Upfront

The most common objection to spending 85% of your time on planning: "I don't really know all the requirements at the beginning, and I need the flexibility to change things later." This is not at all in tension with the methodology. Thorough planning does not mean transcribing requirements you already know. It means using frontier models to discover requirements you never would have found on your own, iteratively, while changes are still cheap.

When you paste a rough concept into GPT Pro and ask for a comprehensive plan, the model surfaces dozens of edge cases, architectural considerations, and workflow details you had not thought of. When you show that plan to three competing models, each one finds blind spots the others missed. When you run five rounds of refinement, each round uncovers issues invisible in the previous round. By the time you start implementation, you know far more about your own project than you would have discovered through months of coding and refactoring.

This extends even further when adding major features to existing projects. You can point an agent at an entirely separate open-source project, have it study that project's architecture, and ask it to reimagine the strongest ideas through the lens of your own project's unique capabilities. Requirements emerge from the research itself. The methodology does not demand omniscience up front; it demands a willingness to let the models do deep, iterative exploration before committing to implementation.

And that isn't at all in tension with my approach, as I hope to illustrate here.

Three Reasoning Spaces

The methodology separates work into three spaces, each with a different artifact and a different question it answers:

Cost Architecture

Where you catch the bug determines the rework bill

Inject the same mistake at different layers. The deeper it lands, the more downstream structure has already hardened around it. This is the Law of Rework Escalation.

plan layer
Plan Space
Price
1x
bead layer
Bead Space
Price
5x
code layer
Code Space
Price
25x
Rework Intuition
Plan Space

Fixes are pure reasoning. Zero code churn.

Bead Space

Fixes rewrite orchestration. High coordination cost.

Code Space

Fixes pay the double-tax: implementation + cleanup.

Strategic Takeaway

Planning earns its keep because it is the cheapest layer for global reasoning. Press 'Inject' on any layer to visualize the cost cascade.

Space
Plan space
Primary Artifact
Large markdown plan
What You Decide There
Architecture, features, workflows, tradeoffs — the whole system still fits in context
Space
Bead space
Primary Artifact
br issues + dependency graph
What You Decide There
Task boundaries, execution order, embedded context — agents need explicit, local work units
Space
Code space
Primary Artifact
Source files + tests
What You Decide There
Implementation and verification — the plan has already constrained the high-level decisions

Plan space is where you figure out what the system should be. Bead space is where you turn that into executable memory, a graph of self-contained work units detailed enough that agents don't have to keep consulting the full plan. Code space is where agents implement, review, and test locally. The key is knowing which space you're in: if you are still redesigning the product, stay in plan space. If you are mainly packaging the work for execution, move to bead space.

3

Creating & Refining the Markdown Plan

Before You Start: The Foundation Bundle

Before writing the plan itself, you need a coherent foundation. Think of it as assembling a foundation bundle: a tech stack decision, an initial architectural direction, a strong AGENTS.md file bootstrapped from a known-good template, up-to-date best-practices guides, and enough product and workflow explanation for the models to understand what "good" looks like.

Keep best practices guides in the project folder and reference them in AGENTS.md. These guides should be kept up to date; you can have Claude Code search the web and update them to the latest versions of your frameworks and libraries.

A strong bootstrap move is to start every new project by copying an AGENTS.md from an existing project that already has good general behavioral rules, safety notes, tool blurbs, and coordination guidance. Later, once the plan and beads are clearer, you ask agents to replace the project-specific content while preserving the general rules that carry across projects.

Critical
Weak foundations leak uncertainty into every later stage. If any of these are missing, the plan will silently absorb ambiguity that later shows up as bad beads, confused agents, and sloppy implementation.

Writing the Initial Plan

You don't even need to write the initial markdown plan yourself. You can write that with GPT Pro, just explaining what it is you want to make. Claude Opus in the web app is also good for this, but GPT Pro with Extended Reasoning remains the top choice for initial planning. No other model can touch Pro on the web when it's dealing with input that easily fits into its context window. It's truly unique. And since you get it on an all-you-can-eat basis with a Pro plan, take full advantage of that.

You usually also specify the tech stack. For a web app, it's generally TypeScript, Next.js 16, React 19, Tailwind, Supabase, with anything performance-critical in Rust compiled to WASM. For a CLI tool, usually Go or Rust. If the stack isn't obvious, do a deep research round with GPT Pro or Gemini and have them study all the relevant libraries and make a suggestion taking your goals into account.

What a First Plan Looks Like

A first serious markdown plan would not say "build a notes app." It would start spelling out the actual user-visible system:

  • Users upload Markdown files through a drag-and-drop UI.
  • The system parses frontmatter tags and stores upload failures for review.
  • Search must support keyword, tag, and date filtering with low perceived latency.
  • Admins need a dedicated screen showing ingestion failures, parse reasons, and retry actions.
  • Auth is internal-only; unauthorized users must never see document content or metadata.
  • We need e2e coverage for upload success, upload failure, search, filtering, and admin review.

That is still only the beginning. But it already shows the difference between ordinary brainstorming and Flywheel planning: the plan tries to make the whole product legible before any code exists.

Multi-Model Plans

For the best results, ask multiple frontier models to independently create plans for the same project. GPT Pro, Claude Opus, Gemini with Deep Think, Grok Heavy. Each comes up with pretty different plans. Different frontier models have different "tastes" and blind spots. Passing a plan through a gauntlet of different models is the cheapest way to buy architectural robustness.

In the CASS Memory System project, the competing plans are publicly visible. This pattern has been used across at least 10 sessions spanning 7+ projects.

Then show their competing plans to GPT Pro with this prompt:

Best-of-All-Worlds Synthesis
I asked 3 competing LLMs to do the exact same thing and they came up with pretty different plans which you can read below. I want you to REALLY carefully analyze their plans with an open mind and be intellectually honest about what they did that's better than your plan. Then I want you to come up with the best possible revisions to your plan (you should simply update your existing document for your original plan with the revisions) that artfully and skillfully blends the "best of all worlds" to create a true, ultimate, superior hybrid version of the plan that best achieves our stated goals and will work the best in real-world practice to solve the problems we are facing and our overarching goals while ensuring the extreme success of the enterprise as best as possible; you should provide me with a complete series of git-diff style changes to your original plan to turn it into the new, enhanced, much longer and detailed plan that integrates the best of all the plans with every good idea included (you don't need to mention which ideas came from which models in the final revised enhanced plan):
Context

GPT Pro web app with Extended Reasoning

Psychology

Forces the model to be intellectually honest about what competitors did better, then synthesize a hybrid that is stronger than any individual plan. The 'best of all worlds' phrasing appears in 10+ distinct sessions across 7+ projects in the session archive. Under the hood, the prompt's length and specificity are deliberate: by asking for git-diff style changes, complete integration of every good idea, and explicit updating of the existing plan, it prevents the model from writing a vague summary and forces structural engagement with the competing plans' actual content.

Take GPT Pro's output (the git-diff style revisions) and paste it into Claude Code or Codex to integrate the revisions in-place:

Integrate Synthesis Revisions
OK, now integrate these revisions to the markdown plan in-place; use ultrathink and be meticulous. At the end, you can tell me which changes you wholeheartedly agree with, which you somewhat agree with, and which you disagree with: [Pasted synthesis output]
Context

Claude Code or Codex

Psychology

Claude critically assesses each suggestion, providing a second layer of quality filtering. The 'wholeheartedly agree / somewhat agree / disagree' framing is load-bearing: it forces the agent to evaluate each revision on a gradient rather than accepting or rejecting the whole batch. You get signal about which changes are obviously good versus which are plausible but debatable, which lets you intervene on the edge cases rather than rubber-stamping everything.

Interactive Studio

Best-of-all-worlds synthesis

Toggle proposal plans on and off, then drag the refinement dial. The point is not “many models” in the abstract; it is that complementary strengths plus fresh-round revision produce a plan that is harder to surprise later.

Hybrid Plan Quality
100%
Active Inputs
4
GPT ProClaude OpusGeminiGrok Heavy
Round 3: Fresh rounds are sanding down contradictions and exposing edge cases.
What the hybrid gains
Architecture100%
Workflow coverage100%
Failure modes100%
Novel ideas100%
Execution readiness100%
If you mute a proposal

All proposal sources are active. Best-of-all-worlds.

Iterative Refinement

Now paste the current plan into a fresh GPT Pro conversation with this prompt. The key word is fresh. Fresh conversations prevent the model from anchoring on its own prior output. Repeat 4-5 rounds:

Plan Refinement Prompt
Carefully review this entire plan for me and come up with your best revisions in terms of better architecture, new features, changed features, etc. to make it better, more robust/reliable, more performant, more compelling/useful, etc. For each proposed change, give me your detailed analysis and rationale/justification for why it would make the project better along with the git-diff style changes relative to the original markdown plan shown below: <PASTE YOUR EXISTING COMPLETE PLAN HERE>
Context

GPT Pro web app — fresh conversation each round

Psychology

This has never failed to improve a plan significantly. Each round finds architectural issues, missing features, and robustness improvements that the previous round missed. Under the hood, asking for 'rationale/justification' prevents the model from making arbitrary changes; it has to argue for each revision, which filters out changes that seem clever but do not actually improve the plan. The git-diff format forces precision rather than vague hand-waving about what should be different.

This has never failed to improve a plan significantly. The best part is that you can start a fresh conversation in ChatGPT and do it all again once Claude Code or Codex finishes integrating your last batch of suggested revisions. After four or five rounds of this, you tend to reach a steady-state where the suggestions become very incremental.

You can still get extra mileage by blending in smart ideas from Gemini with Deep Think enabled, or from Grok Heavy, or Opus in the web app, but you still want to use GPT Pro on the web as the final arbiter of what to take from which model and how to best integrate it.

Insight
The "Lie to Them" technique: Models tend to stop looking for problems after finding ~20-25 issues. If you tell them to find "all" problems, they stop early. The solution: lie to them and give them a huge number, and then they keep cranking until they have uncovered all of them. This works for plan revisions, bead-to-plan cross-references, and any comparison/audit task.
Overshoot Mismatch Hunt
Do this again, and actually be super super careful: can you please check over the plan again and compare it to all that feedback I gave you? I am positive that you missed or screwed up at least 80 elements of that complex feedback.
Context

After any review pass that feels too short or self-satisfied

Psychology

By claiming 80+ errors exist, the model keeps searching exhaustively rather than satisfying itself with a partial list.

Plans created this way routinely reach 3,000-6,000+ lines. They are not slop. They are the result of countless iterations and blending of ideas and feedback from many models. For the CASS GitHub Pages export feature, the plan went through multiple rounds over about 3 hours, growing to approximately 3,500 lines. You can also see a 6,000-line plan to get a feel for the scale.

It feels slow because no code is being written. But if you do it correctly and then start up enough agents in your swarm with Agent Mail, beads, and bv, the code will be written so ridiculously quickly that it more than makes up for this slow part. And what's more, the code will be really good.

When to stop refining and start converting to beads: Stay in plan refinement if whole-workflow questions are still moving around, major architecture debates are still open, or fresh models keep finding substantial missing features, constraints, or tradeoffs. Switch to beads when the plan mostly feels stable and the remaining improvements are about execution structure, testing obligations, sequencing, and embedded context rather than about what the system fundamentally is. If you are still redesigning the product, stay in plan space. If you are mainly packaging the work for execution, move to bead space.

4

Converting the Plan into Beads

Critical
Watch for the plan-bead gap. In the destructive_command_guard sessions, agents occasionally got stuck between plan revision and bead creation: the competing plan was revised but no beads were ever created. Always ensure the synthesis step concludes with an explicit transition to bead creation.

Then you're ready to turn the plan into beads. Think of these as epics, tasks, and subtasks with an associated dependency structure. The name comes from Steve Yegge's amazing project, which is like Jira or Linear, but optimized for use by coding agents. They are stored locally in .beads/ JSONL files that commit with your code.

There are two separate stages here. The planning is and should be prior to and orthogonal to beads. You should always have a super detailed markdown plan first. Then treat transforming that markdown plan into beads as a separate, distinct problem with its own challenges. But once you're in "bead space" you never look back at the markdown plan. But that's why it's so critical to transfer all the details over to the beads.

Plan to Beads Conversion
OK so please take ALL of that and elaborate on it more and then create a comprehensive and granular set of beads for all this with tasks, subtasks, and dependency structure overlaid, with detailed comments so that the whole thing is totally self-contained and self-documenting (including relevant background, reasoning/justification, considerations, etc.-- anything we'd want our "future self" to know about the goals and intentions and thought process and how it serves the over-arching goals of the project.) Use only the `br` tool to create and modify the beads and add the dependencies. Use ultrathink.
Context

Claude Code with Opus

Psychology

This prompt forces the agent to treat plan-to-beads as a translation problem rather than task extraction. The key sentence is the requirement that beads be so detailed you never need to reopen the markdown plan. That pushes rationale, test expectations, design intent, and sequencing into the bead graph itself. Under the hood, it blocks a common failure mode where the model collapses a rich plan into terse todo items. By explicitly asking for tasks, subtasks, dependency structure, comments, and future-self context, you tell the model that memory density matters more than brevity. Restricting to the br tool prevents the agent from drifting into pseudo-beads in markdown instead of editing the actual task graph.

For existing projects with a specific plan file, prefix it: "OK so now read ALL of PLAN_FILE_NAME.md; please take ALL of that and elaborate on it..." The rest of the prompt stays the same.

Critical
Never write pseudo-beads in markdown documents. Go directly from the markdown plan to actual real beads using the br tool, and from that point on just add and change actual beads. If the model starts describing beads in text form instead of creating them, stop it and redirect to br create.
Interactive Exhibit

A plan is only useful once it becomes executable memory

Pick a concept from the plan, then compare what survives into a thin bead versus a context-rich bead. The gap is the source of most swarm confusion.

Human Plan
Selected Concept Insight
File Upload Core

Seems simple, but implies S3 bucket provisioning, chunking, and progress states.

Flows into br-101 & br-102
Translation Output
Context-rich beads
Low guesswork
br-101
Document Upload API
root beadcarries concept
Outcome
Validate uploads, chunk large files, parse them, and emit ingestion telemetry.
Acceptance
Reject malware, cap retries, persist parse status, and record timing for every stage.
Tests
Fixture matrix for malformed PDFs, retry exhaustion, and upload resume edge cases.
br-102
Search Index and Query UX
deps: br-101carries concept
Outcome
Index parsed documents, support ranked retrieval, and explain match reasons in the UI.
Admin hooks
Expose index freshness, failed document counts, and reindex controls for operators.
Tests
Coverage for typo tolerance, empty result states, filter combinations, and stale index warnings.
br-103
Ingestion Failure Dashboard
deps: br-101
Outcome
Surface failed ingestions, stalled jobs, and per-file parse diagnostics in one operational view.
Recovery path
Allow safe requeue actions with audit trails and blast-radius summaries before retry.
Tests
E2E coverage for retry, audit logging, and incorrect operator input during incident handling.
Execution Readiness
94%

Each bead carries the why, what, failure modes, and verification plan needed to execute.

Unanswered Questions
1

Most architecture has already been decided upstream.

The Takeaway

Fresh agents can execute without improvising architecture or silently dropping intent.

Impacts 2 beads

Beads as Executable Memory

The plan is still the best artifact for whole-system thought. But once a swarm is involved, what you need is not a beautiful essay. You need a task graph that carries enough local context for agents to act correctly without repeatedly loading the whole project back into memory. If the beads are weak, the swarm becomes improvisational. If the beads are rich, the swarm becomes almost mechanical.

It works better if most of the decision making is made ahead of time during the planning phases and then is embedded in the beads.
  • Self-contained: Beads must be so detailed that you never need to refer back to the original markdown plan. Every piece of context, reasoning, and intent should be embedded.
  • Rich content: Beads can and should contain long descriptions with embedded markdown. They don't need to be short bullet-point entries. You can embed snippets of markdown inside the beads and they often do; JSONL is just how they serialize.
  • Complete coverage: Everything from the markdown plan must be embedded into the beads. Lose nothing in the conversion.
  • Explicit dependencies: The dependency graph must be correct; this is what enables bv to compute the optimal execution order.
  • Include testing: Beads should include comprehensive unit tests and e2e test scripts with great, detailed logging.
Note
Conceptually, the beads are more for the agents than for you. The models are the primary consumer of beads. You can always have agents interpret beads back into markdown if needed.

What Good Beads Look Like

To make this concrete, imagine a small internal web app called "Atlas Notes" for uploading and searching team notes. Instead of one vague task like "build Atlas Notes," the plan becomes many self-contained beads:

  • br-101 Upload and Parse Pipeline: Describes accepted file formats, frontmatter parsing expectations, where failures are logged, what happens on malformed input, and which unit and e2e tests prove the pipeline works.
  • br-102 Search Index and Query UX: Carries the search behavior, indexing rules, latency expectations, filter semantics, empty-state UX, and test coverage for keyword/tag/date combinations.
  • br-103 Ingestion Failure Dashboard: Includes the admin workflow, permission boundaries, retry logic, logging expectations, and the exact reasons this dashboard matters for operational trust.

The titles are not the important part. What matters is that each bead is rich enough that a fresh agent can open it and immediately understand what correct implementation looks like, why it matters, and how to verify it.

For the CASS Memory System (5,500-line plan), the conversion produced 347 beads with complete dependency structure. FrankenSQLite had hundreds of beads created via parallel subagents. For complex projects, expect 200-500 initial beads.

Once you convert the plan docs into beads, you're supposed to not really need to refer back to the docs if you did a good job. The docs are still useful though, for people and for agents in various contexts, but you don't need to swamp all the agents with the full plan once you've turned it into beads.

Beads CLI Quick Reference

bash
br create --title "..." --priority 2 --label backend # Create issue
br list --status open --json # List open issues
br ready --json # Show unblocked tasks
br show <id> # View issue details
br update <id> --status in_progress # Claim task
br close <id> --reason "Completed" # Close task
br dep add <id> <other-id> # Add dependency
br comments add <id> "Found root cause..." # Add comment
br sync --flush-only # Export to JSONL (no git ops)

Priority uses numbers: P0=critical, P1=high, P2=medium, P3=low, P4=backlog. Types: task, bug, feature, epic, question, docs. br ready shows only unblocked work. Storage is a SQLite + JSONL hybrid; the JSONL files commit with your code.

5

Check Your Beads N Times, Implement Once

Before you burn up a lot of tokens with a big agent swarm on a new project, the old woodworking maxim of "Measure twice, cut once!" is worth revising as "Check your beads N times, implement once," where N is basically as many as you can stomach. This is the step most people underinvest in.

After the initial conversion finishes, do a round of this prompt. If Claude Code did a compaction at any point, be sure to tell it to re-read your AGENTS.md file first:

Bead Polishing Prompt
Reread AGENTS.md so it's still fresh in your mind. Check over each bead super carefully-- are you sure it makes sense? Is it optimal? Could we change anything to make the system work better for users? If so, revise the beads. It's a lot easier and faster to operate in "plan space" before we start implementing these things! DO NOT OVERSIMPLIFY THINGS! DO NOT LOSE ANY FEATURES OR FUNCTIONALITY! Also, make sure that as part of these beads, we include comprehensive unit tests and e2e test scripts with great, detailed logging so we can be sure that everything is working perfectly after implementation. Remember to ONLY use the `br` tool to create and modify the beads and to add the dependencies to beads. Use ultrathink.
Context

Claude Code with Opus — run 4-6+ times

Psychology

This prompt keeps the system from freezing beads too early. It tells the model to stay in plan space for as long as it is still finding meaningful improvements, which is exactly where reasoning is cheapest and most global. The warnings against oversimplifying and losing functionality are crucial because models otherwise tend to 'improve' artifacts by deleting complexity they do not fully understand. It combines local bead QA (via br) with graph QA (via bv), and forces tests into the bead definitions themselves so test work cannot be deferred into an afterthought.

Interactive Simulator

Convergence Detection

Drag the sliders or pick a preset to see how beads tighten into a stable state before implementation begins.

Presets
Dependencies StabilizingWeight: 25%
30%
Content Similarity RisingWeight: 30%
20%
Length Delta Approaching ZeroWeight: 20%
40%
Semantic Density PlateauWeight: 25%
30%
Overall Convergence29.0%
Rapid RefinementKeep polishing. High structural volatility.

From real sessions, polishing involves duplicate detection and merging, quality scoring on WHAT/WHY/HOW criteria, filling empty bead descriptions, correcting dependency links, and cross-referencing beads against the markdown plan to ensure nothing was lost. FrankenSQLite identified 9 exact duplicate pairs and closed them, choosing survivors based on "richer testing specs, better dependency chains, and higher priority."

Tell agents to go through each bead and explicitly check it against the markdown plan. Or vice versa: go through the markdown plan and cross-reference every single thing against the beads (both closed and open) to ensure complete coverage.

Convergence Detection: When to Stop

Bead polishing follows numerical optimization convergence patterns:

Phase
Major Fixes
Rounds
1-3
Character
Wild swings, fundamental changes
Phase
Architecture
Rounds
4-7
Character
Interface improvements, boundary refinements
Phase
Refinement
Rounds
8-12
Character
Edge cases, nuanced handling
Phase
Polishing
Rounds
13+
Character
Converging to steady state

Three signals indicate convergence: agent responses getting shorter (output size shrinking), the rate of change decelerating (change velocity slowing), and successive rounds becoming more similar (content similarity increasing). When the weighted convergence score reaches 0.75+, you're ready to finalize. Above 0.90, you're hitting diminishing returns.

Critical
Early termination red flags: If you see oscillation (alternating between two versions), reframe the problem. If you see expansion (output growing instead of shrinking), step back; the agent is adding complexity. If you see a plateau at low quality, kill the current approach and restart fresh.

Fresh Eyes Technique

If improvements start to flatline, start a brand new Claude Code session:

Fresh Eyes on Beads
First read ALL of the AGENTS.md file and README.md file super carefully and understand ALL of both! Then use your code investigation agent mode to fully understand the code, and technical architecture and purpose of the project. Use ultrathink.
Context

A brand new Claude Code session

Psychology

Fresh sessions don't carry the accumulated assumptions of the previous session. They see the beads with genuinely new eyes.

Then follow up with:

Fresh Review of Beads
We recently transformed a markdown plan file into a bunch of new beads. I want you to very carefully review and analyze these using `br` and `bv`. Check over each bead super carefully-- are you sure it makes sense? Is it optimal? Could we change anything to make the system work better for users? If so, revise the beads. It's a lot easier and faster to operate in "plan space" before we start implementing these things! Use ultrathink.
Context

Same fresh session, after it finishes reading

As a final step, have Codex with GPT (high reasoning effort) do one last round using the same polishing prompt. Different models catch different things.

Deduplication Check

After large bead creation batches, run a dedicated dedup pass:

Bead Deduplication
Reread AGENTS.md so it's still fresh in your mind. Check over ALL open beads. Make sure none of them are duplicative or excessively overlapping... try to intelligently and cleverly merge them into single canonical beads that best exemplify the strengths of each.
Context

Claude Code, after large bead creation batches

Adding Features to Existing Projects

The full planning pipeline (Phases 1-5) is for new projects built from scratch. For existing projects that need new features, the Idea-Wizard is a formalized 6-phase pipeline:

  1. 1Ground in reality. Read AGENTS.md and list all existing beads (br list --json). This prevents creating duplicates.
  2. 2Generate 30, winnow to 5. The agent brainstorms 30 ideas for improvements, then self-selects the best 5 with justification.
  3. 3Expand to 15. Prompt: "ok and your next best 10 and why." The agent produces ideas 6-15, checking each against existing beads for novelty.
  4. 4Human review. You review the 15 ideas and select which to pursue.
  5. 5Turn into beads. Selected ideas become beads with full descriptions, dependencies, and priority levels.
  6. 6Refine 4-5 times. The same polishing loop as above. Single-pass beads are never optimal.
Idea-Wizard: Generate Ideas
Come up with 30 ideas for improvements, enhancements, new features, or fixes for this project. Then winnow to your VERY best 5 and explain why each is valuable.
Context

Claude Code, for existing projects needing new features

Then: "ok and your next best 10 and why." The agent produces ideas 6-15, carefully checking each against existing beads for novelty. Having agents brainstorm 30 then winnow to 5 produces much better results than asking for 5 directly because the winnowing forces critical evaluation.

Not every change needs the full pipeline. For quick, bounded changes, use the built-in TODO system:

Ad-Hoc Execution
OK, please do ALL of that now. Keep a super detailed, granular, and complete TODO list of all items so you don't lose track of anything and remember to complete all the tasks and sub-tasks you identified or which you think of during the course of your work on these items!
Context

When the overhead of formal bead creation would slow you down more than it helps

Psychology

This prompt forces the agent to externalize its local execution plan into a durable checklist instead of juggling a sprawling ad-hoc task in conversational memory. In tools that support a built-in TODO system, that checklist survives compaction. Under the hood, it creates a temporary execution scaffold that is lighter than full bead creation and much safer than 'just remember everything.' This is the right mode when the work is too small to justify immediate bead formalization but too large to trust to ephemeral context alone. When NOT to use it: if the change is expanding, depends on other work, needs graph-aware sequencing, or should be part of the permanent project record. In those cases, stop and convert it into proper beads. If an ad-hoc change later proves important, retroactively create beads for the completed work to preserve continuity.

Major Features: Research and Reimagine

The Idea-Wizard handles bounded improvements, new feature ideas, and fixes. But sometimes you want to add an entirely new capability to an existing project, something ambitious enough that it deserves the same depth of planning as a greenfield project, and where an external project has already solved a related problem worth studying. For these, there is a more powerful approach: study an external project that already solves a related problem, then reimagine its strongest ideas through the lens of your own project's unique strengths.

As a concrete example: adding a robust messaging substrate to the Asupersync project. Rather than designing from scratch or doing a straightforward port, the approach was to study NATS (a mature, production-grade messaging system in Go), extract its strongest architectural ideas, and reimagine them using Asupersync's correct-by-design structured concurrency primitives to create something neither project could achieve alone.

Note
Start from a session that already has context about your project. The model should already understand your project's architecture, primitives, and unique value before you ask it to study an external system. This shared context is what makes the "reimagine" step produce genuinely novel combinations rather than shallow ports.

The process follows a specific prompt sequence. Each step builds on the previous, alternating between expansion (going deeper, inverting the analysis, pushing for architectural innovation) and hardening (repeated blunder hunts that stress-test the result):

Step 1: Investigate and Propose
I want you to clone https://github.com/<external-project> to tmp and then investigate it and look for useful ideas that we can take from that and reimagine in highly accretive ways on top of existing <your project> primitives that really leverage the special, innovative concepts and value-add from both projects to make something truly special and radically innovative. Write up a proposal document, PROPOSAL_TO_INTEGRATE_IDEAS_FROM_<EXTERNAL>_INTO_<YOUR_PROJECT>.md
Context

Codex or Claude Code, in a session with existing project context

Psychology

By asking the agent to clone and investigate the external project firsthand, you get specific, grounded proposals instead of vague suggestions based on training data alone. The 'reimagine in highly accretive ways' framing prevents a shallow porting exercise and pushes toward genuinely novel combinations.

The first draft is always too conservative. Push for depth and ambition:

Step 2: Iterative Deepening
OK, that's a decent start, but you barely scratched the surface here. You must go way deeper and think more profoundly and with more ambition and boldness and come up with things that are legitimately "radically innovative" and disruptive because they are so compelling, useful, accretive, etc.
Context

Same session, immediately after the first proposal

Psychology

Models produce conservative initial proposals to avoid being wrong. Explicit pressure to go deeper unlocks the genuinely creative architectural ideas that make the integration worthwhile rather than incremental.

Then invert the analysis. This technique surfaces opportunities that only exist because of your project's unique capabilities:

Step 3: Inversion Analysis
Now "invert" the analysis: what are things that we can do because we are starting with <your unique primitives/capabilities> that <the external project> simply could never do even if they wanted to because they are working from far less rich primitives that do not offer the sort of guarantees we have?
Context

Same session, after deepening

Psychology

Standard analysis asks 'what can we learn from them?' Inversion asks 'what can we do that they fundamentally cannot?' This surfaces the highest-value integration points: capabilities that are genuinely novel rather than just reimplementations of features the external project already has.

After each major expansion, run a blunder-hunt pass. The critical technique: repeat the exact same critique prompt 5 times in a row. Each pass finds things the previous pass missed, because the model is forced to look beyond the issues it already identified:

Step 4: Repeated Blunder Hunt (Run 5x)
Look over everything in the proposal for blunders, mistakes, misconceptions, logical flaws, errors of omission, oversights, sloppy thinking, etc.
Context

Run this exact prompt 5 times consecutively after each major proposal expansion

Psychology

Models tend to find 15-20 issues on the first pass and declare satisfaction. Running the exact same prompt again forces them past the issues they already found. By the fifth pass, you have caught subtle logical flaws and architectural inconsistencies that no single review pass would surface. This is the critique equivalent of the bead polishing convergence pattern from Section 5.

Continue pushing for specific architectural innovations. In the Asupersync example, this meant asking: "Can you think of a clever, radically innovative way to leverage our unique capabilities so that the messaging substrate doesn't require a separate external server, but each client can self-discover and collectively act as both client and server?" Each major architectural addition gets another round of 5x blunder hunts.

When the proposal has items flagged as needing follow-on design work, address them explicitly rather than leaving them vague:

Step 5: Close Design Gaps
OK so then add this stuff to the proposal, using the very smartest ideas from your alien skills to inform it and your best judgment based on the very latest and smartest academic research: - <specific design gap identified in blunder hunt> - <another specific gap>
Context

Same session, after blunder hunts surface specific open questions

Psychology

Blunder hunts often identify areas where the proposal is 'honest but incomplete' rather than wrong. This prompt converts those honest gaps into concrete design decisions, preventing them from becoming ambiguity that later infects the beads and implementation.

Before sending the proposal for multi-model feedback, make it self-contained. Other models do not have your session context, so the proposal must include everything they need to give useful critique:

Step 6: Self-Contained for Cross-Model Review
OK now we need to make the proposal self-contained so that we can show it to another model such as GPT Pro and have that model understand absolutely anything that might be relevant to understanding and being able to suggest useful revisions to the proposal or to find flaws in the plans. To that end, I need you to add comprehensive background sections about what <your project> is and how it works, what makes it special/compelling, etc. And then do the same in another background section all about <external project> and what it is and what makes it special/compelling, how it works, etc.
Context

After the proposal reaches a stable, ambitious state

Psychology

Cross-model review only works if every model can fully understand the proposal without access to your project. Adding comprehensive background sections prevents other models from making shallow suggestions based on incomplete understanding. This preparation step is what makes the multi-model feedback loop genuinely useful rather than superficial.

Follow this with another 5x blunder hunt, then de-slopify the document. Now you are ready for multi-model triangulation.

Send the self-contained proposal to GPT Pro, Claude Opus, Gemini with Deep Think, and Grok Heavy, all with the same prompt asking for improvements in git-diff format:

Step 7: Multi-Model Feedback
How can we improve this proposal to make it smarter and better-- to make the most radically innovative and accretive and useful and compelling additions and revisions you can possibly imagine. Give me your proposed changes in the form of git-diff style changes against the file below, which is named PROPOSAL_TO_INTEGRATE_IDEAS_FROM_<EXTERNAL>_INTO_<YOUR_PROJECT>.md: <the complete proposal document>
Context

GPT Pro, Claude Opus (web), Gemini Deep Think, and Grok Heavy -- all four, independently

Psychology

Each model has different architectural tastes and blind spots. Asking for git-diff format forces precision: the models cannot hand-wave about what should change, they have to show the exact text transformations. This makes the synthesis step tractable.

Feed the competing feedback from all models into GPT Pro using the "best of all worlds" synthesis prompt from Section 3. Apply the resulting diffs back to the proposal document in Codex or Claude Code, then de-slopify the final result.

Step
Investigate and propose
Purpose
Ground the proposal in real study of the external project
Repetitions
1
Step
Iterative deepening
Purpose
Push past conservative initial suggestions
Repetitions
1-2
Step
Inversion analysis
Purpose
Find opportunities unique to your project's capabilities
Repetitions
1
Step
5x blunder hunt
Purpose
Ruthlessly verify after each expansion
Repetitions
5x after each major step
Step
Architectural innovation
Purpose
Push for specific novel design choices
Repetitions
As needed
Step
Close design gaps
Purpose
Convert honest unknowns into concrete decisions
Repetitions
As needed
Step
Self-contained background
Purpose
Prepare for cross-model review
Repetitions
1
Step
Multi-model feedback + synthesis
Purpose
Triangulate from 4 frontier models
Repetitions
1
Step
Apply diffs + de-slopify
Purpose
Produce the final integrated proposal
Repetitions
1

You can see this exact process applied to Asupersync's NATS integration: the initial proposal, the version after multi-model feedback, and the full GPT Pro synthesis conversation.

From here, the proposal feeds into the standard pipeline: convert to beads (Section 4), polish obsessively (Section 5), launch the swarm (Section 7). The research-driven approach adds significant front-end effort but produces proposals with a level of architectural depth and innovation that no amount of greenfield brainstorming can match, because you are standing on the shoulders of a real, battle-tested system while leveraging capabilities that system never had access to.

6

The Coordination Stack

Then you're ready to start implementing. The fastest way to do that is to start up a big swarm of agents that coordinate using three interlocking tools:

Coordination Logic

Beads, Agent Mail, and bv are a single machine

Hover or tap to inspect each piece. Click again to remove it and watch the system lose a capability it cannot replace. This is the Coordination Triangle.

BeadsAgent Mailbv
Capability Map
Agent Mail

The high-bandwidth negotiation layer.

br

The durable, localized issue state.

bv

The graph-theory compass for triage.

System Architecture

The trio is not three nice-to-have tools. It is one operating system split into memory, communication, and leverage analysis. Remove any side of the triangle and the swarm loses determinism.

Agent Mail + Beads + bv are what unlock the truly insane productivity gains.

Each tool is essential but insufficient alone. Agent Mail without beads leaves agents with no structured work to coordinate around. Beads without bv leaves agents randomly choosing tasks. bv without Agent Mail leaves agents unable to communicate. The system is distributed and decentralized, with each agent using bv to find the next optimal bead, marking it as in-progress, and communicating about it via Agent Mail.

Agent Mail: Why Naive Coordination Fails

Building your own agent coordination from scratch is full of footguns that Agent Mail was designed to sidestep:

  • No broadcast-to-all default. Agents are lazy and will only use broadcast mode, spamming every agent with mostly irrelevant information. It's like if your email system defaulted to reply-all every time. That burns precious context.
  • Good MCP ergonomics. It takes a huge amount of careful iteration to get the API surface right so agents use it reliably without wasting tokens.
  • No git worktrees. Worktrees demolish development velocity and create reconciliation debt when agents diverge. Working in one shared space surfaces conflicts immediately. All agents commit directly to main.
  • Advisory file reservations. Agents call dibs temporarily on files, but it's not rigidly enforced, and reservations expire. Agents can reclaim files that haven't been touched recently. Rigid locks held by dead agents block everyone else. Advisory reservations with TTL expiry degrade gracefully.
  • Semi-persistent identity. Agent Mail generates whimsical names like "ScarletCave" and "CoralBadger" — meaningful enough for coordination, disposable enough that losing one doesn't corrupt the system. No agent's identity is load-bearing.

Before editing files, agents reserve them via Agent Mail:

text
file_reservation_paths(
project_key="/data/projects/my-repo",
agent_name="BlueLake",
paths=["src/auth/*.rs"],
ttl_seconds=3600,
exclusive=true,
reason="br-42: refactor auth"
)

Other agents see the reservation and work on different files. A rigid locking system would deadlock when an agent crashes while holding a lock. Advisory reservations with expiry degrade gracefully. The worst case is a brief window where two agents touch the same file, which the pre-commit guard catches anyway.

Agent Mail provides four high-level macros that wrap common multi-step patterns: macro_start_session (bootstrap: ensure project, register agent, fetch inbox), macro_prepare_thread (join existing thread with summary), macro_file_reservation_cycle (reserve, work, auto-release), and macro_contact_handshake (cross-agent contact setup).

Interactive Visualization

Broadcast vs. Point-to-Point

Agent Mail uses targeted delivery and advisory locks to stay efficient. O(1) noise.

O(1) Messages Advisory Locks High SNR
A1 (UI)
A2 (DB)
A3 (API)
A4 (Tests)
Need /api/users
Schema updated

bv: The Graph-Theory Compass

That right answer comes from the dependency structure of the tasks, and this can be mechanically computed using basic graph theory. And that's what bv does. It's like a compass that each agent can use to tell them which direction will unlock the most work overall.

bv precomputes dependency metrics (PageRank, betweenness, HITS, eigenvector, critical path, cycle detection) so agents get deterministic, dependency-aware output. When multiple agents each independently query bv for priority, you get emergent coordination. Agents naturally spread across the optimal work frontier without needing a central coordinator.

Pattern
High PageRank + High Betweenness
Meaning
Critical bottleneck
Action
DROP EVERYTHING, fix this first
Pattern
High PageRank + Low Betweenness
Meaning
Foundation piece
Action
Important but not currently blocking
Pattern
Low PageRank + High Betweenness
Meaning
Unexpected chokepoint
Action
Investigate why this is a bridge
Pattern
Low PageRank + Low Betweenness
Meaning
Leaf work
Action
Safe to parallelize freely

PageRank finds what everything depends on. Betweenness finds bottlenecks. The math knows your priorities better than gut intuition.

bash
bv --robot-triage # THE MEGA-COMMAND: full recommendations with scores
bv --robot-next # Minimal: just the single top pick + claim command
bv --robot-plan # Parallel execution tracks with unblocks lists
bv --robot-insights # Full graph metrics: PageRank, betweenness, HITS
bv --robot-priority # Priority recommendations with reasoning and confidence
bv --robot-diff --diff-since <ref> # Changes since last check
Critical
Use ONLY --robot-* flags. Bare bv launches an interactive TUI that blocks your session.

bv was made in a single day and was just under 7k lines of Go. It was later rewritten to 80k lines with advanced features. This shows that effort does not correspond to impact. The tool started for humans but pivoted to being primarily for agents:

But the biggest improvement in terms of actual usefulness isn't for you humans at all! It's for your coding agents. They just need to run one simple command, bv --robot-triage, and they instantly get a massive wealth of insights into what to work on next.

Advanced filtering lets you scope analysis to labels, historical point-in-time views, pre-filtered recipes, or grouped output:

bash
bv --robot-plan --label backend # Scope to label's subgraph
bv --robot-insights --as-of HEAD~30 # Historical point-in-time
bv --recipe actionable --robot-plan # Only unblocked items
bv --recipe high-impact --robot-triage # Top PageRank scores
bv --robot-triage --robot-triage-by-track # Group by parallel streams
bv --robot-triage --robot-triage-by-label # Group by domain

Bead IDs as Threading Anchors

Bead IDs create a unified audit trail across all coordination layers: the bead ID goes in the Agent Mail thread_id, the subject prefix ([br-123]), the file reservation reason, and the commit message. This makes all coordination activity traceable back to a single task.

AGENTS.md: The Operating Manual

The AGENTS.md file is the single most critical piece of infrastructure for agent coordination. It tells every agent how to behave, what tools exist, what safety constraints matter, and what "doing a good job" means in this repo. Every tool should come with a prepared blurb designed for inclusion in AGENTS.md. Think of these blurbs as the modern equivalent of man pages.

Every AGENTS.md should include these core rules:

  1. 1Rule 0, The Override Prerogative: The human's instructions override everything.
  2. 2Rule 1, No File Deletion: Never delete files without explicit permission.
  3. 3No destructive git commands: git reset --hard, git clean -fd, rm -rf are absolutely forbidden.
  4. 4Branch policy: All work happens on main, never master.
  5. 5No script-based code changes: Always make code changes manually.
  6. 6No file proliferation: No mainV2.rs or main_improved.rs variants.
  7. 7Compiler checks after changes: Always verify no errors were introduced.
  8. 8Multi-agent awareness: Never stash, revert, or overwrite other agents' changes.

More content in AGENTS.md means more frequent compactions, but it saves time and avoids mistakes by giving agents all the context upfront. This tradeoff is worth making.

If you don't have a good AGENTS.md file, none of this stuff is going to work well. You can see example AGENTS.md files for a complex NextJS webapp and a bash script project.

After compaction they become like drug-addled children and all bets are off. They need to be forced to read it again or they start acting insane.

"Reread AGENTS.md" is the single most common prompt prefix across the entire session archive. After every context compaction, agents must re-read it:

Post-Compaction Reset
Reread AGENTS.md so it's still fresh in your mind.
Context

Immediately after any context compaction (the single most commonly used prompt)

Psychology

Compaction wipes out the soft operational knowledge that keeps the swarm sane: how to behave, how to coordinate, what tools exist, what rules matter, what mistakes to avoid. This one-line prompt restores that control plane in one move. It rehydrates the agent's behavioral contract after context loss. Important enough to have been automated with the post_compact_reminder tool.

The pragmatic approach: do not fight compaction, just re-read AGENTS.md and roll with it until the agent starts doing dumb stuff, then start a new session. When beads are well-constructed, compaction matters less because each bead is self-contained. The agent can pick up any bead fresh without needing the full conversation history.

Single-Branch Git Model

I really think worktrees are a bad pattern and not worth the trouble. There are other, much better ways at getting multiple agents to work well together at the same time that don't come with the overhead and reconciliation burden of worktrees. I run like 10+ of them at once in a single project without git worktrees. Agent Mail is the solution.

All agents commit directly to main. This may surprise you if you're used to feature branches. But branch-per-agent creates merge hell with 10+ agents making frequent commits. Worktrees add filesystem complexity and path confusion. Agents lose context when switching branches. Logical conflicts survive textual merges: a function signature change on one branch and a new callsite on another merge cleanly but fail to compile. On a single branch, the second agent sees the signature change immediately and adapts.

Instead of branch isolation, three complementary mechanisms prevent conflicts: file reservations (agents reserve files via Agent Mail before editing; advisory, not rigid, with TTL expiry so dead agents cannot deadlock the system), a pre-commit guard (blocks commits to files reserved by another agent), and DCG (Destructive Command Guard, which mechanically blocks dangerous commands).

Note
DCG origin story: On December 17, 2025, an agent ran git checkout -- on uncommitted work. Files were recovered via git fsck --lost-found, but the incident proved that instructions do not prevent execution. Mechanical enforcement does. DCG was built the next day.
Blocked Command
git reset --hard
Safe Alternative
git stash
Why
Recoverable
Blocked Command
git checkout -- file
Safe Alternative
git stash push file
Why
Preserves changes
Blocked Command
git push --force
Safe Alternative
git push --force-with-lease
Why
Checks remote unchanged
Blocked Command
git clean -fd
Safe Alternative
git clean -fdn (preview first)
Why
Shows what would delete
Blocked Command
rm -rf /path
Safe Alternative
rm -ri /path
Why
Interactive confirmation

The recommended git workflow: Pull latest, reserve files, edit and test, commit immediately, push, release reservation. Key principles: commit early and often (small commits reduce the conflict window), push after every commit (unpushed commits are invisible to other agents), reserve before editing, release when done.

You NEVER, under ANY CIRCUMSTANCE, stash, revert, overwrite, or otherwise disturb in ANY way the work of other agents. Just treat those changes identically to changes that you yourself made. Just fool yourself into thinking YOU made the changes and simply don't recall it for some reason.

Agent Fungibility

Every agent is a generalist. No role specialization. All agents read the same AGENTS.md and can pick up any bead. This is deliberately opposed to "specialist agent" architectures where one agent has a special role — specialist agents become single points of failure. When the specialist crashes or needs compaction, the whole system stalls. With 12 fungible agents, losing one makes almost no difference.

You also do not want "ringleaders": a coordinating boss agent whose crash takes down the whole system. Coordination must live in artifacts (beads, reservations, threads) and tools (bv, Agent Mail), not in any special agent.

Think of it like RaptorQ fountain codes: beads are "blobs" in a stream, any agent catches any bead in any order. There is no "rarest chunk" bottleneck, and the system is resilient to partial agent failures by design. Failure recovery is trivial: the bead remains marked in_progress, any other agent can resume it, and a replacement agent is just ntm add PROJECT --cc=1 plus the standard marching orders prompt.

When one agent breaks, it's not even a problem when all the agents are fungible. Agents become like commodities and can be instantiated and destroyed at will and the only downside is some slowdown and some wasted tokens.

Fungible Agent Crash Recovery

Tap an agent, then tap again to kill it.

Alive: 6/6
Killed: 0
Recovered: 0
Completed: 0
Event Log
Swarm initialized. All agents working.

Every agent is fungible. Kill any of them to see the swarm self-heal without downtime or data loss.

Prompts Are Deliberately Generic

My projects all start with a ridiculously detailed and comprehensive markdown plan file which is then turned into a comprehensive set of beads (tasks), so the vagueness in the prompts is a feature, letting me reuse them for every project, while the agent gets the specifics they need from the plan and the beads.

This confuses people when they first see the prompt library. The prompts say things like "check over each bead super carefully" rather than "check over each bead in the authentication module for SQL injection risks." That generality is the point. The specificity lives in three places the agent already has access to:

  1. 1The beads themselves contain detailed descriptions, context, and rationale embedded during the plan-to-bead conversion.
  2. 2AGENTS.md contains project-specific rules, conventions, and tool documentation.
  3. 3The codebase contains the actual implementation context.

The prompts are the reusable scaffolding that directs the agent's attention. The beads and AGENTS.md supply the project-specific substance. This separation means you can use the exact same prompt library across every project without modification. The prompt "reread AGENTS.md so it's still fresh in your mind" followed by "use bv to find the most impactful bead to work on next" works identically whether you are building a CLI tool, a web app, or a protocol library, because the specifics come from the project's own artifacts, not from the prompt.

Security Comes Free with Good Planning

Security review is baked into the standard workflow at multiple levels rather than being a separate phase. The cross-agent review prompt explicitly calls out security problems. When models reason about an entire system's architecture at once (which is what the plan enables), they spot authentication gaps, data exposure risks, and trust boundary violations without being told to look. UBS catches security anti-patterns mechanically: unpinned dependencies, missing input validation, hardcoded secrets, supply chain vulnerabilities. Beads that include comprehensive e2e tests naturally cover authentication and authorization paths.

Security vulnerabilities are usually symptoms of incomplete reasoning about the system. If the plan is detailed enough to cover all user workflows, edge cases, and failure modes, security considerations emerge from that completeness rather than requiring a separate checklist. For projects with explicit security requirements (financial, healthcare), add dedicated security review beads.

7

Launching & Running the Swarm

You can create sessions using Claude Code, Codex, and Gemini-CLI in different panes in tmux, or use the ntm project (Named Tmux Manager) as the command center:

bash
# Spawn a multi-agent session
ntm spawn myproject --cc=2 --cod=1 --gmi=1
# Send a prompt to ALL agents
ntm send myproject "Your marching orders prompt here"
# Send to specific agent type
ntm send myproject --cc "Focus on the API layer"
# Open the command palette (battle-tested prompts)
ntm palette

NTM is useful but not mandatory. A mux is a terminal multiplexer: a layer that lets you manage multiple shell sessions inside one higher-level session manager. In practice, that usually means some combination of tabs, panes, detached sessions, and reconnection to work that is still running on a local or remote machine. tmux is the classic Unix terminal multiplexer, powerful and battle-tested. NTM is built on top of tmux, which is why it is a natural fit for multi-agent work. But tmux is only one mux. WezTerm has its own built-in mux. Zellij is another. The method cares that you have a workable orchestration layer, not that you picked one specific multiplexer.

One common alternative is WezTerm because native scrollback and text selection are more convenient than in tmux. A workable setup:

  • Run agents in separate tabs using WezTerm and its built-in mux, often across remote machines
  • Trigger your most common prompts from a Stream Deck with the prompts preconfigured
  • Keep a large prompt file open in Zed and paste rarer prompts manually
  • In Claude Code, use the project-specific Ctrl-r prompt history search when you want to recall something you used recently

There is no single correct operator interface. NTM is one good cockpit. WezTerm tabs plus mux is another. FrankenTerm, which is built on WezTerm, is aimed more explicitly at this style of workflow but is not ready yet. The important thing is that you can launch agents, get prompts into them quickly, monitor them, and keep the coordination layer (AGENTS.md, Agent Mail, beads, bv) intact.

For concrete setup notes on these operator environments:

Give each agent these marching orders:

Swarm Marching Orders
First read ALL of the AGENTS.md file and README.md file super carefully and understand ALL of both! Then use your code investigation agent mode to fully understand the code, and technical architecture and purpose of the project. Then register with MCP Agent Mail and introduce yourself to the other agents. Be sure to check your agent mail and to promptly respond if needed to any messages; then proceed meticulously with your next assigned beads, working on the tasks systematically and meticulously and tracking your progress via beads and agent mail messages. Don't get stuck in "communication purgatory" where nothing is getting done; be proactive about starting tasks that need to be done, but inform your fellow agents via messages when you do so and mark beads appropriately. When you're not sure what to do next, use the bv tool mentioned in AGENTS.md to prioritize the best beads to work on next; pick the next one that you can usefully work on and get started. Make sure to acknowledge all communication requests from other agents and that you are aware of all active agents and their names. Use ultrathink.
Context

Every agent in the swarm gets this as their initial prompt

Psychology

This is the closest thing to a canonical swarm kickoff packet. It front-loads the shared operating context, forces the agent to establish social presence through Agent Mail, and then pivots away from passive waiting toward execution. The line about 'communication purgatory' matters because swarm failure often comes from over-coordination rather than under-coordination. Under the hood, the prompt establishes a control loop: load rules, understand the codebase, join the coordination layer, claim work, keep state synchronized, and use bv whenever local judgment is insufficient. The rch requirement is especially important in real swarms because it externalizes expensive builds and tests, preventing local CPU contention from degrading the entire multi-agent system. That one sentence is operational, not cosmetic. The prompts are deliberately generic; their vagueness is a feature, letting you reuse them for every project while the agent gets specifics from AGENTS.md and the beads.

The First 10 Minutes After Launch

Newcomers often understand each individual tool but do not have a clean picture of the first live operating loop. In practice, the first 10 minutes look like this:

  1. 1Your session manager creates the agent terminals (ntm spawn, WezTerm mux, or equivalent).
  2. 2You send the marching-orders prompt to each agent (staggered, not all at once).
  3. 3Each agent reads AGENTS.md and the repo docs, inspects the codebase, and joins Agent Mail.
  4. 4Each agent checks who else is active, acknowledges waiting messages, and learns the bead-thread naming conventions.
  5. 5Each agent uses bv --robot-triage and br ready --json to choose a bead.
  6. 6Before editing, the agent reserves the relevant file surface and announces the claim in the matching br-### thread.
  7. 7Only then does the agent start coding, reviewing, or testing.

That sequence turns a pile of terminals into a coordinated swarm. Skipping the join-up steps produces duplicate work, silent conflicts, and "communication purgatory." Skipping the routing steps means agents choose work randomly instead of unlocking the dependency graph intelligently.

Agent Composition & Model Recommendations

Phase
Initial plan creation
Recommended Model
GPT Pro (web)
Why
Extended reasoning, all-you-can-eat pricing
Phase
Plan synthesis
Recommended Model
GPT Pro (web)
Why
Best at being the 'final arbiter'
Phase
Plan refinement
Recommended Model
GPT Pro + Opus (web)
Why
Pro reviews, Claude integrates
Phase
Plan → Beads conversion
Recommended Model
Claude Code (Opus)
Why
Best coding agent for structured creation
Phase
Bead polishing
Recommended Model
Claude Code (Opus)
Why
Consistent, thorough
Phase
Implementation
Recommended Model
Claude Code + Codex + Gemini
Why
Diverse swarm
Phase
Code review
Recommended Model
Claude Code + Gemini
Why
Gemini good for review duty
Phase
Final verification
Recommended Model
Codex (GPT)
Why
Different model catches different things

Efficiency definitely declines as N grows, but if you have enough tasks in beads and they have Agent Mail and you don't start them all at the exact same time, you go faster as N grows. The practical limit is around 12 agents on a single project, sometimes higher. Or run 5 agents per project across multiple projects simultaneously. Why the ratio --cc=2 --cod=1 --gmi=1? Two Claude sessions because they are great for architecture and complex reasoning; one Codex for fast iteration and testing with complementary strengths; one Gemini for a different perspective, especially good for docs and review duty.

Open BeadsClaude (cc)Codex (cod)Gemini (gmi)
400+442
100-399332
<100111

The Thundering Herd

When you start up like 5 of each kind of agent and have them all collaborate in the same shared workspace, you can hit the classic "thundering herd" problem. The fix: stagger agent starts by 30 seconds minimum, make sure agents mark beads as in-progress quickly, and wait 4 seconds after launch before sending the initial prompt. For Codex specifically: send Enter twice after pasting long prompts (Codex has an input buffer quirk that sometimes swallows the first submit).

Interactive Exhibit

The same swarm can either stampede or flow

Advance phase by phase and compare the exact same four agents under two launch strategies. The lesson is timing, not talent.

Collision-prone
Thundering Herd

All agents wake together, re-read context together, and pile onto the same frontier.

conflicts
0
idle burn
6
completed
0

The swarm is synchronized before any useful work has started.

A1
UI
Booting from the same starting gun
18%
A2
API
Booting from the same starting gun
18%
A3
DB
Booting from the same starting gun
18%
A4
Tests
Booting from the same starting gun
18%
br-201
frontier
ready
br-202
frontier
ready
br-203
frontier
ready
br-204
frontier
ready
Nothing is broken yet, but the timing guarantees a crowded first claim.
Frontier-friendly
Staggered Start

Agents enter a few beats apart, so each arrival sees a different clean frontier.

conflicts
0
idle burn
2
completed
0

Only the first agent is awake. The rest are still off the critical path.

A1
UI
Booting
20%
A2
API
Queued
0%
A3
DB
Queued
0%
A4
Tests
Queued
0%
br-201
frontier
ready
br-202
frontier
ready
br-203
frontier
ready
br-204
frontier
ready
The swarm keeps optionality because it has not synchronized itself into contention.
Comparative takeaway

At this phase, the herd path has 0 lock conflicts and 6 units of idle burn, while the staggered path has 0 conflicts and 2 idle burn. The difference is not smarter agents. It is whether the system lets them reach distinct frontier at distinct times.

What the Human Actually Does

The human tends the swarm like an operator tending a machine that mostly runs on its own. These tasks are monitoring and maintenance. The hard cognitive work already happened during planning, which is why you can tend multiple project swarms at the same time.

On roughly a 10-30 minute cadence:

  1. 1Check bead progress. Use br list --status in_progress --json or bv --robot-triage. Are agents making steady progress? Are any beads stuck?
  2. 2Handle compactions. When you see an agent acting confused, send: "Reread AGENTS.md so it's still fresh in your mind." This is the single most common intervention. It takes 5 seconds.
  3. 3Run periodic reviews. Pick an agent and send the "fresh eyes" review prompt. This catches bugs before they compound.
  4. 4Manage rate limits. When an agent gets rate-limited, switch its account with caam activate claude backup-2 or start a new agent.
  5. 5Commit periodically. Every 1-2 hours, designate one agent for the organized commit prompt.
  6. 6Handle surprises. Create new beads for unanticipated issues, or if it's plan-level, update the plan and create new beads.
YOU are the bottleneck. Be the clockwork deity to your agent swarms: design a beautiful and intricate machine, set it running, and then move on to the next project. By the time you come back to the first one, you should have huge chunks of work already done and ready.
If you use the right tooling and workflows (agent mail + beads + bv), it transforms into 80% planning (AI-assisted, based on an initial prompt) and turning the markdown plan into very detailed and granular beads. And then the rest is just making sure the swarm of agents stay busy and executing their beads tasks effectively.

Taken to its endpoint, this design supports full autonomy: one puppet master agent controlling ntm via robot mode, replacing the human for routine machine-tending. The methodology is building toward a future where the human designs the plan, polishes the beads, and then walks away entirely while agents execute, review, ship, and start the next cycle.

When the "foregone conclusion" breaks down: If you find yourself doing heavy cognitive work during implementation, that is a signal that planning or bead polishing was insufficient. The remedies are specific: vague beads means agents improvise and produce inconsistent implementations; missing dependencies means agents work on tasks whose prerequisites are not done; thin AGENTS.md means agents produce non-idiomatic code; no Agent Mail means agents step on each other's files. The fix is always the same: pause implementation, go back to bead space, and add the missing detail.

Critical
Watch for strategic drift. A swarm can look productive while heading in the wrong direction: agents generating lots of code and commits while the real goal still feels far away. A recent X post about FrankenEngine described these as "Come to Jesus" moments with the agents, used to make sure "we are not losing sight of the bigger picture" after days of methodically cranking through beads. If that happens, stop and ask: "Where are we on this project? Do we actually have the thing we are trying to build? If we intelligently implement all open beads, would we close that gap completely?" If the answer is no, add or revise beads, re-polish them, and resume with a corrected frontier. Busy agents are not the goal; a bead graph that actually converges on the project goal is the goal.

Diagnosing a Stuck Swarm

When a swarm goes bad, the failure is usually one of two things: a local coordination jam (agents stepping on each other or losing operational context) or a strategic drift problem (the swarm is busy but no longer closing the real gap to the goal).

Symptom
Multiple agents keep picking the same bead
Likely Cause
Starts were not staggered; agents are not marking in_progress
What to Do
Stagger starts, force explicit Agent Mail claim messages, check reservations
Symptom
Agent goes in circles after compaction
Likely Cause
It forgot the operating contract in AGENTS.md
What to Do
Force Reread AGENTS.md; kill/restart the session if still erratic
Symptom
A bead sits in_progress for too long
Likely Cause
Agent crashed, silently blocked, or lost the plot
What to Do
Check Agent Mail, reclaim the bead, split out the blocker into a clearer bead
Symptom
Agents produce contradictory implementations
Likely Cause
Not coordinating through Agent Mail and reservations
What to Do
Audit reservation use, revise bead boundaries if overlapping
Symptom
Lots of code and commits, but goal still feels far
Likely Cause
Strategic drift; current beads do not close the remaining gap
What to Do
Stop, run the reality check prompt, revise bead graph

Atlas Notes as a Live Swarm

For a small project like Atlas Notes, a first swarm might look like this: Claude agent A claims br-101 and implements upload + parse handling. Codex agent B claims br-102 and works on the search path plus tests. Claude agent C claims br-103 and builds the admin failure dashboard. Gemini agent D stays flexible: reviews recent work, checks docs, and fills in test or UX gaps where needed. All four share the same codebase, read the same AGENTS.md, coordinate via Agent Mail, and use bv whenever they are uncertain about what unlocks the most progress next. That is what makes the swarm feel like one system rather than four unrelated terminals.

Account Switching

When you hit rate limits, use CAAM (Coding Agent Account Manager) for sub-100ms account switching:

bash
caam status # See current accounts and usage
caam activate claude backup-2 # Switch instantly
8

Review, Testing & Hardening

Code review in a multi-agent swarm follows a different rhythm than traditional code review. There is no pull request, no human reviewer, no approval gate. Instead, review is woven into the implementation cycle itself: agents review their own work after each bead, review each other's work periodically, and the human triggers broader review rounds at natural checkpoints.

I'm constantly having them review themselves the work of other agents throughout the process. I had routines where I just do this deep review with them until they stop coming up with any problems. Then I have them also create tons of unit tests and e2e tests. And I use the code.

If you've done a good job creating your beads, the agents will be able to get a decent sized chunk of work done in that first pass. Then, before they start moving to the next bead, have them review all their work:

Fresh Eyes Review
Great, now I want you to carefully read over all of the new code you just wrote and other existing code you just modified with "fresh eyes" looking super carefully for any obvious bugs, errors, problems, issues, confusion, etc. Carefully fix anything you uncover. Use ultrathink.
Context

After each bead is implemented; run until no more bugs are found

Psychology

This prompt is short because it is not redirecting the agent into a new domain. It is forcing a mode switch from generative coding to adversarial reading. The phrase 'fresh eyes' pushes the model to reframe code it just wrote as something potentially wrong, confusing, or internally inconsistent. That reduces the pattern where an agent stops once code compiles and never performs the low-cost bug sweep that catches obvious issues. The most effective reviews use subagent delegation: dispatch a fresh subagent with no memory of the original implementation to review each changed file.

Keep running rounds until they stop finding bugs. Typically 1-2 rounds for simple beads, 2-3 for complex ones. If an agent keeps finding bugs after 3 rounds, the implementation approach may be fundamentally off; consider having a different agent take over.

Note
You do not need to send the review prompt manually every time. Include it as part of the agent's workflow expectations in AGENTS.md: "After completing each bead, do a self-review before moving to the next one." Well-trained agents (with a good skill) will do this automatically.

Each review should answer four questions:

  1. 1Is the implementation correct? Does it do what the bead description says it should?
  2. 2Are there edge cases? Empty inputs, concurrent access, error paths, boundary conditions.
  3. 3Are there similar issues elsewhere? If you find a bug, search for the same pattern in other files.
  4. 4Should the approach be different? Sometimes the implementation is correct but there is a simpler or more robust way.

When reviews come back clean, have them move on to the next bead:

Advance to Next Bead
Reread AGENTS.md so it's still fresh in your mind. Use ultrathink. Use bv with the robot flags (see AGENTS.md for info on this) to find the most impactful bead(s) to work on next and then start on it. Remember to mark the beads appropriately and communicate with your fellow agents. Pick the next bead you can actually do usefully now and start coding on it immediately; communicate what you're working on to your fellow agents and mark beads appropriately as you work. And respond to any agent mail messages you've received.
Context

After self-review comes back clean

Psychology

This transition prompt is the glue between beads. It combines re-reading AGENTS.md (for compaction safety), querying bv for priority, and communicating with the swarm. It ensures the agent uses graph-theory routing to choose the task that unblocks the most downstream work, rather than picking arbitrarily.

Testing: Free Labor

When all your beads are completed, make sure you have solid test coverage:

Test Coverage
Do we have full unit test coverage without using mocks/fake stuff? What about complete e2e integration test scripts with great, detailed logging? If not, then create a comprehensive and granular set of beads for all this with tasks, subtasks, and dependency structure overlaid with detailed comments. Use ultrathink.
Context

After initial implementation pass is complete

The tests become obsolete and need to be revised as the code changes, which slows down dev velocity. But if all the tests are written and maintained by agents, who cares? Add another couple agents to the swarm and let them deal with updating the tests and running them. It's free!

Larger projects produce massive test suites. BrennerBot has nearly 5,000 tests. Stuff tends to "just work" in that case. Use UBS (Ultimate Bug Scanner) as a quality gate before every commit: ubs <changed-files> catches errors beyond what linters and type checkers find, including security holes, supply chain vulnerabilities, and runtime stability issues.

After any substantive code changes, always verify with compiler checks:

bash
# Rust
cargo check --all-targets
cargo clippy --all-targets -- -D warnings
cargo fmt --check
# Go
go build ./...
go vet ./...
# TypeScript
bun typecheck
bun lint

UI/UX Polish

For projects with a user interface, there is a dedicated polishing phase that happens after core functionality works but before shipping. This is separate from bug hunting because the problems you are looking for are not bugs; they are friction, ugliness, and missed opportunities to delight. When an agent implements an "authentication" bead, it focuses on making auth work correctly. Whether the login form has good visual hierarchy, whether the error messages are helpful, whether the flow feels smooth on mobile: these are orthogonal concerns requiring a different mode of attention. Trying to do both at once produces mediocre results on both.

The workflow has five steps: Step 1, run the general scrutiny prompt to generate a list of improvement suggestions (not code changes). Step 2, review the suggestions and pick which to pursue (human judgment step; the agent typically generates 15-30 suggestions, some excellent, some unnecessary). Step 3, turn selected suggestions into beads and implement through the normal swarm process. Step 4, run the platform-specific polish prompt. Step 5, repeat until improvements become marginal (typically 2-3 rounds).

UI/UX Scrutiny
Great, now I want you to super carefully scrutinize every aspect of the application workflow and implementation and look for things that just seem sub-optimal or even wrong/mistaken to you, things that could very obviously be improved from a user-friendliness and intuitiveness standpoint, places where our UI/UX could be improved and polished to be slicker, more visually appealing, and more premium feeling and just ultra high quality, like Stripe-level apps. Use ultrathink.
Context

After core functionality is working

Platform-Specific Polish
I still think there are strong opportunities to enhance the UI/UX look and feel and to make everything work better and be more intuitive, user-friendly, visually appealing, polished, slick, and world class in terms of following UI/UX best practices like those used by Stripe, don't you agree? And I want you to carefully consider desktop UI/UX and mobile UI/UX separately while doing this and hyper-optimize for both separately to play to the specifics of each modality. I'm looking for true world-class visual appeal, polish, slickness, etc. that makes people gasp at how stunning and perfect it is in every way. Use ultrathink.
Context

After the scrutiny pass

Psychology

The 'don't you agree?' phrasing is not politeness. It triggers the model to critically evaluate its own previous work rather than just validating it.

De-Slopification

After agents write documentation (README, user-facing text), run a de-slopify pass to remove telltale AI writing patterns. This must be done manually, not via regex. Read each line and revise systematically:

Pattern
Emdash overuse
Problem
LLMs use emdashes constantly, even when semicolons, commas, or sentence splits work better
Pattern
"It's not X, it's Y"
Problem
Formulaic contrast structure
Pattern
"Here's why" / "Here's why it matters:"
Problem
Clickbait-style lead-in
Pattern
"Let's dive in"
Problem
Forced enthusiasm
Pattern
"At its core..."
Problem
Pseudo-profound opener
Pattern
"It's worth noting..."
Problem
Unnecessary hedge

Deep Cross-Agent Review

This phase is distinct from the per-bead self-reviews above. Self-reviews happen after each bead is completed and focus on the code that was just written. Deep review happens after all (or most) beads are done and casts a wider net across the entire codebase, looking for problems that only become visible when you see how all the pieces fit together.

Cross-agent review catches a fundamentally different class of bugs than self-review. When Agent A implements a function and Agent B calls it, Agent A's self-review will never catch the fact that Agent B is passing arguments in the wrong order, because Agent A does not know about Agent B's code. Cross-agent review surfaces these integration issues.

Every 30-60 minutes during active implementation, or after a natural milestone (e.g., all beads in an epic are done), trigger cross-agent review. Do not have all agents stop to review simultaneously; pick one or two agents that just finished a bead and send them the review prompt while the others keep implementing. This keeps the swarm productive while still catching inter-agent issues.

Keep doing rounds of these two prompts until they consistently come back clean with no changes made. These prompts serve different purposes and should be alternated. This is one of the more art-than-science parts of the methodology. The prompts overlap in literal meaning, but they reliably activate different search behaviors in the models:

Random Code Exploration
I want you to sort of randomly explore the code files in this project, choosing code files to deeply investigate and understand and trace their functionality and execution flows through the related code files which they import or which they are imported by. Once you understand the purpose of the code in the larger context of the workflows, I want you to do a super careful, methodical, and critical check with "fresh eyes" to find any obvious bugs, problems, errors, issues, silly mistakes, etc. and then systematically and meticulously and intelligently correct them. Be sure to comply with ALL rules in AGENTS.md and ensure that any code you write or revise conforms to the best practice guides referenced in the AGENTS.md file. Use ultrathink.
Context

Alternate with the cross-agent review below

Psychology

The prompt first asks the agent to build a mental model of purpose and flow, then asks for criticism. That ordering matters. A bug hunt without workflow understanding degrades into linting; a bug hunt after tracing execution flows catches logic errors, mismatched assumptions, and silent product-level breakage. The 'randomly explore' framing breaks the locality trap. Directed reviews focus on files that seem important, which are the files that got the most attention already. Bugs that survive to this phase live in utility modules, error handling paths, configuration parsing, and edge-case branches.

Cross-Agent Review
Ok can you now turn your attention to reviewing the code written by your fellow agents and checking for any issues, bugs, errors, problems, inefficiencies, security problems, reliability issues, etc. and carefully diagnose their underlying root causes using first-principle analysis and then fix or revise them if necessary? Don't restrict yourself to the latest commits, cast a wider net and go super deep! Use ultrathink.
Context

Alternate with the random exploration above

Psychology

This prompt forces the swarm to stop treating code ownership as sacred. A large share of real defects live at the boundaries between agents' changes or in assumptions nobody revisits because they were made by 'someone else.' The instruction not to restrict review to the latest commits prevents shallow PR-style skimming and pushes the agent to trace older surrounding code, dependency surfaces, and adjacent workflows where the real root cause may live. The first-principles wording nudges the reviewer away from symptom-fixing toward actual causal diagnosis.

The cross-agent prompt tends to induce a suspicious, adversarial stance aimed at boundary failures and root causes in code written by others. The random-exploration prompt tends to induce a curiosity-driven stance aimed at reconstructing workflows and finding latent bugs in code that nobody is actively staring at. In practice, alternating them produces better coverage than repeating either one alone.

How to run deep bug hunting: Send the random exploration prompt to 2-3 agents simultaneously — each will explore different parts of the codebase because the randomness ensures variety. After they report back, send the cross-agent review prompt. Alternate until agents consistently come back with "I reviewed X, Y, Z files and found no issues." When two consecutive rounds both come back clean, the codebase is in good shape. If agents keep finding bugs after 4+ rounds, go back to bead space and create specific fix beads. Always run ubs . on the full project first and fix everything it flags before letting agents hunt for subtler issues.

Organized Commits

Periodically have one agent handle git operations:

Organized Commits
Now, based on your knowledge of the project, commit all changed files now in a series of logically connected groupings with super detailed commit messages for each and then push. Take your time to do it right. Don't edit the code at all. Don't commit obviously ephemeral files. Use ultrathink.
Context

Every 1-2 hours during active development

Psychology

Designating one agent prevents merge conflicts and produces coherent commit messages. The 'don't edit the code' instruction is critical: without it, agents treat the commit step as an opportunity to 'fix one more thing,' which creates unbounded scope expansion and makes the commit unpredictable. The 'logically connected groupings' instruction produces a meaningful git history instead of one monolithic 'update everything' commit that is impossible to review or bisect later.

Swarm Diagnosis: Reality Check

When the swarm looks active but you suspect it is not closing the real gap to the goal:

High-Level Reality Check
Where are we on this project? Do we actually have the thing we are trying to build? If not, what is blocking us? If we intelligently implement all open and in-progress beads, would we close that gap completely? Why or why not?
Context

When the swarm feels busy but directionally off

Psychology

This prompt breaks the spell of local productivity. Instead of asking whether the current bead is going well, it asks whether the current frontier of work actually converges on the project outcome. If the agent concludes that finishing all open beads still would not get you there, the answer is not 'work harder.' The answer is to revise the bead graph and re-aim the swarm.

README Revision

README Reviser
OK, we have made tons of recent changes that aren't yet reflected in the README file. First, reread AGENTS.md so it's still fresh in your mind. Now, we need to revise the README for these changes (don't write about them as "changes" however, make it read like it was always like that, since we don't have any users yet!). Also, what else can we put in there to make the README longer and more detailed about what we built, why it's useful, how it works, the algorithms/design principles used, etc? This should be incremental NEW content, not replacement for what is there already.
Context

After significant implementation work

Catch-All Oversight

Quick Final Sanity Check
Great. Look over everything again for any obvious oversights or omissions or mistakes, conceptual errors, blunders, etc.
Context

After any significant change, as a quick final pass

The De-Slopify Prompt

De-Slopify Documentation
I want you to read through the complete text carefully and look for any telltale signs of "AI slop" style writing; one big tell is the use of emdash. You should try to replace this with a semicolon, a comma, or just recast the sentence accordingly so it sounds good while avoiding emdash. Also, you want to avoid certain telltale writing tropes, like sentences of the form "It's not [just] XYZ, it's ABC" or "Here's why" or "Here's why it matters:". Basically, anything that sounds like the kind of thing an LLM would write disproportionately more commonly than a human writer and which sounds inauthentic/cringe. And you can't do this sort of thing using regex or a script, you MUST manually read each line of the text and revise it manually in a systematic, methodical, diligent way.
Context

After agents write README or any user-facing documentation

Landing the Plane

When ending a work session, agents must complete every step. Work is NOT complete until git push succeeds. Unpushed work is stranded locally and invisible to every other agent.

  1. 1File issues for remaining work. Create beads for anything that needs follow-up.
  2. 2Run quality gates. Tests, linters, builds (if code changed).
  3. 3Update issue status. Close finished work, update in-progress items.
  4. 4Sync beads. br sync --flush-only to export to JSONL, then git add .beads/.
  5. 5Commit and push. git pull --rebase && git add <files> && git commit && git push.
  6. 6Verify. git status must show "up to date with origin."

For the Atlas Notes example, "done for now" would not mean "the upload page appears." It would mean: the upload, parse, search, and admin-review workflows all work end to end; the key beads are closed and remaining polish ideas exist as new beads; tests cover the critical user journeys and known failure paths; UBS and compiler/lint checks are clean; commits and pushes are complete; and the next session can restart from beads, AGENTS.md, and Agent Mail threads rather than from human memory.

A Flywheel session is only landable when a future swarm can pick it back up without the human re-explaining the project from scratch.

9

The Complete Toolchain

The Flywheel is supported by a stack of 11 purpose-built tools, all free and open-source:

Tool
NTM
Command
ntm
Purpose
Named Tmux Manager — agent cockpit for spawning, sending, and broadcasting
Tool
Agent Mail
Command
am
Purpose
Agent coordination — identities, inbox/outbox, file reservations
Tool
UBS
Command
ubs
Purpose
Ultimate Bug Scanner — 1000+ patterns, pre-commit guardrails
Tool
Beads
Command
br
Purpose
Issue tracking — dependency-aware, JSONL+SQLite hybrid
Tool
Beads Viewer
Command
bv
Purpose
Triage engine — PageRank, betweenness, HITS, robot mode
Tool
RCH
Command
rch
Purpose
Remote build offloading — keeps heavy CPU work off the swarm box
Tool
CASS
Command
cass
Purpose
Session search — unified agent history indexing
Tool
CASS Memory
Command
cm
Purpose
Procedural memory — episodic → working → procedural
Tool
CAAM
Command
caam
Purpose
Auth switching — sub-100ms account swap across providers
Tool
DCG
Command
dcg
Purpose
Safety guard — blocks destructive git/filesystem operations mechanically
Tool
SLB
Command
slb
Purpose
Two-person rule — optional guardrails for dangerous commands

Not every tool is used the same way. br, bv, ubs, and rch are ordinary shell commands. Agent Mail is primarily experienced through MCP tools and macros. The installer (agent-flywheel.com) installs all of them with a single curl|bash command.

The Flywheel Interactions

The complete interaction flow from spawn to memory:

text
NTM spawns agents --> Agents read AGENTS.md
--> Agents register with Agent Mail
--> Agents query bv for task priority
--> Agents claim beads via br
--> Agents reserve files via Agent Mail
--> Agents implement and test
--> UBS scans for bugs
--> Agents commit and push
--> CASS indexes the session
--> CM distills procedural memory
--> Next cycle is better

The VPS Environment

Aspect
User
Value
ubuntu
Aspect
Shell
Value
zsh (with oh-my-zsh + powerlevel10k)
Aspect
Workspace
Value
/data/projects
Aspect
Sudo
Value
Passwordless (vibe mode)
Aspect
Tmux prefix
Value
Ctrl-a

Use acfs newproj to bootstrap a project with full tooling:

bash
acfs newproj myproject --interactive
# Creates:
# myproject/
# ├── .git/ # Git repository initialized
# ├── .beads/ # Local issue tracking (br)
# ├── .claude/ # Claude Code settings
# ├── AGENTS.md # Instructions for AI agents
# └── .gitignore # Standard ignores

The Incremental Onboarding Path

For beginners who find the full system overwhelming:

  1. 1Start with: Agent Mail + Beads (br) + Beads Viewer (bv) — this core trio captures most of the value
  2. 2Then add: UBS for bug hunting
  3. 3Then add: DCG for destructive command protection
  4. 4Then add: CASS for session history
  5. 5Then add: CM (CASS Memory) for codifying lessons into procedural memory
Critical
Common beginner mistakes: Making a hasty plan all at once instead of the multi-model iterative process. Trying to convert the plan to beads in a single pass. Not doing at least 3 rounds of polishing. "Well, of course the project is going to suck and be a buggy mess if you do that."

Scale Observations from Real Projects

ProjectBeadsPlan LinesAgentsTime to MVP
CASS Memory System347+5,500~25~5 hours
FrankenSQLiteHundredsLarge specMany parallelMulti-session
Frankensearch122+ (3 epics)MultipleMulti-session
Apollobot26Single session2-3 polish rounds

Patterns That Work

  • The "30 to 5 to 15" funnel: When generating ideas, having agents brainstorm 30 then winnow to 5 produces much better results than asking for 5 directly. The winnowing forces critical evaluation.
  • Parallel subagents for bulk bead operations: Creating dozens of beads is faster when dispatched to parallel subagents, each handling a subset.
  • Staggered agent starts: Starting agents 30-60 seconds apart avoids the thundering herd problem.
  • One agent for git operations: Designating one agent to handle all commits prevents merge conflicts and produces coherent commit messages.

Anti-Patterns to Avoid

  • Single-pass beads: First-draft beads are never optimal. Always do 4-5 polishing passes minimum.
  • Skipping plan-to-bead validation: Not cross-referencing beads against the plan leads to missing features discovered only during implementation.
  • Communication purgatory: Agents spending more time messaging each other than coding. Be proactive about starting work.
  • Holding reservations too long: File reservations with long TTLs block other agents unnecessarily. Reserve, edit, commit, release.
  • Not re-reading AGENTS.md after compaction: Context compaction loses nuances. The re-read is mandatory, not optional.

Supporting Infrastructure

Component
AGENTS.md
Purpose
Per-project configuration teaching agents about tools and rules
Component
Best practices guides
Purpose
Referenced in AGENTS.md, kept current
Component
Markdown plan files
Purpose
Source-of-truth planning documents
Component
acfs newproj
Purpose
Bootstraps projects with full tooling (.git, .beads, .claude, AGENTS.md)
Component
acfs doctor
Purpose
Single command to verify entire installation
Component
NTM command palette
Purpose
Battle-tested prompt library accessible via ntm palette
Component
Claude Code Skills
Purpose
Each tool has a dedicated skill for automated workflows

The Skills Ecosystem

The term "skill" confuses people at first, so define it plainly: a skill is a reusable operational instruction pack for an agent. In Claude Code terms, that usually means a SKILL.md file plus optional references, scripts, or templates that tell the agent how to use a tool, how to execute a methodology, what pitfalls to avoid, and what a good result looks like. A good skill is closer to executable know-how than to ordinary prose documentation.

A tool changes what the agent can do. A skill changes how well the agent knows how to do it. The same model with and without a good skill often behaves like two different agents.

Every Flywheel tool has a corresponding Claude Code skill that encodes best practices and automates common workflows. Many of these skills are bundled directly in the repos for the tools themselves and get installed automatically when the tool is installed, which means users often benefit from them without having to think about "skill management" explicitly. There is also a broader public skills collection at GitHub.

The prompt side has a similar split. jeffreysprompts.com has a generous free section and is open source at GitHub. It also has a paid Pro tier with additional prompts and a dedicated CLI called jfp for managing prompt collections. For a larger paid library of higher-end skills, see jeffreys-skills.md, a $20/month service with many of the strongest curated skills, new skills added continuously, and a dedicated CLI called jsm for managing them.

Both paid offerings are still under active development. That means occasional rough edges. Active work is underway to fix issues quickly, feedback is appreciated, and refunds are available for unhappy users.

Skills provide the prompts, procedures, anti-pattern guidance, and tool-specific workflows directly to agents, which reduces the amount of bespoke prompting a human needs to do by hand.

Vendor Lock-In: Avoid It

PSA: you should avoid vendor lock-in for agent coding primitives like task management (e.g., beads) and agent communication (e.g., MCP Agent Mail) so you can use all the agents together, which is more powerful anyway. They want you in a walled garden, but it's 100% unnecessary.

Beads, Agent Mail, and bv are all CLI tools that work identically regardless of which agent invokes them. A Claude Code agent and a Codex agent and a Gemini agent can all call br ready --json and get the same task list. The practical test: could you swap out every Claude Code agent for Codex or Gemini without changing your AGENTS.md, beads, Agent Mail setup, or workflow? If yes, you're vendor-neutral.

Validation Gates

These gates turn the methodology into a contract. If a gate fails, drop back a phase instead of pushing forward optimistically.

Gate
Foundation
Must Be True Before Advancing
Goals, workflows, stack, architecture direction, AGENTS.md, and best-practices guides exist and are coherent
Gate
Plan
Must Be True Before Advancing
Markdown plan covers workflows, architecture, sequencing, constraints, testing expectations, and major failure paths
Gate
Translation
Must Be True Before Advancing
Every material plan element maps to one or more beads, checked in both directions
Gate
Bead
Must Be True Before Advancing
Beads are self-contained, dependency-correct, rich in context, and explicit about test obligations
Gate
Launch
Must Be True Before Advancing
Agent Mail, file reservations, bead IDs, bv, AGENTS.md, and staggered startup are all ready
Gate
Ship
Must Be True Before Advancing
Reviews, tests, UBS, remaining-work beads, and feedback capture into reusable artifacts are complete

Vibe Mode Aliases

On the VPS, agents run with full permissions via short aliases:

bash
alias cc='NODE_OPTIONS="--max-old-space-size=32768" claude --dangerously-skip-permissions'
alias cod='codex --dangerously-bypass-approvals-and-sandbox'
alias gmi='gemini --yolo'

These are configured automatically by the installer. DCG provides the safety net that makes this viable.

Cost

~$500/month for Claude Max and GPT Pro subscriptions (at minimum), plus ~$50/month for a cloud server (OVH, Contabo). Multiple Max accounts may be needed for large swarms; CAAM enables instant switching when hitting rate limits. At scale, token usage for a single intensive session can reach ~20M input tokens, ~3.5M output tokens, ~2.6M reasoning tokens, and ~1.15 billion cached token reads. At full scale: 22 Claude Max accounts, 22 GPT Pro accounts, and 7 Gemini Ultra accounts.

10

The Flywheel Effect

If you simply use these tools, workflows, and prompts in the way just described, you can create really incredible software in just a couple days, sometimes in just one day. I've done it a bunch of times now and it really does work, as crazy as that may sound. You see my GitHub profile for the proof of this. It looks like the output from a team of 100+ developers.

It behaves like a flywheel rather than a checklist because each cycle makes the next one better:

  • Planning quality compounds because you keep reusing prompts, patterns, and reasoning structures that CASS proves actually worked.
  • Execution quality compounds because better beads make swarm behavior more deterministic and less dependent on human improvisation.
  • Tool quality compounds because agents use the tools, complain about them, and then help improve them.
  • Memory compounds because the results of one swarm, captured by CASS session search, become training data, rituals, and infrastructure for the next one.

How the Compounding Actually Works

Each session makes the next one better. Concretely: Session N produces raw data — CASS automatically logs every agent session. Between sessions, CM distills patterns — running cm reflect extracts procedural rules like "always run cargo check after modifying Cargo.toml" with confidence scores that decay without reinforcement and amplify with repetition. Session N+1 starts with those patterns loaded — running cm context "Building an API" retrieves relevant procedural memory. Simultaneously, UBS patterns grow as new bug classes get added. Agent Mail coordination norms get refined in AGENTS.md and skills.

The compounding is real but not automatic in the early stages. You have to actually run cm reflect, actually review CASS session data, actually update AGENTS.md with lessons learned. But even manually, spending 15 minutes between projects reviewing what worked and updating your AGENTS.md template produces outsized returns on every subsequent project.

Agent Feedback Forms

Apply the same feedback mechanisms you would use for humans (structured surveys, satisfaction ratings, net promoter scores) directly to agents evaluating tools. After an agent finishes using a tool in a real project, ask it to fill out a structured feedback survey. Then pipe that feedback directly into another agent working on the tool itself. The iteration cycle collapses from weeks to minutes.

Agent Tool Feedback
Based on your experience with [TOOL] today in this project, how would you rate [TOOL] across multiple dimensions, from 0 (worst) to 100 (best)? Was it helpful to you? Did it flag a lot of useful things that you would have missed otherwise? Did the issues it flagged have a good signal-to-noise ratio? What did it do well, and what was it bad at? Did you run into any errors or problems while using it? What changes to [TOOL] would make it work even better for you and be more useful in your development workflow? Would you recommend it to fellow coding agents? How strongly, and why or why not? The more specific you can be, and the more dimensions you can score [TOOL] on, the more helpful it will be for me as I improve it and incorporate your feedback to make [TOOL] even better for you in the future!
Context

After an agent finishes using a tool in a real project

Psychology

Many of the same concepts we use for people are directly applicable to agents. 'By robots, for robots.' This produces structured, actionable feedback. When used across multiple agents on different project types, you get a diverse sample of experiences. One caveat: as with humans who would tell you they wanted a 'faster horse' instead of a car, it is dangerous to trust agent feedback about potential new features before those features exist. The real test is always after implementation, in real-world usage.

CASS Memory: Three-Layer Architecture

CM (CASS Memory System) implements a three-layer memory architecture that turns raw session history into operational knowledge:

text
EPISODIC MEMORY (cass): Raw session logs from all agents
↓ cass search
WORKING MEMORY (Diary): Structured session summaries
↓ reflect + curate
PROCEDURAL MEMORY (Playbook): Distilled rules with confidence scores

Rules have a 90-day confidence half-life (decays without feedback) and a 4x harmful multiplier (one mistake counts 4x as much as one success). Rules mature through stages: candidate to established to proven.

bash
cm context "Building an API" --json # Get relevant memories for a task
cm recall "authentication patterns" # Search past sessions
cm reflect # Update procedural memory from recent sessions
cm mark b-8f3a2c --helpful # Reinforce a useful rule
cm mark b-xyz789 --harmful --reason "Caused regression" # Flag a bad rule

The cm context command is the single most important pre-task ritual. Running it at the start of a session gives agents knowledge distilled from every previous session that touched similar work.

Meta-Skill: Skill Refinement via CASS Mining

Skill Refinement Meta-Skill
Search CASS for all sessions where agents used the [SKILL] skill. Look for: patterns of confusion, repeated mistakes, steps agents skipped, workarounds they invented, and things they did that weren't in the skill but should be. Then rewrite the skill to fix every issue you found. Make the happy path obvious, add guardrails for the common mistakes, and incorporate the best workarounds as official steps. Test the rewritten skill against the failure cases from the session logs.
Context

Claude Code, targeting any skill with 10+ CASS sessions of usage data

Psychology

This is the meta-skill pattern in action. The skill-refiner skill itself can be refined using its own session data, which is the self-referential property that makes the whole system accelerate. After 3-4 cycles, the skill is dramatically more reliable than the original. Each cycle takes less human effort because the meta-skill itself has improved.

CASS Ritual Detection

The flywheel's learning loop depends on mining past sessions to find what actually works. CASS enables ritual detection: discovering prompts that are repeated so frequently they constitute validated methodology.

Repetition Count
count >= 10
Status
RITUAL
Action
Validated methodology. Extract into a skill.
Repetition Count
count 5-9
Status
Emerging pattern
Action
Worth investigating further.
Repetition Count
count < 5
Status
One-off
Action
Not generalizable yet.

The mining query (user prompts live at lines 1-3 of session entries; --fields minimal reduces output 5x):

bash
cass search "*" --workspace /data/projects/PROJECT --json --fields minimal --limit 500 \
| jq '[.hits[] | select(.line_number <= 3) | .title[0:80]]
| group_by(.) | map({prompt: .[0], count: length})
| sort_by(-.count) | map(select(.count >= 5)) | .[0:30]'

This is how the prompt library in this guide was originally discovered and validated. It was not invented top-down; it was mined bottom-up from hundreds of real sessions.

Why It Works: Layered Context

The workflow works because it keeps different kinds of context in different layers. The markdown plan holds whole-system intent and reasoning. The beads hold executable task structure and embedded local context. AGENTS.md holds operating rules and tool knowledge that must survive compaction. The codebase holds the implementation itself, which is too large to be the primary planning medium. Each layer serves a different purpose, and the methodology is disciplined about keeping the right information in the right layer.

The Kernel: 9 Invariants

  1. 1Global reasoning belongs in plan space. Do the hardest architectural and product reasoning while the whole project still fits in context.
  2. 2The markdown plan must be comprehensive before coding starts. Skeleton-first coding throws away the main advantage of frontier models.
  3. 3Plan-to-beads is a distinct translation problem. A good plan does not automatically produce a good bead graph.
  4. 4Beads are the execution substrate. Once good enough, they should carry enough context that agents no longer need the full plan.
  5. 5Convergence matters more than first drafts. Plans and beads both improve through repeated polishing until changes become small and corrective.
  6. 6Swarm agents are fungible. Coordination must live in artifacts and tools, not in special agents or unstated knowledge.
  7. 7Coordination must survive crashes and compaction. AGENTS.md, Agent Mail, bead state, and robot modes exist to keep work moving when sessions die.
  8. 8Session history is part of the system. Repeated prompts, failures, and recoveries should be mined via CASS and folded back into tools, skills, and validators.
  9. 9Implementation is not the finish line. Review, testing, UBS, and feedback-to-infrastructure loops are part of the core method.

Time Investment

Phase
Planning (multi-model synthesis + refinement)
Typical Duration
3+ hours for a complex feature
Notes
Feels slow because no code is written, but the downstream payoff is enormous
Phase
Plan to beads conversion
Typical Duration
Significant; Claude needed 'coaxing and cajoling' for 347 beads
Notes
A 5,500-line plan with hundreds of beads takes sustained effort
Phase
Bead polishing
Typical Duration
2-3 passes per session, multiple sessions
Notes
Agents typically manage 2-3 passes before running out of context window
Phase
Implementation with swarm
Typical Duration
Remarkably fast
Notes
CASS Memory System: 11k lines in ~5 hours with 25 agents

CASS itself (a complex Rust program used by thousands of people) was made in around a week, but the human personally only spent a few hours on it. The rest of the time was spent by a swarm of agents implementing and polishing it and writing tests.

The Project Is a Foregone Conclusion

Once you have the beads in good shape based on a great markdown plan, I almost view the project as a foregone conclusion at that point. The rest is basically mindless "machine tending" of your swarm of 5-15 agents.

This claim sounds bold, but it follows logically from everything above. If the plan is thorough, the beads faithfully encode it with full context and correct dependencies, and the agents have a clear AGENTS.md, then implementation becomes a mechanical process of agents picking up beads, implementing them, reviewing, and moving on.

This is true when: the plan has genuinely converged (not merely become long), the beads are self-contained enough that fresh agents can execute them without guessing, the swarm has working coordination/review/testing loops, and the human is still tending when flow jams or reality diverges from the plan.

It stops being true when: architecture is still being invented during implementation, the bead graph is thin or missing dependencies, or the swarm cannot coordinate because AGENTS.md, Agent Mail, or bv usage is weak. If you find yourself doing heavy cognitive work during implementation, that is a signal that planning or bead polishing was insufficient. The remedy is to pause, go back to bead space, and add the missing detail.

V1 Is Not Everything

A common misconception is that you have to do everything in one shot. In this approach, that's true only for version 1. Once you have a functioning v1, adding new features follows the same process: create a super detailed markdown plan for the new feature, turn it into beads, and implement. The same process that creates the initial version also handles all subsequent iterations.

Tools Must Be Agent-First

Every new dev tool in the year of our lord 2025 should have a robot mode designed specifically for agents to use. And it should probably be designed by agents, too. And then you iterate based on the feedback of the agents actually using the tool in real-world scenarios.

Every tool ships with a prepared AGENTS.md blurb. The tool is not complete without documentation that agents can consume. But it goes further: the tools themselves should be designed by agents, for agents, with iterative feedback. If agents do not like the tools, they will not use them without constant nagging.

I make sure the agents enjoy using them and solicit their feedback to improve the tooling. If they don't like the tools, they won't use them without constant nagging.

Recursive Self-Improvement: The Meta-Skill Pattern

This is the most advanced concept in the flywheel, and the one that separates linear productivity gains from exponential ones. The core idea: your agent toolchain should improve itself using its own output as fuel.

Most developers treat skills and tools as static artifacts. You write a skill, agents use it, and if it works well enough, you move on. The recursive approach instead treats every agent session as training data for the next version of the skill, creating a tight feedback loop where the system gets measurably better each cycle without additional human effort.

Consider what happens without recursive improvement. You build a CLI tool, write a Claude Code skill for it, and deploy both. Agents use the tool, but they misinterpret certain flags, forget to pass required arguments, or use workarounds because the skill's instructions were ambiguous. Every agent that hits the same snag wastes the same tokens re-discovering the same workaround. Multiply that across dozens of agents and hundreds of sessions, and the waste is enormous. Now consider the alternative: after those sessions happen, you automatically mine them, discover the failure patterns, rewrite the skill to prevent them, and the next wave of agents never hits those snags at all. The skill becomes a living document shaped by real usage rather than a guess about how agents will behave.

Using skills to improve skills, skills to improve tool use, and then feeding the actual experience in the form of session logs (surfaced and searched by my cass tool and /cass skill) back into the design skill for improving the tool interface to make it more natural and intuitive and powerful for the agents. Then taking that revised tool and improving the skill for using that tool, then rinse and repeat.

How to Actually Do This (Step by Step)

  1. 1Build the baseline. Create a tool. Create a skill for it using sc (skill creator). The first version of the skill will be imperfect. Ship it anyway.
  2. 2Let agents use it in real work. Do not test in isolation. Deploy the tool and skill into actual project work where agents are implementing beads, running reviews, doing real tasks. CASS automatically logs every session.
  3. 3Mine the sessions. After 10+ sessions of real usage, search CASS for sessions where agents invoked the tool.
  4. 4Feed findings into a rewrite. Give the session analysis to a fresh agent along with the current skill file. Ask it to rewrite the skill to fix every issue.
  5. 5Repeat. The revised skill produces better sessions, which give you better data for the next revision. After 3-4 cycles, the skill is dramatically more reliable than the original.
bash
# Step 3: Mine sessions for patterns
cass search "tool_name" --workspace /data/projects/PROJECT --json --limit 100

What to look for in the results:

  • Clarifying questions: where agents asked "do you mean X or Y?" means the skill was ambiguous
  • Repeated mistakes across different agents: a systematic gap in the skill's instructions
  • Creative workarounds: agents inventing their own approach means the skill is missing a useful pattern
  • Outright failures: the skill directed agents to do something wrong or impossible

The key insight is that the rewriting step itself can be a skill. You can write a meta-skill whose entire purpose is: take a skill file + CASS session data as input, produce a better skill file as output. Then the meta-skill can also be refined using its own session data, which is the self-referential property that makes the whole system accelerate.

This is the way. Make a skill-refiner meta skill. CASS is unbelievably handy for this.
Documentation for this documentation refining tool (so meta -- I should run it on itself, like a snake eating its own tail!)

Each cycle takes less human effort than the previous one because the meta-skill itself has improved.

The Four Layers of Recursive Improvement

The recursive pattern operates at increasing levels of ambition. The mistake is trying to build all four layers at once. Start simple and let the need for the next layer emerge naturally.

The Four Layers of Recursive Improvement

Each layer amplifies the next. Start at Layer 1, not Layer 4.

Layer 1 of 4
Agent Feedback Forms

Layer 1 requires zero infrastructure. You can do this today, right now, with any tool and two agent sessions.

  1. 1Layer 1: Feedback forms after tool use (start here, no infrastructure needed). After an agent finishes using a tool, ask it to fill out a structured feedback survey. Feed that to another agent working on the tool itself. This requires nothing beyond two agent sessions and produces immediate improvements.
  2. 2Layer 2: CASS-powered skill refinement (requires session logging). Instead of relying on one agent's opinion, mine session logs to find systematic patterns across many agents. An agent using a tool for the first time might blame itself for a confusing flag; when you see 15 agents all struggling with the same flag, you know the flag is the problem.
  3. 3Layer 3: Skills that generate work (the system proposes its own improvements). The idea-wizard skill examines a project and generates improvement ideas. The optimization skill finds performance bottlenecks. These skills create new beads, which agents implement, which improve the tools, which make the skills more effective. The human's role shifts from directing specific work to curating which generated ideas are worth pursuing.
  4. 4Layer 4: Skills bundled with tool installers (the skill improves before the user ever sees it). Every tool you ship includes a pre-optimized Claude Code skill baked into its installer. The skill was refined through multiple CASS cycles before shipping. When a new user installs the tool, their agents immediately benefit from all the refinement work done across every previous user's sessions.
I'm going to start having my tool installers always add a highly optimized skill. It makes a massive difference.

Why the Acceleration Compounds

Most productivity techniques produce linear improvements: you get 10% better each cycle, and those gains do not stack. The recursive skill pattern compounds because each cycle improves the tools that perform the next cycle.

When you improve the extreme-optimization skill, every future optimization pass across every tool benefits. When you improve the idea-wizard skill, every future brainstorming session across every project benefits. When you improve the skill-refiner meta-skill, every future skill refinement benefits. The improvements multiply rather than add.

The tools produced by the recursive loop are the tools that produce the next tools. This is why the Knuth analogy is apt: it is genuinely the same concept as a compiler that compiles itself, except applied to the entire agent-driven development workflow rather than just a compiler.

I'm going over each tool with my extreme optimization skill and applying insane algorithmic lore that Knuth himself probably forgot about already to make things as fast as the metal can go in memory-safe Rust.

The Hidden Knowledge Extraction

The recursive loop has a second, subtler benefit that matters even more at the frontier. Models have internalized vast amounts of academic CS literature: obscure algorithmic techniques, mathematical proofs, design patterns from papers that only a handful of people ever read. Most of this knowledge never surfaces because nobody asks for it with enough precision.

Skills are the mechanism for asking the right questions. Consider the difference:

  • Without a skill: "Optimize this function." The agent applies generic improvements like caching, loop unrolling, or reducing allocations. Useful but shallow.
  • With an extreme-optimization skill: The skill directs the agent to systematically consider cache-oblivious data structures, SIMD vectorization opportunities, branch-free arithmetic, van Emde Boas layout, fractional cascading, and carry-less multiplication, then benchmark before and after each change. The agent draws on deep knowledge it would not volunteer unprompted.

The skill acts as a key that unlocks specific rooms in the model's knowledge base. Without the key, the model defaults to common patterns. With it, the model reaches into the long tail of techniques that most human developers have never encountered.

The knowledge is just sitting there and the models have it. But you need to know how to coax it out of them.

This explains why the recursive loop accelerates rather than plateaus. Each cycle of skill refinement does not just fix bugs in the skill's instructions; it also sharpens the skill's ability to extract deeper knowledge from the model. The optimization skill gets better at asking for the right techniques, because CASS sessions reveal which techniques actually produced measurable gains and which were dead ends. The next cycle of optimization is better informed because the previous cycle's results are now part of the feedback corpus.

Stack enough cycles and the result is code that looks like it was written by someone who read every obscure CS paper ever published. In a functional sense, it was. The agent served as a lens focusing decades of dispersed academic knowledge onto a single practical target. The skill was the lens prescription.

The Operator Library

These recurring cognitive moves show up throughout real Flywheel sessions. They matter more than any single prompt because they say when to apply a move, what failure looks like, and what output is expected. Each operator has a prompt module you can paste directly into an agent session.

Operator 1: Plan-First Expansion
[OPERATOR: plan-first-expansion] 1) Restate the goals, workflows, and constraints in concrete terms. 2) Expand the markdown plan until the main architectural and user-flow decisions are explicit. 3) Do not start coding until the plan covers testing, failure paths, and sequencing. Output (required): revised markdown plan with clarified workflows, architecture, and test obligations.
Context

When the project still fits in a plan but would explode in size once implemented, multiple architectural paths are plausible, or the desired user workflow is still fuzzy

Psychology

Prevents skeleton-first coding that locks in bad boundaries, and stops local code exploration from substituting for product reasoning.

Operator 2: Competing-Plan Triangulation
[OPERATOR: competing-plan-triangulation] 1) Collect independent plans from multiple strong frontier models. 2) Compare them for better ideas, missing concerns, and incompatible assumptions. 3) Integrate only the strongest elements into one revised plan. Output (required): a single merged markdown plan plus explicit notes on what changed and why.
Context

When the project is important enough that one model's biases are dangerous, or early drafts feel plausible but not obviously excellent

Psychology

Prevents picking the first decent plan and calling it done, or combining every idea indiscriminately instead of filtering for quality.

Operator 3: Overshoot Mismatch Hunt
[OPERATOR: overshoot-mismatch-hunt] 1) Tell the model it likely missed a large number of issues. 2) Make it compare the artifact against prior feedback or source material again. 3) Require another full pass rather than a small patch list. Output (required): an expanded list of missed elements, contradictions, and corrections.
Context

When review output looks too short or self-satisfied, or a large plan/bead graph still feels under-audited

Psychology

Prevents asking for 'all problems' and getting a shallow pass. Models stop after finding a 'reasonable' number; this forces exhaustive search.

Operator 4: Plan-to-Beads Transfer Audit
[OPERATOR: plan-to-beads-transfer-audit] 1) Walk every important part of the markdown plan and map it to actual beads. 2) Ensure rationale, constraints, and tests are embedded in the bead descriptions. 3) Identify anything in the plan that has no bead or any bead that has no clear plan backing. Output (required): coverage report plus bead edits that close the gaps.
Context

When a large plan is about to be turned into execution tasks, or agents are creating beads quickly and may drop rationale

Psychology

A beautiful plan does not automatically produce good beads. Prevents creating terse beads that depend on tacit knowledge from the markdown file.

Operator 5: Convergence Polish Loop
[OPERATOR: convergence-polish-loop] 1) Re-run a full critical review in a fresh session. 2) Integrate the fixes and compare the magnitude of changes to the previous round. 3) Stop only when the revisions are small, mostly corrective, and coverage checks keep passing. Output (required): revised artifact plus a judgment of whether it is still in major-change mode or near convergence.
Context

When a plan or bead graph has visible rough edges and the first polishing pass found real issues

Psychology

Prevents treating the first decent revision as final, or continuing endless polishing after returns have gone flat.

Operator 6: Fresh-Eyes Reset
[OPERATOR: fresh-eyes-reset] 1) Start a fresh session. 2) Reload AGENTS.md and the relevant project context. 3) Ask for a full review of the plan or beads as if seeing them for the first time. Output (required): a fresh critical pass unconstrained by the prior session's local minima.
Context

When the agent has done several long review rounds and suggestions are getting repetitive or shallow

Psychology

Prevents trusting a tired context window to keep finding subtle flaws, or mistaking context exhaustion for genuine convergence.

Operator 7: Fungible Swarm Launch
[OPERATOR: fungible-swarm-launch] 1) Confirm AGENTS.md, bead state, Agent Mail, and bv routing are all ready. 2) Start agents in a staggered way and make them claim work immediately. 3) Keep coordination in beads, reservations, and threads rather than in a special overseer agent. Output (required): an active swarm with claimed work, low collision risk, and no special-agent bottleneck.
Context

When beads are polished enough to execute and multiple agents are about to work in the same repository

Psychology

Prevents launching too early before beads are self-contained, or letting agent identity and role specialization become load-bearing.

Operator 8: Feedback-to-Infrastructure Closure
[OPERATOR: feedback-to-infrastructure-closure] 1) Mine CASS for repeated prompts, breakdowns, and fixes. 2) Distill the useful patterns into skills, AGENTS.md guidance, blurbs, or tool changes. 3) Update the reusable artifact so the next swarm starts from the improved baseline. Output (required): a revised reusable artifact plus a short note describing the lesson it now encodes.
Context

When the same confusion or recovery pattern appears repeatedly in CASS, agents complain about a tool, or a project finishes with clear lessons worth retaining

Psychology

Prevents treating lessons as anecdotes instead of durable system inputs. This is the operator that turns repeated behavior into ritual, ritual into skill, skill into infrastructure.

The Prompt Library

All prompts in this guide are preserved verbatim from prompts that worked well in real sessions (quirks and typos included). For a much larger public prompt collection, see jeffreysprompts.com, which has a generous free section and is open source at GitHub. There is also a paid Pro tier with additional prompts and a CLI called jfp for managing prompt collections. For a larger paid library of higher-end skills, see jeffreys-skills.md ($20/month, with a dedicated CLI called jsm). Both paid offerings are still under active development.

Common Problems from Real Deployments

  • Agent Mail CLI availability: Sometimes the binary is not at the expected path; agents fall back to REST API calls.
  • Context window exhaustion: Agents typically manage 2-3 polishing passes before needing a fresh session.
  • Duplicate beads at scale: Large bead sets (100+) develop duplicates; dedicated dedup passes are necessary.
  • Plan-bead gap: The synthesis step sometimes stalls between plan revision and bead creation; always explicitly transition.

Getting Started

The complete system is free and 100% open-source. A beginner with a credit card and a laptop can visit the wizard, follow step-by-step instructions to rent a VPS, paste one curl|bash command, type onboard, and start building with AI agents immediately.

bash
# 1. Rent a VPS (OVH or Contabo, ~$40-56/month, Ubuntu)
# 2. SSH in and run the one-liner
curl -fsSL https://agent-flywheel.com/install.sh | bash
# 3. Reconnect, then learn the workflow
onboard
# 4. Create your first project
acfs newproj my-first-project --interactive
# 5. Spawn agents and start building
ntm spawn my-first-project --cc=2 --cod=1 --gmi=1

You don't even need to know much at all about computers; you just need the desire to learn and some grit and determination. And about $500/month for the subscriptions, plus another $50 or so for the cloud server.

Once you get Claude Code up and running on the cloud server, you basically have an ultra competent friend who can help you with any other problems you encounter. And Jeffrey will personally answer your questions if you reach out on X or on GitHub issues.

If you want to change the entire direction of your life, it has truly never been easier. If you think you might want to do it, I really recommend just immersing yourself.

Ready to Start?

Get the Flywheel Stack

One command installs all 11 tools, three AI coding agents, and the complete environment.
30 minutes to fully configured.