The Agentic Coding
Flywheel
A comprehensive guide to creating extraordinary software by orchestrating swarms of AI agents using exhaustive markdown plans, polished beads, and the Agent Flywheel stack. Based on the methodology of Jeffrey Emanuel.
The Complete Workflow
This is the end-to-end methodology for creating software with frontier AI models, exhaustive markdown planning, beads-based task management, and coordinated agent swarms. Every project follows the same arc, whether it is a small CLI tool or a complex web application. This guide is about moving the hardest thinking into representations that still fit into model context windows. That is the whole game.
Why the flywheel compounds instead of spinning in place
Step through the loop. The same project gets faster and safer because every completed cycle upgrades the artifacts feeding the next one. This is the compounding return on planning.
Intent
Defines the 'Why'
Fuzzy goals
Mental model
It starts with you. You have an idea for a piece of software. Maybe a web app, maybe a CLI tool, maybe a complex system. Instead of opening an editor and starting to code, you do something that feels counterintuitive: you spend the vast majority of your time planning.
- 1You explain what you want to build to a frontier model like GPT Pro with Extended Reasoning. Your concept, your goals, the user workflows, why it matters. The model produces an initial markdown plan: a comprehensive design document for the entire system.
- 2You ask competing models to create their own plans. Claude Opus, Gemini with Deep Think, Grok Heavy. Each one independently designs the same project. They come up with surprisingly different approaches, each with unique strengths and blind spots.
- 3You synthesize the best ideas from all plans into one. GPT Pro analyzes the competing plans and produces a "best of all worlds" hybrid that blends the strongest ideas from every model into a single superior document.
- 4You iterate relentlessly. Round after round of refinement, each time in a fresh conversation, until the suggestions become incremental. Plans created this way routinely reach 3,000 to 6,000+ lines. They are not slop. They are the result of countless iterations and feedback from many frontier models.
- 5You convert the plan into beads. Beads are self-contained work units (like Jira or Linear tasks, but optimized for use by coding agents). Each bead carries its own context, reasoning, dependencies, and test obligations. A complex plan might produce 200-500 beads with a full dependency graph.
- 6You polish the beads obsessively. "Check your beads N times, implement once," where N is as many as you can stomach. Each polishing round finds things the previous round missed: duplicates, missing dependencies, incomplete context. You run this 4-6+ times until convergence.
- 7You launch a swarm of agents. Claude Code, Codex, and Gemini-CLI sessions running in parallel, all in the same codebase. They coordinate through Agent Mail, choose work intelligently using bv's graph-theory routing, and execute beads systematically.
- 8You tend the swarm, not the code. The human checks for stuck beads, rescues agents after context compaction, sends review prompts, and ensures flow quality. You are the clockwork deity. You designed the machine, set it running, and now you manage it.
- 9Agents review, test, and harden. Self-review with fresh eyes, cross-agent review, random code exploration, testing coverage, UI/UX polish. Rounds and rounds until reviews come back clean.
That is the whole movie. For the CASS Memory System, this process turned a 5,500-line markdown plan into 347 beads. Twenty-five agents produced 11,000 lines of working, tested code with 204 commits in about five hours. You can see the actual plan, the actual agent mail messages, and the actual beads for yourself.
The frontier models and coding agent harnesses really are that good already. They just need this extra level of tooling, prompting, and workflows to reach their full potential. The rest of this guide zooms into each stage.
Glossary
Why Planning Is 85% of the Work
You spend 85% of your time on planning. The first time you try it, it feels wrong. No code is being written. Every instinct tells you to just start building. That discomfort is the signal that you are doing it right.
Interactive: The Context Horizon
Why reasoning in plan-space dominates reasoning in code-space as projects scale.
A markdown plan, even a massive 6,000-line one, is still vastly smaller than the codebase it describes. When models reason about a plan instead of raw implementation, they can hold the whole system in their context window at once. Once you start turning that plan into code, the system rapidly becomes too large to understand holistically. You are doing global reasoning while global reasoning is still possible.
Planning tokens are far fewer and cheaper than implementation tokens. A big, complex markdown plan is shorter than a few substantive code files, let alone a whole project. That means you can afford many more refinement rounds in planning than in implementation. Each planning round evaluates system-wide consequences, not just local code edits. Each improvement to the plan gets amortized across every downstream bead and code change. Planning is the cheapest place to buy correctness, coherence, and ambition.
Without front-loaded planning, agents are effectively improvising architecture from a narrow local window into the codebase. That is exactly when you get placeholder abstractions, missing workflow details, contradictory assumptions, and compatibility shims that nobody actually wanted. With a detailed plan and polished beads, the models are no longer inventing the system from scratch while coding. They are executing a constrained, coherent design.
The Human Part
The human is not there to hand-author every line of the plan. The human is there to inject intent, judgment, taste, product sense, and strategic direction at the point where those qualities affect the entire downstream system. Once the plan is excellent, the rest becomes much more mechanical.
When prompting the model to create the initial markdown plan, you spend a lot of time explaining the goals and intent of the project and detailing the workflows: how you want the final software to work from the standpoint of the user's interactions. The more the model understands about what you're really trying to accomplish and the end goal and why, it can do a better job for you.
Debates belong in planning, not implementation. As many important disagreements as possible should happen before the swarm is burning expensive implementation tokens. Implementation can still surface surprises, but the posture of the workflow is to front-load decisions into plan space.
You Don't Need to Know Everything Upfront
The most common objection to spending 85% of your time on planning: "I don't really know all the requirements at the beginning, and I need the flexibility to change things later." This is not at all in tension with the methodology. Thorough planning does not mean transcribing requirements you already know. It means using frontier models to discover requirements you never would have found on your own, iteratively, while changes are still cheap.
When you paste a rough concept into GPT Pro and ask for a comprehensive plan, the model surfaces dozens of edge cases, architectural considerations, and workflow details you had not thought of. When you show that plan to three competing models, each one finds blind spots the others missed. When you run five rounds of refinement, each round uncovers issues invisible in the previous round. By the time you start implementation, you know far more about your own project than you would have discovered through months of coding and refactoring.
This extends even further when adding major features to existing projects. You can point an agent at an entirely separate open-source project, have it study that project's architecture, and ask it to reimagine the strongest ideas through the lens of your own project's unique capabilities. Requirements emerge from the research itself. The methodology does not demand omniscience up front; it demands a willingness to let the models do deep, iterative exploration before committing to implementation.
Three Reasoning Spaces
The methodology separates work into three spaces, each with a different artifact and a different question it answers:
Where you catch the bug determines the rework bill
Inject the same mistake at different layers. The deeper it lands, the more downstream structure has already hardened around it. This is the Law of Rework Escalation.
Fixes are pure reasoning. Zero code churn.
Fixes rewrite orchestration. High coordination cost.
Fixes pay the double-tax: implementation + cleanup.
Planning earns its keep because it is the cheapest layer for global reasoning. Press 'Inject' on any layer to visualize the cost cascade.
Plan space is where you figure out what the system should be. Bead space is where you turn that into executable memory, a graph of self-contained work units detailed enough that agents don't have to keep consulting the full plan. Code space is where agents implement, review, and test locally. The key is knowing which space you're in: if you are still redesigning the product, stay in plan space. If you are mainly packaging the work for execution, move to bead space.
Creating & Refining the Markdown Plan
Before You Start: The Foundation Bundle
Before writing the plan itself, you need a coherent foundation. Think of it as assembling a foundation bundle: a tech stack decision, an initial architectural direction, a strong AGENTS.md file bootstrapped from a known-good template, up-to-date best-practices guides, and enough product and workflow explanation for the models to understand what "good" looks like.
Keep best practices guides in the project folder and reference them in AGENTS.md. These guides should be kept up to date; you can have Claude Code search the web and update them to the latest versions of your frameworks and libraries.
A strong bootstrap move is to start every new project by copying an AGENTS.md from an existing project that already has good general behavioral rules, safety notes, tool blurbs, and coordination guidance. Later, once the plan and beads are clearer, you ask agents to replace the project-specific content while preserving the general rules that carry across projects.
Writing the Initial Plan
You don't even need to write the initial markdown plan yourself. You can write that with GPT Pro, just explaining what it is you want to make. Claude Opus in the web app is also good for this, but GPT Pro with Extended Reasoning remains the top choice for initial planning. No other model can touch Pro on the web when it's dealing with input that easily fits into its context window. It's truly unique. And since you get it on an all-you-can-eat basis with a Pro plan, take full advantage of that.
You usually also specify the tech stack. For a web app, it's generally TypeScript, Next.js 16, React 19, Tailwind, Supabase, with anything performance-critical in Rust compiled to WASM. For a CLI tool, usually Go or Rust. If the stack isn't obvious, do a deep research round with GPT Pro or Gemini and have them study all the relevant libraries and make a suggestion taking your goals into account.
What a First Plan Looks Like
A first serious markdown plan would not say "build a notes app." It would start spelling out the actual user-visible system:
- Users upload Markdown files through a drag-and-drop UI.
- The system parses frontmatter tags and stores upload failures for review.
- Search must support keyword, tag, and date filtering with low perceived latency.
- Admins need a dedicated screen showing ingestion failures, parse reasons, and retry actions.
- Auth is internal-only; unauthorized users must never see document content or metadata.
- We need e2e coverage for upload success, upload failure, search, filtering, and admin review.
That is still only the beginning. But it already shows the difference between ordinary brainstorming and Flywheel planning: the plan tries to make the whole product legible before any code exists.
Multi-Model Plans
For the best results, ask multiple frontier models to independently create plans for the same project. GPT Pro, Claude Opus, Gemini with Deep Think, Grok Heavy. Each comes up with pretty different plans. Different frontier models have different "tastes" and blind spots. Passing a plan through a gauntlet of different models is the cheapest way to buy architectural robustness.
In the CASS Memory System project, the competing plans are publicly visible. This pattern has been used across at least 10 sessions spanning 7+ projects.
Then show their competing plans to GPT Pro with this prompt:
GPT Pro web app with Extended Reasoning
Forces the model to be intellectually honest about what competitors did better, then synthesize a hybrid that is stronger than any individual plan. The 'best of all worlds' phrasing appears in 10+ distinct sessions across 7+ projects in the session archive. Under the hood, the prompt's length and specificity are deliberate: by asking for git-diff style changes, complete integration of every good idea, and explicit updating of the existing plan, it prevents the model from writing a vague summary and forces structural engagement with the competing plans' actual content.
Take GPT Pro's output (the git-diff style revisions) and paste it into Claude Code or Codex to integrate the revisions in-place:
Claude Code or Codex
Claude critically assesses each suggestion, providing a second layer of quality filtering. The 'wholeheartedly agree / somewhat agree / disagree' framing is load-bearing: it forces the agent to evaluate each revision on a gradient rather than accepting or rejecting the whole batch. You get signal about which changes are obviously good versus which are plausible but debatable, which lets you intervene on the edge cases rather than rubber-stamping everything.
Best-of-all-worlds synthesis
Toggle proposal plans on and off, then drag the refinement dial. The point is not “many models” in the abstract; it is that complementary strengths plus fresh-round revision produce a plan that is harder to surprise later.
All proposal sources are active. Best-of-all-worlds.
Iterative Refinement
Now paste the current plan into a fresh GPT Pro conversation with this prompt. The key word is fresh. Fresh conversations prevent the model from anchoring on its own prior output. Repeat 4-5 rounds:
GPT Pro web app — fresh conversation each round
This has never failed to improve a plan significantly. Each round finds architectural issues, missing features, and robustness improvements that the previous round missed. Under the hood, asking for 'rationale/justification' prevents the model from making arbitrary changes; it has to argue for each revision, which filters out changes that seem clever but do not actually improve the plan. The git-diff format forces precision rather than vague hand-waving about what should be different.
This has never failed to improve a plan significantly. The best part is that you can start a fresh conversation in ChatGPT and do it all again once Claude Code or Codex finishes integrating your last batch of suggested revisions. After four or five rounds of this, you tend to reach a steady-state where the suggestions become very incremental.
You can still get extra mileage by blending in smart ideas from Gemini with Deep Think enabled, or from Grok Heavy, or Opus in the web app, but you still want to use GPT Pro on the web as the final arbiter of what to take from which model and how to best integrate it.
After any review pass that feels too short or self-satisfied
By claiming 80+ errors exist, the model keeps searching exhaustively rather than satisfying itself with a partial list.
Plans created this way routinely reach 3,000-6,000+ lines. They are not slop. They are the result of countless iterations and blending of ideas and feedback from many models. For the CASS GitHub Pages export feature, the plan went through multiple rounds over about 3 hours, growing to approximately 3,500 lines. You can also see a 6,000-line plan to get a feel for the scale.
It feels slow because no code is being written. But if you do it correctly and then start up enough agents in your swarm with Agent Mail, beads, and bv, the code will be written so ridiculously quickly that it more than makes up for this slow part. And what's more, the code will be really good.
When to stop refining and start converting to beads: Stay in plan refinement if whole-workflow questions are still moving around, major architecture debates are still open, or fresh models keep finding substantial missing features, constraints, or tradeoffs. Switch to beads when the plan mostly feels stable and the remaining improvements are about execution structure, testing obligations, sequencing, and embedded context rather than about what the system fundamentally is. If you are still redesigning the product, stay in plan space. If you are mainly packaging the work for execution, move to bead space.
Converting the Plan into Beads
Then you're ready to turn the plan into beads. Think of these as epics, tasks, and subtasks with an associated dependency structure. The name comes from Steve Yegge's amazing project, which is like Jira or Linear, but optimized for use by coding agents. They are stored locally in .beads/ JSONL files that commit with your code.
There are two separate stages here. The planning is and should be prior to and orthogonal to beads. You should always have a super detailed markdown plan first. Then treat transforming that markdown plan into beads as a separate, distinct problem with its own challenges. But once you're in "bead space" you never look back at the markdown plan. But that's why it's so critical to transfer all the details over to the beads.
Claude Code with Opus
This prompt forces the agent to treat plan-to-beads as a translation problem rather than task extraction. The key sentence is the requirement that beads be so detailed you never need to reopen the markdown plan. That pushes rationale, test expectations, design intent, and sequencing into the bead graph itself. Under the hood, it blocks a common failure mode where the model collapses a rich plan into terse todo items. By explicitly asking for tasks, subtasks, dependency structure, comments, and future-self context, you tell the model that memory density matters more than brevity. Restricting to the br tool prevents the agent from drifting into pseudo-beads in markdown instead of editing the actual task graph.
For existing projects with a specific plan file, prefix it: "OK so now read ALL of PLAN_FILE_NAME.md; please take ALL of that and elaborate on it..." The rest of the prompt stays the same.
br tool, and from that point on just add and change actual beads. If the model starts describing beads in text form instead of creating them, stop it and redirect to br create.A plan is only useful once it becomes executable memory
Pick a concept from the plan, then compare what survives into a thin bead versus a context-rich bead. The gap is the source of most swarm confusion.
Seems simple, but implies S3 bucket provisioning, chunking, and progress states.
Each bead carries the why, what, failure modes, and verification plan needed to execute.
Most architecture has already been decided upstream.
Fresh agents can execute without improvising architecture or silently dropping intent.
Beads as Executable Memory
The plan is still the best artifact for whole-system thought. But once a swarm is involved, what you need is not a beautiful essay. You need a task graph that carries enough local context for agents to act correctly without repeatedly loading the whole project back into memory. If the beads are weak, the swarm becomes improvisational. If the beads are rich, the swarm becomes almost mechanical.
- Self-contained: Beads must be so detailed that you never need to refer back to the original markdown plan. Every piece of context, reasoning, and intent should be embedded.
- Rich content: Beads can and should contain long descriptions with embedded markdown. They don't need to be short bullet-point entries. You can embed snippets of markdown inside the beads and they often do; JSONL is just how they serialize.
- Complete coverage: Everything from the markdown plan must be embedded into the beads. Lose nothing in the conversion.
- Explicit dependencies: The dependency graph must be correct; this is what enables bv to compute the optimal execution order.
- Include testing: Beads should include comprehensive unit tests and e2e test scripts with great, detailed logging.
What Good Beads Look Like
To make this concrete, imagine a small internal web app called "Atlas Notes" for uploading and searching team notes. Instead of one vague task like "build Atlas Notes," the plan becomes many self-contained beads:
- br-101 Upload and Parse Pipeline: Describes accepted file formats, frontmatter parsing expectations, where failures are logged, what happens on malformed input, and which unit and e2e tests prove the pipeline works.
- br-102 Search Index and Query UX: Carries the search behavior, indexing rules, latency expectations, filter semantics, empty-state UX, and test coverage for keyword/tag/date combinations.
- br-103 Ingestion Failure Dashboard: Includes the admin workflow, permission boundaries, retry logic, logging expectations, and the exact reasons this dashboard matters for operational trust.
The titles are not the important part. What matters is that each bead is rich enough that a fresh agent can open it and immediately understand what correct implementation looks like, why it matters, and how to verify it.
For the CASS Memory System (5,500-line plan), the conversion produced 347 beads with complete dependency structure. FrankenSQLite had hundreds of beads created via parallel subagents. For complex projects, expect 200-500 initial beads.
Beads CLI Quick Reference
br create --title "..." --priority 2 --label backend # Create issuebr list --status open --json # List open issuesbr ready --json # Show unblocked tasksbr show <id> # View issue detailsbr update <id> --status in_progress # Claim taskbr close <id> --reason "Completed" # Close taskbr dep add <id> <other-id> # Add dependencybr comments add <id> "Found root cause..." # Add commentbr sync --flush-only # Export to JSONL (no git ops)
Priority uses numbers: P0=critical, P1=high, P2=medium, P3=low, P4=backlog. Types: task, bug, feature, epic, question, docs. br ready shows only unblocked work. Storage is a SQLite + JSONL hybrid; the JSONL files commit with your code.
Check Your Beads N Times, Implement Once
Before you burn up a lot of tokens with a big agent swarm on a new project, the old woodworking maxim of "Measure twice, cut once!" is worth revising as "Check your beads N times, implement once," where N is basically as many as you can stomach. This is the step most people underinvest in.
After the initial conversion finishes, do a round of this prompt. If Claude Code did a compaction at any point, be sure to tell it to re-read your AGENTS.md file first:
Claude Code with Opus — run 4-6+ times
This prompt keeps the system from freezing beads too early. It tells the model to stay in plan space for as long as it is still finding meaningful improvements, which is exactly where reasoning is cheapest and most global. The warnings against oversimplifying and losing functionality are crucial because models otherwise tend to 'improve' artifacts by deleting complexity they do not fully understand. It combines local bead QA (via br) with graph QA (via bv), and forces tests into the bead definitions themselves so test work cannot be deferred into an afterthought.
Convergence Detection
Drag the sliders or pick a preset to see how beads tighten into a stable state before implementation begins.
From real sessions, polishing involves duplicate detection and merging, quality scoring on WHAT/WHY/HOW criteria, filling empty bead descriptions, correcting dependency links, and cross-referencing beads against the markdown plan to ensure nothing was lost. FrankenSQLite identified 9 exact duplicate pairs and closed them, choosing survivors based on "richer testing specs, better dependency chains, and higher priority."
Tell agents to go through each bead and explicitly check it against the markdown plan. Or vice versa: go through the markdown plan and cross-reference every single thing against the beads (both closed and open) to ensure complete coverage.
Convergence Detection: When to Stop
Bead polishing follows numerical optimization convergence patterns:
Three signals indicate convergence: agent responses getting shorter (output size shrinking), the rate of change decelerating (change velocity slowing), and successive rounds becoming more similar (content similarity increasing). When the weighted convergence score reaches 0.75+, you're ready to finalize. Above 0.90, you're hitting diminishing returns.
Fresh Eyes Technique
If improvements start to flatline, start a brand new Claude Code session:
A brand new Claude Code session
Fresh sessions don't carry the accumulated assumptions of the previous session. They see the beads with genuinely new eyes.
Then follow up with:
Same fresh session, after it finishes reading
As a final step, have Codex with GPT (high reasoning effort) do one last round using the same polishing prompt. Different models catch different things.
Deduplication Check
After large bead creation batches, run a dedicated dedup pass:
Claude Code, after large bead creation batches
Adding Features to Existing Projects
The full planning pipeline (Phases 1-5) is for new projects built from scratch. For existing projects that need new features, the Idea-Wizard is a formalized 6-phase pipeline:
- 1Ground in reality. Read AGENTS.md and list all existing beads (
br list --json). This prevents creating duplicates. - 2Generate 30, winnow to 5. The agent brainstorms 30 ideas for improvements, then self-selects the best 5 with justification.
- 3Expand to 15. Prompt: "ok and your next best 10 and why." The agent produces ideas 6-15, checking each against existing beads for novelty.
- 4Human review. You review the 15 ideas and select which to pursue.
- 5Turn into beads. Selected ideas become beads with full descriptions, dependencies, and priority levels.
- 6Refine 4-5 times. The same polishing loop as above. Single-pass beads are never optimal.
Claude Code, for existing projects needing new features
Then: "ok and your next best 10 and why." The agent produces ideas 6-15, carefully checking each against existing beads for novelty. Having agents brainstorm 30 then winnow to 5 produces much better results than asking for 5 directly because the winnowing forces critical evaluation.
Not every change needs the full pipeline. For quick, bounded changes, use the built-in TODO system:
When the overhead of formal bead creation would slow you down more than it helps
This prompt forces the agent to externalize its local execution plan into a durable checklist instead of juggling a sprawling ad-hoc task in conversational memory. In tools that support a built-in TODO system, that checklist survives compaction. Under the hood, it creates a temporary execution scaffold that is lighter than full bead creation and much safer than 'just remember everything.' This is the right mode when the work is too small to justify immediate bead formalization but too large to trust to ephemeral context alone. When NOT to use it: if the change is expanding, depends on other work, needs graph-aware sequencing, or should be part of the permanent project record. In those cases, stop and convert it into proper beads. If an ad-hoc change later proves important, retroactively create beads for the completed work to preserve continuity.
Major Features: Research and Reimagine
The Idea-Wizard handles bounded improvements, new feature ideas, and fixes. But sometimes you want to add an entirely new capability to an existing project, something ambitious enough that it deserves the same depth of planning as a greenfield project, and where an external project has already solved a related problem worth studying. For these, there is a more powerful approach: study an external project that already solves a related problem, then reimagine its strongest ideas through the lens of your own project's unique strengths.
As a concrete example: adding a robust messaging substrate to the Asupersync project. Rather than designing from scratch or doing a straightforward port, the approach was to study NATS (a mature, production-grade messaging system in Go), extract its strongest architectural ideas, and reimagine them using Asupersync's correct-by-design structured concurrency primitives to create something neither project could achieve alone.
The process follows a specific prompt sequence. Each step builds on the previous, alternating between expansion (going deeper, inverting the analysis, pushing for architectural innovation) and hardening (repeated blunder hunts that stress-test the result):
Codex or Claude Code, in a session with existing project context
By asking the agent to clone and investigate the external project firsthand, you get specific, grounded proposals instead of vague suggestions based on training data alone. The 'reimagine in highly accretive ways' framing prevents a shallow porting exercise and pushes toward genuinely novel combinations.
The first draft is always too conservative. Push for depth and ambition:
Same session, immediately after the first proposal
Models produce conservative initial proposals to avoid being wrong. Explicit pressure to go deeper unlocks the genuinely creative architectural ideas that make the integration worthwhile rather than incremental.
Then invert the analysis. This technique surfaces opportunities that only exist because of your project's unique capabilities:
Same session, after deepening
Standard analysis asks 'what can we learn from them?' Inversion asks 'what can we do that they fundamentally cannot?' This surfaces the highest-value integration points: capabilities that are genuinely novel rather than just reimplementations of features the external project already has.
After each major expansion, run a blunder-hunt pass. The critical technique: repeat the exact same critique prompt 5 times in a row. Each pass finds things the previous pass missed, because the model is forced to look beyond the issues it already identified:
Run this exact prompt 5 times consecutively after each major proposal expansion
Models tend to find 15-20 issues on the first pass and declare satisfaction. Running the exact same prompt again forces them past the issues they already found. By the fifth pass, you have caught subtle logical flaws and architectural inconsistencies that no single review pass would surface. This is the critique equivalent of the bead polishing convergence pattern from Section 5.
Continue pushing for specific architectural innovations. In the Asupersync example, this meant asking: "Can you think of a clever, radically innovative way to leverage our unique capabilities so that the messaging substrate doesn't require a separate external server, but each client can self-discover and collectively act as both client and server?" Each major architectural addition gets another round of 5x blunder hunts.
When the proposal has items flagged as needing follow-on design work, address them explicitly rather than leaving them vague:
Same session, after blunder hunts surface specific open questions
Blunder hunts often identify areas where the proposal is 'honest but incomplete' rather than wrong. This prompt converts those honest gaps into concrete design decisions, preventing them from becoming ambiguity that later infects the beads and implementation.
Before sending the proposal for multi-model feedback, make it self-contained. Other models do not have your session context, so the proposal must include everything they need to give useful critique:
After the proposal reaches a stable, ambitious state
Cross-model review only works if every model can fully understand the proposal without access to your project. Adding comprehensive background sections prevents other models from making shallow suggestions based on incomplete understanding. This preparation step is what makes the multi-model feedback loop genuinely useful rather than superficial.
Follow this with another 5x blunder hunt, then de-slopify the document. Now you are ready for multi-model triangulation.
Send the self-contained proposal to GPT Pro, Claude Opus, Gemini with Deep Think, and Grok Heavy, all with the same prompt asking for improvements in git-diff format:
GPT Pro, Claude Opus (web), Gemini Deep Think, and Grok Heavy -- all four, independently
Each model has different architectural tastes and blind spots. Asking for git-diff format forces precision: the models cannot hand-wave about what should change, they have to show the exact text transformations. This makes the synthesis step tractable.
Feed the competing feedback from all models into GPT Pro using the "best of all worlds" synthesis prompt from Section 3. Apply the resulting diffs back to the proposal document in Codex or Claude Code, then de-slopify the final result.
You can see this exact process applied to Asupersync's NATS integration: the initial proposal, the version after multi-model feedback, and the full GPT Pro synthesis conversation.
From here, the proposal feeds into the standard pipeline: convert to beads (Section 4), polish obsessively (Section 5), launch the swarm (Section 7). The research-driven approach adds significant front-end effort but produces proposals with a level of architectural depth and innovation that no amount of greenfield brainstorming can match, because you are standing on the shoulders of a real, battle-tested system while leveraging capabilities that system never had access to.
The Coordination Stack
Then you're ready to start implementing. The fastest way to do that is to start up a big swarm of agents that coordinate using three interlocking tools:
Beads, Agent Mail, and bv are a single machine
Hover or tap to inspect each piece. Click again to remove it and watch the system lose a capability it cannot replace. This is the Coordination Triangle.
The high-bandwidth negotiation layer.
The durable, localized issue state.
The graph-theory compass for triage.
The trio is not three nice-to-have tools. It is one operating system split into memory, communication, and leverage analysis. Remove any side of the triangle and the swarm loses determinism.
Each tool is essential but insufficient alone. Agent Mail without beads leaves agents with no structured work to coordinate around. Beads without bv leaves agents randomly choosing tasks. bv without Agent Mail leaves agents unable to communicate. The system is distributed and decentralized, with each agent using bv to find the next optimal bead, marking it as in-progress, and communicating about it via Agent Mail.
Agent Mail: Why Naive Coordination Fails
Building your own agent coordination from scratch is full of footguns that Agent Mail was designed to sidestep:
- No broadcast-to-all default. Agents are lazy and will only use broadcast mode, spamming every agent with mostly irrelevant information. It's like if your email system defaulted to reply-all every time. That burns precious context.
- Good MCP ergonomics. It takes a huge amount of careful iteration to get the API surface right so agents use it reliably without wasting tokens.
- No git worktrees. Worktrees demolish development velocity and create reconciliation debt when agents diverge. Working in one shared space surfaces conflicts immediately. All agents commit directly to
main. - Advisory file reservations. Agents call dibs temporarily on files, but it's not rigidly enforced, and reservations expire. Agents can reclaim files that haven't been touched recently. Rigid locks held by dead agents block everyone else. Advisory reservations with TTL expiry degrade gracefully.
- Semi-persistent identity. Agent Mail generates whimsical names like "ScarletCave" and "CoralBadger" — meaningful enough for coordination, disposable enough that losing one doesn't corrupt the system. No agent's identity is load-bearing.
Before editing files, agents reserve them via Agent Mail:
file_reservation_paths(project_key="/data/projects/my-repo",agent_name="BlueLake",paths=["src/auth/*.rs"],ttl_seconds=3600,exclusive=true,reason="br-42: refactor auth")
Other agents see the reservation and work on different files. A rigid locking system would deadlock when an agent crashes while holding a lock. Advisory reservations with expiry degrade gracefully. The worst case is a brief window where two agents touch the same file, which the pre-commit guard catches anyway.
Agent Mail provides four high-level macros that wrap common multi-step patterns: macro_start_session (bootstrap: ensure project, register agent, fetch inbox), macro_prepare_thread (join existing thread with summary), macro_file_reservation_cycle (reserve, work, auto-release), and macro_contact_handshake (cross-agent contact setup).
Broadcast vs. Point-to-Point
Agent Mail uses targeted delivery and advisory locks to stay efficient. O(1) noise.
bv: The Graph-Theory Compass
bv precomputes dependency metrics (PageRank, betweenness, HITS, eigenvector, critical path, cycle detection) so agents get deterministic, dependency-aware output. When multiple agents each independently query bv for priority, you get emergent coordination. Agents naturally spread across the optimal work frontier without needing a central coordinator.
PageRank finds what everything depends on. Betweenness finds bottlenecks. The math knows your priorities better than gut intuition.
bv --robot-triage # THE MEGA-COMMAND: full recommendations with scoresbv --robot-next # Minimal: just the single top pick + claim commandbv --robot-plan # Parallel execution tracks with unblocks listsbv --robot-insights # Full graph metrics: PageRank, betweenness, HITSbv --robot-priority # Priority recommendations with reasoning and confidencebv --robot-diff --diff-since <ref> # Changes since last check
--robot-* flags. Bare bv launches an interactive TUI that blocks your session.bv was made in a single day and was just under 7k lines of Go. It was later rewritten to 80k lines with advanced features. This shows that effort does not correspond to impact. The tool started for humans but pivoted to being primarily for agents:
Advanced filtering lets you scope analysis to labels, historical point-in-time views, pre-filtered recipes, or grouped output:
bv --robot-plan --label backend # Scope to label's subgraphbv --robot-insights --as-of HEAD~30 # Historical point-in-timebv --recipe actionable --robot-plan # Only unblocked itemsbv --recipe high-impact --robot-triage # Top PageRank scoresbv --robot-triage --robot-triage-by-track # Group by parallel streamsbv --robot-triage --robot-triage-by-label # Group by domain
Bead IDs as Threading Anchors
Bead IDs create a unified audit trail across all coordination layers: the bead ID goes in the Agent Mail thread_id, the subject prefix ([br-123]), the file reservation reason, and the commit message. This makes all coordination activity traceable back to a single task.
AGENTS.md: The Operating Manual
The AGENTS.md file is the single most critical piece of infrastructure for agent coordination. It tells every agent how to behave, what tools exist, what safety constraints matter, and what "doing a good job" means in this repo. Every tool should come with a prepared blurb designed for inclusion in AGENTS.md. Think of these blurbs as the modern equivalent of man pages.
Every AGENTS.md should include these core rules:
- 1Rule 0, The Override Prerogative: The human's instructions override everything.
- 2Rule 1, No File Deletion: Never delete files without explicit permission.
- 3No destructive git commands:
git reset --hard,git clean -fd,rm -rfare absolutely forbidden. - 4Branch policy: All work happens on
main, nevermaster. - 5No script-based code changes: Always make code changes manually.
- 6No file proliferation: No
mainV2.rsormain_improved.rsvariants. - 7Compiler checks after changes: Always verify no errors were introduced.
- 8Multi-agent awareness: Never stash, revert, or overwrite other agents' changes.
More content in AGENTS.md means more frequent compactions, but it saves time and avoids mistakes by giving agents all the context upfront. This tradeoff is worth making.
If you don't have a good AGENTS.md file, none of this stuff is going to work well. You can see example AGENTS.md files for a complex NextJS webapp and a bash script project.
"Reread AGENTS.md" is the single most common prompt prefix across the entire session archive. After every context compaction, agents must re-read it:
Immediately after any context compaction (the single most commonly used prompt)
Compaction wipes out the soft operational knowledge that keeps the swarm sane: how to behave, how to coordinate, what tools exist, what rules matter, what mistakes to avoid. This one-line prompt restores that control plane in one move. It rehydrates the agent's behavioral contract after context loss. Important enough to have been automated with the post_compact_reminder tool.
The pragmatic approach: do not fight compaction, just re-read AGENTS.md and roll with it until the agent starts doing dumb stuff, then start a new session. When beads are well-constructed, compaction matters less because each bead is self-contained. The agent can pick up any bead fresh without needing the full conversation history.
Single-Branch Git Model
All agents commit directly to main. This may surprise you if you're used to feature branches. But branch-per-agent creates merge hell with 10+ agents making frequent commits. Worktrees add filesystem complexity and path confusion. Agents lose context when switching branches. Logical conflicts survive textual merges: a function signature change on one branch and a new callsite on another merge cleanly but fail to compile. On a single branch, the second agent sees the signature change immediately and adapts.
Instead of branch isolation, three complementary mechanisms prevent conflicts: file reservations (agents reserve files via Agent Mail before editing; advisory, not rigid, with TTL expiry so dead agents cannot deadlock the system), a pre-commit guard (blocks commits to files reserved by another agent), and DCG (Destructive Command Guard, which mechanically blocks dangerous commands).
git checkout -- on uncommitted work. Files were recovered via git fsck --lost-found, but the incident proved that instructions do not prevent execution. Mechanical enforcement does. DCG was built the next day.The recommended git workflow: Pull latest, reserve files, edit and test, commit immediately, push, release reservation. Key principles: commit early and often (small commits reduce the conflict window), push after every commit (unpushed commits are invisible to other agents), reserve before editing, release when done.
Agent Fungibility
Every agent is a generalist. No role specialization. All agents read the same AGENTS.md and can pick up any bead. This is deliberately opposed to "specialist agent" architectures where one agent has a special role — specialist agents become single points of failure. When the specialist crashes or needs compaction, the whole system stalls. With 12 fungible agents, losing one makes almost no difference.
You also do not want "ringleaders": a coordinating boss agent whose crash takes down the whole system. Coordination must live in artifacts (beads, reservations, threads) and tools (bv, Agent Mail), not in any special agent.
Think of it like RaptorQ fountain codes: beads are "blobs" in a stream, any agent catches any bead in any order. There is no "rarest chunk" bottleneck, and the system is resilient to partial agent failures by design. Failure recovery is trivial: the bead remains marked in_progress, any other agent can resume it, and a replacement agent is just ntm add PROJECT --cc=1 plus the standard marching orders prompt.
Fungible Agent Crash Recovery
Click any agent to kill it. Watch the swarm self-heal.Tap an agent, then tap again to kill it.
Every agent is fungible. Kill any of them to see the swarm self-heal without downtime or data loss.
Prompts Are Deliberately Generic
This confuses people when they first see the prompt library. The prompts say things like "check over each bead super carefully" rather than "check over each bead in the authentication module for SQL injection risks." That generality is the point. The specificity lives in three places the agent already has access to:
- 1The beads themselves contain detailed descriptions, context, and rationale embedded during the plan-to-bead conversion.
- 2AGENTS.md contains project-specific rules, conventions, and tool documentation.
- 3The codebase contains the actual implementation context.
The prompts are the reusable scaffolding that directs the agent's attention. The beads and AGENTS.md supply the project-specific substance. This separation means you can use the exact same prompt library across every project without modification. The prompt "reread AGENTS.md so it's still fresh in your mind" followed by "use bv to find the most impactful bead to work on next" works identically whether you are building a CLI tool, a web app, or a protocol library, because the specifics come from the project's own artifacts, not from the prompt.
Security Comes Free with Good Planning
Security review is baked into the standard workflow at multiple levels rather than being a separate phase. The cross-agent review prompt explicitly calls out security problems. When models reason about an entire system's architecture at once (which is what the plan enables), they spot authentication gaps, data exposure risks, and trust boundary violations without being told to look. UBS catches security anti-patterns mechanically: unpinned dependencies, missing input validation, hardcoded secrets, supply chain vulnerabilities. Beads that include comprehensive e2e tests naturally cover authentication and authorization paths.
Security vulnerabilities are usually symptoms of incomplete reasoning about the system. If the plan is detailed enough to cover all user workflows, edge cases, and failure modes, security considerations emerge from that completeness rather than requiring a separate checklist. For projects with explicit security requirements (financial, healthcare), add dedicated security review beads.
Launching & Running the Swarm
You can create sessions using Claude Code, Codex, and Gemini-CLI in different panes in tmux, or use the ntm project (Named Tmux Manager) as the command center:
# Spawn a multi-agent sessionntm spawn myproject --cc=2 --cod=1 --gmi=1# Send a prompt to ALL agentsntm send myproject "Your marching orders prompt here"# Send to specific agent typentm send myproject --cc "Focus on the API layer"# Open the command palette (battle-tested prompts)ntm palette
NTM is useful but not mandatory. A mux is a terminal multiplexer: a layer that lets you manage multiple shell sessions inside one higher-level session manager. In practice, that usually means some combination of tabs, panes, detached sessions, and reconnection to work that is still running on a local or remote machine. tmux is the classic Unix terminal multiplexer, powerful and battle-tested. NTM is built on top of tmux, which is why it is a natural fit for multi-agent work. But tmux is only one mux. WezTerm has its own built-in mux. Zellij is another. The method cares that you have a workable orchestration layer, not that you picked one specific multiplexer.
One common alternative is WezTerm because native scrollback and text selection are more convenient than in tmux. A workable setup:
- Run agents in separate tabs using WezTerm and its built-in mux, often across remote machines
- Trigger your most common prompts from a Stream Deck with the prompts preconfigured
- Keep a large prompt file open in Zed and paste rarer prompts manually
- In Claude Code, use the project-specific
Ctrl-rprompt history search when you want to recall something you used recently
There is no single correct operator interface. NTM is one good cockpit. WezTerm tabs plus mux is another. FrankenTerm, which is built on WezTerm, is aimed more explicitly at this style of workflow but is not ready yet. The important thing is that you can launch agents, get prompts into them quickly, monitor them, and keep the coordination layer (AGENTS.md, Agent Mail, beads, bv) intact.
For concrete setup notes on these operator environments:
- WezTerm persistent remote sessions: its native mux supports persistent remote sessions that survive disconnects, sleep, or reboot while preserving native scrollback and text selection
- Ghostty terminfo for remote machines: Ghostty is a good terminal frontend in its own right, whether used directly or paired with another mux such as Zellij
- Host-aware color themes for Ghostty and WezTerm: different color schemes per host make it visually obvious which machine you are connected to
Give each agent these marching orders:
Every agent in the swarm gets this as their initial prompt
This is the closest thing to a canonical swarm kickoff packet. It front-loads the shared operating context, forces the agent to establish social presence through Agent Mail, and then pivots away from passive waiting toward execution. The line about 'communication purgatory' matters because swarm failure often comes from over-coordination rather than under-coordination. Under the hood, the prompt establishes a control loop: load rules, understand the codebase, join the coordination layer, claim work, keep state synchronized, and use bv whenever local judgment is insufficient. The rch requirement is especially important in real swarms because it externalizes expensive builds and tests, preventing local CPU contention from degrading the entire multi-agent system. That one sentence is operational, not cosmetic. The prompts are deliberately generic; their vagueness is a feature, letting you reuse them for every project while the agent gets specifics from AGENTS.md and the beads.
The First 10 Minutes After Launch
Newcomers often understand each individual tool but do not have a clean picture of the first live operating loop. In practice, the first 10 minutes look like this:
- 1Your session manager creates the agent terminals (ntm spawn, WezTerm mux, or equivalent).
- 2You send the marching-orders prompt to each agent (staggered, not all at once).
- 3Each agent reads AGENTS.md and the repo docs, inspects the codebase, and joins Agent Mail.
- 4Each agent checks who else is active, acknowledges waiting messages, and learns the bead-thread naming conventions.
- 5Each agent uses
bv --robot-triageandbr ready --jsonto choose a bead. - 6Before editing, the agent reserves the relevant file surface and announces the claim in the matching br-### thread.
- 7Only then does the agent start coding, reviewing, or testing.
That sequence turns a pile of terminals into a coordinated swarm. Skipping the join-up steps produces duplicate work, silent conflicts, and "communication purgatory." Skipping the routing steps means agents choose work randomly instead of unlocking the dependency graph intelligently.
Agent Composition & Model Recommendations
Efficiency definitely declines as N grows, but if you have enough tasks in beads and they have Agent Mail and you don't start them all at the exact same time, you go faster as N grows. The practical limit is around 12 agents on a single project, sometimes higher. Or run 5 agents per project across multiple projects simultaneously. Why the ratio --cc=2 --cod=1 --gmi=1? Two Claude sessions because they are great for architecture and complex reasoning; one Codex for fast iteration and testing with complementary strengths; one Gemini for a different perspective, especially good for docs and review duty.
| Open Beads | Claude (cc) | Codex (cod) | Gemini (gmi) |
|---|---|---|---|
| 400+ | 4 | 4 | 2 |
| 100-399 | 3 | 3 | 2 |
| <100 | 1 | 1 | 1 |
The Thundering Herd
When you start up like 5 of each kind of agent and have them all collaborate in the same shared workspace, you can hit the classic "thundering herd" problem. The fix: stagger agent starts by 30 seconds minimum, make sure agents mark beads as in-progress quickly, and wait 4 seconds after launch before sending the initial prompt. For Codex specifically: send Enter twice after pasting long prompts (Codex has an input buffer quirk that sometimes swallows the first submit).
The same swarm can either stampede or flow
Advance phase by phase and compare the exact same four agents under two launch strategies. The lesson is timing, not talent.
All agents wake together, re-read context together, and pile onto the same frontier.
The swarm is synchronized before any useful work has started.
Agents enter a few beats apart, so each arrival sees a different clean frontier.
Only the first agent is awake. The rest are still off the critical path.
At this phase, the herd path has 0 lock conflicts and 6 units of idle burn, while the staggered path has 0 conflicts and 2 idle burn. The difference is not smarter agents. It is whether the system lets them reach distinct frontier at distinct times.
What the Human Actually Does
The human tends the swarm like an operator tending a machine that mostly runs on its own. These tasks are monitoring and maintenance. The hard cognitive work already happened during planning, which is why you can tend multiple project swarms at the same time.
On roughly a 10-30 minute cadence:
- 1Check bead progress. Use
br list --status in_progress --jsonorbv --robot-triage. Are agents making steady progress? Are any beads stuck? - 2Handle compactions. When you see an agent acting confused, send: "Reread AGENTS.md so it's still fresh in your mind." This is the single most common intervention. It takes 5 seconds.
- 3Run periodic reviews. Pick an agent and send the "fresh eyes" review prompt. This catches bugs before they compound.
- 4Manage rate limits. When an agent gets rate-limited, switch its account with
caam activate claude backup-2or start a new agent. - 5Commit periodically. Every 1-2 hours, designate one agent for the organized commit prompt.
- 6Handle surprises. Create new beads for unanticipated issues, or if it's plan-level, update the plan and create new beads.
Taken to its endpoint, this design supports full autonomy: one puppet master agent controlling ntm via robot mode, replacing the human for routine machine-tending. The methodology is building toward a future where the human designs the plan, polishes the beads, and then walks away entirely while agents execute, review, ship, and start the next cycle.
When the "foregone conclusion" breaks down: If you find yourself doing heavy cognitive work during implementation, that is a signal that planning or bead polishing was insufficient. The remedies are specific: vague beads means agents improvise and produce inconsistent implementations; missing dependencies means agents work on tasks whose prerequisites are not done; thin AGENTS.md means agents produce non-idiomatic code; no Agent Mail means agents step on each other's files. The fix is always the same: pause implementation, go back to bead space, and add the missing detail.
Diagnosing a Stuck Swarm
When a swarm goes bad, the failure is usually one of two things: a local coordination jam (agents stepping on each other or losing operational context) or a strategic drift problem (the swarm is busy but no longer closing the real gap to the goal).
Atlas Notes as a Live Swarm
For a small project like Atlas Notes, a first swarm might look like this: Claude agent A claims br-101 and implements upload + parse handling. Codex agent B claims br-102 and works on the search path plus tests. Claude agent C claims br-103 and builds the admin failure dashboard. Gemini agent D stays flexible: reviews recent work, checks docs, and fills in test or UX gaps where needed. All four share the same codebase, read the same AGENTS.md, coordinate via Agent Mail, and use bv whenever they are uncertain about what unlocks the most progress next. That is what makes the swarm feel like one system rather than four unrelated terminals.
Account Switching
When you hit rate limits, use CAAM (Coding Agent Account Manager) for sub-100ms account switching:
caam status # See current accounts and usagecaam activate claude backup-2 # Switch instantly
Review, Testing & Hardening
Code review in a multi-agent swarm follows a different rhythm than traditional code review. There is no pull request, no human reviewer, no approval gate. Instead, review is woven into the implementation cycle itself: agents review their own work after each bead, review each other's work periodically, and the human triggers broader review rounds at natural checkpoints.
If you've done a good job creating your beads, the agents will be able to get a decent sized chunk of work done in that first pass. Then, before they start moving to the next bead, have them review all their work:
After each bead is implemented; run until no more bugs are found
This prompt is short because it is not redirecting the agent into a new domain. It is forcing a mode switch from generative coding to adversarial reading. The phrase 'fresh eyes' pushes the model to reframe code it just wrote as something potentially wrong, confusing, or internally inconsistent. That reduces the pattern where an agent stops once code compiles and never performs the low-cost bug sweep that catches obvious issues. The most effective reviews use subagent delegation: dispatch a fresh subagent with no memory of the original implementation to review each changed file.
Keep running rounds until they stop finding bugs. Typically 1-2 rounds for simple beads, 2-3 for complex ones. If an agent keeps finding bugs after 3 rounds, the implementation approach may be fundamentally off; consider having a different agent take over.
Each review should answer four questions:
- 1Is the implementation correct? Does it do what the bead description says it should?
- 2Are there edge cases? Empty inputs, concurrent access, error paths, boundary conditions.
- 3Are there similar issues elsewhere? If you find a bug, search for the same pattern in other files.
- 4Should the approach be different? Sometimes the implementation is correct but there is a simpler or more robust way.
When reviews come back clean, have them move on to the next bead:
After self-review comes back clean
This transition prompt is the glue between beads. It combines re-reading AGENTS.md (for compaction safety), querying bv for priority, and communicating with the swarm. It ensures the agent uses graph-theory routing to choose the task that unblocks the most downstream work, rather than picking arbitrarily.
Testing: Free Labor
When all your beads are completed, make sure you have solid test coverage:
After initial implementation pass is complete
Larger projects produce massive test suites. BrennerBot has nearly 5,000 tests. Stuff tends to "just work" in that case. Use UBS (Ultimate Bug Scanner) as a quality gate before every commit: ubs <changed-files> catches errors beyond what linters and type checkers find, including security holes, supply chain vulnerabilities, and runtime stability issues.
After any substantive code changes, always verify with compiler checks:
# Rustcargo check --all-targetscargo clippy --all-targets -- -D warningscargo fmt --check# Gogo build ./...go vet ./...# TypeScriptbun typecheckbun lint
UI/UX Polish
For projects with a user interface, there is a dedicated polishing phase that happens after core functionality works but before shipping. This is separate from bug hunting because the problems you are looking for are not bugs; they are friction, ugliness, and missed opportunities to delight. When an agent implements an "authentication" bead, it focuses on making auth work correctly. Whether the login form has good visual hierarchy, whether the error messages are helpful, whether the flow feels smooth on mobile: these are orthogonal concerns requiring a different mode of attention. Trying to do both at once produces mediocre results on both.
The workflow has five steps: Step 1, run the general scrutiny prompt to generate a list of improvement suggestions (not code changes). Step 2, review the suggestions and pick which to pursue (human judgment step; the agent typically generates 15-30 suggestions, some excellent, some unnecessary). Step 3, turn selected suggestions into beads and implement through the normal swarm process. Step 4, run the platform-specific polish prompt. Step 5, repeat until improvements become marginal (typically 2-3 rounds).
After core functionality is working
After the scrutiny pass
The 'don't you agree?' phrasing is not politeness. It triggers the model to critically evaluate its own previous work rather than just validating it.
De-Slopification
After agents write documentation (README, user-facing text), run a de-slopify pass to remove telltale AI writing patterns. This must be done manually, not via regex. Read each line and revise systematically:
Deep Cross-Agent Review
This phase is distinct from the per-bead self-reviews above. Self-reviews happen after each bead is completed and focus on the code that was just written. Deep review happens after all (or most) beads are done and casts a wider net across the entire codebase, looking for problems that only become visible when you see how all the pieces fit together.
Cross-agent review catches a fundamentally different class of bugs than self-review. When Agent A implements a function and Agent B calls it, Agent A's self-review will never catch the fact that Agent B is passing arguments in the wrong order, because Agent A does not know about Agent B's code. Cross-agent review surfaces these integration issues.
Every 30-60 minutes during active implementation, or after a natural milestone (e.g., all beads in an epic are done), trigger cross-agent review. Do not have all agents stop to review simultaneously; pick one or two agents that just finished a bead and send them the review prompt while the others keep implementing. This keeps the swarm productive while still catching inter-agent issues.
Keep doing rounds of these two prompts until they consistently come back clean with no changes made. These prompts serve different purposes and should be alternated. This is one of the more art-than-science parts of the methodology. The prompts overlap in literal meaning, but they reliably activate different search behaviors in the models:
Alternate with the cross-agent review below
The prompt first asks the agent to build a mental model of purpose and flow, then asks for criticism. That ordering matters. A bug hunt without workflow understanding degrades into linting; a bug hunt after tracing execution flows catches logic errors, mismatched assumptions, and silent product-level breakage. The 'randomly explore' framing breaks the locality trap. Directed reviews focus on files that seem important, which are the files that got the most attention already. Bugs that survive to this phase live in utility modules, error handling paths, configuration parsing, and edge-case branches.
Alternate with the random exploration above
This prompt forces the swarm to stop treating code ownership as sacred. A large share of real defects live at the boundaries between agents' changes or in assumptions nobody revisits because they were made by 'someone else.' The instruction not to restrict review to the latest commits prevents shallow PR-style skimming and pushes the agent to trace older surrounding code, dependency surfaces, and adjacent workflows where the real root cause may live. The first-principles wording nudges the reviewer away from symptom-fixing toward actual causal diagnosis.
The cross-agent prompt tends to induce a suspicious, adversarial stance aimed at boundary failures and root causes in code written by others. The random-exploration prompt tends to induce a curiosity-driven stance aimed at reconstructing workflows and finding latent bugs in code that nobody is actively staring at. In practice, alternating them produces better coverage than repeating either one alone.
How to run deep bug hunting: Send the random exploration prompt to 2-3 agents simultaneously — each will explore different parts of the codebase because the randomness ensures variety. After they report back, send the cross-agent review prompt. Alternate until agents consistently come back with "I reviewed X, Y, Z files and found no issues." When two consecutive rounds both come back clean, the codebase is in good shape. If agents keep finding bugs after 4+ rounds, go back to bead space and create specific fix beads. Always run ubs . on the full project first and fix everything it flags before letting agents hunt for subtler issues.
Organized Commits
Periodically have one agent handle git operations:
Every 1-2 hours during active development
Designating one agent prevents merge conflicts and produces coherent commit messages. The 'don't edit the code' instruction is critical: without it, agents treat the commit step as an opportunity to 'fix one more thing,' which creates unbounded scope expansion and makes the commit unpredictable. The 'logically connected groupings' instruction produces a meaningful git history instead of one monolithic 'update everything' commit that is impossible to review or bisect later.
Swarm Diagnosis: Reality Check
When the swarm looks active but you suspect it is not closing the real gap to the goal:
When the swarm feels busy but directionally off
This prompt breaks the spell of local productivity. Instead of asking whether the current bead is going well, it asks whether the current frontier of work actually converges on the project outcome. If the agent concludes that finishing all open beads still would not get you there, the answer is not 'work harder.' The answer is to revise the bead graph and re-aim the swarm.
README Revision
After significant implementation work
Catch-All Oversight
After any significant change, as a quick final pass
The De-Slopify Prompt
After agents write README or any user-facing documentation
Landing the Plane
When ending a work session, agents must complete every step. Work is NOT complete until git push succeeds. Unpushed work is stranded locally and invisible to every other agent.
- 1File issues for remaining work. Create beads for anything that needs follow-up.
- 2Run quality gates. Tests, linters, builds (if code changed).
- 3Update issue status. Close finished work, update in-progress items.
- 4Sync beads.
br sync --flush-onlyto export to JSONL, thengit add .beads/. - 5Commit and push.
git pull --rebase && git add <files> && git commit && git push. - 6Verify.
git statusmust show "up to date with origin."
For the Atlas Notes example, "done for now" would not mean "the upload page appears." It would mean: the upload, parse, search, and admin-review workflows all work end to end; the key beads are closed and remaining polish ideas exist as new beads; tests cover the critical user journeys and known failure paths; UBS and compiler/lint checks are clean; commits and pushes are complete; and the next session can restart from beads, AGENTS.md, and Agent Mail threads rather than from human memory.
A Flywheel session is only landable when a future swarm can pick it back up without the human re-explaining the project from scratch.
The Complete Toolchain
The Flywheel is supported by a stack of 11 purpose-built tools, all free and open-source:
Not every tool is used the same way. br, bv, ubs, and rch are ordinary shell commands. Agent Mail is primarily experienced through MCP tools and macros. The installer (agent-flywheel.com) installs all of them with a single curl|bash command.
The Flywheel Interactions
The complete interaction flow from spawn to memory:
NTM spawns agents --> Agents read AGENTS.md--> Agents register with Agent Mail--> Agents query bv for task priority--> Agents claim beads via br--> Agents reserve files via Agent Mail--> Agents implement and test--> UBS scans for bugs--> Agents commit and push--> CASS indexes the session--> CM distills procedural memory--> Next cycle is better
The VPS Environment
Use acfs newproj to bootstrap a project with full tooling:
acfs newproj myproject --interactive# Creates:# myproject/# ├── .git/ # Git repository initialized# ├── .beads/ # Local issue tracking (br)# ├── .claude/ # Claude Code settings# ├── AGENTS.md # Instructions for AI agents# └── .gitignore # Standard ignores
The Incremental Onboarding Path
For beginners who find the full system overwhelming:
- 1Start with: Agent Mail + Beads (br) + Beads Viewer (bv) — this core trio captures most of the value
- 2Then add: UBS for bug hunting
- 3Then add: DCG for destructive command protection
- 4Then add: CASS for session history
- 5Then add: CM (CASS Memory) for codifying lessons into procedural memory
Scale Observations from Real Projects
| Project | Beads | Plan Lines | Agents | Time to MVP |
|---|---|---|---|---|
| CASS Memory System | 347+ | 5,500 | ~25 | ~5 hours |
| FrankenSQLite | Hundreds | Large spec | Many parallel | Multi-session |
| Frankensearch | 122+ (3 epics) | — | Multiple | Multi-session |
| Apollobot | 26 | — | Single session | 2-3 polish rounds |
Patterns That Work
- The "30 to 5 to 15" funnel: When generating ideas, having agents brainstorm 30 then winnow to 5 produces much better results than asking for 5 directly. The winnowing forces critical evaluation.
- Parallel subagents for bulk bead operations: Creating dozens of beads is faster when dispatched to parallel subagents, each handling a subset.
- Staggered agent starts: Starting agents 30-60 seconds apart avoids the thundering herd problem.
- One agent for git operations: Designating one agent to handle all commits prevents merge conflicts and produces coherent commit messages.
Anti-Patterns to Avoid
- Single-pass beads: First-draft beads are never optimal. Always do 4-5 polishing passes minimum.
- Skipping plan-to-bead validation: Not cross-referencing beads against the plan leads to missing features discovered only during implementation.
- Communication purgatory: Agents spending more time messaging each other than coding. Be proactive about starting work.
- Holding reservations too long: File reservations with long TTLs block other agents unnecessarily. Reserve, edit, commit, release.
- Not re-reading AGENTS.md after compaction: Context compaction loses nuances. The re-read is mandatory, not optional.
Supporting Infrastructure
The Skills Ecosystem
The term "skill" confuses people at first, so define it plainly: a skill is a reusable operational instruction pack for an agent. In Claude Code terms, that usually means a SKILL.md file plus optional references, scripts, or templates that tell the agent how to use a tool, how to execute a methodology, what pitfalls to avoid, and what a good result looks like. A good skill is closer to executable know-how than to ordinary prose documentation.
A tool changes what the agent can do. A skill changes how well the agent knows how to do it. The same model with and without a good skill often behaves like two different agents.
Every Flywheel tool has a corresponding Claude Code skill that encodes best practices and automates common workflows. Many of these skills are bundled directly in the repos for the tools themselves and get installed automatically when the tool is installed, which means users often benefit from them without having to think about "skill management" explicitly. There is also a broader public skills collection at GitHub.
The prompt side has a similar split. jeffreysprompts.com has a generous free section and is open source at GitHub. It also has a paid Pro tier with additional prompts and a dedicated CLI called jfp for managing prompt collections. For a larger paid library of higher-end skills, see jeffreys-skills.md, a $20/month service with many of the strongest curated skills, new skills added continuously, and a dedicated CLI called jsm for managing them.
Both paid offerings are still under active development. That means occasional rough edges. Active work is underway to fix issues quickly, feedback is appreciated, and refunds are available for unhappy users.
Skills provide the prompts, procedures, anti-pattern guidance, and tool-specific workflows directly to agents, which reduces the amount of bespoke prompting a human needs to do by hand.
Vendor Lock-In: Avoid It
Beads, Agent Mail, and bv are all CLI tools that work identically regardless of which agent invokes them. A Claude Code agent and a Codex agent and a Gemini agent can all call br ready --json and get the same task list. The practical test: could you swap out every Claude Code agent for Codex or Gemini without changing your AGENTS.md, beads, Agent Mail setup, or workflow? If yes, you're vendor-neutral.
Validation Gates
These gates turn the methodology into a contract. If a gate fails, drop back a phase instead of pushing forward optimistically.
Vibe Mode Aliases
On the VPS, agents run with full permissions via short aliases:
alias cc='NODE_OPTIONS="--max-old-space-size=32768" claude --dangerously-skip-permissions'alias cod='codex --dangerously-bypass-approvals-and-sandbox'alias gmi='gemini --yolo'
These are configured automatically by the installer. DCG provides the safety net that makes this viable.
Cost
~$500/month for Claude Max and GPT Pro subscriptions (at minimum), plus ~$50/month for a cloud server (OVH, Contabo). Multiple Max accounts may be needed for large swarms; CAAM enables instant switching when hitting rate limits. At scale, token usage for a single intensive session can reach ~20M input tokens, ~3.5M output tokens, ~2.6M reasoning tokens, and ~1.15 billion cached token reads. At full scale: 22 Claude Max accounts, 22 GPT Pro accounts, and 7 Gemini Ultra accounts.
The Flywheel Effect
If you simply use these tools, workflows, and prompts in the way just described, you can create really incredible software in just a couple days, sometimes in just one day. I've done it a bunch of times now and it really does work, as crazy as that may sound. You see my GitHub profile for the proof of this. It looks like the output from a team of 100+ developers.
It behaves like a flywheel rather than a checklist because each cycle makes the next one better:
- Planning quality compounds because you keep reusing prompts, patterns, and reasoning structures that CASS proves actually worked.
- Execution quality compounds because better beads make swarm behavior more deterministic and less dependent on human improvisation.
- Tool quality compounds because agents use the tools, complain about them, and then help improve them.
- Memory compounds because the results of one swarm, captured by CASS session search, become training data, rituals, and infrastructure for the next one.
How the Compounding Actually Works
Each session makes the next one better. Concretely: Session N produces raw data — CASS automatically logs every agent session. Between sessions, CM distills patterns — running cm reflect extracts procedural rules like "always run cargo check after modifying Cargo.toml" with confidence scores that decay without reinforcement and amplify with repetition. Session N+1 starts with those patterns loaded — running cm context "Building an API" retrieves relevant procedural memory. Simultaneously, UBS patterns grow as new bug classes get added. Agent Mail coordination norms get refined in AGENTS.md and skills.
The compounding is real but not automatic in the early stages. You have to actually run cm reflect, actually review CASS session data, actually update AGENTS.md with lessons learned. But even manually, spending 15 minutes between projects reviewing what worked and updating your AGENTS.md template produces outsized returns on every subsequent project.
Agent Feedback Forms
Apply the same feedback mechanisms you would use for humans (structured surveys, satisfaction ratings, net promoter scores) directly to agents evaluating tools. After an agent finishes using a tool in a real project, ask it to fill out a structured feedback survey. Then pipe that feedback directly into another agent working on the tool itself. The iteration cycle collapses from weeks to minutes.
After an agent finishes using a tool in a real project
Many of the same concepts we use for people are directly applicable to agents. 'By robots, for robots.' This produces structured, actionable feedback. When used across multiple agents on different project types, you get a diverse sample of experiences. One caveat: as with humans who would tell you they wanted a 'faster horse' instead of a car, it is dangerous to trust agent feedback about potential new features before those features exist. The real test is always after implementation, in real-world usage.
CASS Memory: Three-Layer Architecture
CM (CASS Memory System) implements a three-layer memory architecture that turns raw session history into operational knowledge:
EPISODIC MEMORY (cass): Raw session logs from all agents↓ cass searchWORKING MEMORY (Diary): Structured session summaries↓ reflect + curatePROCEDURAL MEMORY (Playbook): Distilled rules with confidence scores
Rules have a 90-day confidence half-life (decays without feedback) and a 4x harmful multiplier (one mistake counts 4x as much as one success). Rules mature through stages: candidate to established to proven.
cm context "Building an API" --json # Get relevant memories for a taskcm recall "authentication patterns" # Search past sessionscm reflect # Update procedural memory from recent sessionscm mark b-8f3a2c --helpful # Reinforce a useful rulecm mark b-xyz789 --harmful --reason "Caused regression" # Flag a bad rule
The cm context command is the single most important pre-task ritual. Running it at the start of a session gives agents knowledge distilled from every previous session that touched similar work.
Meta-Skill: Skill Refinement via CASS Mining
Claude Code, targeting any skill with 10+ CASS sessions of usage data
This is the meta-skill pattern in action. The skill-refiner skill itself can be refined using its own session data, which is the self-referential property that makes the whole system accelerate. After 3-4 cycles, the skill is dramatically more reliable than the original. Each cycle takes less human effort because the meta-skill itself has improved.
CASS Ritual Detection
The flywheel's learning loop depends on mining past sessions to find what actually works. CASS enables ritual detection: discovering prompts that are repeated so frequently they constitute validated methodology.
The mining query (user prompts live at lines 1-3 of session entries; --fields minimal reduces output 5x):
cass search "*" --workspace /data/projects/PROJECT --json --fields minimal --limit 500 \| jq '[.hits[] | select(.line_number <= 3) | .title[0:80]]| group_by(.) | map({prompt: .[0], count: length})| sort_by(-.count) | map(select(.count >= 5)) | .[0:30]'
This is how the prompt library in this guide was originally discovered and validated. It was not invented top-down; it was mined bottom-up from hundreds of real sessions.
Why It Works: Layered Context
The workflow works because it keeps different kinds of context in different layers. The markdown plan holds whole-system intent and reasoning. The beads hold executable task structure and embedded local context. AGENTS.md holds operating rules and tool knowledge that must survive compaction. The codebase holds the implementation itself, which is too large to be the primary planning medium. Each layer serves a different purpose, and the methodology is disciplined about keeping the right information in the right layer.
The Kernel: 9 Invariants
- 1Global reasoning belongs in plan space. Do the hardest architectural and product reasoning while the whole project still fits in context.
- 2The markdown plan must be comprehensive before coding starts. Skeleton-first coding throws away the main advantage of frontier models.
- 3Plan-to-beads is a distinct translation problem. A good plan does not automatically produce a good bead graph.
- 4Beads are the execution substrate. Once good enough, they should carry enough context that agents no longer need the full plan.
- 5Convergence matters more than first drafts. Plans and beads both improve through repeated polishing until changes become small and corrective.
- 6Swarm agents are fungible. Coordination must live in artifacts and tools, not in special agents or unstated knowledge.
- 7Coordination must survive crashes and compaction. AGENTS.md, Agent Mail, bead state, and robot modes exist to keep work moving when sessions die.
- 8Session history is part of the system. Repeated prompts, failures, and recoveries should be mined via CASS and folded back into tools, skills, and validators.
- 9Implementation is not the finish line. Review, testing, UBS, and feedback-to-infrastructure loops are part of the core method.
Time Investment
CASS itself (a complex Rust program used by thousands of people) was made in around a week, but the human personally only spent a few hours on it. The rest of the time was spent by a swarm of agents implementing and polishing it and writing tests.
The Project Is a Foregone Conclusion
This claim sounds bold, but it follows logically from everything above. If the plan is thorough, the beads faithfully encode it with full context and correct dependencies, and the agents have a clear AGENTS.md, then implementation becomes a mechanical process of agents picking up beads, implementing them, reviewing, and moving on.
This is true when: the plan has genuinely converged (not merely become long), the beads are self-contained enough that fresh agents can execute them without guessing, the swarm has working coordination/review/testing loops, and the human is still tending when flow jams or reality diverges from the plan.
It stops being true when: architecture is still being invented during implementation, the bead graph is thin or missing dependencies, or the swarm cannot coordinate because AGENTS.md, Agent Mail, or bv usage is weak. If you find yourself doing heavy cognitive work during implementation, that is a signal that planning or bead polishing was insufficient. The remedy is to pause, go back to bead space, and add the missing detail.
V1 Is Not Everything
A common misconception is that you have to do everything in one shot. In this approach, that's true only for version 1. Once you have a functioning v1, adding new features follows the same process: create a super detailed markdown plan for the new feature, turn it into beads, and implement. The same process that creates the initial version also handles all subsequent iterations.
Tools Must Be Agent-First
Every tool ships with a prepared AGENTS.md blurb. The tool is not complete without documentation that agents can consume. But it goes further: the tools themselves should be designed by agents, for agents, with iterative feedback. If agents do not like the tools, they will not use them without constant nagging.
Recursive Self-Improvement: The Meta-Skill Pattern
This is the most advanced concept in the flywheel, and the one that separates linear productivity gains from exponential ones. The core idea: your agent toolchain should improve itself using its own output as fuel.
Most developers treat skills and tools as static artifacts. You write a skill, agents use it, and if it works well enough, you move on. The recursive approach instead treats every agent session as training data for the next version of the skill, creating a tight feedback loop where the system gets measurably better each cycle without additional human effort.
Consider what happens without recursive improvement. You build a CLI tool, write a Claude Code skill for it, and deploy both. Agents use the tool, but they misinterpret certain flags, forget to pass required arguments, or use workarounds because the skill's instructions were ambiguous. Every agent that hits the same snag wastes the same tokens re-discovering the same workaround. Multiply that across dozens of agents and hundreds of sessions, and the waste is enormous. Now consider the alternative: after those sessions happen, you automatically mine them, discover the failure patterns, rewrite the skill to prevent them, and the next wave of agents never hits those snags at all. The skill becomes a living document shaped by real usage rather than a guess about how agents will behave.
How to Actually Do This (Step by Step)
- 1Build the baseline. Create a tool. Create a skill for it using
sc(skill creator). The first version of the skill will be imperfect. Ship it anyway. - 2Let agents use it in real work. Do not test in isolation. Deploy the tool and skill into actual project work where agents are implementing beads, running reviews, doing real tasks. CASS automatically logs every session.
- 3Mine the sessions. After 10+ sessions of real usage, search CASS for sessions where agents invoked the tool.
- 4Feed findings into a rewrite. Give the session analysis to a fresh agent along with the current skill file. Ask it to rewrite the skill to fix every issue.
- 5Repeat. The revised skill produces better sessions, which give you better data for the next revision. After 3-4 cycles, the skill is dramatically more reliable than the original.
# Step 3: Mine sessions for patternscass search "tool_name" --workspace /data/projects/PROJECT --json --limit 100
What to look for in the results:
- Clarifying questions: where agents asked "do you mean X or Y?" means the skill was ambiguous
- Repeated mistakes across different agents: a systematic gap in the skill's instructions
- Creative workarounds: agents inventing their own approach means the skill is missing a useful pattern
- Outright failures: the skill directed agents to do something wrong or impossible
The key insight is that the rewriting step itself can be a skill. You can write a meta-skill whose entire purpose is: take a skill file + CASS session data as input, produce a better skill file as output. Then the meta-skill can also be refined using its own session data, which is the self-referential property that makes the whole system accelerate.
Each cycle takes less human effort than the previous one because the meta-skill itself has improved.
The Four Layers of Recursive Improvement
The recursive pattern operates at increasing levels of ambition. The mistake is trying to build all four layers at once. Start simple and let the need for the next layer emerge naturally.
The Four Layers of Recursive Improvement
Each layer amplifies the next. Start at Layer 1, not Layer 4.
Layer 1 requires zero infrastructure. You can do this today, right now, with any tool and two agent sessions.
- 1Layer 1: Feedback forms after tool use (start here, no infrastructure needed). After an agent finishes using a tool, ask it to fill out a structured feedback survey. Feed that to another agent working on the tool itself. This requires nothing beyond two agent sessions and produces immediate improvements.
- 2Layer 2: CASS-powered skill refinement (requires session logging). Instead of relying on one agent's opinion, mine session logs to find systematic patterns across many agents. An agent using a tool for the first time might blame itself for a confusing flag; when you see 15 agents all struggling with the same flag, you know the flag is the problem.
- 3Layer 3: Skills that generate work (the system proposes its own improvements). The idea-wizard skill examines a project and generates improvement ideas. The optimization skill finds performance bottlenecks. These skills create new beads, which agents implement, which improve the tools, which make the skills more effective. The human's role shifts from directing specific work to curating which generated ideas are worth pursuing.
- 4Layer 4: Skills bundled with tool installers (the skill improves before the user ever sees it). Every tool you ship includes a pre-optimized Claude Code skill baked into its installer. The skill was refined through multiple CASS cycles before shipping. When a new user installs the tool, their agents immediately benefit from all the refinement work done across every previous user's sessions.
Why the Acceleration Compounds
Most productivity techniques produce linear improvements: you get 10% better each cycle, and those gains do not stack. The recursive skill pattern compounds because each cycle improves the tools that perform the next cycle.
When you improve the extreme-optimization skill, every future optimization pass across every tool benefits. When you improve the idea-wizard skill, every future brainstorming session across every project benefits. When you improve the skill-refiner meta-skill, every future skill refinement benefits. The improvements multiply rather than add.
The tools produced by the recursive loop are the tools that produce the next tools. This is why the Knuth analogy is apt: it is genuinely the same concept as a compiler that compiles itself, except applied to the entire agent-driven development workflow rather than just a compiler.
The Hidden Knowledge Extraction
The recursive loop has a second, subtler benefit that matters even more at the frontier. Models have internalized vast amounts of academic CS literature: obscure algorithmic techniques, mathematical proofs, design patterns from papers that only a handful of people ever read. Most of this knowledge never surfaces because nobody asks for it with enough precision.
Skills are the mechanism for asking the right questions. Consider the difference:
- Without a skill: "Optimize this function." The agent applies generic improvements like caching, loop unrolling, or reducing allocations. Useful but shallow.
- With an extreme-optimization skill: The skill directs the agent to systematically consider cache-oblivious data structures, SIMD vectorization opportunities, branch-free arithmetic, van Emde Boas layout, fractional cascading, and carry-less multiplication, then benchmark before and after each change. The agent draws on deep knowledge it would not volunteer unprompted.
The skill acts as a key that unlocks specific rooms in the model's knowledge base. Without the key, the model defaults to common patterns. With it, the model reaches into the long tail of techniques that most human developers have never encountered.
This explains why the recursive loop accelerates rather than plateaus. Each cycle of skill refinement does not just fix bugs in the skill's instructions; it also sharpens the skill's ability to extract deeper knowledge from the model. The optimization skill gets better at asking for the right techniques, because CASS sessions reveal which techniques actually produced measurable gains and which were dead ends. The next cycle of optimization is better informed because the previous cycle's results are now part of the feedback corpus.
Stack enough cycles and the result is code that looks like it was written by someone who read every obscure CS paper ever published. In a functional sense, it was. The agent served as a lens focusing decades of dispersed academic knowledge onto a single practical target. The skill was the lens prescription.
The Operator Library
These recurring cognitive moves show up throughout real Flywheel sessions. They matter more than any single prompt because they say when to apply a move, what failure looks like, and what output is expected. Each operator has a prompt module you can paste directly into an agent session.
When the project still fits in a plan but would explode in size once implemented, multiple architectural paths are plausible, or the desired user workflow is still fuzzy
Prevents skeleton-first coding that locks in bad boundaries, and stops local code exploration from substituting for product reasoning.
When the project is important enough that one model's biases are dangerous, or early drafts feel plausible but not obviously excellent
Prevents picking the first decent plan and calling it done, or combining every idea indiscriminately instead of filtering for quality.
When review output looks too short or self-satisfied, or a large plan/bead graph still feels under-audited
Prevents asking for 'all problems' and getting a shallow pass. Models stop after finding a 'reasonable' number; this forces exhaustive search.
When a large plan is about to be turned into execution tasks, or agents are creating beads quickly and may drop rationale
A beautiful plan does not automatically produce good beads. Prevents creating terse beads that depend on tacit knowledge from the markdown file.
When a plan or bead graph has visible rough edges and the first polishing pass found real issues
Prevents treating the first decent revision as final, or continuing endless polishing after returns have gone flat.
When the agent has done several long review rounds and suggestions are getting repetitive or shallow
Prevents trusting a tired context window to keep finding subtle flaws, or mistaking context exhaustion for genuine convergence.
When beads are polished enough to execute and multiple agents are about to work in the same repository
Prevents launching too early before beads are self-contained, or letting agent identity and role specialization become load-bearing.
When the same confusion or recovery pattern appears repeatedly in CASS, agents complain about a tool, or a project finishes with clear lessons worth retaining
Prevents treating lessons as anecdotes instead of durable system inputs. This is the operator that turns repeated behavior into ritual, ritual into skill, skill into infrastructure.
The Prompt Library
All prompts in this guide are preserved verbatim from prompts that worked well in real sessions (quirks and typos included). For a much larger public prompt collection, see jeffreysprompts.com, which has a generous free section and is open source at GitHub. There is also a paid Pro tier with additional prompts and a CLI called jfp for managing prompt collections. For a larger paid library of higher-end skills, see jeffreys-skills.md ($20/month, with a dedicated CLI called jsm). Both paid offerings are still under active development.
Common Problems from Real Deployments
- Agent Mail CLI availability: Sometimes the binary is not at the expected path; agents fall back to REST API calls.
- Context window exhaustion: Agents typically manage 2-3 polishing passes before needing a fresh session.
- Duplicate beads at scale: Large bead sets (100+) develop duplicates; dedicated dedup passes are necessary.
- Plan-bead gap: The synthesis step sometimes stalls between plan revision and bead creation; always explicitly transition.
Getting Started
The complete system is free and 100% open-source. A beginner with a credit card and a laptop can visit the wizard, follow step-by-step instructions to rent a VPS, paste one curl|bash command, type onboard, and start building with AI agents immediately.
# 1. Rent a VPS (OVH or Contabo, ~$40-56/month, Ubuntu)# 2. SSH in and run the one-linercurl -fsSL https://agent-flywheel.com/install.sh | bash# 3. Reconnect, then learn the workflowonboard# 4. Create your first projectacfs newproj my-first-project --interactive# 5. Spawn agents and start buildingntm spawn my-first-project --cc=2 --cod=1 --gmi=1
You don't even need to know much at all about computers; you just need the desire to learn and some grit and determination. And about $500/month for the subscriptions, plus another $50 or so for the cloud server.
Once you get Claude Code up and running on the cloud server, you basically have an ultra competent friend who can help you with any other problems you encounter. And Jeffrey will personally answer your questions if you reach out on X or on GitHub issues.
If you want to change the entire direction of your life, it has truly never been easier. If you think you might want to do it, I really recommend just immersing yourself.
Get the Flywheel Stack
One command installs all 11 tools, three AI coding agents, and the complete environment.
30 minutes to fully configured.