r/SpecDrivenDevelopment

▲ 4 r/SpecDrivenDevelopment+1 crossposts

Sharing my claude driven development workflow, appreciate the feedback

If you've done TDD with Claude Code you've probably hit this: you ask it to write tests and the implementation, and it quietly writes tests that just describe what the code already does. Everything goes green and you've learned nothing. The tests are worthless.

I got tired of babysitting that, so I built the whole thing into custom slash commands and PreToolUse hooks. The idea is simple: the session that writes the tests is literally not allowed to look at the implementation. A hook blocks Read, Grep, and Bash from touching the domain folder while I'm on the test branch. It can't peek even if it wanted to.

Here's how a feature moves through it, one slash command per step:

- `/draft-spec` writes a frozen spec for one slice (one slice = one PR)
- `/write-tests` writes failing tests from that spec only, on a `test/<slug>-red` branch, hook-blocked from the code
- `/clear`, then `/implement-domain` picks up the red branch in a fresh session and builds until green. It's not allowed to edit the tests to make them pass.
- `/ship` runs a scope-aware pre-PR gate and opens the PR

The `/clear` between writing tests and writing code is the part that actually matters. Same-session test-and-code is exactly where the bias sneaks back in, so I force a hard context reset between them. `/build-feature` ties it all together and just tells me the next command to run based on the repo state, so I'm never guessing which step I'm on.

Been running real backend work through this for a while now and it's held up better than I expected.

What I'm genuinely curious about from this community:

- Is the "hook-block the test writer" thing clever or am I overengineering it?
- Anyone else leaning this hard on PreToolUse hooks to enforce process, not just permissions?
- Where does this fall apart as the project gets bigger?

Happy to share the hook scripts if people want them.

reddit.com

u/xmen-nowomen — 14 hours ago

▲ 6 r/SpecDrivenDevelopment

A specification language that tells you you're wrong

I do all my coding with agents now, and I'm not going back. But it took me a while to work out what I missed about it.

When you write code yourself, you get scolded a lot (by the compiler, by the test suite, or by someone reviewing your PR). It always felt annoying, but one good thing was that it told you quickly when you hadn't thought something through enough.

Spec driven tooling doesn't generally do that. AI agents usually fill in gaps with a guess, and bugs take up residence in the implementation.

So a few months ago I started building Allium. It's a small spec language for writing down what the software is meant to do, and it runs checks with an optional CLI and pushes back while it's easy to change your mind. It's wrapped by AI skills so you still don't write any code (even the spec code).

A colleague added looping recently, so it can keep running the check-and-fix loop on its own until the code and the spec line up.

If this sounds like it might be useful to you, I'd love you to give it a go. Constructive feedback enormously appreciated!

Link: https://allium-lang.org

u/hendroid — 3 days ago

▲ 23 r/SpecDrivenDevelopment+1 crossposts

My Claude Code agents kept saying "done, all tests passing" on apps where the login button did nothing. So I made them prove it.

A few months back I released production-grade, a free plugin that turns Claude Code into a 14-agent pipeline: PM, architect, backend, frontend, QA, security, the whole crew. It got some love here, then I went quiet because I was using it myself and kept hitting the wall everyone hits with agents.

The agent finishes, prints a lovely summary, claims every test passes. You open the app and half the buttons are decorative. My pipeline was one pass: each agent does its job, hands the work down, done. Real engineering is not one pass. It's loops. Write, run, fail, fix, run again.

v5.5 is one idea applied everywhere. An agent is not done when it says it's done. It's done when a check it cannot argue with says so. I'm calling it the Loop Engine.

Concretely:

The pipeline generates a fast check script for your project (typecheck + lint, under 15 seconds), and a hook runs it after every single file edit. Claude breaks something, the error lands back in its face immediately. Not at the end. Every edit.

Coding agents are banned from touching tests. QA writes failing tests first and owns the test folder. If a coder adds .skip or loosens an assertion to get green, that gets flagged as a critical finding. No more grading their own homework.

"12 test files written" stopped being an acceptable QA report. Suites have to actually run: executed, passing, failing.

Before the final gate, a separate agent boots your real app and drives it like a user. Every button clicked, every form submitted, every link followed. A button that renders but does nothing is a critical bug. This feature exists because I was tired of being my own QA department at 1am.

Loops stop on evidence, not vibes. Each one tracks a number (failing tests, open findings) and stops when it hits zero or stops improving, then shows you the trend, like 7 to 3 to 3, and asks. No retrying the same fix five times.

And when Claude hits something weird with no existing check, it can build its own loop, with one rule: first create a check it can run, like a failing repro script. No check, no loop.

Still free, MIT. Gates, receipts, and worktree isolation for parallel agents are unchanged.

Install:
/plugin marketplace add nagisanzenin/claude-code-plugins
/plugin install production-grade@nagisanzenin

Repo: github.com/nagisanzenin/claude-code-production-grade-plugin
Theory writeup lives in docs/LOOPS.md if you want the reasoning.

Honest caveat: this shipped this week. The edit-hook enforcement is real code, the rest is protocol the agents follow, and I want to see where they drift in the wild. If an agent cheats the rules or a loop converges badly on your project, I want the transcript.

reddit.com

u/No_Skill_8393 — 3 days ago

▲ 1 r/SpecDrivenDevelopment

SAP — Spec-Driven Architectural Pipeline (A Deep Rethink Based on superpowers · agent + skill + rule three-layer architecture)

> As models grow stronger, the real challenge isn't "can it be done" anymore — it's "can it be done reliably every time." SAP uses three layers of constraints to converge LLM randomness into reproducible engineering delivery, at an acceptable cost. > > Brainstorm-first · Spec-driven · Atomic execution · Five-layer verification > > GitHub: https://github.com/cocacocca/sap

superpowers Is a Giant. SAP Stands on Its Shoulders.

superpowers did something remarkable — it proved AI coding agents can follow structured workflows: brainstorm-first, worktree isolation, TDD red-green cycles, subagent-driven execution.

I've used superpowers extensively. I deeply respect its design philosophy.

But after deep use, I formed one core judgment:

> The stronger the model, the stronger the constraints must be.

This isn't a slogan. Let me explain how I arrived at this.

Core Hypothesis: Why "Stronger Models Need Stronger Constraints"

What Happens When Models Get Stronger

When models were weak, constraints had to be light — leave room for the model to maneuver. superpowers keeps skills under 200 lines for exactly this reason: limited model capacity, heavy constraints would stifle it.

When my primary model upgraded to 1M standard context, the situation reversed:

First change: degrees of freedom explode. The model can process far more information simultaneously, meaning it can "improvise" in far more directions. Within a 1M window, the model can simultaneously consider 10 implementation approaches, 5 architecture styles, 3 naming conventions.

Second change: randomness amplifies. High freedom → different reasoning paths each time → different output. Same request, asked twice, yields two code styles. Asked ten times, ten variations.

Third change: non-reproducibility. High randomness → unpredictable output → cannot reproduce. You ask the model to run the same flow again, it produces entirely different results. During code review you notice "last time it wasn't written this way," but how it was written last time is already lost.

The causal chain:

Model gets stronger → Freedom increases → Randomness increases → Output becomes uncontrollable → Non-reproducible
     ↑
Strong model + weak constraints = different output every time, quality depends on luck

Why Constraints Lock Down Randomness

Constraints don't limit model capability — they limit model degrees of freedom.

Agent persona constraint: "You are backend craftsman, you don't write frontend" — eliminates the model's freedom to improvise toward frontend
Skill workflow constraint: "Five-layer construction: types → data → logic → interface → cross-cutting" — eliminates the model's freedom to choose arbitrary architectures
Rule project constraint: "Use MySQL + snake_case + soft delete" — eliminates the model's freedom to choose databases and naming

After three layers stack, the model's freedom is compressed into a narrow but deep channel:

No constraints:     model freedom ████████████████████ → extreme randomness
One layer:          model freedom ████████████ → moderate randomness
Two layers:         model freedom ████████ → low randomness
Three layers:       model freedom ████ → minimal randomness, approaching deterministic fit

Compressed freedom ≠ compressed capability. The model's reasoning power, code generation ability — these don't change. Only its "improvisation space" narrows to a channel with higher determinism. In this channel, it still thinks deeply, but the direction of thinking is locked onto "the correct track."

Validated in Practice

With three layers of constraints, whether I use a strong model (GLM 5.2) or a slightly weaker one, output stays on track — three-layer constraint stacking locks randomness within acceptable bounds, approaching deterministic fit.

What superpowers Does vs What SAP Changes

Dimension	superpowers	SAP	Why
Architecture	Single layer: skill	Three layers: agent (persona) + skill (workflow) + rule (constraints)	Stronger models need sharper role separation
Code review	Generic reviewer	GAN discriminator — different model cross-reviews	Same model self-defends, different models complement blind spots
Project rules	None (skill-embedded discipline)	Explicit rule layer	No constraints → style drift, knowledge doesn't accumulate
Skill size	<200 lines (small context constraint)	500-800 lines (1M context allows)	Strong models handle complete checklists
Skill count	~14	Designed by workflow, no upper limit	Add what's missing, redesign what's unsatisfactory

Cost & Efficiency: Why You Don't Need Top-Tier Models

First, Why "Best Below the Best" Can Match the Best

A clarification: GLM 5.2 and Kimi Code 2.7 are the best below the best — in their respective domains (code generation / frontend), they are themselves top-tier, and the gap with Claude / Codex is a gradient, not a cliff.

According to Artificial Analysis coding-index:

GLM 5.2 closely trails top-tier models in code generation accuracy
Kimi Code 2.7 excels in frontend/UI scenarios
DeepSeek-V4-Flash has unique advantages in reasoning chain depth

The gap between them and Claude / Codex is not a cliff — it's a shrinking gradient.

So the question becomes: when the gap is already small, what determines final output quality?

The answer: process discipline.

A feature from requirement to delivery passes through fixed phases: brainstorm → spec → design → decompose → implement → review → document. Each phase has clear inputs, outputs, and check criteria.

Top-tier models excel at "intuition" — they make correct choices under weak constraints. But intuition is unreliable (high randomness) and expensive.

GLM 5.2 / Kimi Code 2.7 / DeepSeek-V4-Flash — these "best below the best" — excel at "execution": their coding ability is already strong, they just need clear processes and checklists to produce high-quality output stably. And they're affordable.

What SAP's three-layer constraints do: replace model intuition with process discipline.

Phase	Top-tier model relies on	SAP relies on (GLM 5.2 / Kimi / DeepSeek + three layers)
Brainstorm	Model's own reasoning power	brainstorming skill's structured frameworks (5W2H / fishbone / SCQA) guide reasoning
Specification	Model "knows" what to write	spec-writing skill's checklists pin down output item by item
Implementation	Model "intuits" correct architecture	backend-implementation skill's five-layer + gate self-checks
Code review	Model "spots" issues	Model heterogeneity GAN review — different models complement blind spots

Conclusion: when a model's own capability is already strong enough, what determines output quality isn't "use a stronger model" — it's "give a strong enough model sufficient process discipline." Process discipline + model heterogeneity ≈ top-tier model intuition, at 1/4 the cost.

Test Data

Primary model combo: GLM 5.2 + Kimi Code 2.7 + DeepSeek-V4-Flash + LongCat-2.0 (Meituan). Tested on single medium-to-large feature development.

A complete /sap run (including brainstorming, discussion, self-review, doc collaboration — all phases) consumes approximately 3-5 million tokens (amortized), producing a complete planning package.

Monthly Cost (China Coding Plans)

Model	Plan	Monthly Cost
GLM Max	~4B tokens/month	¥375.2 (~$52)
Kimi Code	~1-2B tokens/month	¥149 (~$21)
DeepSeek / LongCat	Pay-per-use	~~¥50 (~~$7)
Total		~¥600/month (<$100)

Without model heterogeneity (single GLM Max handles everything), monthly cost drops to ~$50.

Cost Gap vs Top-Tier Models

Claude Code and Codex do have coding plans (subscriptions), but their plans scale with usage — heavy development can easily hit $200-$500/month. A single medium-to-large feature running the full SAP workflow (3-5 million tokens) costs 3-5x more with top-tier models compared to the Chinese model combo.

About Proxy Multiplier Rates

API proxies make top-tier models more accessible at lower cost — this is a good thing. It lowers the barrier and benefits more developers.

But there's an issue worth paying attention to: multiplier rates aren't just price discounts — they often come with service differences. Models accessed through multiplier-rate proxies may have different response quality, stability, and concurrency limits compared to direct API access. This isn't about proxies being bad — it's about factoring the multiplier's potential impact into your comparison, especially when putting a proxy-discounted top-tier model next to a directly-connected Chinese model.

For model capability comparison, see Artificial Analysis coding-index rankings. Chinese models like GLM and Kimi are closing the gap with Claude and Codex on coding ability. When the base gap is already small, the service differences from proxy multiplier rates may further narrow or even reverse it.

So my logic is: within a limited budget, use direct, complete, cost-controllable Chinese model combos combined with three-layer constraints, to achieve what top-tier models need several times the budget to do. This isn't "settling for less" — it's "optimizing within constraints."

Why It Gets Cheaper Over Time

Chinese models keep upgrading — stronger capability, same or lower price. And China's AI infrastructure is accelerating: as compute backbones like Huawei Ascend 950 supernodes come online, inference costs will drop further. Subscription plans will evolve in two directions — either more quota or lower prices. Either way, the usable token budget per dollar will be more generous than Claude Code / Codex.

What does this mean? What a top-tier model does in 1 pass, a near-top model with SAP's three-layer constraints might take 2-3 passes to complete — but the cost difference is large enough that you can afford those 2-3 passes and still have budget left for more features. The key metric isn't single-pass efficiency — it's total output per unit budget.

The three-layer framework stays constant, but the models executing within it keep getting stronger and cheaper.

This is SAP's long-term compound interest: framework locks the process, models keep upgrading, costs keep dropping, efficiency keeps rising.

Three-Layer Architecture: Why Three, Not Two or Four

One Layer (superpowers' Choice)

superpowers earning widespread adoption is itself proof that AI coding agents can follow structured workflows.

But as model capabilities continued upgrading, the relative constraint strength weakened. I observed three trends:

Role boundary blurring: A single agent handling both brainstorming and implementation had no clear switching point between thinking modes
Project conventions not persisting: Discipline embedded in skills couldn't distinguish "what this project uses" from "general best practices"
Process and constraints coupled: Skills mixed "how to do" with "what not to do" — changing constraints meant touching process, and vice versa

These aren't superpowers' design flaws — they're the natural consequence of models getting stronger while single-layer constraint strength stayed the same. This observation is exactly what drove me to rethink the architecture.

Two Layers (skill + rule)

Adding the rule layer solved the project convention problem — RULE_DB declares "use MySQL," agent follows. Process and constraints were decoupled.

But the role boundary issue remained: without an agent persona layer, the one executing spec-writing and the one executing code-audit were "the same character." Brainstorming's divergence and review's convergence are conflicting modes — without explicit identity switching, the model transitions模糊ly between them.

Three Layers Was Enough

Adding the agent persona layer gave each phase a clear role identity:

brainstorm-agent: divergent thinking, exploring possibilities, forbidden from writing code
spec-coordinator: convergent thinking, making fuzzy precise, forbidden from writing code
backend-craftsman: execution thinking, implementing per spec, forbidden from crossing boundaries
quality-evaluator: skeptical thinking,专门 finding problems, forbidden from fixing (only reports)

Each layer's responsibility:

┌─────────────────────────────────────────────────────┐
│  Agent (Persona)                                    │
│  "Who am I? What are my boundaries?"                │
│  → Identity, responsibilities, gates, iron laws     │
│  → Defines role, forbids boundary crossing          │
├─────────────────────────────────────────────────────┤
│  Skill (Workflow)                                   │
│  "How do I do this specific task?"                  │
│  → Step-by-step methodology, checklists, templates  │
│  → Defines process, reusable across projects        │
├─────────────────────────────────────────────────────┤
│  Rule (Constraint)                                  │
│  "What can I NOT do in this project?"               │
│  → Project conventions, tech stack, naming, style   │
│  → Defines constraints, project-specific            │
└─────────────────────────────────────────────────────┘

Why not four layers? I tried splitting "communication protocol" into a separate layer, but it's fundamentally part of the agent persona (each agent knows who to hand off to and how). Separating it added complexity without value. Three layers is sufficient and minimal.

Key Distinctions

Skills contain no persona — multiple agents share the same skill. backend-craftsman and frontend-craftsman both use TDD red-green cycles, but load different rules.

Skills contain no constraints — the same backend-implementation skill behaves differently under different RULE_DB.md (MySQL vs PostgreSQL).

Rules contain no process — RULE_API.md declares "RESTful + URL versioning + error code format," it doesn't teach you how to write APIs (that's a skill's job).

How Three Layers Collaborate

Agent loads skill to execute workflow, while following rule constraints

Example: backend-craftsman (agent persona)
  + backend-implementation (skill workflow: five-layer construction)
  + RULE_DB.md (rule constraint: MySQL + snake_case + soft delete)
  = implement backend features per project conventions

The effect of three-layer constraint stacking: regardless of using a strong or slightly weaker model, output stays within the three-layer framework, barely drifting — solving the problem of high randomness and non-reproducible generation in LLMs.

> Some might ask: aren't three layers too heavy? Won't they confuse the model? > > This framework has been repeatedly validated in real projects. If three layers of constraints actually confused the model and degraded output quality, I wouldn't publish it — let alone use it for daily development. The opposite happened: with three layers, output quality and reproducibility improved significantly. That's exactly why I'm sharing this idea — because it actually works.

Skill Definition: Why "Workflow" Not "Ability"

What a Skill Is Not

Many people understand skill as "ability" — "the model learned a skill." This is a misunderstanding.

In SAP, skill is not the model's capability. The model's code generation, reasoning, language understanding — these are built-in. Skill doesn't manage them and shouldn't.

What a Skill Is

Skill is complete documentation of a specific workflow.

It tells the agent: from start to finish, what to do at every step, what to check, what to produce. It doesn't teach the model "how to write code" (the model can write) — it teaches the model "what process to follow when writing code" (process is human engineering experience).

Example: backend-implementation skill doesn't teach the model "what TDD is" (the model knows) — it mandates that the five-step cycle (write failing test → verify failure → minimal implementation → verify pass → commit) must execute, and testing happens immediately after each construction layer.

What a Skill Contains

Content	How specific
Step-by-step process	Phase 1 → Phase 2 → ... → Phase N, each with clear inputs/outputs
Checklists	Not "ensure quality," but "V1 Lint zero errors / V2 Typecheck zero type errors / V3 Build succeeds"
Output templates	Not "write a document," but specific markdown structure (field names, format, examples)
Gate checks	Not "good enough, submit," but item-by-item self-check table (binary pass/fail)
Anti-pattern tables	Not "be careful," but "when this error appears, do this" (specific fix)

What a Skill Does NOT Contain

No persona identity (that's agent's job) — skill doesn't say "I am backend craftsman"
No project constraints (that's rule's job) — skill doesn't say "use MySQL"
No communication protocol (that's agent's job) — skill doesn't say "report to whom after completion"

Why Skills Can Be 500-800 Lines

superpowers keeps them under 200 — optimal for small context, where context is precious and must be concise.

But at 1M context, conciseness becomes a disadvantage:

Constraint	Small Context	1M Context
Skill size	<200 lines	500-800 lines
Simultaneous load	1-2	5-7
Detail level	Summary + pointers	Full checklists + templates + examples

SAP removes line limits. Each skill contains complete checklists, templates, examples — no "see references/" indirection. When an agent loads a skill, it gets a complete manual ready for immediate execution, not an index that "requires looking elsewhere."

How Skills Evolve

New scenario → design new skill
Unsatisfactory flow → redesign skill
Gap between two scenarios → add a skill

Skill count is a snapshot of workflow coverage, not a target value.

Core Design Philosophy: Leverage LLM Strengths + Weaknesses

1M context windows make specialization possible. LLM weaknesses make specialization necessary.

Leverage Strengths

Strength	How SAP Uses It
Deep reasoning	Brainstorm agent explores multiple paths before committing
Context retention	1M window holds multiple skills + rules simultaneously
Multi-perspective analysis	Six thinking hats, 5Why, fishbone — structured frameworks

Compensate Weaknesses

Weakness	How SAP Compensates
Context overflow	Multi-agent isolation — each agent loads only what it needs
Self-defense in review	Model heterogeneity — reviewer uses different model than implementer
No project memory	Rule layer — explicit constraints persist across sessions
Inconsistent output	Structured communication protocol — machine-readable handoff
High randomness, non-reproducible	Three-layer constraint stacking — compresses freedom into a narrow but deep channel

Communication Protocol Between Agents

Agents communicate through structured handoff messages. Not human chat — machine-readable protocol.

[PLANNING_COMPLETE] feature_id=user-auth-v2
[PLAN_PACKAGE] sap/user-auth-v2/
[CONTENTS] spec.md, tasks.md, checklist.md, dag.md
[GATE_RESULT] G1-G8 all pass
[NEXT] controller dispatch per DAG

[DISPATCH] task_id=T-001
[TO] sap:backend-craftsman
[REQ] goal/design_ref/criteria/packages

[COMPLETE] task_id=T-001
[TEST] 42 passed, 0 failed
[GATE] V1-V5 + S1-S3 pass

Why structured protocol: machine-parseable, audit trail, recoverable after context compaction.

Model Heterogeneity (GAN Discriminator)

Generator and Discriminator should not share the same model.

Same model writing and reviewing code → "understands intent, lets it pass" + shared blind spots + self-justification.

Role	Model	Why
Brainstorm	DeepSeek-V4-Flash	Strong reasoning chain
Backend	GLM 5.2	Code generation accuracy
Frontend	Kimi Code 2.7	Frontend-specific patterns
Review	Different from implementer	GAN adversarial review

Not "better models" — different bias patterns catching each other's blind spots.

Workflow Overview

User Request
  ↓
Main agent reads bootstrap (auto-injected at SessionStart)
  ↓
P0 Brainstorm → P0.5 Spec → P0.85 Design → P1 Decompose
  ↓ (planning package ready)
Main agent → controller mode
  ↓
P2 Dispatch craftsmen (model heterogeneity) → P3 Review QE (different model) → P4 Docs

Output Path

sap/{feature-id}/
├── brainstorm.md      # P0
├── spec.md            # P0.5
├── tasks.md           # P0.5
├── checklist.md       # P0.5
├── design.md          # P0.85 (complex only)
├── arch/              # P0.85 ADR
└── dag.md             # P1

Project Structure

sap/
├── .zcode-plugin/ .claude-plugin/ .codex-plugin/ .opencode/
├── .mcp.json               # MCP server config
├── /lsp                    # LSP integration
├── hooks/                  # SessionStart injection (multi-platform)
├── commands/               # /sap /brainstorm /audit /rules
├── agents/                 # agent personas
├── skills/                 # designed by workflow, no upper limit
└── rules/                  # project constraint templates (git-ignored)

About Model Selection & Open Source

Model Selection

The model combo mentioned in this article (GLM 5.2 / Kimi Code 2.7 / DeepSeek-V4-Flash / LongCat-2.0) is my personal choice after weighing capability against cost. It doesn't mean these are the only or optimal options. I've tried mainstream models on the market — including MiMo, MiniMax, and others — and the ones I selected are those that hold up under daily development in both capability and cost.

If you think "these models aren't good enough," that's a completely understandable perspective. Honestly, if someone were willing to sponsor me unlimited access to Claude Code Fable 5 and Codex GPT-5.5 xHigh, I'd happily use top-tier models to fully unleash what SAP can do :)

But the reality is: not everyone can afford top-tier models long-term. SAP's value is — within your affordable model budget, pulling output quality as high as possible.

About Open Source

This repository currently shares design philosophy only, not the plugin implementation code.

The reason is simple: I'm not sure if this idea is truly valuable yet. If people resonate with it — and stars indicate that — I'll open-source the full plugin code. If not enough people connect with the idea for now, I'll keep absorbing new insights and evolving — I'll share the code when I've figured it out.

No rush. Ideas need validation, not aggressive promotion.

Contact

If you have thoughts to discuss, feel free to reach out via email or on GitHub.

GitHub: https://github.com/cocacocca/sap

u/cocacocca — 3 days ago

▲ 12 r/SpecDrivenDevelopment

grill-with-docs versus spec driven

I intended to use open OpenSpec for a project but I just saw
Matt Pocock's video about grill-with-docs and they appear to be at least overlapping or maybe even trying to save the same problem. What do you think?

https://www.youtube.com/watch?v=6BB6exR8Zd8

u/IndependentFew2451 — 5 days ago

▲ 6 r/SpecDrivenDevelopment

I built a phase-driven workflow for AI-assisted development — looking for feedback

I’ve been experimenting with AI-assisted development on a few personal projects, and I kept running into the same issue: the larger the planning surface became, the more assumptions I had to make and keep in my head.

So I built Mano.

Mano is a fast feedback loop for AI-assisted development:

Define what is needed for the current phase
Build it
Review what was learned or missed
Adjust the backlog
Define the next phase and repeat

Specs, implementation rules, and UX guidance are optional inputs rather than mandatory artefacts for every phase.

The goal is not to remove planning. It is to keep the planning horizon small enough that assumptions can be tested before they spread across many stories or implementation tasks.

The human approves the phase scope and direction. The agent helps plan and execute but does not autonomously decide the roadmap.

I’ve tested Mano across a few personal projects, and it is now working well enough for me to make it public.

I’d especially value feedback on:

whether the phase-driven distinction is clear
whether this solves a real problem or simply moves the planning effort elsewhere;
where do you think the workflow would break on larger projects

https://github.com/ceceppa/mano

u/ceceppa — 5 days ago

▲ 8 r/SpecDrivenDevelopment

Problems w/ OpenSpec or SpecKit?

Im just getting into SDD. Are there any common problems or hindrances or mistakes that AI can make or I can encounter that I should be looking out for?

reddit.com

u/CrazyGeek7 — 6 days ago

▲ 22 r/SpecDrivenDevelopment+1 crossposts

Spec-Driven Development Multi-Model Adversarial Authoring and Glossary with OpenCode and OpenSpec

This is a follow-up to my earlier post about "Spec-Driven Development with OpenSpec and OpenCode": https://www.reddit.com/r/SpecDrivenDevelopment/s/jLn7MWYwcj. In this video I cover multi-model adversarial authoring of Specifications with one sub-agent authoring, another reviewing to reduce bias before human review. Also glossary skills where terminology defined once, reused everywhere to improve consistency and quality of specifications. Thanks.

youtube.com

u/harikrishnan_83 — 7 days ago

▲ 2 r/SpecDrivenDevelopment+1 crossposts

I kept running into the same problem: my team's AI context lived in plans.md / claude.md / spec files, but there was no good way to co-edit them, and agents only ever saw old pasted snapshots. So I fixed it building easymd, completely free

So I built easymd: real-time collaborative markdown editing, like Google Docs, except the document is an actual .md file and an AI agent can edit it live alongside you.

- Real-time multiplayer editor (live cursors, CRDT sync, no merge conflicts)

- The file on disk stays canonical — edits sync straight back, no drift

- MCP server built in: connect an agent in one line; it can read, create, and update the same docs you're editing. Its changes appear live in your editor, and your edits are instantly visible to it.

- Clean markdown also saves a ton of tokens vs feeding models HTML/DOCX/PDF

- One-line CLI to start: npx easymd open CLAUDE.md

https://www.easymd.tech/

The idea is one shared, always-current file that both humans and agents work from, instead of copy-pasting context around.

Would love feedback! Especially on the agent-as-collaborator workflow and what'd make it actually useful in your setup.

u/Historical-Willow679 — 8 days ago

▲ 1 r/SpecDrivenDevelopment

what "level" of AI-assisted coding are you actually at? (autocomplete → not touching the code)

saw this framework recently and it's been a useful mirror, curious where this sub lands.

the idea (Dan Shapiro's, modeled on self-driving levels): there are 6 levels, 0 to 5.

0: autocomplete, you write everything

1: you delegate tiny tasks, review all of it

2: AI writes across files, you read every line

3: you stop writing, you review the PRs it opens

4: you write a spec, walk away, check if tests pass (code = black box)

5: nobody writes or reviews code, specs in / software out

the spicy claim is that ~90% of devs are stuck oscillating between 2 and 3 and don't realize it. you climb a bit, get tired of reviewing endless diffs, drop back to "let me just write it myself." every level feels like the top.

what makes 3→4 hard imo isn't the tooling, it's trust. going from "i read the code" to "i trust a spec + external tests" is a mental jump most people (me included, some days) won't make.

genuinely curious, not rhetorical: what level are you at, and what's keeping you from the next one? and if anyone's living at 4-5 in a real codebase (not a demo), how's it actually going at 3am when prod breaks?

reddit.com

u/jokiruiz — 9 days ago

▲ 1 r/SpecDrivenDevelopment

Learning SDD and have some questions

I got a subscription for $20/month for Claude Code, and I've also looked at OpenCode. The last company I worked was using GitHub CoPilot and was going to SDD ... and I got laid off right before that.

So, I have been using Claude Code to do multiple prompts for my SpringBoot Java app, and create all my boilder-plate code which I found to be a great help. I know how I write Hibernate Entities, Repositories, Business Service, and REST Controllers, etc. and all their tests. ClaudCode wrote code the way I did, and so I know what how the code works, and ONLY I checked the code into myGitHub.

So, I have been doing Agile/Scrum since 2007, so that is like 19 years. Every company that used it has used it quite differently than any other company. I understand in Agile, we only coded features for 2 weeks and had code, actually deployed to UAT so the PM and Client could see it and test it before it got into Production. In the best case, we got it right, and we can move on. In the case where the Client or PM got something wrong, or forgot something, we just go back and fix it. No problem it's a new feature in a new feature branch.

So, I have been watching a lot of YouTube videos, and scouring the internet for the best take on Spec Driven Development, and I have a lot of questions. The first is my understanding is that the developer (or someone) creates a bunch of .MD files for the "Spec" I haven't seen too many examples of this, but lets say we want to create an RESTFUL API login controller. We can create the .md file, and put in all the specifications. Then, I guess this then creates a "Plan" that can br reviewed, and I am guessing that "Plan" is written to a new .md file (Plan.md?). Up until this point, no code is written and no code is deployed? So, no one can actual test the code yet, right?

Ok, so the plan is fine, and then we go ahead and "Implement" the code? I am guessing AI writes this code, and deploys it so we can look at it and test it right? Does the developer review the code at this point although it is deployed?

What if the PM or Client forgets something, like they want the username to be an email? Or the password has to have a symbol? We add this to the Spec I imagine, and then it goes to update the Plan, and then we have to re-implement? Does it re-write ALL the code again for this feature? If this was a human doing this job, we would just add in the missing steps, and we wouldn't have to rewrite everything. And redoing ALL the code for an implement... doesn't that use more tokens?

What if we have a dozen features, and we have a dozen developers. Each has a new feature .. do we use one giant Spec.MD, or do we have multiple Specs, one per feature, that would mean one Plan.md per feature right? So, we have the Plan where we want it. When we go to implement, that is have AI create code, what if these two Specs write code that is similar? At what point does the developer go in and review this code.

I apologize for this giant method. I just got a new job, and they use Claude Code, and I don't know how they'll use it. I think they are still figuring it out. I remember the training we had learning Agile, and how painful it was. I think there will be a lot of growing pains for SDD as well. So, for anyone who has had more training on this, can you answer some of these questions or correct my misunderstanding of SDD. Thanks!

u/Huge_Road_9223 — 12 days ago

▲ 8 r/SpecDrivenDevelopment+4 crossposts

Need Help Choosing the Right AutoGen Teams Architecture

Hi everyone,

I'm currently working on a project where I need to migrate an existing multi-agent workflow to Microsoft AutoGen.

The current workflow is pretty simple:

- One node collects data from different sources.

- Multiple specialized node process that data in parallel (each has a different responsibility).

- A final validation node combines all the results and decides the final output based on some rules.

I first started using GraphFlow because it felt very similar to my existing graph-based workflow. However, my client wants the implementation to use AutoGen Teams instead.

I've gone through the documentation, but I'm still confused about which Team type is the best fit:

- Selector Group Chat

- Swarm

- Round Robin

- Or something else?

My goal is to keep the workflow efficient, allow parallel processing, and maintain the same quality of results.

If you've built projects using AutoGen Teams, I'd love to hear:

- Which Team would you choose for this kind of workflow?

- Any tips or common mistakes to avoid?

Thanks in advance for your help!

reddit.com

u/Ninjapakoda — 12 days ago

▲ 21 r/SpecDrivenDevelopment+4 crossposts

Built an OpenSpec extension that makes AI agents better at spec-driven development

I've been using OpenSpec quite a bit and noticed that different AI coding agents still vary a lot in how they handle planning and implementation. They often jump straight into coding or leave requirements and design vague.

I ended up building OpenSpec Plus to add more structure to the workflow—better discovery, testable specs, design validation, dependency-aware task planning, and a stronger TDD-first approach. It's designed to fit into the existing OpenSpec workflow rather than replace it. It works with OpenCode, Claude Code, Cursor, Windsurf, Copilot, and others.

I've been using it across a few projects already, but I'd really like feedback from others using OpenSpec or AI coding agents. I'm especially interested in what works well and what could be improved.

Repo: https://github.com/sudokar/openspec-plus

u/sudhakarms — 13 days ago

▲ 19 r/SpecDrivenDevelopment

Understanding OpenSpec & Spec-Driven Development

fadamakis.com

u/galher — 12 days ago

▲ 5 r/SpecDrivenDevelopment+1 crossposts

Which software architecture patterns are actually useful in real projects?

I'm currently building several software projects, mostly desktop apps and backend/SaaS-style systems, and I want to understand architecture beyond just writing code that works.

There are many architecture patterns and styles: layered architecture, MVC, hexagonal architecture, clean architecture, event-driven architecture, microservices, modular monoliths, and others.

For people who have worked with real systems:

Which architecture patterns do you use most often?
Which ones are actually useful in practice?
Which ones are overused or misunderstood?
What should a self-taught developer focus on first?

I'm not looking for a theoretical list. I want to understand what matters when building maintainable software in the real world.

reddit.com

u/Famous_DyaDya — 13 days ago

r/SpecDrivenDevelopment

Sharing my claude driven development workflow, appreciate the feedback

A specification language that tells you you're wrong

My Claude Code agents kept saying "done, all tests passing" on apps where the login button did nothing. So I made them prove it.

SAP — Spec-Driven Architectural Pipeline (A Deep Rethink Based on superpowers · agent + skill + rule three-layer architecture)

superpowers Is a Giant. SAP Stands on Its Shoulders.

Core Hypothesis: Why "Stronger Models Need Stronger Constraints"

What Happens When Models Get Stronger

Why Constraints Lock Down Randomness

Validated in Practice

What superpowers Does vs What SAP Changes

Cost & Efficiency: Why You Don't Need Top-Tier Models

First, Why "Best Below the Best" Can Match the Best

Test Data

Monthly Cost (China Coding Plans)

Cost Gap vs Top-Tier Models

About Proxy Multiplier Rates

Why It Gets Cheaper Over Time

Three-Layer Architecture: Why Three, Not Two or Four

One Layer (superpowers' Choice)

Two Layers (skill + rule)

Three Layers Was Enough

Key Distinctions

How Three Layers Collaborate

Skill Definition: Why "Workflow" Not "Ability"

What a Skill Is Not

What a Skill Is

What a Skill Contains

What a Skill Does NOT Contain

Why Skills Can Be 500-800 Lines

How Skills Evolve

Core Design Philosophy: Leverage LLM Strengths + Weaknesses

Leverage Strengths

Compensate Weaknesses

Communication Protocol Between Agents

Model Heterogeneity (GAN Discriminator)

Workflow Overview

Output Path

Project Structure

About Model Selection & Open Source

Model Selection

About Open Source

Contact

grill-with-docs versus spec driven

I built a phase-driven workflow for AI-assisted development — looking for feedback

Problems w/ OpenSpec or SpecKit?

Spec-Driven Development Multi-Model Adversarial Authoring and Glossary with OpenCode and OpenSpec

I kept running into the same problem: my team's AI context lived in plans.md / claude.md / spec files, but there was no good way to co-edit them, and agents only ever saw old pasted snapshots. So I fixed it building easymd, completely free

what "level" of AI-assisted coding are you actually at? (autocomplete → not touching the code)

Learning SDD and have some questions

Need Help Choosing the Right AutoGen Teams Architecture

Built an OpenSpec extension that makes AI agents better at spec-driven development

Understanding OpenSpec &amp; Spec-Driven Development

Which software architecture patterns are actually useful in real projects?

Understanding OpenSpec & Spec-Driven Development