Inside GSD 2.0: The Architecture Behind Reliable AI Agents
GSD Foundation published "How We Built The World's Most Powerful Coding Agent" this week — a technical preview of GSD 2.0, launching tomorrow. I analyzed it alongside Spec Kit and OpenSpec in a comparison piece, but that post covered the architecture at comparison depth. This one goes deeper.
I've shipped NovaMX, this portfolio site, and multiple client projects using GSD 1.x. I know the workflow, I know the pain points, and I know where the current version hits its limits. The GSD Foundation article describes an architecture that directly addresses the problems I've hit — and some innovations I didn't expect. Here's what GSD 2.0 brings to the table, why each design decision matters, and what I'm watching closely as a practitioner. Not the workflow — I covered that in my Spec-Driven Development guide. The internals.
The Core Insight: Deterministic Infrastructure
The GSD Foundation article opens with a claim that resonates with everything I've experienced using GSD 1.x: most of what makes AI coding agents unreliable isn't the model's code generation — it's everything around it.
State management. Context pollution. Lost continuity between sessions. Mechanical errors in git operations. Verification that checks process instead of outcomes. Summaries that lose information through compounding compression.
GSD 2.0's architectural response is a strict separation: if you could write an if-else that handles it correctly every time, it must be deterministic code — not LLM reasoning. Every token the model spends on mechanical operations is a token wasted and a failure mode introduced.
Two Tools Replace Everything
The entire deterministic layer is exposed through two tools:
gsd_manage — 18 actions covering state transitions, git operations, directory scaffolding, context assembly, and file formatting. The model never constructs a git command. Never parses markdown to figure out which task is next. Never creates frontmatter. It calls gsd_manage with an action name, gets a result, and moves on.
gsd_verify — 4 actions for static verification: file existence, export detection, import wiring validation, and stub detection. One tool call checks that auth.ts exists, has at least 30 lines, exports generateToken and verifyToken, and isn't a placeholder returning null.
One tool call replaces what would otherwise be 5-10 bash/read/edit calls that the model has to reason about. That's not just an efficiency gain — it's an entire category of failure modes eliminated. The model can't misformat a git commit message or forget to stage a file because it never touches git directly.
What does the LLM actually do? The creative work:
- Decomposing scope into slices (architectural judgment)
- Writing must-haves (understanding what observable outcomes matter)
- Discussing gray areas with the user (interpreting intent)
- Scouting the codebase during research (judging relevance)
- Diagnosing verification failures (abductive reasoning)
- Actually writing the code
Judgment work. Everything else is TypeScript that either works correctly or throws a clear error.
How GSD 2.0 Kills Context Rot
This is GSD 2.0's most important innovation, and almost no one has covered how it actually works.
Context rot is the silent quality degradation that happens when models reason over polluted context windows. By task 3 or 4 in a sequence, the context is saturated with stale tool output from earlier tasks — file contents read four tasks ago that have since been refactored, debugging traces from solved problems, hundreds of lines of irrelevant terminal output. The model doesn't know what's current and what's stale. So it starts making decisions based on outdated information.
I've seen this kill sessions in GSD 1.x. The agent references variables that were renamed. Follows patterns from code that was restructured. Avoids approaches it tried earlier that failed for reasons that no longer apply. The reasoning quality drops as the signal-to-noise ratio in the context collapses. GSD 1.x mitigated this with fresh subagent contexts per plan, but 2.0 takes it further with a mechanism called anchor pruning.
Anchor Pruning: The Mechanism
Each task gets an invisible anchor message injected into the conversation. Before every LLM call, a context hook prunes the message history back to the current task's anchor.
What this means in practice: task 5 doesn't see the 40 tool calls from tasks 1-4. No stale file reads. No failed attempts. No intermediate debugging. It gets a clean context window containing its task plan, relevant upstream summaries, and nothing else.
Task 7 runs with the same context quality as task 1. Not approximately the same — the same. Fresh window, clean signal, zero accumulated noise.
Zero Discovery Calls
Before a task starts, GSD 2.0 pre-assembles everything the agent needs:
- The task plan (goal, steps, must-haves)
- Compressed summaries from dependency slices
- Milestone-level context and locked decisions
- Continue-here data if resuming interrupted work
This is injected automatically. The agent never has to grep for project structure, read state files to figure out where it is, or search for what was built in prior slices. If it does, the context assembly is broken — that's a bug in GSD, not a workflow step.
The goal is zero discovery calls. Every token the agent spends on "where am I, what exists, what was decided" is a token not spent writing code.
The context budget is deliberate: the orchestrator stays at 10-15% context usage. Subagents get a fresh 200,000-token window. The main session targets 30-40% utilization. These numbers keep every agent operating in the quality sweet spot.
Fractal Summaries: Memory That Scales
When a task completes, the agent writes a structured summary: what was built, key decisions made, files modified, patterns established, and what downstream work should know.
When a slice completes, task summaries compress into a slice summary. When enough slices are done, slice summaries compress into a milestone summary. Each level includes drill-down paths to the level below if more detail is needed.
When planning slice 6, you don't load 15 individual task summaries from slices 1-5. You load one milestone summary — maybe 200 lines — that contains the essentials: what was built, what's available, what patterns to follow, what decisions were locked.
The token budget for injected summary context is capped at ~2,500 tokens. If the dependency chain is too large, the oldest and least relevant summaries are dropped first. Milestone-level summaries take priority over slice-level, which take priority over task-level.
One critical rule from the article that I want to highlight: never summarize summaries. Each summary level regenerates from the level below plus actual code state. A slice summary comes from task summaries, not from a compressed version of a prior slice summary. This prevents the compounding information loss you get when you keep compressing compressed text — a problem I've hit in GSD 1.x on longer projects where late-phase summaries would lose critical early decisions.
Boundary Maps: Contracts Before Code
This is maybe the most impactful planning feature and the one I see developers skip most often.
When a milestone is planned, every slice declares what it produces and what it consumes from upstream slices. Not vaguely — concretely. Functions, types, interfaces, endpoints, with names:
S01 → S02
Produces:
types.ts → User, Session, AuthToken (interfaces)
auth.ts → generateToken(), verifyToken(), refreshToken()
Consumes: nothing (leaf node)
S02 → S03
Produces:
api/auth/login.ts → POST handler
middleware.ts → authMiddleware()
Consumes from S01:
auth.ts → generateToken(), verifyToken()
This forces interface thinking before implementation. When slice 3 is being planned, it doesn't guess what slice 1 built — the boundary map says exactly what's available. The planning step verifies that the upstream slice actually produced what the map claims.
No more "slice 3 needs a function that slice 1 never exported." No more silent assumptions about what exists. The contracts are explicit and checked.
Building NovaMX with GSD 1.x — a CRM with auth, leads, properties, analytics, and AI-driven status tracking — I hit exactly this class of integration bugs. Phase 3 needed functions that phase 1 never exported. Silent assumptions about what existed led to hours of debugging. GSD 2.0's boundary maps are the architectural answer to a problem I've lived through.
The Discuss Phase: 10 Minutes That Save Hours
GSD 1.x already has a discuss phase via /gsd:discuss-phase, and it's one of the features I use most. GSD 2.0 evolves it significantly. The core problem with most AI coding agents: you say "build me auth" and they immediately start writing code. They make 30 decisions in the first 5 minutes — session storage vs JWT, email verification vs none, OAuth vs password-only — and you don't find out which choices they made until you're looking at the finished result.
GSD makes discussion a first-class phase. Before planning starts, the agent reads the scope, identifies the gray areas — places where multiple reasonable approaches exist and your preference actually matters — and interviews you about them.
The key behaviors that make this work:
- It follows energy. Whatever you emphasize, it digs into. If you spend time talking about error handling, it asks deeper questions about that.
- It challenges vagueness. "Make it simple" gets pushed back. Simple how? For the user? To implement? To extend later?
- It makes the abstract concrete. "Walk me through using this." "What does that look like on screen?" "What happens when this fails?"
- Scope guardrails prevent drift. If you suggest a feature that belongs in a different slice, it captures the idea as deferred and redirects.
The output is a context.md file — a structured record of every decision with your reasoning. This file gets injected into all downstream work: planning, execution, verification. When the agent is implementing task 4, it still has your discuss-phase decisions in context. It doesn't re-debate them. It doesn't silently make a different choice because it forgot what you said. The decisions are locked.
This is what makes hands-off execution possible. You front-load alignment in a 10-minute conversation, and every task inherits those decisions automatically.
Surviving Interruptions: Continue-Here
Context windows end. Sessions time out. Users hit Ctrl+C. Runtimes auto-compact. GSD 1.x has /gsd:resume-work and STATE.md for cross-session memory, but the continue-here mechanism in 2.0 goes deeper.
If a task is interrupted, the system writes a continue file capturing:
- What's already completed
- What remains to be done
- Decisions made during the task (so the next session doesn't re-debate them)
- The "vibe" — what was tricky, what to watch out for
- The exact first thing to do when resuming
A fresh session reads this file, loads the task plan, injects both into context, and picks up from exactly where it left off. The continue file is consumed on resume — it's ephemeral, not a permanent record.
This is hooked into the runtime's compaction event. If the runtime auto-compacts the conversation, the continue file is written automatically before compaction happens. No work is lost. I've had sessions compacted mid-task on complex NovaMX features in GSD 1.x and lost context that took 10 minutes to reconstruct. If 2.0's continue-here works as described, that problem disappears entirely.
Verification and Automatic UAT
"All steps done" is not verification. GSD 2.0's verification is goal-backward: it checks actual outcomes.
The 4-Tier Verification Ladder
Every task picks the strongest tier it can reach:
- Static — files exist, exports present, imports wired, no stubs detected
- Command — tests pass, build succeeds, lint is clean
- Behavioral — browser flows work, API responses are correct
- Human — the user checks only when the agent genuinely can't verify itself
The stub detector scans for TODO comments, FIXME markers, return null, return {}, console.log placeholders, and hardcoded empty responses. An 8-line file that returns an empty object doesn't pass static verification. This catches the most common agent failure mode: files that exist but don't actually work.
Auto-Generated UAT Scripts
Every time a slice completes, GSD 2.0 produces a User Acceptance Test script — a human-readable document telling you exactly how to verify what was built:
Test: Sign up flow
Do:
Open http://localhost:3000/signup
Enter "[email protected]" in the Email field
Enter "password123" in the Password field
Click "Sign Up"
Expected:
Page redirects to http://localhost:3000/dashboard
Header shows "Welcome, [email protected]"
Refreshing the page keeps you logged in
Every step is copy-pasteable. Every expected result describes exactly what you should see — not "it should work" but the specific text, URL, and behavior. UATs are derived from the slice's demo sentence and must-haves, cross-referenced against what was actually built.
UATs are non-blocking. The agent writes the script and moves on. You test whenever convenient. At any point in a project, you have a UAT file for every completed slice — an automatic paper trail of what was built and how to prove it.
Git Strategy: A Changelog You Can Bisect
Each slice gets its own git branch. Every task gets a checkpoint commit before it starts and a proper commit after verification passes. When the slice is done, the branch squash-merges to main as one clean commit:
feat(M001/S06): verification + summarization + UAT
feat(M001/S05): task execution + context pruning
feat(M001/S04): milestone and slice planning commands
feat(M001/S03): extension scaffold and command routing
feat(M001/S02): state machine + deterministic operations
feat(M001/S01): types + file I/O + git operations
One commit per slice. Individually revertable. The branch preserves per-task history for git bisect and git blame. Rollback is straightforward: bad task → reset to the checkpoint on the branch. Bad slice → revert the single squash commit on main.
The user never runs a git command. The agent handles all branching, committing, merging, and archiving through deterministic gsd_manage calls.
What I'm Watching Closely
The architecture looks strong on paper, but real-world usage will tell the full story. Based on my GSD 1.x experience, here's what I'll be testing first:
Token consumption. GSD 1.x already uses significantly more tokens than unstructured coding. Fresh subagent contexts, parallel research agents, verification loops — it adds up. GSD 2.0 adds more layers (anchor pruning hooks, fractal summary generation, boundary map verification). I'll be tracking whether the efficiency gains from fewer wasted tokens offset the overhead of the new infrastructure.
Claude Code dependency. GSD 1.x expanded to OpenCode, Gemini CLI, and Codex, but the architecture was designed for Claude Code's agent and skill system. GSD 2.0's deterministic tools (gsd_manage, gsd_verify) could either deepen this dependency or abstract it away. If your team uses Cursor or Copilot, Spec Kit or OpenSpec remain better options.
The discuss phase stability. In GSD 1.x since v1.22.0, /gsd:discuss-phase sometimes auto-answers its own questions instead of waiting for user input. It's the most complained-about issue on GitHub. If 2.0 doesn't fix this, the enhanced discuss phase described in the article is undermined from day one.
Subagent context gap. In 1.x, subagents may not receive the project CLAUDE.md, making coding guidelines and security instructions invisible to executor agents. I'm hoping 2.0's context injection system solves this, but it wasn't mentioned in the article.
Boundary maps in practice. The concept is compelling. But will the produces/consumes declarations add too much upfront planning for smaller projects? GSD 1.x already walks the line between useful structure and excessive ceremony. I'll be testing whether boundary maps tip that balance.
The Bottom Line
GSD 2.0's architecture is differentiated because it treats reliability as an engineering problem, not a prompting problem. The anchor pruning, fractal summaries, deterministic tools, boundary maps, and continue-here mechanism are all infrastructure solutions to problems that better prompts can't fix.
Having used GSD 1.x across real projects, I can map every architectural decision in this article to a concrete pain point I've experienced. Context rot degrading long sessions. Lost decisions after compaction. Integration bugs from implicit contracts between phases. Summaries that compressed away critical details. GSD 2.0 addresses each of these with engineering, not instructions.
The promise: an agent that gets a fresh context for every task, never wastes tokens on mechanical operations, produces verifiable outcomes, survives interruptions, and maintains clean git history. All backed by markdown files on disk. No database. No external service. Just files and git.
I'll be migrating my workflow to 2.0 the moment it drops and writing about the real-world results. If you're new to GSD, start with my Spec-Driven Development guide for the methodology and my SDD tools comparison to understand the trade-offs against Spec Kit and OpenSpec. Check out my tools and stack for the full setup.
Planning to try GSD 2.0? Reach out — I'll be publishing follow-up results once I've shipped a real project with it.