A Week of Spec-Driven Development With GSD
Most posts about spec-driven development with GSD explain the architecture. This one is the opposite — seven days of actually shipping with it on a production codebase, written like a diary. Real numbers. Real mistakes. Real phase plans.
The project is this portfolio site. Last week I had four pending items — a blog implementation, a contact form with rate limiting, a GitHub contribution heatmap, and particle-based hero animation. I ran each through GSD's full workflow and tracked the time, token usage, and friction points. This is what actually happened.
If you want the methodology first, start with what is spec-driven development. For the comparison against Spec Kit and OpenSpec, I wrote that post separately.
The Setup
- Project: azanello.com (this site) — Next.js 16, React 19, Tailwind v4
- Agent: Claude Code with Sonnet 4.6 as the primary model
- GSD version: Latest as of 2026-04-12
- Goal: Four features shipped, each with its own phase, in a single working week
- Measurement: Wall-clock time, tokens consumed, number of commits, verification outcomes
I track this stuff because "it felt fast" is not useful data. If the tool claims orchestration value, the value should show up in the numbers.
Day 1 — Blog Infrastructure
Phase goal: Ship a working blog with listing, detail pages, SEO, and sitemap integration.
The day started at /gsd:discuss-phase. GSD's discuss mode asks adaptive questions about the phase before committing to a plan. It wanted to know:
- Markdown or MDX?
- Where do post files live?
- What metadata goes in frontmatter?
- Do we need tags, categories, or both?
- Sitemap integration?
- OG image generation per post?
I answered each one in chat. Ten minutes. The discuss phase wrote its output to .planning/001-blog/DISCUSS.md. I read it, caught one thing I'd misspoken, edited inline, moved on.
Then /gsd:plan-phase. The planner spawned a phase researcher to check the Next.js 16 App Router docs (via mcp__context7__*), then produced PLAN.md with 11 tasks organized into 3 slices:
- Content infrastructure —
src/content/blog/,src/lib/blog.ts, gray-matter parsing - Routes —
src/app/blog/page.tsx,src/app/blog/[slug]/page.tsx - SEO — sitemap entries, BlogPosting JSON-LD, OG image generation
The plan was 487 lines. Heavier than I'd have written by hand. Most of the weight was in the verification section — specific truths each slice had to satisfy before it could merge.
/gsd:execute-phase kicked off execution. Wave 1 ran all three slice-0 tasks in parallel. Each task ran in a fresh subagent with its own context. I watched in the Claude Code UI — three concurrent executions, each atomic-committing on success.
Total day 1:
- Elapsed: 2h 40m (50 min planning, 1h 50m execution)
- Commits: 11 (one per task)
- Tokens: ~420k
- Verification: passed on first run
The verification phase caught one issue automatically — the sitemap entry was present but the BlogPosting JSON-LD was missing a required datePublished field. GSD flagged it, I fixed it in a follow-up commit, verification re-ran clean.
Day 2 — Contact Form With Rate Limiting
Phase goal: Contact form with Resend email delivery, server-side validation, and IP-based rate limiting.
This feature is smaller. I ran /gsd:quick instead of the full workflow — GSD's quick mode skips the discuss phase and the plan-checker, but keeps atomic commits and verification. Good fit for features you've mentally scoped already.
/gsd:quick Contact form with Resend, Zod validation, and an in-memory IP rate limiter (5 req/min)
Quick mode produced a mini-plan (127 lines) and executed in one pass. Five tasks, five commits, about 90 minutes end-to-end. The rate limiter was the only piece that needed a second iteration — the first implementation stored the counter in module scope, which doesn't persist across Vercel serverless invocations. Verification caught it by running a curl loop and observing the counter reset.
// What I originally got
const counts = new Map<string, number>()
// What verification forced me to write
// (documented clearly as "best-effort per-instance only")
const counts = new Map<string, { count: number; resetAt: number }>()
// ...with acknowledgment in comments that serverless
// horizontal scaling invalidates this entirely.
The honest note in comments — "this is per-instance only" — was GSD's idea. Its verification agent insisted that if a limitation exists, it has to be visible in the code. That kind of nit is exactly what I want an automated check to enforce.
Day 2 totals:
- Elapsed: 1h 35m
- Commits: 6 (5 tasks + 1 fix)
- Tokens: ~210k
- Verification: failed once, passed on second run
Day 3 — GitHub Contribution Heatmap
Phase goal: Pull GitHub contribution data via GraphQL, render a year-view heatmap on the GitHub card, cache for 1 hour via ISR.
This phase was instructive because it broke in an interesting way. The plan looked clean — four tasks, GraphQL query, server component for the fetch, client component for the heatmap, cache revalidation tag. Execution started cleanly. Task 1 (the GraphQL query in src/lib/github.ts) passed verification. Task 2 (the server component) passed. Task 3 (the heatmap component) failed.
The failure was subtle. The heatmap rendered correctly, but the verification agent ran it against a specific truth from the plan: "the week-of-year axis must align to ISO calendar weeks." It didn't. The agent had defaulted to US calendar weeks (Sunday-start) when ISO weeks are Monday-start. That's a real bug — non-US users would see weekdays shifted by one.
I wouldn't have caught this in review. It looked fine in the screenshot. The verification agent caught it because the spec explicitly said ISO. This is the kind of moment that sells me on goal-backward verification — checking outcomes against declared truths, not just "does it compile."
Day 3 totals:
- Elapsed: 2h 10m (30m re-planning mid-execution)
- Commits: 7 (4 tasks + 3 fixes across task 3)
- Tokens: ~340k
- Verification: failed twice, passed on third run
Day 4 — Particle Hero Canvas
Phase goal: Canvas-based particle system for the hero background. No library, no WebGL. Just Canvas 2D with requestAnimationFrame.
I spent the morning writing the phase spec by hand before running /gsd:plan-phase. This feature had visual requirements that are hard to capture in elicitation — I knew what I wanted from looking at reference sites. Better to write it down than try to answer 15 elicitation questions.
The spec was 180 lines. Key truths:
- 60fps target on M1 MacBook Air
- Paused when
prefers-reduced-motion: reduceis set - Canvas reinitializes on window resize (debounced 100ms)
- Particles respect the navy + warm gold color palette
GSD planned six tasks. Execution was the smoothest of the week — each task produced a commit, verification passed clean on every slice. The reduced-motion check was the only surprise. The verification agent ran a Playwright script that set prefers-reduced-motion: reduce and asserted requestAnimationFrame was not being called. Caught it wasn't — I had forgotten the check. Task re-ran, passed.
Day 4 totals:
- Elapsed: 2h 25m
- Commits: 7
- Tokens: ~280k
- Verification: failed once (the reduced-motion case), passed on rerun
Day 5 — The Friction Day
Phase goal: Performance pass on the bento grid — Lighthouse score to 95+, LCP under 1.5s, CLS under 0.1.
Performance phases are where GSD's spec-driven approach gets awkward. The "spec" for a performance pass isn't a feature description — it's a set of measurement targets. GSD's discuss phase wanted to elicit scope details that didn't apply. I ended up manually writing the plan because the elicitation wasn't productive.
This is a known limitation. SDD fits feature work cleanly. It fits refactoring and performance work less cleanly — you're not building a thing, you're tightening a thing. The verification side is fine (measure Lighthouse, assert thresholds). The planning side drags.
My workaround: skip /gsd:discuss-phase, run /gsd:plan-phase with a pre-written spec, execute normally. It worked, but it felt like I was working around the tool.
Day 5 totals:
- Elapsed: 3h 5m (including an hour of rework the tool didn't help with)
- Commits: 9
- Tokens: ~450k
- Verification: passed
- Result: Lighthouse 96, LCP 1.3s, CLS 0.04
Day 6 — Cleanup and Review
Friday I didn't plan a new phase. I ran /gsd:audit-milestone against the week's work — GSD's milestone audit checks that every phase in the milestone actually delivered what its spec promised, no silent drift between plan and shipped code.
The audit flagged one gap. The blog phase's plan called for a "Related Posts" section on each blog detail page. I'd shipped the blog without it because I forgot — the plan got buried under execution, and the verification truths I wrote didn't include a specific check for related posts.
This is a failure mode of SDD: if the verification truths don't match the full plan scope, things fall through the cracks. GSD's milestone audit exists specifically to catch this. I logged a follow-up task for the next milestone instead of shipping a half-baked related-posts section at 4pm on a Friday.
What The Numbers Say
Across four phases plus a performance pass, one week total:
| Phase | Elapsed | Commits | Tokens (approx) | Verification |
|---|---|---|---|---|
| Blog infrastructure | 2h 40m | 11 | 420k | 1 rerun |
| Contact form | 1h 35m | 6 | 210k | 1 rerun |
| GitHub heatmap | 2h 10m | 7 | 340k | 2 reruns |
| Particle canvas | 2h 25m | 7 | 280k | 1 rerun |
| Performance pass | 3h 5m | 9 | 450k | clean |
| Week totals | 11h 55m | 40 | ~1.7M | 5 reruns |
Observations:
- Tokens are real cost. At current Sonnet 4.6 pricing, 1.7M tokens is around $12–14 for the week. Reasonable for solo work. Would be meaningful on a team.
- Verification reruns are features, not bugs. Every rerun caught a real issue that would have shipped otherwise. The ISO-week bug, the missing reduced-motion check, the rate-limiter persistence — all three would have made it to production under an un-verified workflow.
- The performance phase showed GSD's limits. SDD is optimized for feature work. Refactoring and performance passes fit awkwardly. This is a tool-shape problem, not a user-error problem.
- 40 commits across 4 features is a lot. That's the atomic-commit discipline. Each task is its own commit, which means reviewable history but long
git log. Squash-merge handles it at the phase boundary.
What Would Have Been Different Without GSD
Without GSD, I'd have shipped fewer features in the same week. Probably three instead of five. Two reasons:
- Context rot would have eaten me on the blog phase. Eleven tasks in one Claude Code session is exactly the range where context degrades. Fresh subagents per task made the blog ship cleanly.
- I'd have skipped verification entirely on half of these. When I ship solo, verification is the first thing I cut. GSD bakes it into the workflow such that skipping it is harder than running it.
What I'd have gained without GSD:
- An hour of planning time per phase back. GSD's discuss + plan steps cost ~40–60 minutes each. On smaller work that tax is unnecessary.
- Flexibility on agent choice. Not relevant to me (I'm Claude Code primary) but relevant to teams.
Net: worth it on this project. Would be worth it on every project I run where Claude Code is the primary agent and features are more than 3–4 tasks deep.
The Honest Summary
SDD with GSD isn't magic. It's a working agreement with yourself — I will describe before I build, and I will verify after I build, every time. The tool enforces the discipline so you don't have to. The discipline is what actually produces reliable output.
I shipped four features plus a performance pass in a working week, with verification catching three real bugs before they shipped and one gap at milestone audit. None of this is heroic. It's just boringly steady.
For the deeper architectural reasons behind GSD's design choices, see the architecture deep-dive. For whether to pick GSD over OpenSpec or Spec Kit, see the three-tool comparison. And the full development setup I run alongside GSD lives on the tools page.
Running GSD on a real project or considering it? Reach out — always trading notes on what shipping with SDD actually feels like.