How to Ship Code with AI: The gstack Way
Shipping code with AI using gstack means a structured Ship phase — sync, test, push, PR, deploy, verify — that prevents common post-deploy failures.
How to Ship Code with AI: The gstack Way
Shipping code with AI is where most teams lose the gains they made during development. The build was thorough, the review was rigorous — and then the deploy happens fast and loose, the verification is skipped, and the production incident shows up 20 minutes later. gstack's Ship phase exists to close that gap.
Here's how the gstack Ship workflow works, what each step does, and why it's structured the way it is.
The Ship Phase in Context#
Before diving into Ship, it's worth understanding where it fits in gstack's full workflow.
By the time code reaches Ship, it has passed through:
- Think: Problem reframed, design document produced
- Plan: Architecture locked, design approved
- Build: Implementation complete against locked plan
- Review: Staff engineer review, all findings auto-fixed
- Test QA: QA lead has tested the running application
- Test Bench: Performance validated
Ship is the last gate before code reaches users. The job is not to add new review — it's to execute a specific, invariant checklist that ensures the code that passed all the prior phases actually gets to production intact.
The Ship Checklist#
gstack's Ship role is a Release Engineer running a non-negotiable sequence:
1. Sync with Main#
git fetch origin
git checkout main
git pull origin main
git checkout feature-branch
git rebase mainFeature branches drift. Code that tested cleanly against a 3-day-old main might conflict with what landed since then. Sync before anything else.
If there are conflicts, resolve them. Don't merge a stale branch and hope for the best.
2. Run the Full Test Suite#
npm test -- --coverageNot just the tests related to the feature. The full suite. Every test. If something else broke during the feature build — an integration test, a utility function, a shared component — you want to know now, not after deploy.
What counts as passing: All tests green. Coverage at or above the project threshold. No skipped tests that were passing before.
3. Audit Coverage#
npm test -- --coverage --coverageThreshold='{"global":{"branches":80,"functions":80,"lines":80,"statements":80}}'New code should have test coverage. If the feature added 500 lines and tests only cover 200 of them, that's a flag. The Ship role doesn't automatically block for low coverage — it reports the coverage delta and flags for human review if coverage drops significantly.
4. Check for Debug Code and TODOs#
grep -rn "TODO\|FIXME\|console\.log\|debugger\|breakpoint" src/ --include="*.ts" --include="*.tsx" --include="*.js"TODOs that should have been resolved during Build, debug statements that made it through Review — these happen. Ship catches them. Any production-bound code with a TODO in it gets flagged. The Release Engineer decides whether to resolve it or document it as intentional.
5. Verify Environment Configs#
- Are all required environment variables documented?
- Are any new secrets needed in production?
- Are there any hardcoded development values?
- Does the deployment configuration match the requirements in the architecture doc?
Environment config failures are the most common cause of "it worked on my machine" deploys. Check explicitly.
6. Push Branch#
git push origin feature-branchStraightforward, but the safety tools matter here. The careful tool in gstack's safety system watches for force-push attempts. A git push --force to main would be catastrophic. careful blocks it with an explicit confirmation prompt.
7. Open the PR#
The PR description should be complete:
- What problem does this solve? (link to the design doc)
- What changed? (architecture, data model, user-visible behavior)
- How was it tested? (QA scenarios run, performance results)
- How to verify in production? (specific smoke tests)
- Any follow-up items? (scope decisions made during Build)
A complete PR description turns future debugging into a 5-minute exercise instead of a 2-hour archeological dig.
The Deploy Phase#
Ship opens the PR. Deploy closes the loop.
1. Merge after CI passes#
# After CI green
gh pr merge --squash --delete-branchNever merge a failing CI. If CI is consistently flaky, fix the tests before merging. "CI is flaky, it'll probably be fine" is how production incidents start.
2. Trigger Deployment#
Deployment mechanism depends on your stack, but gstack's Deploy role:
- Confirms the deployment pipeline triggered correctly
- Waits for deployment to complete (doesn't assume it worked)
- Verifies the deployed version matches the expected commit
3. Post-Deploy Health Check#
# Example: verify the deployment is running the expected version
curl https://api.yourdomain.com/health | jq '.version'
# Verify error rate hasn't spiked
# Check your monitoring dashboard
# Verify key database connections
# Run smoke tests
npm run e2e:smokeThe health check is non-negotiable. Don't close the deploy without confirming:
- Error rate is within normal range
- Response times haven't degraded
- Key flows work in production
The Monitor Phase#
gstack adds a Monitor phase after Deploy — an SRE role that watches production for 15–30 minutes post-deploy.
What to watch:
Error rate: baseline ± 20%
P99 response time: baseline ± 30%
Database connection pool: below 80% utilization
Memory: no unexpected growth trend
CPU: no unexpected spike
Most deployment-induced incidents are visible within 20 minutes. If you catch them here, you can roll back before users notice. If you close the deploy loop and walk away, you find out via Slack at midnight.
Set a timer. Watch your dashboards. Only close the loop when monitoring looks clean.
What Goes Wrong Without This Process#
The anti-pattern is common: AI builds good code, review passes, tests pass, and then the ship process is "commit, push to main, trigger deploy, check Slack for errors."
Here's what that process misses:
Stale branch syncs — The feature was tested against 3-day-old main. Something else landed that conflicts. The conflict isn't caught until production users hit it.
Skip full test suite — "The feature tests pass" is not "the test suite passes." A utility function change that breaks three other features goes to production.
No environment config audit — A new secret wasn't added to the production environment. The feature silently fails for all users.
No post-deploy verification — The deployment completed, the CI turned green, and the error rate spiked immediately because of a configuration mismatch that wasn't caught by tests. Nobody was watching.
None of these are exotic failure modes. They're the most common production incidents in codebases of any size.
Integration with AI Development Tools#
The Ship phase is fully automatable with AI. The gstack 18 roles design means the Ship role can be instructed to run the full checklist and report results — flagging issues for human review rather than blocking automatically.
Typical Ship invocation:
# DenchClaw running gstack Ship role
denchclaw gstack ship --branch feature/login-refactor --target mainOutput:
✅ Synced with main (no conflicts)
✅ Test suite: 847 tests, 0 failures
⚠️ Coverage: 76% (threshold: 80%) — 3 uncovered functions flagged
✅ No debug code or unresolved TODOs
✅ Environment config verified
⚠️ New env var required: FEATURE_LOGIN_V2=true
PR opened: https://github.com/org/repo/pull/1204
Human reviews the two warnings. Decides to add missing tests or explicitly accept the coverage gap. Adds the env var to production secrets. Approves.
This is the gstack approach to shipping — AI does the systematic work, humans make the judgment calls.
FAQ#
Q: How long does the Ship phase take? For a medium-complexity feature: 10–15 minutes for the automated checklist, 5–10 minutes of human review of flagged items. Significantly faster than manual review, and more thorough.
Q: Should Ship block if tests fail? Yes. A failing test suite means the code is not ready to ship. The test can be wrong (and should be fixed), or the code is wrong (and should be fixed), but neither case supports shipping.
Q: What if the CI is consistently flaky? Fix the flakiness before it affects ship reliability. Flaky CI creates pressure to "just merge through it" — and that pressure is where discipline breaks down. The Ship phase's value depends on CI being reliable.
Q: How does this work for solo developers? The same way. The checklist is the checklist. The difference between a team and a solo developer isn't the process — it's who reviews the flagged items. Solo developers are both the Release Engineer and the human reviewer.
Q: What about hotfixes that need to ship immediately? Hotfixes should still run the test suite and post-deploy verification. What they can skip: full QA and performance testing. What they cannot skip: tests, sync with main, and post-deploy monitoring. Speed is not a reason to skip the steps that prevent making the incident worse.
Ready to try DenchClaw? Install in one command: npx denchclaw. Full setup guide →
