Verification-First Development
“Give Claude a way to verify its output. Once you do that, Claude will iterate until the result is great.” — Boris Cherny, creator of Claude Code
This is Boris’s single most important principle. Everything else — parallel execution, automation, voice coding — depends on it working.
What Verification-First Means
Most developers use Claude in a request-response loop: ask Claude to write code, read what it produces, decide if it’s correct, give feedback. This works, but it makes you the verifier. You become the bottleneck.
Verification-first inverts this. You give Claude a machine-readable signal — a test suite, a build command, a linting script, a screenshot diff — and Claude uses that signal to assess its own output. When the signal is red, Claude fixes the code and tries again. When the signal is green, the task is done.
You are no longer the bottleneck. Claude iterates on its own until it passes.
Without verification-first: You write task → Claude writes code → You read code → You say "not quite" → Claude rewrites → You read again → repeat
With verification-first: You write task + verification command → Claude writes code → Claude runs tests → Tests fail → Claude reads failure → Claude fixes → Claude runs tests → Tests pass → DoneThe difference compounds on long tasks. A feature that requires 8 rounds of human feedback becomes 8 rounds of automated feedback happening in seconds, not hours.
Why It Works
Claude is excellent at reading structured error output and mapping it back to source code. A failing test with a stack trace, a TypeScript error with a line number, a Playwright assertion with a screenshot diff — these are precise, unambiguous signals. Claude can act on them immediately without needing to ask you what went wrong.
Natural language feedback (“that’s not quite right”) forces Claude to guess what you mean. Structured failure output removes the guessing.
The deeper reason: Claude’s effective context window is finite. Spending tokens on human-mediated clarification is expensive. Spending tokens on reading a stack trace and fixing it is cheap and reliable.
How to Set It Up
Step 1: Identify your verification command
Every project has at least one. Use what already exists.
# Unit testsnpm testpytestgo test ./...
# Build verificationnpm run buildtsc --noEmitcargo build
# Lintingnpm run linteslint src/ --max-warnings 0
# End-to-endnpx playwright testStep 2: Tell Claude to run it
Include the verification command in your initial prompt. Make it explicit.
Implement the user authentication flow.After each change, run `npm test` and fix any failures before proceeding.Do not consider the task done until `npm test` passes with no errors.Step 3: Make failures readable
If your test output is noisy, add a summary script. Claude should be able to read failure output and know exactly what to fix.
#!/bin/bashset -e
echo "=== Type check ==="npx tsc --noEmit
echo "=== Unit tests ==="npm test -- --reporter=verbose
echo "=== Lint ==="npm run lint
echo "=== All checks passed ==="Then instruct Claude: “Run ./scripts/verify.sh after each change.”
Example 1: Web Feature with Playwright Screenshot Verification
Boris commonly uses screenshot comparison for UI work. Claude makes a change, takes a screenshot, compares against the expected state, and iterates until the visual matches.
Task: Implement the pricing table component per the design spec in docs/design/pricing.png
After each change:1. Run `npm run dev` in the background (already running)2. Run `npx playwright test tests/pricing.spec.ts --reporter=line`3. If the screenshot diff test fails, inspect the diff at test-results/pricing-diff.png4. Fix the CSS/markup to match the expected design5. Repeat until all Playwright assertions passThe Playwright test itself:
import { test, expect } from '@playwright/test';
test('pricing table matches design', async ({ page }) => { await page.goto('/pricing'); await expect(page).toHaveScreenshot('pricing-expected.png', { maxDiffPixelRatio: 0.02, });});Claude reads the pixel diff output, identifies which elements are misaligned, corrects them, and re-runs. No human in the loop.
Example 2: API Endpoint with Integration Test Verification
Task: Add a POST /api/users endpoint that creates a user and returns the created record.
Requirements:- Validate email format- Hash password with bcrypt (cost factor 12)- Return 201 with {id, email, createdAt}- Return 422 with {error} for validation failures
After each change, run `npm test -- --testPathPattern=users.integration`and fix any failures. Do not proceed until all tests pass.The integration test file gives Claude precise signal:
describe('POST /api/users', () => { it('creates a user with valid data', async () => { const res = await request(app) .post('/api/users') .send({ email: 'test@example.com', password: 'secure123' });
expect(res.status).toBe(201); expect(res.body).toMatchObject({ id: expect.any(String), email: 'test@example.com', createdAt: expect.any(String), }); expect(res.body.password).toBeUndefined(); });
it('returns 422 for invalid email', async () => { const res = await request(app) .post('/api/users') .send({ email: 'not-an-email', password: 'secure123' });
expect(res.status).toBe(422); expect(res.body.error).toBeDefined(); });});Each failing assertion tells Claude exactly what is wrong. Claude does not need to ask.
Example 3: Data Pipeline with Output Validation Script
For data work, write a validation script that checks shape and invariants:
import jsonimport sys
with open('output/processed.json') as f: data = json.load(f)
errors = []
if not isinstance(data, list): errors.append("Output must be a list")
for i, record in enumerate(data): if 'id' not in record: errors.append(f"Record {i} missing 'id' field") if 'timestamp' not in record: errors.append(f"Record {i} missing 'timestamp' field") if record.get('value') is not None and not isinstance(record['value'], (int, float)): errors.append(f"Record {i} has non-numeric value: {record['value']}")
if errors: print("VALIDATION FAILED:") for e in errors: print(f" - {e}") sys.exit(1)
print(f"VALIDATION PASSED: {len(data)} records OK")Prompt to Claude:
Rewrite the data pipeline in pipeline.py to match the new schema in docs/schema-v2.json.After each change, run `python scripts/validate_pipeline.py`.Fix any validation errors before proceeding.The Anti-Pattern: Asking Claude to “Check Your Work”
The most common mistake is asking Claude to self-evaluate in natural language:
# Weak — Claude is guessing, not verifying"Review your implementation and make sure it's correct.""Does your solution handle edge cases?""Check your work."This is not verification-first. Claude cannot reliably catch its own logical errors through introspection alone. It will often say “looks good” when it is not.
The verification must be executable. It must produce a pass/fail signal that is independent of Claude’s reasoning about the code.
# Strong — machine signal, not opinion"Run npm test. Fix any failures.""Run tsc --noEmit. Fix any type errors.""Run ./scripts/validate.sh. Do not continue if it exits non-zero."The Feedback Loop Diagram
The left path bottlenecks on human review time. The right path bottlenecks only on Claude’s iteration speed — which is seconds, not hours.
Connection to Automation
Verification-first is what makes /loop safe. When Claude runs on a schedule — checking for new PR comments, watching for test failures, monitoring an external API — it needs to know when it has done the right thing without asking you.
A loop that monitors and auto-fixes test failures only works if Claude can distinguish “fixed” from “still broken” without human judgment.
See Automation Workflows and /advanced/loop for how to connect verification into persistent background agents.
About Boris Cherny: Boris created Claude Code at Anthropic. The verification-first principle emerged from his daily experience shipping production code with Claude — the pattern that consistently separated successful long-running tasks from ones that drifted off course.