Verification-First Development

“Give Claude a way to verify its output. Once you do that, Claude will iterate until the result is great.” — Boris Cherny, creator of Claude Code

This is Boris’s single most important principle. Everything else — parallel execution, automation, voice coding — depends on it working.

What Verification-First Means

Most developers use Claude in a request-response loop: ask Claude to write code, read what it produces, decide if it’s correct, give feedback. This works, but it makes you the verifier. You become the bottleneck.

Verification-first inverts this. You give Claude a machine-readable signal — a test suite, a build command, a linting script, a screenshot diff — and Claude uses that signal to assess its own output. When the signal is red, Claude fixes the code and tries again. When the signal is green, the task is done.

You are no longer the bottleneck. Claude iterates on its own until it passes.

Without verification-first:
  You write task → Claude writes code → You read code → You say "not quite"
  → Claude rewrites → You read again → repeat

With verification-first:
  You write task + verification command → Claude writes code → Claude runs tests
  → Tests fail → Claude reads failure → Claude fixes → Claude runs tests
  → Tests pass → Done

The difference compounds on long tasks. A feature that requires 8 rounds of human feedback becomes 8 rounds of automated feedback happening in seconds, not hours.

Why It Works

Claude is excellent at reading structured error output and mapping it back to source code. A failing test with a stack trace, a TypeScript error with a line number, a Playwright assertion with a screenshot diff — these are precise, unambiguous signals. Claude can act on them immediately without needing to ask you what went wrong.

Natural language feedback (“that’s not quite right”) forces Claude to guess what you mean. Structured failure output removes the guessing.

The deeper reason: Claude’s effective context window is finite. Spending tokens on human-mediated clarification is expensive. Spending tokens on reading a stack trace and fixing it is cheap and reliable.

How to Set It Up

Step 1: Identify your verification command

Every project has at least one. Use what already exists.

# Unit tests
npm test
pytest
go test ./...

# Build verification
npm run build
tsc --noEmit
cargo build

# Linting
npm run lint
eslint src/ --max-warnings 0

# End-to-end
npx playwright test

Step 2: Tell Claude to run it

Include the verification command in your initial prompt. Make it explicit.

Implement the user authentication flow.
After each change, run `npm test` and fix any failures before proceeding.
Do not consider the task done until `npm test` passes with no errors.

Step 3: Make failures readable

If your test output is noisy, add a summary script. Claude should be able to read failure output and know exactly what to fix.

#!/bin/bash
set -e

echo "=== Type check ==="
npx tsc --noEmit

echo "=== Unit tests ==="
npm test -- --reporter=verbose

echo "=== Lint ==="
npm run lint

echo "=== All checks passed ==="

Then instruct Claude: “Run ./scripts/verify.sh after each change.”

Example 1: Web Feature with Playwright Screenshot Verification

Boris commonly uses screenshot comparison for UI work. Claude makes a change, takes a screenshot, compares against the expected state, and iterates until the visual matches.

Task: Implement the pricing table component per the design spec in docs/design/pricing.png

After each change:
1. Run `npm run dev` in the background (already running)
2. Run `npx playwright test tests/pricing.spec.ts --reporter=line`
3. If the screenshot diff test fails, inspect the diff at test-results/pricing-diff.png
4. Fix the CSS/markup to match the expected design
5. Repeat until all Playwright assertions pass

The Playwright test itself:

import { test, expect } from '@playwright/test';

test('pricing table matches design', async ({ page }) => {
  await page.goto('/pricing');
  await expect(page).toHaveScreenshot('pricing-expected.png', {
    maxDiffPixelRatio: 0.02,
  });
});

Claude reads the pixel diff output, identifies which elements are misaligned, corrects them, and re-runs. No human in the loop.

Example 2: API Endpoint with Integration Test Verification

Task: Add a POST /api/users endpoint that creates a user and returns the created record.

Requirements:
- Validate email format
- Hash password with bcrypt (cost factor 12)
- Return 201 with {id, email, createdAt}
- Return 422 with {error} for validation failures

After each change, run `npm test -- --testPathPattern=users.integration`
and fix any failures. Do not proceed until all tests pass.

The integration test file gives Claude precise signal:

describe('POST /api/users', () => {
  it('creates a user with valid data', async () => {
    const res = await request(app)
      .post('/api/users')
      .send({ email: 'test@example.com', password: 'secure123' });

    expect(res.status).toBe(201);
    expect(res.body).toMatchObject({
      id: expect.any(String),
      email: 'test@example.com',
      createdAt: expect.any(String),
    });
    expect(res.body.password).toBeUndefined();
  });

  it('returns 422 for invalid email', async () => {
    const res = await request(app)
      .post('/api/users')
      .send({ email: 'not-an-email', password: 'secure123' });

    expect(res.status).toBe(422);
    expect(res.body.error).toBeDefined();
  });
});

Each failing assertion tells Claude exactly what is wrong. Claude does not need to ask.

Example 3: Data Pipeline with Output Validation Script

For data work, write a validation script that checks shape and invariants:

import json
import sys

with open('output/processed.json') as f:
    data = json.load(f)

errors = []

if not isinstance(data, list):
    errors.append("Output must be a list")

for i, record in enumerate(data):
    if 'id' not in record:
        errors.append(f"Record {i} missing 'id' field")
    if 'timestamp' not in record:
        errors.append(f"Record {i} missing 'timestamp' field")
    if record.get('value') is not None and not isinstance(record['value'], (int, float)):
        errors.append(f"Record {i} has non-numeric value: {record['value']}")

if errors:
    print("VALIDATION FAILED:")
    for e in errors:
        print(f"  - {e}")
    sys.exit(1)

print(f"VALIDATION PASSED: {len(data)} records OK")

Prompt to Claude:

Rewrite the data pipeline in pipeline.py to match the new schema in docs/schema-v2.json.
After each change, run `python scripts/validate_pipeline.py`.
Fix any validation errors before proceeding.

The Anti-Pattern: Asking Claude to “Check Your Work”

The most common mistake is asking Claude to self-evaluate in natural language:

# Weak — Claude is guessing, not verifying
"Review your implementation and make sure it's correct."
"Does your solution handle edge cases?"
"Check your work."

This is not verification-first. Claude cannot reliably catch its own logical errors through introspection alone. It will often say “looks good” when it is not.

The verification must be executable. It must produce a pass/fail signal that is independent of Claude’s reasoning about the code.

# Strong — machine signal, not opinion
"Run npm test. Fix any failures."
"Run tsc --noEmit. Fix any type errors."
"Run ./scripts/validate.sh. Do not continue if it exits non-zero."

The Feedback Loop Diagram

graph TD subgraph WITHOUT["Without Verification-First"] A1[Write Task] --> B1[Claude Writes Code] B1 --> C1[Human Reviews] C1 --> D1{Correct?} D1 -->|No| E1[Human Gives Feedback] E1 --> B1 D1 -->|Yes| F1[Merge] end subgraph WITH["With Verification-First"] A2[Write Task + Verify Command] --> B2[Claude Writes Code] B2 --> C2[Claude Runs Tests] C2 --> D2{Pass?} D2 -->|Fail| E2[Claude Reads Errors] E2 --> F2[Claude Fixes Code] F2 --> C2 D2 -->|Pass| G2[Done] end style A1 fill:#1e293b,color:#7dd3fc,stroke:#334155 style B1 fill:#1e293b,color:#7dd3fc,stroke:#334155 style C1 fill:#1e293b,color:#fcd34d,stroke:#334155 style D1 fill:#1e293b,color:#fcd34d,stroke:#334155 style E1 fill:#1e293b,color:#fcd34d,stroke:#334155 style F1 fill:#1e293b,color:#86efac,stroke:#334155 style A2 fill:#1e293b,color:#7dd3fc,stroke:#334155 style B2 fill:#1e293b,color:#7dd3fc,stroke:#334155 style C2 fill:#1e293b,color:#7dd3fc,stroke:#334155 style D2 fill:#1e293b,color:#fcd34d,stroke:#334155 style E2 fill:#1e293b,color:#7dd3fc,stroke:#334155 style F2 fill:#1e293b,color:#7dd3fc,stroke:#334155 style G2 fill:#1e293b,color:#86efac,stroke:#334155

The left path bottlenecks on human review time. The right path bottlenecks only on Claude’s iteration speed — which is seconds, not hours.

Connection to Automation

Verification-first is what makes /loop safe. When Claude runs on a schedule — checking for new PR comments, watching for test failures, monitoring an external API — it needs to know when it has done the right thing without asking you.

A loop that monitors and auto-fixes test failures only works if Claude can distinguish “fixed” from “still broken” without human judgment.

See Automation Workflows and /advanced/loop for how to connect verification into persistent background agents.

About Boris Cherny: Boris created Claude Code at Anthropic. The verification-first principle emerged from his daily experience shipping production code with Claude — the pattern that consistently separated successful long-running tasks from ones that drifted off course.