Context & Memory
LLMs have finite context windows. A typical agentic session burns through 100K tokens in 30–60 minutes. Without defense, the agent simply forgets — mid-task, silently, with no warning.
The Math
Context consumption in a real session:
| Source | Tokens (est.) |
|---|---|
| System prompt | ~8K |
| Conversation history | ~20K |
| 50 tool results × 2K each | ~100K |
| Thinking blocks | ~15K |
| Total | ~143K |
Most models have a 200K effective context window. A busy session hits the wall in under an hour. The system needs a layered defense — not a single fallback.
Pre-Flight Token Estimation
Before every API call, the system estimates token count to decide if compaction is needed — preventing 413 errors before they happen.
Deep Dive: Token Estimation Across Providers
Token estimation runs for 3 API providers with different backends:
| Provider | Method | Fallback |
|---|---|---|
| Anthropic API | Native token counting endpoint | Character heuristic (~4 chars/token) |
| AWS Bedrock | Bedrock-specific token estimator | Character heuristic |
| Google Vertex AI | Vertex token count API | Character heuristic |
Key behaviors:
- Extended thinking tokens counted separately with configurable budget
- Beta fields (e.g.,
tool_search) stripped before counting to prevent overestimation - Cache-aware: already-cached prompt tokens counted at reduced weight
- If estimation fails: falls back to character-based heuristic (~4 chars per token)
The estimation feeds directly into the auto-compact trigger: if estimated tokens > effective context window, Layer 3 fires before the API call — not after a 413 error.
5-Layer Escalating Defense
Each layer activates only when cheaper layers are insufficient. Once a layer runs, it does not retry — it either succeeds or escalates.
Key principle: Each layer runs at most once per iteration. Failure → escalate. No retries within the same layer. Circuit breakers on every layer.
Layer 3 Detail: Auto-compact
The 20K token reserve for summary output is based on empirical measurement: p99.99 of observed compact summaries = 17,387 tokens. The system over-provisions slightly to guarantee the summary never gets truncated.
When auto-compact fires, memory extraction runs in parallel (non-blocking). The summary replaces everything before the compact boundary. Everything before that boundary is permanently discarded.
Layer 4 Detail: Reactive Compact
Triggered when the API returns HTTP 413 (prompt_too_long) mid-turn — meaning the request was already too large to send. The circuit breaker flag hasAttemptedReactiveCompact prevents infinite loops: if the first attempt fails, the session aborts gracefully rather than spiraling.
Deep Dive: Effective Context Window Calculation
The system calculates how much context it can actually use:
FUNCTION getEffectiveContextWindowSize(model): maxContext = getContextWindowForModel(model) // e.g., 200,000
// Reserve space for compact summary output // p99.99 observed compact output = 17,387 tokens // Over-provision to 20,000 to guarantee no truncation reserved = min( getMaxOutputTokensForModel(model), 20_000 // MAX_OUTPUT_TOKENS_FOR_SUMMARY )
RETURN maxContext - reserved // Example: 200,000 - 20,000 = 180,000 effective tokensThe 20,000 token reserve is an empirical value: based on production telemetry, 99.99% of compact summaries fit within 17,387 tokens. The system adds a ~15% safety margin. This means out of a 200K context window, only 180K is available for actual conversation — the rest is reserved for the summary if compaction is needed.
Compact Boundary = Controlled Forgetting
A compact boundary is not random truncation. It is a system decision about what to forget and what to keep:
Before boundary (discarded): turn 1 → turn 2 → ... → turn 47
Boundary message (kept): "Summary: You were refactoring the auth module. Completed: token validation, session expiry. In progress: refresh token rotation. Key decisions: use httpOnly cookies, 15-min expiry."
After boundary (kept): turn 48 → turn 49 → ...The summary captures decisions and state — not a transcript. The agent continues with context, not with history.
Deep Dive: Microcompact Algorithm
Microcompact is the cheapest defense that actually removes content (Layer 1 only truncates). It runs without any LLM call:
FUNCTION microcompact(messages, compactBoundaries): FOR EACH message IN messages: IF message.type == "tool_result": // Check if this tool result is "stale" — older than the newest compact boundary IF message.turnIndex < latestCompactBoundary.turnIndex: // Option 1: Replace with brief summary IF message.resultSize > SUMMARY_THRESHOLD: message.content = "[Tool result truncated. Original: ${message.toolName} " + "returned ${message.resultSize} chars. Run tool again if needed.]" // Option 2: Remove entirely if the tool was a read-only query ELSE IF message.toolName IN [FileRead, Grep, Glob, WebSearch]: REMOVE message from conversation
// Create microcompact boundary (marks the cutoff point) IF anyRemoved: INSERT microcompactBoundaryMessage(removedCount, freedTokens)cachedMicrocompact variant (feature-gated): Caches previous microcompact decisions so repeated triggers don’t re-scan the entire conversation. Tracks cache_deleted_input_tokens from API response headers to measure effectiveness.
Memory Lifecycle
Memory decay: older memories receive lower relevance scores over time. A memory about “user prefers 2-space indent” stays relevant forever. A memory about “file X had a bug on day 3” decays quickly.
Deep Dive: Memory Extraction and Session Memory Compact
extractMemories() runs at session end as a background task (non-blocking):
FUNCTION extractMemories(sessionMessages): // Spawn a lightweight agent to extract key facts // This agent has read-only access and a focused prompt memories = await agent.extract({ messages: sessionMessages, prompt: "Extract key facts, decisions, and learnings. Skip transient details.", maxMemories: 20 // Limit per session })
FOR EACH memory IN memories: memory.createdAt = now() memory.relevanceScore = 1.0 // Starts at full relevance saveToMemoryStore(memory)sessionMemoryCompact (630 lines) runs in parallel with auto-compact (Layer 3):
FUNCTION sessionMemoryCompact(messages, compactSummary): // Extract memories from the messages being compacted // These memories would otherwise be lost when the compact boundary // replaces the original messages emergencyMemories = extractKeyFacts(messages)
// Merge with existing session memories (deduplicate) FOR EACH memory IN emergencyMemories: IF NOT isDuplicate(memory, existingMemories): saveToSessionMemory(memory)Memory age decay:
FUNCTION calculateRelevance(memory): daysSinceCreated = (now() - memory.createdAt) / ONE_DAY
// Exponential decay with 30-day half-life decayFactor = 0.5 ^ (daysSinceCreated / 30)
RETURN memory.baseRelevanceScore * decayFactor // Day 0: score × 1.0 // Day 30: score × 0.5 // Day 60: score × 0.25 // Day 90: score × 0.125Memories about stable facts (“user prefers 2-space indent”) are tagged with decay: false and maintain full relevance indefinitely. Only transient memories decay.
Team memory: Agents in the same coordinator team share a memory namespace at .claude/team-memory/{teamName}/. Agent A extracts a memory about a codebase pattern, Agent B in the same team can access it in the next session. This enables organizational knowledge accumulation across agent boundaries.
Background Tasks
The system runs 7 types of background tasks, all writing output to disk for resilience:
| Type | Description | Key Detail |
|---|---|---|
local_bash | Shell command in background | Output to file, incremental reads via outputOffset |
local_agent | Agent with own query loop | Isolated context, reports to coordinator |
remote_agent | Agent on remote server | Cross-network, same mailbox protocol |
in_process_teammate | Coordinator worker | Shared process, message passing |
local_workflow | Scripted pipeline | Sequential steps, output chained |
monitor_mcp | MCP server health check | Periodic ping, restart on failure |
dream | Memory consolidation | Runs between sessions |
Disk-Based Output Pattern
All background tasks write to files. Readers use outputOffset for incremental reading — like tail -f but crash-safe:
Benefits: Survives crashes → file persists, state not lost Unlimited size → not constrained by memory Multiple readers → coordinator + UI read same file Incremental reads → no re-reading already-seen outputStall detection: If no new output appears for 45 seconds, the system checks whether the command is waiting for interactive input — scanning for patterns like Y/n?, Continue?, Press any key. If found, it surfaces the prompt to the user rather than silently hanging.
The Dream Task
Between sessions, a dream agent reviews recent work and consolidates memories. It runs in 4 phases:
| Phase | Action |
|---|---|
| Orient | Load recent session summaries and existing memories |
| Gather | Extract key facts, decisions, and patterns from recent work |
| Consolidate | Merge redundant memories, resolve contradictions, update scores |
| Prune | Remove stale or low-relevance memories below threshold |
Idle time becomes improvement time. The next session starts with sharper context than the last one ended with.
Deep Dive: Stall Detection Patterns
Background shell tasks (local_bash) are monitored for stalls — commands that stop producing output because they’re waiting for interactive input:
FUNCTION checkForStall(task): // Check every 5 seconds (STALL_CHECK_INTERVAL_MS = 5000) IF (now() - task.lastOutputTime) > 45_000: // STALL_THRESHOLD_MS // Read the last 500 bytes of output tail = readTail(task.outputFile, 500)
// Check against interactive prompt patterns patterns = [ /\[?[Yy](es)?\/[Nn](o)?\]?\s*$/, // Y/n, Yes/No /\bContinue\?\s*$/i, // Continue? /\bProceed\?\s*$/i, // Proceed? /\bPress\s+(any\s+)?key/i, // Press any key /\bPassword:\s*$/i, // Password: /\bpassphrase.*:\s*$/i, // SSH passphrase /\b(y\/n)\s*$/i, // (y/n) />\s*$/, // Generic prompt > /\$\s*$/, // Shell prompt $ /\boverwrite\b.*\?\s*$/i // Overwrite? ]
IF anyMatch(tail, patterns): notifyAgent("Task ${task.id} appears to be waiting for input: ${tail}") // Agent can: send input to stdin, kill task, or escalate to userThe 45-second threshold is calibrated: short enough to catch stalls quickly, long enough to avoid false positives during legitimate pauses (e.g., compilation, large downloads).
Why This Matters to You
- Use
/compactproactively → don’t wait for the emergency layer (Layer 4); trigger Layer 3 yourself at a natural breakpoint - Why Claude “forgets” earlier instructions → a compact boundary replaced old messages with a summary; the original instruction is gone
- How memories persist across sessions → extracted at session end, scored, injected into system prompt at next session start
- What happens during long-running tasks → output written to disk with
outputOffsetincremental reads; survives crashes - What the “dream” task does → background memory consolidation between sessions; you benefit without any action
See also: The Agent Loop — Multi-Agent System — Tool Orchestration