Every major AI lab has published guidance on how to build agents. Anthropic's "Building Effective Agents," Google's multi-agent patterns, Microsoft's orchestration architecture --- the knowledge is out there. But most of it is buried in long whitepapers, and the signal-to-noise ratio is not great.
Here are the 5 patterns that matter, distilled from all three sources. Ordered from simplest to most complex. With clear decision criteria for when to use each one. And --- critically --- the anti-patterns that will waste your time if you choose wrong.
Key Takeaways
- Anthropic's core principle: "Start with the simplest solution. Don't build an agent when prompt chaining will do."
- Most production systems only need patterns 1-3. Patterns 4-5 are for genuinely complex problems.
- The #1 anti-pattern is reaching for multi-agent orchestration when a single well-prompted call would work
- Every pattern has a clear "use this when..." decision point. If you can't articulate why you need the pattern, you probably don't.
Short Answer
Which pattern should I use? Start with the simplest one that works. If a single LLM call solves your problem, stop there. If you need multiple steps, use prompt chaining. If inputs need different handling, add routing. Only reach for orchestrator-workers or evaluator-optimizer when simpler approaches have demonstrably failed. Anthropic, Google, and Microsoft all agree on this hierarchy.
Pattern 1: Prompt Chaining
Complexity: Low | Source: Anthropic, Google, Microsoft
Break a complex task into a sequence of focused LLM calls. Each step has a single responsibility, processes the previous step's output, and can include a validation gate before proceeding.
How It Works
Input → Step 1 (Extract) → Gate → Step 2 (Analyze) → Gate → Step 3 (Generate) → Output
Each step is a separate LLM call with its own focused prompt. Gates between steps are simple checks: "Does the output contain the required fields? Is the confidence above threshold? Did the format validation pass?"
Real-World Examples
- Anthropic's citation system: Extract claims from text → find relevant quotes for each claim → verify quotes are accurate. Three steps, each testable independently.
- RAG pipelines: Query rewriting → retrieval → reranking → generation → fact-check.
- Content moderation: Classify content → extract flagged elements → generate explanation → format report.
When to Use It
- Your task naturally decomposes into 2-5 sequential steps
- Each step is well-defined and independently testable
- You need high accuracy and want to catch errors between steps
- You can tolerate sequential latency (each step adds an LLM call)
When NOT to Use It
- Steps can't be cleanly separated (if steps blur together, it's one prompt)
- You need real-time responses (latency compounds)
- The next step depends on dynamic decisions about what to do (that's orchestration)
Decision test: "Can I describe each step on a sticky note?" If yes, chain. If the steps blur together or change based on input, you need a different pattern.
Pattern 2: Routing
Complexity: Low-Medium | Source: Anthropic, Microsoft (Selector pattern), Google
A classifier examines the input and dispatches it to the appropriate specialized handler. Each handler is optimized for its category with tailored prompts, tools, and sometimes different models.
How It Works
Input → Classifier → Route A (specialized handler)
→ Route B (specialized handler)
→ Route C (specialized handler)
→ Default (general handler)
The classifier can be an LLM call, a fine-tuned model, or even keyword matching --- whatever reliably separates your input categories.
Real-World Examples
- Customer support (Intercom, Zendesk AI): Classify intent (billing, technical, sales, cancellation) → route to specialized agent with domain-specific tools and context.
- GitHub Copilot: Routes between code completion, explanation, test generation, and documentation based on user intent.
- ChatGPT: Routes between browsing, code interpreter, DALL-E, and direct response based on query analysis.
When to Use It
- You have 3+ distinct input types requiring different handling
- Different categories need different tools, prompts, or models
- You want cost optimization (route simple queries to cheaper models)
When NOT to Use It
- Categories overlap significantly (the classifier becomes the bottleneck)
- You only have 1-2 categories (use conditional logic, not an LLM classifier)
- The routing decision itself is the hard problem
Start with 3-5 routes maximum. Over-routing (10+ categories with overlapping boundaries) makes the classifier unreliable and maintenance painful. Always include a default "general" route.
Pattern 3: Parallelization
Complexity: Medium | Source: Anthropic (sectioning + voting), Google, Microsoft
Multiple LLM calls run simultaneously, then results are aggregated. Anthropic identifies two distinct sub-patterns:
Sectioning --- Different Subtasks in Parallel
Input → Task A (tone analysis) ↘
→ Task B (entity extraction) → Aggregate → Output
→ Task C (grammar check) ↗
Different aspects of the same input processed simultaneously. Faster than sequential, but uses more tokens.
Voting --- Same Task, Multiple Times
Input → Attempt 1 ↘
→ Attempt 2 → Judge/Compare → Best Output
→ Attempt 3 ↗
The same task run multiple times for consensus. Catches individual failures through redundancy.
Real-World Examples
- Cursor and AI code editors: Fan out edits to multiple files in parallel, then apply all changes. One file's edit doesn't depend on another's.
- Anthropic's content moderation: Safety checks run in parallel with the main response. Both must complete before output is delivered.
- Code generation systems: Generate 3-5 solutions in parallel, run tests on each, pick the one that passes.
When to Use It
- Subtasks are independent (no step depends on another's output)
- Latency matters more than cost (parallel is faster but multiplies LLM calls)
- You need higher reliability (voting catches individual failures)
When NOT to Use It
- Tasks have sequential dependencies (output of A feeds into B)
- Cost is the primary constraint
- Results need deep coherence with each other (aggregation across complex outputs is hard)
Pattern 4: Orchestrator-Workers
Complexity: High | Source: Anthropic, Google (hierarchical/supervisor), Microsoft (AutoGen), LangGraph
A central orchestrator LLM analyzes the task, dynamically breaks it into subtasks, delegates to specialized workers, and synthesizes results. Unlike chaining, the decomposition happens at runtime --- different inputs produce different subtask structures.
How It Works
Input → Orchestrator (plans subtasks dynamically)
├→ Worker A (specialized capability)
├→ Worker B (different capability)
└→ Worker C (another capability)
Orchestrator (synthesizes results) → Output
Real-World Examples
- Devin (Cognition): An orchestrator plans the coding task, spawns workers for file editing, terminal commands, browser research, and code review. The number and type of workers changes per task.
- Microsoft AutoGen: Manager agent coordinates specialist agents (coder, reviewer, executor) with dynamic turn-taking.
- Replit Agent: Plans the app, delegates to code generation, file system operations, package management, and deployment workers.
When to Use It
- Tasks are complex and can't be predetermined at design time
- Different inputs require different numbers and types of subtasks
- Sub-agents have genuinely different capabilities (different tools, different models, different context)
When NOT to Use It
- The task decomposition is always the same (use chaining)
- You have fewer than 3 distinct worker capabilities (one agent with tools is simpler)
- Debugging and observability are critical (orchestrator-worker systems are harder to trace)
Anthropic's key distinction: chaining is predetermined orchestration (you know the steps at design time), while orchestrator-workers is dynamic orchestration (the model decides the steps at runtime). If your steps don't change per input, chaining is almost always better.
Pattern 5: Evaluator-Optimizer
Complexity: High | Source: Anthropic, Academic (Reflexion), LangChain
A generator produces output, an evaluator critiques it against defined criteria, and the generator refines based on feedback. The loop continues until quality thresholds are met or max iterations are reached.
How It Works
Generator → Output → Evaluator → Feedback → Generator → Refined Output → Evaluator → ...
(until good enough)
Real-World Examples
- Competitive programming systems (AlphaCode): Generate solution → run tests → analyze failures → regenerate. The test suite IS the evaluator.
- AI writing tools (Jasper, Writer): Generate draft → evaluate against brand voice and compliance criteria → refine → re-evaluate.
- Anthropic's Constitutional AI: The training methodology itself is an evaluator-optimizer pattern.
- Translation systems: Translate → back-translate and compare → refine differences.
When to Use It
- Quality criteria are well-defined and measurable
- First-pass output is good enough to refine (not completely wrong)
- You have a reliable evaluator (deterministic tests, a scoring rubric, or a strong judge model)
When NOT to Use It
- You can't define clear evaluation criteria (if you can't measure "good," you can't optimize toward it)
- First-pass quality is consistently terrible (fix the generator first)
- The evaluator is as unreliable as the generator (that's just noise evaluating noise)
Always set max iterations. Always have a "good enough" threshold. Always have a human escape hatch. AutoGPT (2023) showed what happens with unbounded autonomous loops: runaway costs, circular reasoning, and no convergence.
The Decision Flowchart
When choosing a pattern, start at the top and stop at the first match:
| Question | If Yes | If No |
|---|---|---|
| Can a single, well-prompted LLM call handle this? | Stop. Do that. | Continue. |
| Is the task decomposition always the same? | Prompt Chaining (sequential) or Parallelization (independent) | Continue. |
| Do inputs fall into distinct categories needing different handling? | Routing (+ chain or parallel per route) | Continue. |
| Does the task need dynamic decomposition at runtime? | Orchestrator-Workers | Continue. |
| Does output need iterative quality improvement? | Evaluator-Optimizer | Revisit your requirements. |
Most production systems only need the first three rows. If you find yourself at Orchestrator-Workers or Evaluator-Optimizer, double-check that simpler approaches genuinely failed first.
The Anti-Patterns
1. "Agent Everything" Syndrome
The most common mistake. A developer reads about autonomous agents and builds one for a task that a three-step chain would handle perfectly.
Anthropic says it directly: "The most successful implementations are not the most complex. They use simple, composable patterns."
2. Premature Multi-Agent
Designing a five-agent system when a single agent with five tools would work. Every agent boundary adds latency, cost, and information loss. One agent with 5 tools almost always beats 5 agents with 1 tool each.
3. Unbounded Loops
Evaluator-optimizer without max iterations. This is how you wake up to a $500 API bill and no useful output. Set a hard cap. Define "good enough." Include a circuit breaker.
4. Ignoring the Baseline
The strongest anti-pattern in 2026: building an elaborate multi-step architecture when better prompting of a single call would achieve 90%+ accuracy. Foundation models are dramatically more capable than they were in 2023. Many patterns that were necessary with GPT-3.5 are overkill with Claude Opus or GPT-4o.
Before building any pattern, try a single well-crafted prompt with examples. If it works, you're done.
FAQ
Which pattern does Claude Code use internally?
Claude Code primarily uses the orchestrator-workers pattern. A central agent plans tasks, delegates to subagents for exploration and implementation, and synthesizes results. But individual interactions within that system use simpler patterns --- tool calls are essentially prompt chaining, and parallel file edits use parallelization.
Can I combine patterns?
Yes, and most production systems do. Routing at the top level dispatches to different chains. Chains may include parallel steps. An orchestrator might use evaluator-optimizer for quality-critical subtasks. Patterns are composable building blocks.
What about ReAct (Reason + Act)?
ReAct is the foundational tool-use loop: the model reasons about what to do, takes an action (tool call), observes the result, and repeats. It's the basis for how most agents interact with tools, but it's more of a mechanism than a design pattern. All five patterns above use ReAct at the individual step level.
Is there a performance benchmark for these patterns?
Not a universal one. Performance depends entirely on the task, the model, the prompt quality, and the tool implementation. The best benchmark is your own evaluation: try the simplest pattern first, measure quality and latency, and escalate complexity only when the simpler approach falls short.
Key Takeaways
- Start simple, escalate only when needed --- Anthropic, Google, and Microsoft all agree: the simplest working pattern is the best pattern
- Chaining, routing, and parallelization cover most production use cases --- you probably don't need patterns 4-5
- The decision flowchart is your friend --- walk through it honestly before committing to an architecture
- Anti-patterns are more dangerous than missing patterns --- premature complexity causes more failures than insufficient capability
- Patterns are composable --- the best systems combine simple patterns rather than reaching for one complex one
Further reading: MCP Explained: The USB-C of AI Agents | Building MCP-Compatible Skills for AI Agents