You've decided a single agent isn't enough. The task is too complex, the tools are too diverse, or the workload benefits from specialization. Now comes the hard question: how should these agents talk to each other?
The three fundamental orchestration patterns --- sequential, parallel, and hierarchical --- each come with distinct trade-offs in latency, cost, debuggability, and flexibility. Choosing wrong doesn't just slow you down. It compounds into the kind of architectural debt that takes months to untangle.
Here's how each pattern works, when to use it, and what breaks when you choose wrong.
Key Takeaways
- Sequential is simplest but slowest. Parallel is fastest but hardest to aggregate. Hierarchical is most flexible but most expensive.
- Multi-agent systems use approximately 15x more tokens than single-agent approaches (Anthropic data)
- LangGraph benchmarks show swarm/handoff marginally outperforms hierarchical supervision across all conditions
- Microsoft Azure recommends trying a single agent with tools before reaching for multi-agent orchestration
- The #1 anti-pattern is using multi-agent when a single agent would work. The #2 is adding agents without structured topology.
Short Answer
Which pattern should I use? If tasks have clear linear dependencies, use sequential. If tasks are independent, use parallel. If task decomposition varies per input and you need runtime flexibility, use hierarchical. If agents need to self-route based on specialization, use swarm/handoff. But first --- try a single agent with multiple tools. Most of the time, that's enough.
Multi-agent systems use ~15x more tokens than single-agent approaches. A single agent deploys in weeks; multi-agent takes months. Only escalate to multi-agent when a single agent demonstrably fails due to prompt complexity, tool overload, or security isolation requirements.
Pattern 1: Sequential (Pipeline)
Agents execute in a fixed, linear order. Each agent's output feeds the next. No agent decides who goes next --- the pipeline defines it at design time.
Input → Agent A → Agent B → Agent C → Output
Where It's Used
- Content pipelines: Research → Write → Edit → Publish
- RAG systems: Query rewrite → Retrieval → Rerank → Generate → Fact-check
- Legal document generation (Microsoft Azure): Template selection → Clause customization → Regulatory compliance → Risk assessment
- Financial advisory (Google Cloud): Data retrieval → Analysis → Recommendation → Execution
Implementation
In CrewAI, sequential is the default:
crew = Crew(agents=[researcher, writer, editor], tasks=tasks, process=Process.sequential)
In AutoGen, sequential uses initiate_chats() with a carryover mechanism that accumulates context across chats:
results = sender.initiate_chats([
{"recipient": agent_1, "message": "Research this topic", "summary_method": "last_msg"},
{"recipient": agent_2, "message": "Write based on research", "summary_method": "last_msg"},
])
Trade-offs
| Dimension | Rating |
|---|---|
| Latency | Slowest (sum of all agents) |
| Cost | Lowest (fewest total LLM calls) |
| Debuggability | Best (linear trace, step-by-step) |
| Flexibility | Lowest (fixed order, no adaptation) |
| Error handling | Poor (bad output at step 1 poisons everything) |
Add gates between pipeline stages --- simple validation checks that catch bad output before it propagates. "Does the output contain required fields? Is confidence above threshold?" These cost almost nothing and prevent cascading failures.
When to Use
- Clear linear dependencies (each step needs the previous output)
- Draft → review → polish workflows
- When auditability matters (you need to trace each transformation)
- When you can tolerate sequential latency
When NOT to Use
- Stages are independent (use parallel instead)
- You need backtracking (step 3 discovers step 1 was wrong)
- A single agent with good prompting could handle all stages
Pattern 2: Parallel (Fan-Out / Fan-In)
Multiple agents work simultaneously on independent tasks. A coordinator splits the work, specialists execute concurrently, and an aggregator merges results.
┌→ Agent A ─┐
Input → ├→ Agent B ─┼→ Aggregator → Output
└→ Agent C ─┘
Two Sub-Patterns
Sectioning: Different subtasks in parallel. Security review + style review + performance review of the same PR, simultaneously.
Voting: Same task, multiple times. Three agents generate solutions, a judge picks the best. Catches individual failures through redundancy.
Where It's Used
- Code review: Security, style, performance, and accessibility agents reviewing the same PR concurrently
- Stock analysis (Microsoft Azure): Fundamental, technical, sentiment, and ESG analysis agents evaluating the same ticker
- Anthropic Research: Lead agent spawns 3-5 subagents for concurrent exploration. Reduced research time by up to 90%.
- Document analysis: Entity extraction + summarization + classification running simultaneously
Trade-offs
| Dimension | Rating |
|---|---|
| Latency | Best (time = slowest agent, not sum) |
| Cost | Highest (all agents run simultaneously) |
| Debuggability | Medium (parallel traces, need correlation IDs) |
| Flexibility | Medium (fixed tasks, but adaptable aggregation) |
| Error handling | Good (one agent's failure doesn't block others) |
The Aggregation Problem
The hardest part of parallel orchestration isn't running agents concurrently --- it's merging their results. When three agents analyze the same stock and two say "buy" while one says "sell with high conviction," the aggregation logic determines the outcome.
Strategies:
- Majority voting for classification tasks
- Weighted merging for scored recommendations
- LLM synthesis when results need narrative reconciliation
When to Use
- Tasks are genuinely independent (no agent needs another's output)
- Latency is the primary constraint
- Multiple perspectives improve quality (code review, research)
- You need redundancy for reliability (voting pattern)
When NOT to Use
- Tasks have sequential dependencies
- Cost is the primary constraint (parallel multiplies LLM calls)
- Aggregation logic is too complex or degrades quality
Pattern 3: Hierarchical (Supervisor)
A supervisor agent receives the task, dynamically decomposes it, delegates to specialists, reviews outputs, and synthesizes results. Unlike sequential, the decomposition happens at runtime --- different inputs produce different delegation patterns.
Input → Supervisor ─┬→ Worker A → Supervisor
├→ Worker B → Supervisor → Output
└→ Worker C → Supervisor
Where It's Used
- Anthropic Research: Lead researcher plans strategy, spawns subagents for parallel exploration, synthesizes findings. Scored 90.2% better than single-agent Opus 4.
- Complex coding tasks: Supervisor plans architecture, delegates file-specific changes to workers, reviews and integrates.
- Enterprise customer service: Tier-1 supervisor routes to specialized teams (billing, technical, sales).
Implementation
In LangGraph:
from langgraph_supervisor import create_supervisor
workflow = create_supervisor(
agents=[research_agent, math_agent, writing_agent],
model=ChatOpenAI(model="gpt-4o"),
prompt="Delegate research to research_agent, math to math_agent, writing to writing_agent"
)
app = workflow.compile()
Supports nested hierarchies --- compile a supervisor into an agent, add it to another supervisor:
research_team = create_supervisor([search_agent, analysis_agent]).compile()
top_level = create_supervisor([research_team, writing_team, review_team])
In CrewAI:
crew = Crew(
agents=specialists,
tasks=tasks,
process=Process.hierarchical,
manager_llm="gpt-4o"
)
The manager dynamically allocates tasks based on agent capabilities. Tasks are not pre-assigned.
Trade-offs
| Dimension | Rating |
|---|---|
| Latency | Medium-High (supervisor adds overhead per delegation) |
| Cost | Highest (supervisor LLM calls + all worker calls) |
| Debuggability | Hardest (non-deterministic delegation decisions) |
| Flexibility | Highest (adapts decomposition per input) |
| Error handling | Good (supervisor can reassign failed tasks) |
The "Translation Tax"
LangGraph benchmarks revealed a critical finding: supervisors that paraphrase between sub-agents and users ("the translation tax") use significantly more tokens. Removing this overhead improved performance by nearly 50%. Let workers respond directly when possible.
When to Use
- Task decomposition varies per input (you can't predict subtasks at design time)
- You need quality gates (supervisor reviews before synthesis)
- Cross-functional tasks requiring diverse specializations
- Third-party agents where you can't modify behavior (supervisor is the most generic pattern)
When NOT to Use
- The decomposition is always the same (use sequential or parallel)
- Agents can coordinate directly without a manager (use swarm)
- Supervisor overhead isn't justified for the task complexity
Pattern 4: Swarm / Handoff
Agents dynamically transfer control to each other based on specialization. No central supervisor. An agent decides it's not the right one for the current task and hands off to a better-suited agent.
User → Agent A → (realizes this is a billing question) → Agent B → User
The Benchmark Result
LangGraph's benchmarks found that swarm marginally outperformed supervisor across all conditions. The reason: lower token cost because agents respond directly to users without supervisor paraphrasing. However, swarm requires all agents to know about each other, making it unsuitable for third-party integrations.
When to Use
- Customer service routing (triage → specialist → resolution)
- Well-defined agent boundaries where routing logic is clear
- When you control all agents and can configure their handoff rules
The Decision Framework
| Question | If True | Pattern |
|---|---|---|
| Can a single agent with tools handle this? | Yes | Don't use multi-agent |
| Do tasks have clear linear dependencies? | Yes | Sequential |
| Are tasks independent with no shared dependencies? | Yes | Parallel |
| Does task decomposition change per input? | Yes | Hierarchical |
| Can agents self-route based on specialization? | Yes | Swarm/Handoff |
Microsoft Azure's hierarchy (use the simplest that works):
- Direct model call --- no agents
- Single agent with tools --- "often the right default for enterprise use cases"
- Multi-agent orchestration --- only when single agent fails
Anthropic's data: multi-agent systems use approximately 15x more tokens than single-agent approaches. If your single agent works 80% of the time, optimize the prompt before adding more agents.
The Anti-Patterns
The "Bag of Agents"
Adding agents without structured topology. Errors compound multiplicatively through uncoordinated agents --- not additively. Research documents this as the "17x error trap": five agents with individual 90% accuracy chain together to produce 59% overall accuracy.
Over-Orchestrating
A major retailer spent 18 months building a "perfect" multi-agent system that was obsolete on launch. One documented case: $12,000/month in LLM tokens to do what three API calls and a decision tree could handle for $40.
Shared Memory Race Conditions
Multiple agents reading and writing shared state without versioning produces non-deterministic behavior impossible to reproduce. Run the same job twice, get different results. This is a production debugging nightmare.
Beyond ~4 Agents, Accuracy Saturates
Without structured topology, accuracy gains plateau or fluctuate beyond approximately four agents. Adding more agents without an orchestrating structure is like adding engineers without a project manager.
FAQ
How do I debug multi-agent systems?
Structured logging with correlation IDs that trace a request through every agent. Log every tool call, every agent response, every delegation decision. Without this, you're debugging a black box. OpenTelemetry with agent-specific spans is the emerging standard.
Which framework should I start with?
CrewAI for simplicity (least code to get started, most opinionated). LangGraph for flexibility (graph-based, supports all patterns). AutoGen for research and group chat patterns. If unsure, try CrewAI first --- you can always migrate.
Can I combine patterns?
Yes, and most production systems do. Routing at the top level dispatches to different pipelines. Pipelines may include parallel steps. A supervisor might use evaluator-optimizer for quality-critical subtasks. Patterns are composable.
What about the A2A protocol for multi-agent?
A2A is for agent-to-agent communication across organizational boundaries --- agents that don't share a framework or deployment. For agents within a single system, framework-native orchestration (LangGraph, AutoGen, CrewAI) is simpler. See our MCP vs A2A comparison for details.
Key Takeaways
- Start with a single agent --- escalate to multi-agent only when it demonstrably fails
- Sequential for pipelines, parallel for speed, hierarchical for flexibility --- each pattern has clear use cases
- Multi-agent costs ~15x more tokens --- budget accordingly or don't do it
- Swarm beats supervisor on benchmarks --- but requires all agents to know about each other
- The anti-patterns are more dangerous than the wrong pattern --- unstructured agent topology and shared state race conditions cause the worst production failures
Further reading: The 5 AI Agent Design Patterns Every Architect Must Know | MCP vs A2A vs ACP: The Comparison Guide