The agent that delegated
On May 28, 2026, Anthropic shipped the Agent SDK alongside Claude Opus 4.8. The SDK lets a single orchestrator spawn parallel subagents, each with its own tools and structured output schema. The agents run concurrently, return typed results, and the orchestrator synthesizes them.

Introduction
On May 28, 2026, Anthropic shipped the Agent SDK alongside Claude Opus 4.8. The SDK lets a single orchestrator spawn parallel subagents, each with its own tools and structured output schema. The agents run concurrently, return typed results, and the orchestrator synthesizes them. This is not a research preview. It is a production API with concurrency caps, retry logic, and schema validation built in.
Most multi-agent tutorials stop at two LLMs passing messages back and forth. One generates, the other reviews. That is not orchestration. That is a relay. This tutorial builds a real multi-agent code review pipeline that fans out three specialist reviewers in parallel, collects structured findings, and then adversarially verifies each finding before returning confirmed results.
What the Agent SDK gives you
Three primitives matter for orchestration:
- agent(prompt, opts) spawns a subagent. With a schema option, the agent returns a validated JSON object. Without one, raw text. The schema is what makes ten agents composable without string parsing.
- parallel(thunks) runs multiple agent calls concurrently behind a barrier. All finish before results return.
- pipeline(items, stage1, stage2, ...) runs items through sequential stages with no barrier between them. Item A can be in stage 3 while item B is still in stage 1.
For this tutorial we will use the Anthropic Python SDK directly with asyncio, which gives you the same fan-out pattern without depending on the workflow runtime.
Setup
You need an Anthropic API key and the Python SDK:
pip install anthropic
export ANTHROPIC_API_KEY="sk-ant-..."Verify the installation:
import anthropic
client = anthropic.AsyncAnthropic()
print("SDK version:", anthropic.__version__)Defining the review schema
Every specialist agent returns findings against the same schema. This is the critical design decision. Without a shared schema, you are parsing free text from ten agents and hoping the formats converge.
FINDING_SCHEMA = {
"type": "object",
"properties": {
"findings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"file": {"type": "string"},
"line": {"type": "integer"},
"severity": {
"type": "string",
"enum": ["high", "medium", "low"]
},
"title": {"type": "string"},
"description": {"type": "string"}
},
"required": [
"file", "line", "severity",
"title", "description"
]
}
}
},
"required": ["findings"]
}The line field is intentionally an integer, not a string. Models approximate line numbers from diffs, especially across renames, so expect some noise here. Design schemas for what the model can reliably determine, not what you wish it could.
Building the specialist reviewers
Each reviewer gets a narrow system prompt that restricts its focus to one dimension. The narrower the scope, the fewer hallucinated findings.
import anthropic
import asyncio
client = anthropic.AsyncAnthropic()
DIMENSIONS = [
{
"key": "bugs",
"system": (
"You are a code reviewer focused exclusively on "
"correctness bugs. Ignore style, performance, and "
"security. Return only genuine logic errors, off-by-one "
"mistakes, null dereferences, and race conditions."
),
},
{
"key": "perf",
"system": (
"You are a code reviewer focused exclusively on "
"performance. Ignore correctness and security. Return "
"only measurable inefficiencies: unnecessary allocations, "
"O(n^2) where O(n) exists, missing indexes, redundant I/O."
),
},
{
"key": "security",
"system": (
"You are a code reviewer focused exclusively on security "
"vulnerabilities. Ignore style and performance. Return "
"only exploitable issues: injection, auth bypass, secrets "
"in code, unsafe deserialization."
),
},
]
async def review_dimension(dimension, diff):
"""Run a single specialist reviewer on the diff."""
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=dimension["system"],
messages=[
{"role": "user", "content": f"Review this diff:\n\n{diff}"}
],
tools=[
{
"name": "submit_findings",
"description": "Submit structured review findings",
"input_schema": FINDING_SCHEMA,
}
],
tool_choice={"type": "tool", "name": "submit_findings"},
)
for block in response.content:
if block.type == "tool_use":
return {"dimension": dimension["key"], **block.input}
return {"dimension": dimension["key"], "findings": []}
async def run_parallel_review(diff):
"""Fan out all specialist reviewers concurrently."""
tasks = [review_dimension(d, diff) for d in DIMENSIONS]
return await asyncio.gather(*tasks)tool_choice={"type": "tool", "name": "submit_findings"} forces the model to call the tool, which means the response is always structured JSON matching your schema. No parsing, no regex, no "the model decided to answer in prose instead."
Adversarial verification
Parallel agents hallucinate at roughly the same rate they find real issues. A security reviewer flags a SQL injection that is actually parameterized. A perf reviewer flags an allocation inside a loop that runs exactly once. The fix is a second pass: for each finding, spawn a skeptic agent prompted to refute it.
VERDICT_SCHEMA = {
"type": "object",
"properties": {
"is_real": {"type": "boolean"},
"reasoning": {"type": "string"},
},
"required": ["is_real", "reasoning"],
}
async def verify_finding(finding, diff):
"""Spawn a skeptic to try to refute a single finding."""
prompt = (
f"Finding: {finding['title']}\n"
f"Description: {finding['description']}\n"
f"Location: {finding['file']}:{finding['line']}\n\n"
f"Diff:\n{diff}"
)
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=(
"You are a skeptical code reviewer. Your only job is "
"to REFUTE the following finding. If you cannot refute "
"it with evidence from the code, mark it as real. "
"Default to refuted if uncertain."
),
messages=[{"role": "user", "content": prompt}],
tools=[
{
"name": "submit_verdict",
"description": "Submit verification verdict",
"input_schema": VERDICT_SCHEMA,
}
],
tool_choice={"type": "tool", "name": "submit_verdict"},
)
for block in response.content:
if block.type == "tool_use":
return {**finding, **block.input}
return {**finding, "is_real": False, "reasoning": "Verification failed"}
async def verify_all(results, diff):
"""Run adversarial verification on all findings."""
all_findings = [
{**f, "dimension": r["dimension"]}
for r in results
for f in r["findings"]
]
tasks = [verify_finding(f, diff) for f in all_findings]
verified = await asyncio.gather(*tasks)
return [f for f in verified if f["is_real"]]The verification stage doubles the agent count and roughly doubles the cost. In our testing on production diffs, it cuts false positives by 60-70%. (The remaining false positives are the ones the verifier also gets wrong, which tells you something about the current ceiling of LLM-as-judge.)
Putting it all together
async def main():
diff = open("changes.diff").read()
# Phase 1: parallel specialist review
results = await run_parallel_review(diff)
total = sum(len(r["findings"]) for r in results)
print(f"Phase 1: {total} findings across {len(DIMENSIONS)} dimensions")
for r in results:
print(f"\n [{r['dimension'].upper()}] {len(r['findings'])} findings")
for f in r["findings"]:
print(f" [{f['severity']}] {f['file']}:{f['line']} {f['title']}")
# Phase 2: adversarial verification
confirmed = await verify_all(results, diff)
print(f"\nPhase 2: {len(confirmed)} confirmed out of {total}")
for f in confirmed:
print(f" [{f['severity']}] {f['file']}:{f['line']} {f['title']}")
print(f" {f['reasoning']}")
asyncio.run(main())Three reviewers plus up to three verifiers per finding. On a 2,000-line diff with Sonnet 4.6, expect roughly $0.30-0.50 per run and 15-30 seconds wall-clock (the concurrency helps).
When to use this
This pattern earns its cost when: - You are gating a release or a merge to a protected branch and need high-confidence review - The diff spans multiple files across different concerns (API + DB + auth) and a single-pass review consistently misses cross-cutting issues - You want structured, machine-readable findings that feed into a CI pipeline or a dashboard, not prose comments
When this is overkill
Skip the multi-agent pattern when: - The diff is under 200 lines and a single Claude call with "review this for bugs, perf, and security" covers it fine - You need results in under 5 seconds (single call is faster than fan-out + verification) - Cost sensitivity is high and you are running this on every commit to every branch ($0.40 per run on a team pushing 50 commits/day is $400/month) - The review is for style or formatting, not correctness. LLMs are not linters. Use a linter.
What comes next
The same fan-out-and-verify pattern applies beyond code review. Document analysis (three agents reading for legal risk, financial exposure, and compliance gaps), test generation (one agent per module, skeptic verifies each test actually asserts something), migration planning (one agent per service boundary, synthesis agent deduplicates the dependency graph). The Agent SDK primitives are general. The specificity is in the prompts and schemas you hand them.
The hard part is never asyncio.gather. It is knowing what each specialist should see, what it should ignore, and when its output should be overruled.
You might also like
Keep reading from the journal.
January 7, 2026RAG
Taking your RAG pipelines to a next level ! LangGraphs
Already working with RAG pipelines and LangChain? This article introduces LangGraph, LangChain’s latest upgrade, and shows how it elevates RAG workflows with better structure, control, and scalability for modern generative AI applications.
May 18, 2026AI
Don't pre-build the AI layer
The retail industry has a name for the developer who builds the mall before the anchor tenant signs. Optimistic. The same developer five years later, when the mall is half empty and the anchor never showed, is called bankrupt.
May 5, 2026AI
Not everything needs an agent
In 1931, Rube Goldberg won a Pulitzer Prize for drawings of machines that accomplished simple tasks through spectacular chains of unnecessary steps. A self-wiping napkin apparatus involving a parrot, a lit candle, a swinging pendulum, and seven other components.