The agent that delegated

June 2, 20266 min read10 sectionsBy Ahmed Abdullah

Introduction

On May 28, 2026, Anthropic shipped the Agent SDK alongside Claude Opus 4.8. The SDK lets a single orchestrator spawn parallel subagents, each with its own tools and structured output schema. The agents run concurrently, return typed results, and the orchestrator synthesizes them. This is not a research preview. It is a production API with concurrency caps, retry logic, and schema validation built in.

Most multi-agent tutorials stop at two LLMs passing messages back and forth. One generates, the other reviews. That is not orchestration. That is a relay. This tutorial builds a real multi-agent code review pipeline that fans out three specialist reviewers in parallel, collects structured findings, and then adversarially verifies each finding before returning confirmed results.

What the Agent SDK gives you

Three primitives matter for orchestration:

agent(prompt, opts) spawns a subagent. With a schema option, the agent returns a validated JSON object. Without one, raw text. The schema is what makes ten agents composable without string parsing.
parallel(thunks) runs multiple agent calls concurrently behind a barrier. All finish before results return.
pipeline(items, stage1, stage2, ...) runs items through sequential stages with no barrier between them. Item A can be in stage 3 while item B is still in stage 1.

For this tutorial we will use the Anthropic Python SDK directly with asyncio, which gives you the same fan-out pattern without depending on the workflow runtime.

Setup

You need an Anthropic API key and the Python SDK:

code

pip install anthropic
export ANTHROPIC_API_KEY="sk-ant-..."

Verify the installation:

python

import anthropic
client = anthropic.AsyncAnthropic()
print("SDK version:", anthropic.__version__)

Defining the review schema

Every specialist agent returns findings against the same schema. This is the critical design decision. Without a shared schema, you are parsing free text from ten agents and hoping the formats converge.

code

FINDING_SCHEMA = {
    "type": "object",
    "properties": {
        "findings": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "file": {"type": "string"},
                    "line": {"type": "integer"},
                    "severity": {
                        "type": "string",
                        "enum": ["high", "medium", "low"]
                    },
                "title": {"type": "string"},
                "description": {"type": "string"}
            },
        "required": [
            "file", "line", "severity",
            "title", "description"
        ]
}
}
},
"required": ["findings"]
}

The line field is intentionally an integer, not a string. Models approximate line numbers from diffs, especially across renames, so expect some noise here. Design schemas for what the model can reliably determine, not what you wish it could.

Building the specialist reviewers

Each reviewer gets a narrow system prompt that restricts its focus to one dimension. The narrower the scope, the fewer hallucinated findings.

python

import anthropic
import asyncio
client = anthropic.AsyncAnthropic()
DIMENSIONS = [
    {
        "key": "bugs",
        "system": (
            "You are a code reviewer focused exclusively on "
            "correctness bugs. Ignore style, performance, and "
            "security. Return only genuine logic errors, off-by-one "
            "mistakes, null dereferences, and race conditions."
        ),
    },
    {
        "key": "perf",
        "system": (
            "You are a code reviewer focused exclusively on "
            "performance. Ignore correctness and security. Return "
            "only measurable inefficiencies: unnecessary allocations, "
            "O(n^2) where O(n) exists, missing indexes, redundant I/O."
        ),
    },
    {
        "key": "security",
        "system": (
            "You are a code reviewer focused exclusively on security "
            "vulnerabilities. Ignore style and performance. Return "
            "only exploitable issues: injection, auth bypass, secrets "
            "in code, unsafe deserialization."
        ),
    },
]

async def review_dimension(dimension, diff):
    """Run a single specialist reviewer on the diff."""
    response = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        system=dimension["system"],
        messages=[
            {"role": "user", "content": f"Review this diff:\n\n{diff}"}
        ],
        tools=[
            {
                "name": "submit_findings",
                "description": "Submit structured review findings",
                "input_schema": FINDING_SCHEMA,
            }
        ],
        tool_choice={"type": "tool", "name": "submit_findings"},
    )
    for block in response.content:
        if block.type == "tool_use":
            return {"dimension": dimension["key"], **block.input}
            return {"dimension": dimension["key"], "findings": []}

            async def run_parallel_review(diff):
                """Fan out all specialist reviewers concurrently."""
                tasks = [review_dimension(d, diff) for d in DIMENSIONS]
                return await asyncio.gather(*tasks)

tool_choice={"type": "tool", "name": "submit_findings"} forces the model to call the tool, which means the response is always structured JSON matching your schema. No parsing, no regex, no "the model decided to answer in prose instead."

Adversarial verification

Parallel agents hallucinate at roughly the same rate they find real issues. A security reviewer flags a SQL injection that is actually parameterized. A perf reviewer flags an allocation inside a loop that runs exactly once. The fix is a second pass: for each finding, spawn a skeptic agent prompted to refute it.

python

VERDICT_SCHEMA = {
    "type": "object",
    "properties": {
        "is_real": {"type": "boolean"},
        "reasoning": {"type": "string"},
    },
    "required": ["is_real", "reasoning"],
}

async def verify_finding(finding, diff):
    """Spawn a skeptic to try to refute a single finding."""
    prompt = (
        f"Finding: {finding['title']}\n"
        f"Description: {finding['description']}\n"
        f"Location: {finding['file']}:{finding['line']}\n\n"
        f"Diff:\n{diff}"
    )
    response = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=(
            "You are a skeptical code reviewer. Your only job is "
            "to REFUTE the following finding. If you cannot refute "
            "it with evidence from the code, mark it as real. "
            "Default to refuted if uncertain."
        ),
        messages=[{"role": "user", "content": prompt}],
        tools=[
            {
                "name": "submit_verdict",
                "description": "Submit verification verdict",
                "input_schema": VERDICT_SCHEMA,
            }
        ],
        tool_choice={"type": "tool", "name": "submit_verdict"},
    )
    for block in response.content:
        if block.type == "tool_use":
            return {**finding, **block.input}
            return {**finding, "is_real": False, "reasoning": "Verification failed"}

            async def verify_all(results, diff):
                """Run adversarial verification on all findings."""
                all_findings = [
                    {**f, "dimension": r["dimension"]}
                    for r in results
                    for f in r["findings"]
                ]
                tasks = [verify_finding(f, diff) for f in all_findings]
                verified = await asyncio.gather(*tasks)
                return [f for f in verified if f["is_real"]]

The verification stage doubles the agent count and roughly doubles the cost. In our testing on production diffs, it cuts false positives by 60-70%. (The remaining false positives are the ones the verifier also gets wrong, which tells you something about the current ceiling of LLM-as-judge.)

Putting it all together

python

async def main():
    diff = open("changes.diff").read()
    # Phase 1: parallel specialist review
    results = await run_parallel_review(diff)
    total = sum(len(r["findings"]) for r in results)
    print(f"Phase 1: {total} findings across {len(DIMENSIONS)} dimensions")
    for r in results:
        print(f"\n [{r['dimension'].upper()}] {len(r['findings'])} findings")
        for f in r["findings"]:
            print(f" [{f['severity']}] {f['file']}:{f['line']} {f['title']}")
            # Phase 2: adversarial verification
            confirmed = await verify_all(results, diff)
            print(f"\nPhase 2: {len(confirmed)} confirmed out of {total}")
            for f in confirmed:
                print(f" [{f['severity']}] {f['file']}:{f['line']} {f['title']}")
                print(f" {f['reasoning']}")

                asyncio.run(main())

Three reviewers plus up to three verifiers per finding. On a 2,000-line diff with Sonnet 4.6, expect roughly $0.30-0.50 per run and 15-30 seconds wall-clock (the concurrency helps).

When to use this

This pattern earns its cost when: - You are gating a release or a merge to a protected branch and need high-confidence review - The diff spans multiple files across different concerns (API + DB + auth) and a single-pass review consistently misses cross-cutting issues - You want structured, machine-readable findings that feed into a CI pipeline or a dashboard, not prose comments

When this is overkill

Skip the multi-agent pattern when: - The diff is under 200 lines and a single Claude call with "review this for bugs, perf, and security" covers it fine - You need results in under 5 seconds (single call is faster than fan-out + verification) - Cost sensitivity is high and you are running this on every commit to every branch ($0.40 per run on a team pushing 50 commits/day is $400/month) - The review is for style or formatting, not correctness. LLMs are not linters. Use a linter.

What comes next

The same fan-out-and-verify pattern applies beyond code review. Document analysis (three agents reading for legal risk, financial exposure, and compliance gaps), test generation (one agent per module, skeptic verifies each test actually asserts something), migration planning (one agent per service boundary, synthesis agent deduplicates the dependency graph). The Agent SDK primitives are general. The specificity is in the prompts and schemas you hand them.

The hard part is never asyncio.gather. It is knowing what each specialist should see, what it should ignore, and when its output should be overruled.

Keep reading from the journal.

July 6, 2026

Map every system once, not to each other

A canonical model turns integration forty-two into a mapping

Build a Concurrent Multi-Agent Repository Digest with Go

July 24, 2026

Feature

Build a Concurrent Multi-Agent Repository Digest with Go

On July 10, 2026, Microsoft released the public preview of the Microsoft Agent Framework for Go, bringing the framework that already ships 1.0 SDKs for .NET and Python to a third language.

Both halves got worse and the average got better

July 20, 2026

AgenticAI

Both halves got worse and the average got better

Rate-mix decomposition splits every KPI move into what customers did and what the mix did