$ cat ai-engineering/tool-selection-at-scale.md

Tool Selection at Scale: MCP and Dynamic Routing

Why tool selection collapses past 30 tools, and how MCP, lazy loading, and retrieval keep accuracy high across thousands of tools without context bloat.

Jatin Bansal@blog:~/ai-engineering$ open tool-selection-at-scale

A coding agent ships with three tools: read_file, edit_file, run_tests. It works. Six months later it has 47 — every team in the org added their own MCP server, the deploy bot got integrated, somebody wired up the data warehouse. The agent now gets simpler questions wrong than the version with three tools did. It picks grep_repository when the user wanted search_docs, calls deploy_staging when asked to “check the deploy,” and burns 18k tokens of tool definitions on every turn before the user message starts. The model didn’t regress. The tool surface regressed. Past roughly 30 tools, an agent’s accuracy on tool selection drops faster than you’d expect, the per-call token cost climbs linearly with catalog size, and the failure modes shift from “model picked the wrong arguments” to “model couldn’t even tell which tool was relevant.” This is the wall every production agent hits, and it’s the wall this article is about.

Opening bridge

Yesterday’s piece on multi-agent orchestration closed with the observation that the supervisor pattern’s tool list overflows somewhere around ten specialists — and that the natural fix is a router-style tool selection layer, deferred for its own piece. This is that piece, and it generalizes the problem. The same dynamic that breaks a supervisor’s specialist-routing decision also breaks a single agent’s tool selection: the model treats each candidate as a near-uniform prior, and the cost of disambiguating climbs with N. The fixes — dynamic tool routing, lazy schema loading, retrieval over the tool catalog itself — are useful at every level. The tool-use article flagged the 30-tool wall and pointed forward; today we work through it.

What “tool selection at scale” actually means

The phrase covers three distinct problems that get conflated in casual conversation, and the right fix depends on which one you have.

Selection accuracy. Given a user request and N candidate tools, can the model pick the right one? Empirically, this is fine up to ~10 tools, soft-degrades from 10 to 30, and falls off a cliff past ~30. Anthropic’s published numbers for their Tool Search Tool show Opus 4 selection accuracy moving from 49% to 74% when tool search is enabled on a large catalog, and Opus 4.5 moving from 79.5% to 88.1% — the cliff is real and the lift is large.
Context cost. Every tool’s name, description, and JSON schema is serialized into the system prompt on every call. A five-server MCP setup (GitHub, Slack, Sentry, Grafana, Splunk) consumes about 55k tokens in tool definitions before the conversation starts. Adding Atlassian’s MCP server alone is roughly another 17k. A 200k-token Claude model with 100k of “preamble” is a 100k-token model, which is a worse model.
Namespace conflicts. Across the public MCP server ecosystem, Microsoft Research catalogued 775 tools with name collisions. search appears in 32 servers; get_user in 11; execute_query in 11. Without namespacing, two servers in the same agent collide deterministically, and the disambiguation has to happen somewhere — either in the client, the protocol, or the model.

Confusing these is the most common mistake. Lazy schema loading solves context cost and helps selection accuracy by reducing the candidate pool, but does nothing for namespace conflicts. Embedding-based retrieval over tool descriptions solves accuracy and context cost, but only if your embedding model is reliable on tool-language (it usually is — these descriptions are short, semantically dense, and well-formed). Tool prefixing solves namespaces but doesn’t shrink the catalog. The right production setup uses all three.

Why selection degrades past ~30 tools

Three mechanisms compound. None of them is a model bug; they’re inherent to how tool selection works.

The schema lives in the prompt and bloats the working context. Tool definitions are not metadata that lives outside the call — they are JSON serialized into the system prompt, prepended to every user turn. The model attends to them with the same lossy mechanism it uses for the rest of the prompt, and the lost-in-the-middle failure modes from the context-engineering article apply directly. A tool described in row 4 of a 100-row schema list is easier for the model to retrieve than the same tool in row 47.

Descriptions interfere with each other. When you have two tools whose descriptions both contain “search for documents,” the model can no longer route on description alone — it has to read both schemas, compare argument names, and pick. As the catalog grows, semantic clusters appear: 20+ variants of web_search across MCP servers per the Microsoft data. Each cluster forces a disambiguation the model often gets wrong.

The prior shifts toward “any tool is plausible.” With three tools, the model’s prior is sharp: each candidate has a ~33% baseline likelihood of being relevant before considering the user request. With 50 tools, the baseline collapses, and small lexical hints (a keyword that appears in two descriptions) become disproportionately influential. This is why “the wrong tool that sounded right” becomes the dominant failure mode at scale — the signal-to-noise ratio of any single description drops as N grows.

These compound. By N=50 you have schema bloat and description interference and a flat prior, and the model’s selection accuracy on the long tail of tools collapses even when its accuracy on the top 5 stays fine. Anthropic’s published guidance — tool selection accuracy degrades significantly once you exceed 30–50 tools — is the right ballpark.

The distributed-systems parallel

Tool selection at scale is service discovery for a non-deterministic client, and the failure modes are the same ones every distributed system has hit.

The tool list is a service registry. A monolith with three local function calls is a tool list with three tools. A microservices estate with hundreds of services is a tool list with hundreds of MCP-exposed tools. Just as the monolith→microservices move forced explicit service discovery (Consul, etcd, DNS-based discovery, sidecar service meshes), the tools→MCP move forces explicit tool discovery. Loading every service descriptor into every client at startup didn’t scale; loading every tool schema into every prompt doesn’t either.

Lazy loading is the standard answer. gRPC’s reflection service, Kubernetes’ Discovery API, and Consul’s catalog API all let a client ask what’s available at runtime rather than hardcoding. Anthropic’s defer_loading: true flag and the Tool Search Tool are exactly this pattern: tools are registered in the catalog but not loaded into the model’s context until the model issues a search-then-load call. The discovery call is a network round-trip in microservices and a synthetic tool call in agents; the semantics are isomorphic.

Namespaces and prefixing are the same fight. DNS had it (com.example.svc resolves uniquely), gRPC had it (package.Service.Method), Kubernetes had it (namespace/resource-name). MCP is going through it now: the protocol emits reverse-domain namespaces in the official registry (io.github.user/server-name) and Claude Code prefixes tool names with unique identifiers at the client layer to disambiguate. The history rhymes: every system that hosted N independently-maintained components had to invent namespacing once N got large enough.

The non-determinism is the twist. Unlike a microservices client, the LLM may pick the wrong tool from a perfectly-formed registry. Every defensive layer the tool-use article covered — schema validation, idempotency keys, circuit breakers — survives the move to a registry. The registry just changes which tools are visible at a given moment; it does not make the model’s selection deterministic. Service discovery in this world is lossy retrieval, not lookup, and that’s the difference that matters.

The Model Context Protocol, briefly

MCP is the agent-to-tool protocol Anthropic shipped in late 2024 and that has adopted by 28% of Fortune 500 companies in their AI stacks by early 2026, along with explicit support from OpenAI, Microsoft, AWS, and Google. What it actually defines is small enough to fit on one page:

A server exposes capabilities (tools, resources, prompts) over JSON-RPC 2.0 transported over stdio, Streamable HTTP, or SSE.
A client connects to one or more servers and presents their tools to the agent.
Discovery is explicit. The client calls tools/list to fetch tool definitions; the server returns a list of Tool objects with name, description, and inputSchema. The client decides what to expose to the model.
Invocation is RPC. tools/call with {name, arguments} returns a CallToolResult. The agent’s tool-use loop wraps the call exactly as in the tool-use article.

What MCP does not do, and what surprises people: it doesn’t pick tools for you, it doesn’t reduce token cost on its own, and it doesn’t fix namespace conflicts. MCP is the registry mechanism; what you do with the registry is the tool-routing layer this article is about. An MCP-using agent with 200 tools loaded into context has the same selection problem as a hand-rolled agent with 200 tools — and the same fixes apply.

The 2026 protocol’s in-progress roadmap is moving the conversation in the right direction: a session-management spec with hierarchical routing, enterprise auth, and formal namespacing. None of it eliminates the need for client-side tool selection logic; it just makes that logic cleaner to write.

Pattern 1: deferred loading and the Tool Search Tool

The lowest-friction fix on Anthropic’s API today is the Tool Search Tool, released November 2025. Two pieces:

defer_loading: true on each tool definition. Deferred tools are registered but not loaded into the system prompt. They’re invisible to the model until discovered.
tool_search_tool_regex_20251119 or tool_search_tool_bm25_20251119 as a server-side tool. The model issues a query (a regex pattern or a natural-language string) and the API returns 3–5 most relevant tool_reference blocks. The references auto-expand into full tool definitions for the next turn.

The flow is exactly the discovery-then-invoke pattern from service meshes. The model calls tool_search with "github.*pr", the API returns references to github_list_prs, github_get_pr, github_create_pr_comment, and only those three definitions get spliced into the conversation for the next turn. Total context cost stays bounded; the catalog can be 10,000 tools without inflating any single call.

A worked example with a deferred catalog and the regex tool:

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import anthropic

client = anthropic.Anthropic()

# Imagine 50 tools from various MCP servers; show 4 representative ones.
# Real catalogs come from MCP `tools/list` and are passed through unchanged.
deferred_catalog = [
    {
        "name": "github_create_pr_comment",
        "description": "Post a review comment on a GitHub pull request.",
        "input_schema": {
            "type": "object",
            "properties": {
                "repo": {"type": "string"},
                "pr": {"type": "integer"},
                "body": {"type": "string"},
            },
            "required": ["repo", "pr", "body"],
        },
        "defer_loading": True,
    },
    {
        "name": "slack_post_message",
        "description": "Post a message to a Slack channel.",
        "input_schema": {
            "type": "object",
            "properties": {
                "channel": {"type": "string"},
                "text": {"type": "string"},
            },
            "required": ["channel", "text"],
        },
        "defer_loading": True,
    },
    {
        "name": "sentry_resolve_issue",
        "description": "Mark a Sentry issue as resolved.",
        "input_schema": {
            "type": "object",
            "properties": {"issue_id": {"type": "string"}},
            "required": ["issue_id"],
        },
        "defer_loading": True,
    },
    # ... 47 more deferred tools
    {
        # Cheap, frequently-used tool — keep it in the always-loaded prefix.
        "name": "get_current_time",
        "description": "Get the current UTC time. Use whenever the user asks 'now', 'today'.",
        "input_schema": {"type": "object", "properties": {}},
        # NO defer_loading: this tool stays in the prompt always.
    },
]

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    tools=[
        # The search tool MUST be non-deferred; it's the discovery entry point.
        {"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"},
        *deferred_catalog,
    ],
    system=(
        "You have access to tools across GitHub, Slack, Sentry, Grafana, and Splunk. "
        "Use the tool_search_tool_regex to discover the right tool before calling it. "
        "Pattern examples: 'github.*pr', 'slack', '(?i)sentry.*resolve'."
    ),
    messages=[{
        "role": "user",
        "content": "Resolve Sentry issue ABC-123 and post a note to #incident-12.",
    }],
)

Three things the worked example makes visible. First, the system prompt is doing real work — telling the model the catalog’s shape and giving it regex hints. Without this, the model wastes turns guessing search patterns. Second, at least one tool must be non-deferred, including the search tool itself. The API rejects requests where every tool is deferred (it would have nothing to render in the prompt). Third, the always-loaded tools should be your hot path. Anthropic recommends keeping the 3–5 most frequently used tools non-deferred, the same cache-locality logic as keeping working-set data in L1 cache.

The search itself happens as a server_tool_use block (the search is executed by Anthropic’s backend, not your runtime) and returns tool_reference blocks that the API auto-expands. The model then issues a normal tool_use against the discovered tool. Your runtime executes that, returns a tool_result as usual, and the loop continues. The deferred-loading mechanism is invisible to your handler code — only the system-prompt construction changes.

Performance notes that matter for production: deferred tool schemas live outside the system-prompt prefix, so they don’t invalidate the prompt cache when you add or remove tools from the catalog. The deferred catalog can rotate without breaking cache hits on the prefix — a big win for evolving tool surfaces. And the strict-mode grammar is built from the full toolset including deferred entries, so strictness still applies to discovered tools without needing a separate compile pass.

Pattern 2: embedding-based tool retrieval

What the Tool Search Tool does server-side, you can do client-side with embeddings — and you should, when the catalog is yours, you want full control over the retrieval logic, or you’re not on Anthropic. The pattern: embed every tool’s name + description at registration time; at each turn, embed the user’s most recent message (or the current task summary), cosine-similarity over the tool catalog, expose the top-k to the model.

This is RAG, applied to the tool registry instead of a document corpus. Everything from the embeddings article and vector-DB article transfers: chunk the catalog at the tool granularity, normalize the embeddings, store in pgvector or an in-memory NumPy array (the catalogs are small — even 10k tools at 1536 dimensions fits in 60 MB), top-k cosine, hand back the survivors.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import { openai } from "@ai-sdk/openai";
import { anthropic } from "@ai-sdk/anthropic";
import { generateText, tool, stepCountIs, embed } from "ai";
import { z } from "zod";

// --- the catalog ---
type ToolDef = { name: string; description: string; schema: z.ZodTypeAny; execute: (args: any) => Promise<any> };

const allTools: ToolDef[] = [
  {
    name: "githubCreatePrComment",
    description: "Post a review comment on a GitHub pull request.",
    schema: z.object({ repo: z.string(), pr: z.number(), body: z.string() }),
    execute: async ({ repo, pr, body }) => ({ ok: true, repo, pr }),
  },
  {
    name: "slackPostMessage",
    description: "Post a message to a Slack channel.",
    schema: z.object({ channel: z.string(), text: z.string() }),
    execute: async ({ channel, text }) => ({ ok: true, ts: Date.now() }),
  },
  {
    name: "sentryResolveIssue",
    description: "Mark a Sentry issue as resolved.",
    schema: z.object({ issueId: z.string() }),
    execute: async ({ issueId }) => ({ ok: true, issueId }),
  },
  // ... imagine 47 more, drawn from MCP servers, internal RPC, etc.
];

// --- pre-embed once at startup ---
const toolVectors: { tool: ToolDef; vec: number[] }[] = await Promise.all(
  allTools.map(async (t) => ({
    tool: t,
    // Embed name+description together — both are signal.
    vec: (await embed({
      model: openai.embedding("text-embedding-3-small"),
      value: `${t.name}\n${t.description}`,
    })).embedding,
  })),
);

function cosine(a: number[], b: number[]) {
  let dot = 0, na = 0, nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i]; na += a[i] * a[i]; nb += b[i] * b[i];
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb));
}

async function selectTools(query: string, k = 5): Promise<ToolDef[]> {
  const qv = (await embed({
    model: openai.embedding("text-embedding-3-small"),
    value: query,
  })).embedding;
  return toolVectors
    .map(({ tool, vec }) => ({ tool, score: cosine(qv, vec) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, k)
    .map(({ tool }) => tool);
}

// --- per-turn: retrieve k tools, then run the loop with just those ---
export async function ask(userQuery: string) {
  const selected = await selectTools(userQuery, 5);
  const toolBag = Object.fromEntries(
    selected.map((t) => [
      t.name,
      tool({ description: t.description, inputSchema: t.schema as any, execute: t.execute }),
    ]),
  );
  const { text, usage } = await generateText({
    model: anthropic("claude-opus-4-7"),
    tools: toolBag,
    stopWhen: stepCountIs(8),
    prompt: userQuery,
  });
  return { text, tools_loaded: selected.map((t) => t.name), tokens: usage.totalTokens };
}

Notice the asymmetry from Pattern 1. Here, the retrieval happens in your code, not as a synthetic tool the model calls. The model never sees the deferred tools at all — only the top-k. This trades flexibility (the model can’t recover if your retriever misses) for predictability (the loop is shorter; the model doesn’t burn a turn on search). Use this pattern when:

The user’s intent is clear from the first message and a single retrieval pass is enough.
You’re outside the Anthropic API and need a portable solution.
You want to log every retrieval decision for offline eval — the retriever is a pure function, easy to test.

Avoid it when the agent runs many turns and the relevant tool set shifts mid-conversation. There, you either re-retrieve on every turn (cheap if the embeddings are cached but adds latency) or fall back to model-driven search like Pattern 1.

A nuance worth flagging: the embedding model matters less than you’d think for tool retrieval, but reranking helps disproportionately. Tool descriptions are short (50–500 tokens), use a constrained vocabulary, and tend to be semantically dense — they’re a near-ideal retrieval target. A text-embedding-3-small or Voyage-3-lite is usually sufficient. Where you can win is at the top: dropping a cross-encoder reranker on the top-20 to pick the final top-5 reliably improves accuracy by a few absolute points. The cost (one extra round-trip to a small reranker) is trivial compared to the cost of a bad tool selection.

For a worked Anthropic-side reference implementation, see the embeddings-based tool search cookbook — it implements Pattern 1’s tool_reference-block flow with embeddings on the search-tool side, which is the cleanest hybrid.

Pattern 3: namespaces, prefixing, and tool consolidation

Discovery and retrieval shrink the visible catalog. Namespacing fixes the names in the catalog so that disambiguation is possible even before retrieval. These compose; you need both.

The emerging MCP convention is reverse-domain namespacing on server names — io.github.user/server-name, com.company/tool-name — and prefixing tool names with the server identity at the client layer. Claude Code, for example, exposes mcp__github__create_pr rather than the bare create_pr from the GitHub MCP server. This is the same fix Java packages applied to class names in the 1990s and that gRPC applied to service descriptors a generation later.

A defensible client-side namespacing layer:

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def namespace_tools(server_name: str, mcp_tools: list[dict]) -> list[dict]:
    """Prefix every tool from one MCP server with a deterministic namespace.

    Reverse-domain server names ('com.example.github') flatten to 'com_example_github'
    in the tool name (most providers reject '.' and '/' in tool names).
    """
    prefix = server_name.replace(".", "_").replace("/", "__")
    return [
        {**t, "name": f"{prefix}__{t['name']}"}
        for t in mcp_tools
    ]

# Resolve the conflict between two servers both exposing `search`:
github_tools = namespace_tools("com.github.search", [{"name": "search", "description": "..."}])
# -> [{"name": "com_github_search__search", "description": "..."}]
docs_tools = namespace_tools("com.example.docs", [{"name": "search", "description": "..."}])
# -> [{"name": "com_example_docs__search", "description": "..."}]

Two consequences. First, the model sees the namespace and can use it for routing. A user asking “search the docs for X” lands on com_example_docs__search more reliably than on com_github_search__search because the namespace token is a high-signal feature. Second, traces become parseable. Splitting on __ gives you (server, tool) for every call — the same way splitting on . gives you (package, class, method) in a JVM stack trace.

Beyond namespacing, tool consolidation is the other leverage point. Anthropic’s writing-tools-for-agents guidance recommends merging fine-grained tools into action-parameterized ones: instead of list_open_prs, list_closed_prs, list_draft_prs, expose list_prs(state: 'open' | 'closed' | 'draft'). The model’s selection task becomes “pick the verb” instead of “pick from three near-duplicates.” This is the JIT context engineering move applied to tool surface: keep the catalog narrow at the entry point, let arguments carry the variability. A catalog of 80 tools consolidates to 30 in most real codebases; that alone can push you back below the 30-tool wall.

When to reach for which pattern

A working decision tree:

<10 tools, all used regularly. Don’t bother. Flat tool list, no routing layer. The overhead of search tooling exceeds the win.
10–30 tools, mixed usage. Tool consolidation + namespacing. You probably don’t need dynamic loading yet; you do need to make sure the same noun doesn’t appear twice.
30–100 tools, distinct domains. Anthropic’s deferred loading + Tool Search Tool (Pattern 1) if you’re on Anthropic. Embedding retrieval (Pattern 2) if you’re not. Pick the 3–5 hottest tools as always-loaded; defer the rest.
100+ tools, cross-vendor. All three patterns simultaneously. Namespace at ingestion (Pattern 3), retrieve to a candidate pool of 20–50 (Pattern 2), use the Tool Search Tool inside the model loop to pick the final 3–5 (Pattern 1). The retrieval pipeline mirrors a retrieval/rerank cascade, which is exactly the right framing.
MCP gateway in front of multiple servers. The gateway layer (MCP Gateway & Registry, Stacklok MCP Optimizer, lazy-tool) implements Patterns 2 and 3 between the agent and the underlying servers. The agent sees one unified, namespaced, retrieval-filtered catalog; the gateway is the ops surface. Reach for this when you have >5 MCP servers in production and the per-server cost (auth, version pinning, audit logs) starts to dominate.

The pattern hierarchy isn’t a ladder — it’s a stack. Adding a higher pattern doesn’t replace the lower ones, it amplifies them. A well-namespaced 200-tool catalog with retrieval is much better than a flat 200-tool catalog with retrieval, because the namespace token is a feature the retriever picks up.

Trade-offs, failure modes, gotchas

Retrieval misses are silent. The single most dangerous failure mode in Patterns 1 and 2 is the retriever returning tools that look relevant but aren’t, while excluding the right one. The user asks “deploy the API to staging” and the retriever returns deploy_marketing_site, deploy_docs, deploy_storybook because of lexical overlap — and the actual deploy_api_staging tool never even enters the model’s consideration set. Mitigations: log every retrieval, evaluate tool-recall@k on a golden set, and prefer a larger k (10–20) with cross-encoder reranking over a smaller k (3–5) with pure embedding similarity. The reranking and RAG evaluation playbooks transfer directly.

The “search-then-call” loop costs an extra turn. Pattern 1’s flow is search → discover → call, which is one more LLM turn than a flat tool list. For latency-sensitive chat, this is a real cost; for batch agents and research workflows, it’s noise. Anthropic’s published guidance is to use deferred loading only when the catalog is large enough that the token savings exceed the extra round-trip, which is roughly the 10-tool / 10k-token threshold.

Embedding drift breaks Pattern 2 silently. If you upgrade your embedding model (or fine-tune it), pre-computed tool vectors become stale relative to query vectors. The tool retrieval will silently degrade — no error, just worse selections. Re-embed the catalog on any model change, and pin the embedding model version in your build. The embedding drift problem is the same one that bites RAG corpora; tool registries are just a smaller, easier-to-rebuild instance.

Strict-mode grammar compilation costs at load. Deferred-loading tools still participate in the strict-mode FSM compiled by the provider. The grammar is built once from the full catalog; on Anthropic, defer_loading and strict mode compose without recompilation per turn, but the initial compile is still O(catalog size). For 10k-tool catalogs, expect a one-time cold-start hit. Warm-cache builds are free.

Namespace prefixes can leak through to user-visible logs. If you expose tool names in user-facing traces (“I called com_github_search__search to find your issue”), the namespace clutter erodes UX. Render names in two layers: the canonical, prefixed name for routing and traces, and a clean display name for the user. The same separation gRPC clients use between the wire-format method name and the display label.

The Tool Search Tool isn’t compatible with everything. Anthropic’s docs note that the Tool Search Tool is incompatible with tool-use examples, and the Bedrock InvokeModel/Converse split changes which API surface to use. Read the constraints before committing; in particular, if you rely on few-shot tool examples for accuracy, Pattern 2 may be a better fit than Pattern 1 on Anthropic.

Tool-search rate limits exist. The tool_search_requests field in usage.server_tool_use is metered, and too_many_requests is a documented error code. A loop that misuses search (re-querying repeatedly because the prior result didn’t satisfy the model) can saturate the budget. Cap tool_search invocations per turn in your loop driver — the same way you cap any other tool call.

Cross-tenant tool leakage is real in shared MCP gateways. If two tenants share an MCP gateway and one tenant’s tools accidentally leak into another tenant’s catalog (because of a router misconfiguration), an agent in tenant A can invoke a tool that mutates tenant B’s data. Treat the MCP layer like an authorization boundary — the same way you’d treat an internal service mesh — and put per-tenant filtering at the gateway, not at the agent.

Selection accuracy isn’t a single number. Anthropic’s quoted improvements (49% → 74% for Opus 4) are on internal benchmarks across mixed tool sets. Your accuracy on your catalog depends on description quality, namespace hygiene, and the distribution of user queries. Build a tool-selection eval — give it 100 representative user queries, ground-truth the correct tool, measure top-1 and top-3 accuracy across patterns — before deciding which to deploy. The same eval-first discipline from the RAG-evaluation article applies; the test cases are just shorter.