
An AI agent can only do what its tools let it do.
Which tools you give it, and how those tools get loaded, is most of the design work. Everyone's selling a flavor of this. Stripe and PayPal ship vendor-specific toolkits. OpenAI and Anthropic bake tools into their SDKs. Vercel publishes a tool-definition interface and lets the community fill it in. Different shapes, same job.
A toolkit isn't a list of functions. It's the action space your model picks from. It's the descriptions the model rereads every turn. It's whatever auth pattern the publisher decided on. And it's a quiet bet on the framework you'll end up married to.
The install takes ten minutes. Everything after that is the design choice.
TL;DR
- A toolkit defines the action space: what the agent can read, write, search, and trigger.
- Production teams choose from three patterns: vendor toolkits, agent SDKs with hosted tools, and tool-definition libraries or direct APIs.
- Tool definitions, schemas, and auth helpers live inside the toolkit. The reasoning loop, framework, and governance layer wrap around it.
- MCP moves toolkits from compile-time imports to runtime discovery.
- More tools is not better. Tight, well-described toolkits beat broad ones in production. Every time.
What Is An Agent Toolkit?
A packaged set of tools, their schemas, descriptions, and auth helpers, loaded at runtime. That's it.
Most are scoped to one system (Stripe's covers Stripe APIs) or one framework (OpenAI's plugs into the Agents SDK). The job is to hand the model a clean menu of callable actions, with descriptions specific enough that it picks the right one, and the credentials and request mechanics already handled before the model gets involved.
A few adjacent forms do the same job under different names. Agent SDKs ship hosted tools next to the runtime. Tool-definition libraries publish a shared interface and let others extend it. Direct API surfaces expose server-side tools without bundling them as an importable thing. The design concerns are the same across all of them: action space, prompt surface, authentication, governance. How those interact is what AI agent integration design is actually about.
How Production Agents Get Their Tools
Three patterns. Pick one.
Vendor toolkits are packaged bundles. Install, plug into a host framework, call the tools. Auth and request mechanics live inside the package, which is why purpose-built AI connectors of this shape are the fastest zero-to-working path for a single vendor.
Stripe and PayPal have near-identical playbooks: vendor publishes, multi-framework, auth tucked away.
Agent SDKs with hosted tools are a different animal. The agent runtime and the tool runtime ship as one package, both running on the vendor's infrastructure.
OpenAI also ships AgentKit, which sits on top of the Agents SDK with Agent Builder, ChatKit, Connector Registry, and a beefier Evals.
The third pattern is the lowest-level one. Tool-definition libraries and direct APIs don't ship a packaged toolkit. They expose a way to define or call tools, and everything else (vendor toolkits, agent SDKs, your custom code) builds on top.
What Goes Into a Tool Definition
Three pieces per tool: a name, a structured description, a function. Name and description go into the prompt. The function is what actually runs when the model picks the tool.
The schema is usually JSON Schema or a typed equivalent. It declares inputs, types, what's required, any constraints (enums, ranges, patterns). A refund tool's schema names a charge_id string, an optional amount integer, and an optional reason enum. That's how the model decides whether it has enough information to call the tool, and how it formats arguments when it does.
Auth helpers handle credentials so the agent doesn't have to. Publishers wire in API key handling, OAuth flows, service-account access. The model never sees credentials. Every request brokers through the toolkit's auth layer. Model Context Protocol (MCP) standardizes the wire format for this exchange.
The Airbyte Agent MCP is a concrete example of the pattern at scale: one OAuth handshake gets the client into the MCP server, then each connected service (Salesforce, HubSpot, Stripe, Slack, Gong) runs its own credential flow in the browser. Credentials live server-side. The agent chat never touches them. Worth understanding before you commit to how you'll define tools at scale.
What a Toolkit Does and Doesn't Cover
Two things. Which tools the agent has, and how they get called.
Not the reasoning loop that picks among them. Not the framework that orchestrates calls. Not the governance layer that enforces permissions. Not the data infrastructure the tools query into. Those all sit around the toolkit, and conflating them with it is the most common toolkit-design mistake we see.
How Do Popular Frameworks Define and Organize Agent Tools?
Pretty much every framework models a tool the same way: a callable function, or an interface to one external system.
Tool Descriptions Drive Tool Selection
This one always surprises teams the first time they measure it.
LlamaIndex says it flat out in their docs: name and description "rely strongly" on driving selection quality, and "spending time tuning these parameters can result in large changes in how the Large Language Model (LLM) calls these tools." It's a context engineering problem in the end. Where the description sits in the prompt and how precisely it's written controls how reliably the model uses the tool. Vague descriptions get vague selection.
Function Calling and Structured Outputs
Most toolkits lean on the model's tool-calling interface to turn natural language into a tool invocation. Function calling, which OpenAI shipped first and the other providers now match, lets the model emit a JSON object naming the tool and its arguments. Framework dispatches the call, runs the function, hands the result back. Done. The relationship between MCP and traditional APIs is worth a closer look here, because MCP turns that per-framework convention into a protocol.
Structured outputs are stricter. The model's response is constrained at decode time to match a JSON Schema. Goodbye parsing failures. Anthropic's tool_use blocks, OpenAI's response_format with json_schema, and Google's responseSchema are three labels for the same idea.
Whatever interface the toolkit targets, the toolkit inherits its constraints. Something written for OpenAI function calling lifts to Anthropic's Messages API with minor schema tweaks. Something written against a specific framework's tool decorator usually does not port without an adapter in the middle.
Framework Patterns Affect Reasoning Quality
Each framework picks its own tool abstraction. That choice decides what kinds of tools fit naturally and how easily a toolkit moves between frameworks.
How Does MCP Change What a Toolkit Can Do?
MCP pushes tool discovery to runtime. Your agent's capability is no longer a property of the imports in your codebase. It's a property of whichever servers are connected at the moment.
Knowing what MCP servers are and how they register tools is the prerequisite for anything else in this section.
Runtime Discovery Changes Capability Management
Tool registration is no longer welded to agent code. Capability is discovered at runtime, not compiled at build. The protocol standardizes how agents find and call tools, and leaves reasoning orchestration, plan execution, and multi-agent coordination to other layers. The A2A vs MCP comparison is the right place to figure out which protocol fits where. Short version: MCP for tools to agents, A2A for agents to agents.
Authentication and Data Quality Sit Outside the Protocol
MCP defines the wire format. It does not define authentication, data freshness, or entity resolution. That's left to whoever runs the server, and the gap is real.
Knostic security researchers mapped 1,862 publicly exposed MCP servers. Of the 119 they verified by hand, all 119 lacked authentication. Zero out of 119. An active GitHub debate is still chewing on whether OAuth belongs in the MCP Client to Server handshake for every deployment, or only some.
These gaps belong to a broader AI agent security surface that toolkit designers have to plan around explicitly. There's also a quieter problem: tool definitions burn context-window tokens at registration time, before any reasoning starts, because many MCP clients load every definition upfront. Big toolkit, small remaining context budget.
Why Does Adding More Tools Not Always Mean a More Capable Agent?
Because more tools is often worse. Counterintuitive, but the data is consistent.
Tool Definitions Are Part of the Prompt Surface
Inkeep's production deployments call out "load every tool just in case" as the number-one context-pollution mistake. Specialized agents with focused responsibilities and clean handoffs win on reliability.
Cursor's A/B testing put a number on it. Loading tool descriptions just-in-time instead of upfront cut total agent tokens by 46.9% on runs that are actually called MCP tools. The savings scaled with the number of servers in play. Context window limits are a hard ceiling, and tool schemas eat into the budget before the agent has read the user's first message.
Notion's team runs 100+ tools across their agent systems and treats context management as a permanent quality constraint, not a one-time fix.
Authentication and Credential Overhead Grows with Toolkit Breadth
Every tool brings its own auth. Broader toolkit, broader operational surface.
A Stripe call needs an API key. A Salesforce call needs OAuth with refresh tokens. A Slack call needs bot tokens with scope-specific permissions. A custom database call needs a service account. Single-vendor surfaces (Stripe's toolkit, OpenAI's hosted tools) handle this once and hide it from the agent code. Multi-vendor toolkits multiply the credential types, the refresh schedules, and the permission scopes you have to keep correct.
Token expiration. Scope drift. Rate-limit budgets. All real operational work. It's one of the under-discussed reasons single-vendor toolkits run smoother at scale than multi-vendor stacks.
Should You Use a Published Agent Toolkit or Build Your Own?
Use the published one. Most of the time.
The ecosystem already covers the obvious SaaS APIs, the hosted built-ins, and a handful of lightweight framework interfaces. For most teams, a published toolkit saves weeks of integration work and inherits the publisher's maintenance commitment as APIs change. Stripe ships toolkit updates the same day the API moves. If you wrote your own Stripe integration, that maintenance is on you. Forever.
Three situations make custom worth the trouble:
- Internal systems with no public API. Nobody could have published a toolkit because there's no surface to wrap.
- Workflows that compose several APIs into one semantic action. "Close the ticket and refund the customer" as a single callable tool, not two tools and a glue layer in the model's prompt.
- Agents on strict latency or cost budgets that need pre-aggregated views. A published toolkit's per-call structure adds round trips you can't afford.
There's a middle path that ends up being the right answer more often than either extreme: wrap a published toolkit with a thin custom layer. The published part keeps up with the vendor. The custom layer adds composed actions, business rules, and result post-processing. Best of both, less maintenance than either pure approach.
What Is the Best Way to Design an Agent Toolkit for Production?
Four decisions. None of them about which framework you picked.
How broad is the action space? How good are the descriptions? How does auth get handled? How does governance get enforced at runtime?
Tight beats broad. Specific beats generic. Just-in-time beats upfront where the framework supports it. Enterprise AI governance belongs in a runtime layer that wraps the toolkit, not as a property of the toolkit itself.
The other half of this conversation, the part most teams skip, is what the toolkit queries against. Tool definitions pollute context, sure. So does the data the tools fetch when discovery happens against raw APIs at runtime. Exploratory calls multiply, token counts climb, and the model burns budget figuring out what to ask before it can actually answer. Discovery against an indexed context layer collapses the same work to one or two targeted calls.
Airbyte Agents is built on that split. Context Store is a live, searchable index of your business data, populated by Airbyte's 600+ replication connectors. Another 50+ agent connectors handle live fetch and write, with managed auth on each one. Teams access it through the web app UI, Agent MCP, or Agent SDK. The SDK plugs in beneath whatever framework you already run, including LangChain, OpenAI Agents SDK, and Claude Agent SDK.
Get a demo to see how Airbyte gives your agents a unified, permission-aware context layer.
Frequently Asked Questions
What is the difference between agent tools and an agent toolkit?
A tool is a single callable function. One action. Search Salesforce, read a Stripe invoice. A toolkit groups many of those into one importable package, so the agent registers them as a batch instead of you wiring each one up by hand.
How many tools should an agent have?
Fewer than you think. Specialized agents with a tight tool set beat one agent with a buffet, almost without exception. The right number depends on how good the descriptions are and how much context-window budget you've got left after the system prompt.
Does MCP replace an agent toolkit?
No. MCP standardizes how agents discover and call tools. Toolkit architecture still has to solve tool filtering and auth. The protocol moves a problem; it doesn't erase it.
Why do production agents fail early in execution?
Most failures show up on the first turn. Tool selection errors, malformed arguments, auth failures, return formats the model didn't expect. The first turn assembles the context the rest of the conversation depends on, so a failure there cascades into everything downstream. AI agent observability tooling exists to catch these before they spread.
Why does governance matter in an agent toolkit?
Governance enforces what the model is allowed to do at runtime through deterministic policy evaluation. In multi-agent setups, this ties directly into AI agent orchestration design. Policies belong at the orchestration layer, not buried in individual toolkit configs where they'll drift.
Try Airbyte Agents
We're building the future of agent data infrastructure. Be amongst the first to explore our new platform and get access to our latest features.
.avif)
