
Photo: Anton Savinov on Unsplash. Function calling is the API by which the model picks which tool to grab — and your code does the actual work.
If an LLM only emits text, it is a writing assistant. If an LLM can decide to call a function in your code, it is the brain of an agent. Function calling — sometimes called tool use — is the primitive that crosses that gap, and it is the mechanism every production AI agent in 2026 is built on.
If you have heard about Model Context Protocol (MCP) and wondered what it is standardizing, the answer is: it is standardizing the function-calling pattern across providers and across the boundaries between systems. Understanding function calling first makes MCP make sense second.
The contract
Function calling is a small contract between your code and the LLM, expressed in JSON. Your code declares what functions exist; the model decides when to call them and with what arguments; your code runs the function and feeds the result back; the model uses the result to keep going.
A tool definition looks roughly the same at every provider:
{
"name": "get_pod_status",
"description": "Return the current status of a Kubernetes pod. Use when the user asks about the health of a specific pod.",
"input_schema": {
"type": "object",
"properties": {
"namespace": { "type": "string" },
"pod_name": { "type": "string" }
},
"required": ["namespace", "pod_name"]
}
}
Three fields, all load-bearing. The name is what the model uses to refer to the tool in its response. The description is what the model reads to decide when to call it — this is the field most teams underweight; a bad description is the single largest source of “the model never picks the right tool” bugs. The input_schema is a JSON Schema fragment that constrains what the model is allowed to put in the arguments.
You ship a list of these tool definitions with every request. The model sees them, decides whether to answer in text or to invoke one (or several), and returns one of two things:
- A normal text response, in which case your application is done.
- A tool-use block — structured output naming the tool and the arguments. Your application runs the function, captures the return value, and sends a follow-up request that includes the original conversation, the model’s tool-use block, and a tool-result block carrying the function’s output. The model then continues — possibly with another tool call, possibly with a final text answer.
That request → tool-use → execute → tool-result → response loop is the entirety of function calling. Everything else — agents, MCP servers, multi-step reasoning chains — is variations on the same loop.
Why JSON-schema-as-contract is the whole point
The reason this works at all is that the model is constrained to emit valid JSON matching the schema you supplied. The provider enforces this at decoding time: the model cannot return malformed arguments. Required fields are required; types are types; enums are enums.
That guarantee is the load-bearing one. Without it, you would be back in the world of regex-parsing free-form text to extract structured arguments — the exact failure mode the early “tool-use via prompt engineering” patterns suffered from in 2023. With it, your application can treat the tool-use block exactly like any other strongly-typed input: validate at the boundary, run the function, return the result.
The corollary: your schema is your security boundary. A tool definition that accepts "path": {"type": "string"} and passes the value to os.system(f"cat {path}") is a remote-code-execution waiting to happen. Constrain enums where you can, validate against an allow-list before execution, and treat every value the model gives you as untrusted input — because it is. The model has been guided by an attacker’s instructions before, and it will be again.
Parallel and forced tool calls
Two features ship in the modern function-calling APIs that cover most of the production patterns:
Parallel tool use. A single model response can include multiple tool-use blocks. The model decides “to answer this, I need to call get_pod_status for these three pods at once,” emits all three tool-use blocks in a single turn, and your code dispatches them in parallel. The savings are real — a sequential chain of three tool calls is three round-trips with the model; a parallel chain is one. Most providers default to parallel-on; on Anthropic, disable_parallel_tool_use=true is the opt-out, and the Claude 4 models are noticeably more aggressive about emitting parallel calls than Claude 3.7 was. If you find your agent stubbornly serializing calls that obviously could fan out, the model version is the first thing to check.
Forced tool use. Sometimes you want to guarantee the model calls a tool, not just suggest it. The tool_choice parameter lets you force one — "tool_choice": {"type": "tool", "name": "get_pod_status"} requires the model to call exactly that tool on its next turn. Useful for the first turn of an agent when you want to ensure the agent retrieves context before reasoning, or for any UX where “answer in text” is not an acceptable next step.
The dual — "tool_choice": "none" — forces the model to not call any tool, even if its instinct would be to. Useful when the application has decided to take over and only wants the model to summarize.
Where it breaks
The failure modes cluster around the four boundaries the loop crosses.
- The model picks the wrong tool, or no tool when one was needed. Almost always a
descriptionproblem. The model reads tool descriptions like a triage list; vague or overlapping descriptions produce wrong picks. Rewrite descriptions to be operationally distinctive — what task this tool serves, with one or two example invocations. - The model invents an argument that violates the schema. Modern providers prevent this at decoding time, but older clients and self-hosted runtimes do not. Always validate the arguments against the schema in your application code, not just at the provider. Defense in depth.
- A tool succeeds but returns a payload too large for the next call’s context window. A
list_podstool that returns 4,000 pods will blow the model’s window. Wrap every tool with a result-size cap; truncate or paginate at the boundary. - A tool fails and the model loops. The model gets a tool-result with
{"error": "timeout"}, decides to retry, gets the same error, retries again, indefinitely. Cap the number of tool-call iterations per turn — a typical bound is 10 — and on hitting the cap, return a structured failure to the user instead of letting the model dig itself out.
What function calling is not
- Not a guarantee that the function ran. The model emits the intent to call the function; your code decides whether to execute it. Inserting a policy check between intent and execution is the entire point of the MCP gateway pattern — the model proposes, the gateway decides.
- Not the same as code execution. Some providers ship a sandboxed code interpreter — Anthropic’s code execution tool, OpenAI’s Code Interpreter. Those are one specific tool the provider has implemented for you. Generic function calling is the broader mechanism that lets you wire in tools the provider has never seen.
- Not protocol-level. Function calling is provider-specific. Each provider’s request shape, tool-result shape, and edge cases differ. MCP is the layer above function calling that makes a tool defined once usable across any client and any provider — which is why most teams building serious agent systems land on MCP rather than provider-native tools.
- Not free. Tool definitions count against the input token budget on every call. A 50-tool agent is paying to re-read 50 tool definitions on every turn. Prompt caching helps; pruning the tool list to the per-task minimum helps more.
Where to start
Pick one read-only operation in your system — “look up an order by ID,” “get the current value of a metric,” “fetch a runbook section.” Define it as a single tool with a tight JSON schema and a clear description. Wire your LLM call to advertise that one tool. Ask the model a question that should provoke the tool call.
Watch the request and the response. The model’s tool-use block will tell you whether the description was clear enough, whether the schema was right, and whether the argument it produced is what you expected. Run it through a half-dozen ambiguous queries; refine the description until the model picks the tool when it should and skips it when it should not.
That single read-only tool is the seed of every agent you will ever build. Adding more tools is mechanical. Adding write tools — anything that changes state — is where the discipline of bounded autonomy and the MCP gateway pattern start to matter, and that is the next concept worth picking up.