LLM Tool Calling
Understanding tool calling and human oversight in AI workflows
Functions and tools are a key part of Agentic Workflows. They enable LLMs to interact meaningfully with the outside world and automate broad scopes of impactful work. Correct and accurate function calling is essential for AI agents that do meaningful things like book appointments, interact with customers, manage billing information, write+execute code, and more.
Dupont’s excellent Transforming Software Interactions with Tool Calling and LLMs provides a great overview of tool calling and its benefits.
In the AI context, a function is just a generic python/js function that you’ve written.
Information about this function like name, description, parameters, and types are passed to the AI model as a “tool”. The AI model can then indicate that it would like call this function as part of its reasoning process.
From the OpenAI docs
Under the hood, functions are injected into the system message in a syntax the model has been trained on
When a tool call is selected by the LLM, it is the responsibility of the client code or framework to actually call the function and send the result back to the LLM.
An example tool calling workflow might look like
System: You are a helpful assistant. You have access to the function:
check_weather_in_city(name: str) -> dict
User: What’s the weather in San Francisco?
Assistant:
__tool_call(check_weather_in_city, {"name": "San Francisco"})
At this point, your code would call the function, and add the result to the chain, and send the whole conversation back to the LLM
Tool:
{"temperature_farenheit": "72", "weather": "sunny"}
The LLM can then produce a response based on the results of the tool call:
Assistant: “The weather in San Francisco is 72 degrees and sunny!”
User: “Thanks!”
Learn more:
- Function Calling - OpenAI Docs
- Louis Dupont’s excellent Transforming Software Interactions with Tool Calling and LLMs
- Leverage OpenAI Tool Calling: Building a Reliable AI Agent from Scratch
- How Does Tool Calling Work in Langchain
- Tool Calling in Crew AI
- The Berkeley Function Calling Leaderboard
The Challenge
The most useful functions we can give to an LLM are also the most risky.
We can all imagine the value of an AI Database Administrator that constantly tunes and refactors our SQL database, but most teams wouldn’t give an LLM access to run arbitrary SQL statements against a production database (heck, we mostly don’t even let humans do that).
Even with state-of-the-art agentic reasoning and prompt routing, LLMs are not sufficiently reliable to be given access to high-stakes functions without human oversight
Function Stakes
To better define what is meant by “high stakes”, some examples:
Low Stakes
- Read Access to public data (e.g. search wikipedia, access public APIs and DataSets)
- Communicate with agent author (e.g. an engineer might empower an agent to send them a private Slack message with updates on progress)
Medium Stakes
- Read Access to Private Data (e.g. read emails, access calendars, query a CRM)
- Communicate with strict rules (e.g. sending based on a specific sequence of hard-coded email templates)
High Stakes
- Communicate on my Behalf or on behalf of my Company (e.g. send emails, post to slack, publish social/blog content)
- Write Access to Private Data (e.g. update CRM records, modify feature toggles, update billing information)
The Solution
The high stakes functions are the ones that are the most valuable and promise the most impact in automating away human workflows. But they are also the ones where “90% accuracy” is not acceptable. Reliability is further impacted by today’s LLMs’ tendency to hallucinate or craft low-quality text that is clearly AI generated.
HumanLayer provides a set of tools to deterministically guarantee human oversight of high stakes function calls. Even if the LLM makes a mistake or hallucinates, HumanLayer is baked into the tool/function itself, guaranteeing a human in the loop.
See require_approval and human_as_tool for implementation details.