Functions and tools are a key part of Agentic Workflows. They enable LLMs to interact meaningfully with the outside world and automate broad scopes of impactful work. Correct and accurate function calling is essential for AI agents that do meaningful things like book appointments, interact with customers, manage billing information, write+execute code, and more.

Dupont’s excellent Transforming Software Interactions with Tool Calling and LLMs provides a great overview of tool calling and its benefits.

In the AI context, a function is just a generic python/js function that you’ve written.

Information about this function like name, description, parameters, and types are passed to the AI model as a “tool”. The AI model can then indicate that it would like call this function as part of its reasoning process.

From https://louis-dupont.medium.com/transforming-software-interactions-with-tool-calling-and-llms-dc39185247e9

From the OpenAI docs

Under the hood, functions are injected into the system message in a syntax the model has been trained on

When a tool call is selected by the LLM, it is the responsibility of the client code or framework to actually call the function and send the result back to the LLM.

An example tool calling workflow might look like

System: You are a helpful assistant. You have access to the function: check_weather_in_city(name: str) -> dict

User: What’s the weather in San Francisco?

Assistant: __tool_call(check_weather_in_city, {"name": "San Francisco"})

At this point, your code would call the function, and add the result to the chain, and send the whole conversation back to the LLM

Tool: {"temperature_farenheit": "72", "weather": "sunny"}

The LLM can then produce a response based on the results of the tool call:

Assistant: “The weather in San Francisco is 72 degrees and sunny!”

User: “Thanks!”

Learn more:

The Challenge

The most useful functions we can give to an LLM are also the most risky.

We can all imagine the value of an AI Database Administrator that constantly tunes and refactors our SQL database, but most teams wouldn’t give an LLM access to run arbitrary SQL statements against a production database (heck, we mostly don’t even let humans do that).

Even with state-of-the-art agentic reasoning and prompt routing, LLMs are not sufficiently reliable to be given access to high-stakes functions without human oversight

Function Stakes

To better define what is meant by “high stakes”, some examples:

Low Stakes

  • Read Access to public data (e.g. search wikipedia, access public APIs and DataSets)
  • Communicate with agent author (e.g. an engineer might empower an agent to send them a private Slack message with updates on progress)

Medium Stakes

  • Read Access to Private Data (e.g. read emails, access calendars, query a CRM)
  • Communicate with strict rules (e.g. sending based on a specific sequence of hard-coded email templates)

High Stakes

  • Communicate on my Behalf or on behalf of my Company (e.g. send emails, post to slack, publish social/blog content)
  • Write Access to Private Data (e.g. update CRM records, modify feature toggles, update billing information)

The Solution

The high stakes functions are the ones that are the most valuable and promise the most impact in automating away human workflows. But they are also the ones where “90% accuracy” is not acceptable. Reliability is further impacted by today’s LLMs’ tendency to hallucinate or craft low-quality text that is clearly AI generated.

HumanLayer provides a set of tools to deterministically guarantee human oversight of high stakes function calls. Even if the LLM makes a mistake or hallucinates, HumanLayer is baked into the tool/function itself, guaranteeing a human in the loop.

See require_approval and human_as_tool for implementation details.