The ability to call functions directed by large language models (LLMs), like those offered by OpenAI or Anthropic is revolutionary. It allows LLMs to operate not just as standalone smart chatbots, but in conjunction with various tools, essentially serving as the hands directed by the brain. This integration vastly enhances the capabilities of LLMs, turning them into more dynamic, versatile tools that can execute complex tasks, manipulate data, and interact with external environments.

However, leveraging these capabilities through major online platforms often comes with a steep price tag. Each request sent out can feel like another cha-ching on the cash register, which might be sustainable for large-scale enterprises but is a significant barrier for individual developers and small teams.

Given the cost implications of cloud-based solutions, setting up local environments capable of function calling becomes a tempting alternative. Local deployment not only cuts down ongoing operational costs but also reduces latency and increases data privacy. This post explores the practical steps and considerations involved in harnessing the power of local LLMs for function calling.

Here we aim to detail the steps to set up function-calling LLMs locally.

Exploring Compatible Models

Not all models support local function calling. A good starting point is to explore models available on platforms like Hugging Face, which hosts a variety of function calling models such as Llama-3 based or Phi3 based models. These can be integrated into local setups, providing a viable alternative to cloud-based services. I chose the meetkai/functionary-7b-v2-GGUF from MeetKai on Hugging Face.

Server Options and Setup

For deployment, the options include TGI (Text Generation Interface), vLLM, and Llama.cpp. Each option has its merits, but compatibility with your operating system and hardware is crucial. For instance, Ubuntu 24.04 currently lacks support for the Nvidia container toolkit, rendering Docker-based solutions impractical.

After struggling with dependencies for installing TGI and vLLM locally, Llama.cpp emerged as the most compatible option. Its seamless integration into the Ubuntu 24.04, despite initial hurdles, proves it a worthy choice for those facing similar issues. However, I believe TGI and vLLM would function perfectly on different operating systems. Here are the steps.

Install llama-cpp-python

Create and activate a python virtual environment:

python3 -m venv .venv
source activate .venv/bin/activate

Install llama-cpp-python

pip install llama-cpp-python
# for server version, install following
pip install 'llama-cpp-python[server]'

If you have Nvidia GPUs and have installed CUDA, you can also use GPU to accelerate the model inference speed. For more information, see llama-cpp-python on GitHub.

Download model and tokenizer

You can use the Hugging Face CLI to download the entire repository, but you can also download the models you need and tokenizer directly from the Hugging Face website: meetkai/functionary-7b-v2-GGUF at main

For my following test, I downloaded the following files:

Place these in a directory such as /data/models/function-calling. You may need to adjust permissions if any issues arise later.

Start Server and Test

Once everything is installed, the next step is to setup the server and test the function calling in code:

Setup Server

python -m llama_cpp.server --model /data/models/function-calling/functionary-7b-v2.1.q8_0.gguf --chat_format functionary-v2 --hf_pretrained_model_name_or_path /data/models/function-calling/

Install code dependencies

pip install openai
pip install instructor

Run following code

import openai
import json
import instructor
client = openai.OpenAI(
    api_key = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", # can be anything
    base_url = "http://localhost:8000/v1" 
)
client = instructor.patch(client=client)
def get_current_weather(location, unit="fahrenheit"):
    """Get the current weather in a given location"""
    if "tokyo" in location.lower():
        return json.dumps({"location": "Tokyo", "temperature": "10", "unit": "celsius"})
    elif "san francisco" in location.lower():
        return json.dumps({"location": "San Francisco", "temperature": "72", "unit": "fahrenheit"})
    elif "paris" in location.lower():
        return json.dumps({"location": "Paris", "temperature": "22", "unit": "celsius"})
    else:
        return json.dumps({"location": location, "temperature": "unknown"})

def run_conversation():
    messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
    # follow this page for OpenAI function calling instruction
    # https://platform.openai.com/docs/guides/function-calling
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                    },
                    "required": ["location"],
                },
            },
        }
    ]
    response = client.chat.completions.create(
        model="functionary", # the model name doesn't really matter in this case. 
        messages=messages,
        tools=tools,
        tool_choice="auto",  # auto is default, but we'll be explicit
    )
    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    # Step 2: check if the model wanted to call a function
    if tool_calls:
        # Step 3: call the function
        # Note: the JSON response may not always be valid; be sure to handle errors
        available_functions = {
            "get_current_weather": get_current_weather,
        }  # only one function in this example, but you can have multiple
        # Step 4: send the info for each function call and function response to the model
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_functions[function_name] # Get function given name
            function_args = json.loads(tool_call.function.arguments)
            # call function to get the result
            function_response = function_to_call(
                location=function_args.get("location"),
                unit=function_args.get("unit"),
            )
            messages.append(
                {
                    "tool_call_id": tool_call.id,
                    "role": "function",
                    "name": function_name,
                    "content": function_response,
                }
            )  # extend conversation with function response
        for message in messages:
            # Function call responses
            if message["role"] == "function" and "name" in message:
                message["name"] = f"functions.{message['name']}"
        second_response = client.chat.completions.create(
            model="functionary",
            messages=messages,
        )  # get a new response from the model where it can see the function response
        return second_response
print(run_conversation())

Result

The raw result includes extensive metadata from the chat:

ChatCompletion(
  id='chatcmpl-595f86be-1d88-4a7f-ae4d-eeed86f0a4b1', 
  choices=[
      Choice(
          finish_reason='stop', 
          index=0, 
          logprobs=None, 
          message=ChatCompletionMessage(
              content="The current weather in San Francisco is 72°F. In Tokyo, the temperature is 10°C. And in Paris, it's 22°C.", 
              role='assistant', 
              function_call=None, 
              tool_calls=None
          )
      )
  ], 
  created=1717843192, 
  model='functionary', 
  object='chat.completion', 
  system_fingerprint=None, 
  usage=CompletionUsage(
      completion_tokens=39, 
      prompt_tokens=176, 
      total_tokens=215
  )
)

And the bit we are interested is:

The current weather in San Francisco is 72°F. In Tokyo, the temperature is 10°C. And in Paris, it's 22°C.

Summary and Reflections

Setting up a local LLM with function calling capabilities can be daunting but rewarding. It offers an escape from the recurring costs associated with cloud services, while also providing enhanced control over your computational environment and data privacy. Despite potential initial setbacks with installation and configuration, the end result is a robust, cost-effective solution that expands the capabilities of LLMs beyond simple text generation to more interactive and dynamic applications.

Final Thoughts

Embracing local LLMs with function calling is not just about saving costs — it’s about gaining independence and flexibility in how you deploy advanced AI capabilities. Whether you’re a developer, researcher, or tech enthusiast, the journey towards local deployment is one filled with learning opportunities and a significant step towards democratizing advanced AI technologies.

Function Calling with Local LLMs

Exploring Compatible Models

Server Options and Setup

Start Server and Test

Summary and Reflections

Final Thoughts

Daniel Tapia