Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool-call: add support to llama-cli using new --tools arg #11556

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

bandoti
Copy link
Collaborator

@bandoti bandoti commented Jan 31, 2025

This PR adds support for tool-calls using a --tools switch to llama-cli.

It is currently ⚠Experimental!⚠

This required slight modifications to common_chat_apply_template in order to support passing a new common_params_tools type which encapsulates the tools JSON array and tool_choice.

This doesn't work yet needs the brains added... Trying to figure out how this all works in the server code if anyone has tips please feel free to chime in!😅

Tasks:

Integrating toolcall support with llama-cli

  • Add a --tools option to pass in a JSON tools array
  • Add a --tool-choice option which defaults to "auto" (see this ref)
  • Add a --tool-parallel switch for parallel tool-calls.
  • Copy remaining logic from oaicompat_completion_params_parse in utils.hpp into common_chat_apply_template (common.cpp).
  • Some other grammar changes in the main.cpp algorithm?

Implement toolcall handlers for Model Context Protocol (MCP).

  • Add C++ types for base MCP messages.
  • Add C++ types and procedures for Lifecycle phase of MCP protocol.
  • Implement Stdio transport.
  • Implement HTTP SSE transport using cURL.
  • Add base types in common library for abstracting out a tool-call handlers. This should include types/functions for translating between the underlying tool-call implementation (OpenAI style) to other formats (MCP in this case). After the template gets applied in common_chat_apply_template via a call to common_chat_params_init, the resulting prompt member of common_chat_params will contain the JSON-Formatted tool-calls. This should be translated and dispatched to the registered handlers (if one was specified).
  • Other refactoring to support receiving input from the handlers while simultaneously allowing the users input/interjection between request/response in the handlers.
  • Add C++ types for MCP utility messages to ping, cancel, and receive progress updates for long-running tool-calls.

@bandoti bandoti requested a review from ngxson as a code owner February 4, 2025 19:06
@github-actions github-actions bot added testing Everything test related server labels Feb 4, 2025
@bandoti
Copy link
Collaborator Author

bandoti commented Feb 4, 2025

@ochafik I am working on adding the tool calls to llama-cli, and at this point I have wired into common_chat_apply_template initial support (from what I can tell) for passing in the templates and tool array/tool_choice.

However, I am needing some advice on how to handle the remaining fields of common_chat_params as returned by common_chat_params_init. It is my basic understanding of this that each time the template gets applied, it needs to relay this back to the sampling parameters so it can get hooked into the main token-processing routine. Is this correct? If so, do I simply need to tokenize/push the grammar triggers like server.cpp? At the moment when common_chat_apply_template is called it returns a string but I can change that by adding an out parameter or something.

Thank you for your work on the core of this feature I am excited to get it working on llama-cli! 😊

@ochafik
Copy link
Collaborator

ochafik commented Feb 5, 2025

Hey @bandoti , sorry for the delay, some quick background questions first:

  • What use case you have in mind for this, is it to treat the cli as a single shot server?
  • How would you display the output of the tool calls to make it useable (in openai format?). Could you add an example output to the PR description?

Have you considered going directly one step further and have the CLI call tools? @brucepro is looking into doing tool call w/ MCP servers from the server's Web UI (ref), maybe you could join forces / do the same in C++ w/ CURL).

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 5, 2025

@ochafik I got this working now in llama-cli now. Here's the command I ran followed by the output:

 ./build/bin/llama-cli.exe -c 2048 -ngl 8 -cnv --jinja -m 'C:/Users/mtmcp/Downloads/Llama-3.2-3B-Instruct-Q6_K.gguf' --tools '[
    {
      "type":"function",
      "function":{
        "name":"get_current_weather",
        "description":"Get the current weather in a given location",
        "parameters":{
          "type":"object",
          "properties":{
            "location":{
              "type":"string",
              "description":"The city and state, e.g. San Francisco, CA"
            }
          },
          "required":["location"]
        }
      }
    }
  ]'

system

Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 05 Feb 2025

You have access to the following functions. To call a function, please respond with JSON for a function call.Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.Do not use variables.

{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                }
            },
            "required": [
                "location"
            ]
        }
    }
}

You are a helpful assistant


> What is the weather like in Mumbai?
{"name": "get_current_weather", "parameters": {"location": "Mumbai"}}

>
llama_perf_sampler_print:    sampling time =       1.41 ms /    36 runs   (    0.04 ms per token, 25477.71 tokens per second)
llama_perf_context_print:        load time =    1731.11 ms
llama_perf_context_print: prompt eval time =   17904.77 ms /   204 tokens (   87.77 ms per token,    11.39 tokens per second)
llama_perf_context_print:        eval time =    1457.84 ms /    18 runs   (   80.99 ms per token,    12.35 tokens per second)
llama_perf_context_print:       total time =   29930.62 ms /   222 tokens
Interrupted by user

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 5, 2025

Hey @bandoti , sorry for the delay, some quick background questions first:

* What use case you have in mind for this, is it to treat the cli as a single shot server?

* How would you display the output of the tool calls to make it useable (in openai format?). Could you add an example output to the PR description?

Have you considered going directly one step further and have the CLI call tools? @brucepro is looking into doing tool call w/ MCP servers from the server's Web UI (ref), maybe you could join forces / do the same in C++ w/ CURL).

@ochafik Good timing we responded at the exact same time haha. No worries on the delay—here's some general objectives:

  1. Testability. Having llama-cli being able to process these function calls can lend for some really useful automated tests using tools like expect &co. This can quickly validate logic in the function-call behavior.
  2. I actually have been working on an on-going effort to wrap llama-cli in a Tcl scripting environment, and the general idea here is that these function calls could be extremely interesting way to create automation.

In both of these cases, the output can be processed and simply scanned for a valid JSON result. If it's valid, then honor the function calls otherwise just print to the console.

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 5, 2025

I will track the MCP protocol work it sounds interesting! I still think there's a lot of need for local-only tools however, and want to ensure these features are workable/testable without standing up endpoints and such. 😊

When you mention adding this capability in cURL, how do you mean? Setting up llama-cli as a MCP client?

EDIT: After reading more on MCP I see the potential flow, where the AI runs and communicates with the resource services. I'd imagine building that on top of the changes here would work well. A series of services can simply be passed into the llama-cli and it could dispatch to them when it needs something (at least that's how I'm understanding it).

@brucepro
Copy link
Contributor

brucepro commented Feb 5, 2025

I will track the MCP protocol work it sounds interesting! I still think there's a lot of need for local-only tools however, and want to ensure these features are workable/testable without standing up endpoints and such. 😊

When you mention adding this capability in cURL, how do you mean? Setting up llama-cli as a MCP client?

For MCP, I am adding the SSE client support into the webui. This link was the best example I found: https://github.com/apify/tester-mcp-client/blob/main/src/mcpClient.ts
Then you can run one of the proxy's that allows you to use MCP's servers directly. This one seemed promising. https://github.com/punkpeye/mcp-proxy/ although I think writing a python solution to handle the SSE api calls and just using the python sdk directly is where I will end up. https://github.com/modelcontextprotocol So in the end will have the WebUI able to add any SSE server with a congif of

{
  "mcpServers": {
    "fetch": {
      "name": "Fetch",
      "type": "sse",
      "serverUrl": "http://localhost:8765/sse"
    }
  }
}

Still in progress. Once I hit debug mode will update my repo and start testing.

@ochafik ochafik self-requested a review February 5, 2025 17:09
@bandoti
Copy link
Collaborator Author

bandoti commented Feb 5, 2025

@brucepro thanks for the info on this. It seems to me, in general, a protocol like this is the way to go for the local AI in llama-cli to invoke actions as well. I'll take a closer look and see what it'll take to add it.

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 5, 2025

@ochafik As I understand it the requirement to get this working I need to add a "translation" layer between the models OpenAI function call request/response and MCP, correct? This shouldn't be too difficult with cURL and the json library.

I really like the discovery aspect of the MCP protocol—will make managing a collection of functionality much easier.

So I will start working on it as I think this is an important part of the function call API. We can revisit the other aspects of MCP like prompts and the like—those are very powerful as well, albeit that's a fair amount of work so will have to be done gradually.

@brucepro
Copy link
Contributor

@ochafik As I understand it the requirement to get this working I need to add a "translation" layer between the models OpenAI function call request/response and MCP, correct? This shouldn't be too difficult with cURL and the json library.

I really like the discovery aspect of the MCP protocol—will make managing a collection of functionality much easier.

So I will start working on it as I think this is an important part of the function call API. We can revisit the other aspects of MCP like prompts and the like—those are very powerful as well, albeit that's a fair amount of work so will have to be done gradually.

Did you get

@ochafik As I understand it the requirement to get this working I need to add a "translation" layer between the models OpenAI function call request/response and MCP, correct? This shouldn't be too difficult with cURL and the json library.

I really like the discovery aspect of the MCP protocol—will make managing a collection of functionality much easier.

So I will start working on it as I think this is an important part of the function call API. We can revisit the other aspects of MCP like prompts and the like—those are very powerful as well, albeit that's a fair amount of work so will have to be done gradually.

Did you make any progress on the cli mcp? I have a super basic React App made that seems to work with llamacpp here. https://github.com/brucepro/llamacppMCPClientDemo I tested with llama3.3 70b but not much else. Will be adding prompts and resources next and debugging. Once it is cleaned up, I will work on migrating it to the WebUI.

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 11, 2025

@brucepro I'm currently working on adding the types for MCP protocol and initialization handshake. I have all the types defined just going to add unit test on them today.

Working in a different branch but I'll merge that piece in hopefully today.

I added a checklist in the PR description above to track these changes. 😊

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 14, 2025

@brucepro Quick update: I have most of the pieces in place, just working on the SSE transport. I am hoping to finish it—well, make it marginally-workable—this weekend. I am leaving the stdio transport unfinished for this PR, but it can be followed up on later, as the HTTP endpoint has a bit more utility.

The SSE changes will require setting up a background thread listening to SSE endpoint while allowing tool-calls to be sent to a separate endpoint (arbitrarily set by the SSE endpoint event). There are some concurrency-related hiccups which may cause issues given that the MCP server can push an update to the tools list at any time. But, other than that, I don't foresee many other problems.

Looking forward to being able to test this thing! Stay tuned...

@brucepro
Copy link
Contributor

@brucepro Quick update: I have most of the pieces in place, just working on the SSE transport. I am hoping to finish it—well, make it marginally-workable—this weekend. I am leaving the stdio transport unfinished for this PR, but it can be followed up on later, as the HTTP endpoint has a bit more utility.

The SSE changes will require setting up a background thread listening to SSE endpoint while allowing tool-calls to be sent to a separate endpoint (arbitrarily set by the SSE endpoint event). There are some concurrency-related hiccups which may cause issues given that the MCP server can push an update to the tools list at any time. But, other than that, I don't foresee many other problems.

Looking forward to being able to test this thing! Stay tuned...

Awesome. Looking forward to testing it out.

@ochafik ochafik mentioned this pull request Feb 15, 2025
3 tasks
@bandoti
Copy link
Collaborator Author

bandoti commented Feb 15, 2025

@ochafik I took a quick look at your cleanup branch and see the switch to common_chat_templates_apply. I suppose at the moment per the run.cpp example, the intention is that each application will call this directly? The changes I put in place currently forwards the new toolcall::handler type through common_chat_format_single to invoke the tool-calls, but it seems like having a cleaner separation between applying the template and invoking the tool-call is desired.
Here's the high-level order:

  1. Apply chat template
  2. Update grammar/vocab/sampler
  3. Invoke toolcall via toolcall::handler (if supplied)
  4. Tokenize the tool-call response (if supplied)

I would like to unify the means of invoking a tool-call (from the client-side) so the logic may be shared. Would you be okay with updating common_chat_format_single to return common_chat_params instead of a string? This would ensure that step (2) above will be much cleaner to implement.

@ggerganov Please see the above.

@ngxson
Copy link
Collaborator

ngxson commented Feb 15, 2025

The common_chat_format_single is only used by llama-cli and llama-run (kinda one-time usage) so it's fine to update it (well, just make sure that it runs correctly with non-tool templates too)

In far future, it would be better to get rid of this function and track the KV cache at token level instead.

@bandoti
Copy link
Collaborator Author

bandoti commented Feb 15, 2025

@ngxson Okay sounds good. I will add an out-param instead of modifying the return type because it's not returning the full prompt but a delta on it. Main thing is removing the toolcall::handler as a parameter in these functions!

commit 98c4a8d
Author: Mason M <[email protected]>
Date:   Wed Feb 19 11:05:18 2025 -0400

    Refactor MCP transport callback mechanism

commit 2b3d1f6
Author: Mason M <[email protected]>
Date:   Tue Feb 18 18:01:54 2025 -0400

    add message_to_json function

commit 3c7ae27
Author: Mason M <[email protected]>
Date:   Tue Feb 18 15:32:26 2025 -0400

    Implement send routine

commit 9cec1e0
Author: Mason M <[email protected]>
Date:   Tue Feb 18 10:12:17 2025 -0400

    Fix include paths

commit b5642f0
Author: Mason M <[email protected]>
Date:   Tue Feb 18 09:56:52 2025 -0400

    Use log API

commit 7a83b2b
Author: Mason M <[email protected]>
Date:   Mon Feb 17 19:32:48 2025 -0400

    Fix build errors

commit cc7fd66
Author: Mason M <[email protected]>
Date:   Mon Feb 17 19:03:43 2025 -0400

    Use condition variable to wait for endpoint event

commit 73ccdd1
Author: Mason M <[email protected]>
Date:   Mon Feb 17 17:18:09 2025 -0400

    Process SSE data asynchronously

commit e9c37a3
Author: Mason M <[email protected]>
Date:   Mon Feb 17 14:01:56 2025 -0400

    Add keep-alive header to sse handler

commit 57f84e6
Author: Mason M <[email protected]>
Date:   Mon Feb 17 13:37:59 2025 -0400

    Add methods for handling endpoint/message events

commit 5c160f6
Author: Mason M <[email protected]>
Date:   Mon Feb 17 13:25:18 2025 -0400

    Process sse values

commit f51b493
Author: Mason M <[email protected]>
Date:   Mon Feb 17 13:07:38 2025 -0400

    Clean up sse_read algorithm

commit 29d6875
Author: Mason M <[email protected]>
Date:   Sun Feb 16 19:04:39 2025 -0400

    WIP: implementing SSE protocol
@bandoti
Copy link
Collaborator Author

bandoti commented Feb 19, 2025

@ochafik I was able to successfully merge in your changes. Thanks for getting those through it's very helpful!

The SSE transport is finished (in it's initial incarnation) but hasn't been tested yet. I have all the parts in place and now the main task is to add some routines to convert the MCP tool-call messages into OAI JSON format, which goes through the new common_chat_tools_parse_oaicompat functions. There's a small amount of synchronization that needs to be added to the toolcall::mcp_impl to make the asynchronous mcp_transport routines block.

I should have these finished before end of week, and will do some initial testing to fix initial bugs on the MCP connection. After that it should be ready for general testing. I don't plan on changing the general architecture much (pending review feedback of course).

@github-actions github-actions bot added the build Compilation issues label Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants