Tokenizing string via the HTTP API #388

dwillie · 2023-06-17T00:52:56Z

dwillie
Jun 17, 2023

As I understand it, right now the HTTP API doesn't provide an endpoint for tokenising a string.

I was interested in being able to do this because I would like to use tokens with logit biases, but I also want my application to purely interact with the model via the HTTP API (because I'd like to deploy the HTTP API on a more powerful machine while I work).

I considered that I could possibly just tokenize the string on my terminal machine rather than the machine with the model---it looks like the Llama class has the tokenize() method and construction of the Llama class requires a model path though. Is it possible to tokenize without a model? Do all Llama models use the same tokenizer (sorry for what might be a silly question, still learning)?

As I understand it the HTTP API provided is intended to be OpenAI-compatible---I'm not sure if the OpenAI API supports tokenizing but would the team be opposed to making the API a superset of the OpenAI API? I could look into adding a tokenize endpoint if that was considered acceptable and valuable. Perhaps non-OpenAI relevant routes could go under a different root path.

Thanks so much for an awesome project. It's been really great to learn and play with while being able to stay more or less on the cutting edge of the LlamaCpp features.

Apologies if I've overlooked any related PRs/Issues/Discussions already relating to this!

ChristophJud · 2023-08-08T12:04:14Z

ChristophJud
Aug 8, 2023

Hi dwillie

I'm interested in the same feature.

I'm working on a document augmented chat-bot where I now integrate REST API server of llama-cpp-python. When I retrieve documents from my vector store I want to fill my context with as much documents as possible. For that, I need to retrieve the number of tokens a certain string needs. For the openai api, it is easy because you can just use tiktoken. But in general, if you don't know the model behind the api and you don't want to load the whole llama model it would be handy to have a rest endpoint which can be called for that.

I implemented a prototype for that:

diff --git a/llama_cpp/llama_types.py b/llama_cpp/llama_types.py
index 6ba8023..4225bf4 100644
--- a/llama_cpp/llama_types.py
+++ b/llama_cpp/llama_types.py
@@ -57,6 +57,10 @@ class Completion(TypedDict):
     usage: CompletionUsage
 
 
+class Tokens(TypedDict):
+    prompt_tokens: Union[List[int], List[List[int]]]
+
+
 class ChatCompletionMessage(TypedDict):
     role: Literal["assistant", "user", "system"]
     content: str
diff --git a/llama_cpp/server/app.py b/llama_cpp/server/app.py
index 4afcfd5..09b3042 100644
--- a/llama_cpp/server/app.py
+++ b/llama_cpp/server/app.py
@@ -1,4 +1,5 @@
 import json
+import logging
 import multiprocessing
 from re import compile, Match, Pattern
 from threading import Lock
@@ -534,6 +535,12 @@ class CreateCompletionRequest(BaseModel):
     }
 
 
+class TokenizeRequest(BaseModel):
+    prompt: Union[str, List[str]] = Field(
+        default="", description="The prompt to generate completions for."
+    )
+
+
 def make_logit_bias_processor(
     llama: llama_cpp.Llama,
     logit_bias: Dict[str, float],
@@ -567,6 +574,31 @@ def make_logit_bias_processor(
     return logit_bias_processor
 
 
[email protected](
+    "/v1/tokenize",
+)
+async def tokenize(
+        body: TokenizeRequest,
+        llama: llama_cpp.Llama = Depends(get_llama),
+) -> llama_cpp.Tokens:
+    if type(body.prompt) == list:
+        try:
+            tokens_list = [llama.tokenize(text=text.encode("utf-8"), add_bos=True) for text in body.prompt]
+        except Exception as e:
+            logging.getLogger("uvicorn").info(f"Error while tokenizing \"{body.prompt}\": {e}")
+        return {
+            "prompt_tokens": tokens_list
+        }
+    elif type(body.prompt) == str:
+        try:
+            tokens = llama.tokenize(text=body.prompt.encode("utf-8"), add_bos=True)
+        except Exception as e:
+            logging.getLogger('uvicorn').info(f"Error while tokenizing \"{body.prompt}\": {e}")
+        return {
+            "prompt_tokens": tokens
+        }
+
+
 @router.post(
     "/v1/completions",
 )

In this way, the API is still compatible to OpenAI, it is just a superset.

Any suggestions to make the endpoint more sophisticated?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizing string via the HTTP API #388

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Tokenizing string via the HTTP API #388

dwillie Jun 17, 2023

Replies: 1 comment

ChristophJud Aug 8, 2023

dwillie
Jun 17, 2023

ChristophJud
Aug 8, 2023