Why is my Llama.cpp result so much worse than for the same model on another platform? #7781

TechDom · 2024-06-05T22:11:38Z

TechDom
Jun 5, 2024

Hi all, I hope this is the right location to post this!

I am prompting Llama3 8B Instruct Q5_K_M - which is running on llama.cpp - using the /completion API endpoint. My confusion is that the results are fundamentally different (worse), than when I use the Llama3 Online Demo. The difference does not just come down to natural variations in responses, instead they are completely different, often syntactically false, despite same parameter settings.

Yes, the online demo is the base model of Llama3 while I use the finetuned Instruct version (i could not get the base model running due to not knowing the correct chat template to apply, as the tokenizer from huggingface does not include one). Yet, imo that does not result in such a difference.

Here is an original example prompt that I use for the llama3 8B Instruct. It tasks the LLM to perform a code segmentation by inserting special characters into the code:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

<|eot_id|><|start_header_id|>user<|end_header_id|>

Task: Segment the below code into distinct logical parts. Do this by adding segmentation identifiers into the code.
Code:

from langchain.agents import AgentExecutor, LLMSingleActionAgent
from langchain.chains import LLMChain
from LangchainLLM import LangchainLLM
from CustomPromptTemplate import CustomPromptTemplate
from CustomOutputParser import CustomOutputParser
from ToolManager import ToolManager
from prompts.prompts import TEMPLATE_TOOL_USE

# Define LLM - responsible for generating text
llm = LangchainLLM()

# Define tools available to the LLM
tool_manager = ToolManager()
available_tools = tool_manager.get_tools()
print(available_tools)

# Define PromptTemplate - responsible for crafting any prompt to the LLM using a template
prompt = CustomPromptTemplate(
template=TEMPLATE_TOOL_USE,
tools=available_tools,
input_variables=["input", "intermediate_steps"]
)

# Define Chain - no clue why this is necessary
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Define OutputParser - responsible for extracting tool names and tool inputs from a LLM response
output_parser = CustomOutputParser()

# Define Agent and AgentExecutor - responsible for coordinating all the functionality
agent = LLMSingleActionAgent(
llm_chain=llm_chain,
output_parser=output_parser,
stop=["Observation:"],
allowed_tools=available_tools
)
agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=available_tools, verbose=True)

input = "In July 2022 there was a huge heatwave in England and Wales. What was the cause for this heatwave?"
tool_manager.set_question(input)
response = agent_executor.run(input)

print()
print("Response: ", response)<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The response from with Llama 3 8B Instruct on llama.cpp:

I'd be happy to help you segment your code into distinct logical parts using identifiers! Here is the segmented code:
`python
<SEGMENT: IMPORTS>
from langchain.agents import AgentExecutor, LLMSingleActionAgent
from langchain.chains import LLMChain
from LangchainLLM import LangchainLLM
from CustomPromptTemplate import CustomPromptTemplate
from CustomOutputParser import CustomOutputParser
from ToolManager import ToolManager
from prompts.prompts import TEMPLATETOOLUSE

<SEGMENT: LLML INIT <SEGMENT: LLML >>

(thats it, it cut of here. Token limit is set to 1400)
Sometimes responses are different and the llm briefly explains something vagely related to segmenting code, but nothing at all similar to the expected (and well within capabilities) output.

In comparison, when I put the same prompt (without the special <|eot|> tokens, etc.) into the online demo, I get this response (which is exactly the expected result):

Here is the code with added <SEGMENT> identifiers to segment it into distinct logical parts:
from langchain.agents import AgentExecutor, LLMSingleActionAgent
from langchain.chains import LLMChain
from LangchainLLM import LangchainLLM
from CustomPromptTemplate import CustomPromptTemplate
from CustomOutputParser import CustomOutputParser
from ToolManager import ToolManager
from prompts.prompts import TEMPLATE_TOOL_USE
<SEGMENT: IMPORTS>
# Define LLM - responsible for generating text
llm = LangchainLLM()
<SEGMENT: LLM DEFINITION>
# Define tools available to the LLM
tool_manager = ToolManager()
available_tools = tool_manager.get_tools()
print(available_tools)
<SEGMENT: TOOL DEFINITION>
# Define PromptTemplate - responsible for crafting any prompt to the LLM using a template
prompt = CustomPromptTemplate(
template=TEMPLATE_TOOL_USE,
tools=available_tools,
input_variables=["input", "intermediate_steps"]
)
<SEGMENT: PROMPT DEFINITION>
# Define Chain - no clue why this is necessary
llm_chain = LLMChain(llm=llm, prompt=prompt)
<SEGMENT: CHAIN DEFINITION>
# Define OutputParser - responsible for extracting tool names and tool inputs from a LLM response
output_parser = CustomOutputParser()
<SEGMENT: OUTPUT PARSER DEFINITION>
# Define Agent and AgentExecutor - responsible for coordinating all the functionality
agent = LLMSingleActionAgent(
llm_chain=llm_chain,
output_parser=output_parser,
stop=["Observation:"],
allowed_tools=available_tools
)
agent_executor = AgentExecutor.from_agent_and_tools(
agent=agent,
tools=available_tools,
verbose=True
)
<SEGMENT: AGENT DEFINITION>
input = "In July 2022 there was a huge heatwave in England and Wales. What was the cause for this heatwave?"
tool_manager.set_question(input)
response = agent_executor.run(input)
print()
print("Response: ", response)

Quite obviously there is something wrong with my llama.cpp prompt. To me it looks like the result of an incorrect chat template, however I believe that the prompt template is correct and uses the correct syntax. it matches the syntax described by Meta themselves: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3#meta-llama-3
I observed similar behavior with other models (Mistral, Llama2) before and also with prompts that do not include code - I haven't been able to reproduce those examples recently though.
Any clues on how to continue debugging here? Thanks!

janwolff123 · 2025-01-24T05:58:00Z

janwolff123
Jan 24, 2025

I have got exactly the same problem with local model, with all llama modells an via different apperoaches, llama.cpp, ollama, transformers, langchain. All much worse than using the hugging face api to run it online. Please let me know if you found any solution :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is my Llama.cpp result so much worse than for the same model on another platform? #7781

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Why is my Llama.cpp result so much worse than for the same model on another platform? #7781

TechDom Jun 5, 2024

Replies: 1 comment

janwolff123 Jan 24, 2025

TechDom
Jun 5, 2024

janwolff123
Jan 24, 2025