You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a chatbot that uses the PAL method to program some custom functions to answer user questions in combination with user data in the system.
User data is sleep and exercise data uploaded by users through smart wearable devices. The data types are very rich (including 50+ data fields), and the amount of data is large (every user will generate multiple pieces of data every day).
The Python code generated by LLM, that determines the time range of the query data, the data fields that need to be queried, the function arrangement and other information
Prompt template:
As a sleep and sport AI, You focus on sleep, health, and exercise. You provide Python code to answer sleep-related or and sport-related questions with personal data.
...
## At any point, you have access to the following functions:
- get_data_by_date_range(start_date: str, end_date: str, fields: list): Query the specified sleep and sport metrics data for the user within a specified time range.
- draw(data: list): Plot the graph based on the queried data and the required metric.
- summarize(data: list, question: str): Respond to non-graphical sleep-related questions from the user based on the queried sleep data.
- combination_response(summarize_response_list: list, chart_list: list): Used to aggregate user query results and return them uniformly.
...
## Here are all sleep data metrics (fields definitions) we have:
- `sleep_duration`: Sleep duration, in minutes.
- `rem_duration`: Duration of time spent in REM (rapid eye movement) sleep, in minutes.
- `resting_heart_rate`: Resting heart rate.
... (The other 50+ field descriptions are omitted here)
## Here some examples of how to use the functions:
Human: Show a chart displaying the duration of my sleep duration and heart rate this week, and give me some suggestions.
AI:
```python
start_date = "2023-04-17"
end_date = "2023-04-23"
fields = ["sleep_duration","resting_heart_rate"]
sleep_data = get_data_by_date_range(start_date, end_date, fields)
summarize_resp = summarize(sleep_data, "Please analyze my sleep data give me some suggestions.")
draw_chart = draw(sleep_data)
response = combination_response([summarize_resp], [draw_chart])
``
... (The other 4 examples are omitted here)
Thank you very much for your patience in reading this far. I wrote a lot of background information in order to describe the problem, which resulted in a very long text.
Question
PAL is an amazing method. We have already used it in production. We want to replace openai LLM with the open source LLM and encounter some problems:
Recently, many popular open source LLM have been released. Have we conducted supplementary tests? Is there any recommended open source LLM?
Accuracy on our test cases:
LLM
Accuracy
gpt-3.5-turbo
96%
PaLM2(text-bison@001)
72.88%
WizardCoder-15B
45%
Vicuna-13B
29%
Found by analysis:
a. Compared with gpt-3.5-turbo, The PaLM2, WizardCoder, and Vicuna all have a decline in date reasoning performance. Is there any way to improve date reasoning?
b. The generalization ability of WizardCoder-15B and Vicuna-13B is insufficient. There are many output codes, basically copying few-shot, and not generating code according to the problem. Is it caused by insufficient model parameters?
Our prompt is very longm, There are many field descriptions, function descriptions, and few-shot examples in it. Can we fine-tuning to reduce the number of input tokens? The few-shot can be omitted, but the functions description and fields definitions can be omitted?
Any suggestions for fine-tuning the training and testing datasets?
Thanks again everyone, if you can pick some questions and help me answer them
The text was updated successfully, but these errors were encountered:
Hi @liuhu ,
Thank you for your interest in our work and for your kind words!
We haven't conducted much experiments with other open source models.
I agree that new open source models come out every day, claiming to surpass ChatGPT, but they are eventually found not to be as general and as adaptive as ChatGPT.
I don't have a clear solution to that other than trying a few others (maybe Falcon?).
Regarding fine-tuning: yes, I think that if you have the resources and the data, fine-tuning on your examples can help reducing the prompt size to as small as the example-specific inputs.
We are also using programming to solve user problems through PAL, and we are facing issues with long prompts and inference. We would like to know if fine-tuning is effective in addressing these issues or if there are other solutions that we can consider.
Background
We have a chatbot that uses the PAL method to program some custom functions to answer user questions in combination with user data in the system.
User data is sleep and exercise data uploaded by users through smart wearable devices. The data types are very rich (including 50+ data fields), and the amount of data is large (every user will generate multiple pieces of data every day).
The Python code generated by LLM, that determines the time range of the query data, the data fields that need to be queried, the function arrangement and other information
Prompt template:
Thank you very much for your patience in reading this far. I wrote a lot of background information in order to describe the problem, which resulted in a very long text.
Question
PAL is an amazing method. We have already used it in production. We want to replace openai LLM with the open source LLM and encounter some problems:
Found by analysis:
a. Compared with gpt-3.5-turbo, The PaLM2, WizardCoder, and Vicuna all have a decline in date reasoning performance. Is there any way to improve date reasoning?
b. The generalization ability of WizardCoder-15B and Vicuna-13B is insufficient. There are many output codes, basically copying few-shot, and not generating code according to the problem. Is it caused by insufficient model parameters?
Thanks again everyone, if you can pick some questions and help me answer them
The text was updated successfully, but these errors were encountered: