-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AgentTuning 7b evaluate in HH, not expect as paper result #39
Comments
{ |
Your output seems like there may be a mismatch in the evaluation setup you've used. Please ensure that you're using the evaluation code from |
Yes, when I use the latest version of them, where do I send the trajectory information? |
But I can get to 0.84 with gpt-4 { |
here is my trajectories for a thorough review in HH. |
As mentioned in https://github.com/THUDM/AgentTuning#held-in-tasks
Please use the AgentBench.old directory at AgentBench.old for Agent task evaluation. |
But it's just a lot below the latest Agentbench test. a bit unexpected. Make sure that the uploaded model is okay. |
How much epoch have you trained? |
The models are trained for 2k steps, batch size 64, sequence length 4096 with packing. |
I use fastchat to fine tune llama2, but the effect was not very ideal. Can you use fastchat to achieve the effect of the paper after fine tuning? Although the batch size I set is not very large at 2, the improvement in completing tasks after fine-tuning is not significant. Do you have any good suggestions? |
in addtion , one of AgentInstruct data is invalid : Model S Model 3 Model X Model Y Email Address Zip Code }, { "from": "gpt", "loss": true, "value": "" } ], "id": "mind2web_60" } |
Since I achieved poor results after fine-tuning with FastChat, I intend to further improve its capabilities by increasing the dataset size. |
Is alfworld's prompt "alfworld_multiturn_new.json" better than "alfworld_multiturn_react.json"? |
https://huggingface.co/THUDM/agentlm-7b , I try it,but far below 84% in alfworld-std. Is it the wrong model?
The text was updated successfully, but these errors were encountered: