-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] The success condition for alfworld might be incorrect #19
Comments
You are right. We didn't notice this because we used a max of 30 steps per trajectory for the paper results. |
Thanks so much for the quick response. |
Btw when I follow your fix commit (change 50 -> 49), the performance drastically drops. Did you observe this? |
Which model are you using? can you provide numbers for the drop in performance |
I used gpt-3.5-turbo as the same as the default config file, and the success rate is equal to 0.13 when the trial number is zero (lower than the reported performance, i.e., ~=0.6, as shown in Fig. 3 of the paper). I am still waiting for the remaining trials. |
@noahshinn024 The results with 12 trials are below, and the success rate is much lower than the reported results. ***** Start Trial #11 ***** Environment #0 Trial #11: SUCCESS SUCCESS: 98
|
A month has passed, have there been any further findings regarding this reproduction result? Have you successfully reproduced it? |
Hi @stevenyangyj , thanks for these findings! Did you look through the log files to check if the errors can be explained by incorrect action choice or incorrect action specification? I am asking this because we used gpt-3.5 (text-davinci-003, not gpt-3.5-turbo) for the Alfworld runs. My guess is that the chat model may perform worse due to formatting errors as the complete action space is not defined at each time step. Let me know if this aligns with your findings! |
According to this line and the other line, the function alfworld_run will return True when the environment reaches the allowed maximal number of steps regardless the goal is or not achieved, which will result in a spuriously higher measure of success rate.
The text was updated successfully, but these errors were encountered: