-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing Alfworld Results #35
Comments
Hi @ai-nikolai , what model are you using? |
Thanks. The model used: |
@noahshinn would it also be possible to upload the actual game logs for alfworld as well? |
The model To aid this, I would advise you to display the action space to the model to eliminate parsing errors. I can add a side implementation for this if it would be helpful for you. Also, I will dig to see if I can find the original log files from the |
Thank you @noahshinn. Please let us know, if there was any luck finding the original logs using |
I had the same issue with got-3.5-turbo. The success rate seems much much lower. The first trial success rate for me on a subset of tasks is only around 17% which is consistent with the report from Agentbench paper. So if you could provide the original log would be really helpful |
Hi all, A couple of comments to follow-up on this:
Concrete Actions / Questions:
|
@noahshinn - any updates on the above? |
Hi @ai-nikolai, |
@CSUN1997 @noahshinn @dong-river @ysymyth - It seems there are a couple of issues, which are summarised in this paper StateAct (https://arxiv.org/abs/2410.02810). Specifically the issues are:
|
@ai-nikolai Hi,I also find it very hard to reproduce the reported results of ReAct and Reflextion. I’ve read the paper StateAct, and I’d like to clarify the reported results in Table 2 (Success Rate (SR) on the 135 test-set examples from Alfworld). Are these results based on a single trial, or multiple trials? |
@Xyuan13 this is single trial. |
Hi,
Thanks for the great work. Unfortunately, we are unable to reproduce your results for ReAct / Reflexion on Alfworld.
E.g. Env0 & Env1 are successful for you, however, we always get failures on our end. (Other Envs are successful though, so it does work sometimes).
@noahshinn
The text was updated successfully, but these errors were encountered: