Reproducing Alfworld Results #35

ai-nikolai · 2024-01-15T15:20:05Z

Hi,

Thanks for the great work. Unfortunately, we are unable to reproduce your results for ReAct / Reflexion on Alfworld.

E.g. Env0 & Env1 are successful for you, however, we always get failures on our end. (Other Envs are successful though, so it does work sometimes).

@noahshinn

noahshinn · 2024-01-15T21:58:10Z

Hi @ai-nikolai , what model are you using?

ai-nikolai · 2024-01-16T09:36:29Z

Thanks. The model used: gpt-3.5-turbo @noahshinn

ai-nikolai · 2024-01-16T17:47:13Z

@noahshinn would it also be possible to upload the actual game logs for alfworld as well?

noahshinn · 2024-01-16T18:00:40Z

The model gpt-3.5-turbo is not the same model used during the paper's time (Feb 2023). We used text-davinci-002. I'd expect that the mistakes you see result from the inferred action not matching any of the actions in the action space. We followed ReAct's implementation for AlfWorld results to stay consistent with their work.

To aid this, I would advise you to display the action space to the model to eliminate parsing errors. I can add a side implementation for this if it would be helpful for you. Also, I will dig to see if I can find the original log files from the text-davinci-002 runs.

ai-nikolai · 2024-01-31T18:14:30Z

Thank you @noahshinn.

Please let us know, if there was any luck finding the original logs using text-davinci-002. This would be a really big help. Thank you.

dong-river · 2024-02-25T13:36:31Z

I had the same issue with got-3.5-turbo. The success rate seems much much lower. The first trial success rate for me on a subset of tasks is only around 17% which is consistent with the report from Agentbench paper. So if you could provide the original log would be really helpful

ai-nikolai · 2024-03-08T15:10:10Z

Hi all,

A couple of comments to follow-up on this:

The results you report are very hard to reproduce. (The model you used text-davinci-002 is deprecated, the two alternatives davinci-002 and gpt-3.5-turbo both have an accuracy of 0.3 on a subset, while your reported results have 0.7). Could you provide the traces, or tell us how we could produce your results.
Secondly, please see attached the screenshot from AgentBench. The relevant column is HH, where you can see that only GPT-4 achieves comparable results to your ReAct results. While text-davinci-002 (which is the model your code shows, only achieves 16%, which is in-line with our reproducibility experiments).
Finally, the original ReAct paper implemented the success condition using info["won"]==True, while you use done==True. This is referenced in the original alfworld repository as an issue Success Condition(s): done[0] is not equal to info["won"][0] alfworld/alfworld#51

Concrete Actions / Questions:

Please clarify how to get the results you get? (With the weaker models, or were stronger models used, or do you have traces)
Please clarify if we mis-understand your results or whether they are actually 70+% or more closer to 30%?

@noahshinn @ysymyth @becklabs

ai-nikolai · 2024-03-20T15:24:01Z

@noahshinn - any updates on the above?

CSUN1997 · 2024-05-30T01:57:20Z

Hi @ai-nikolai,
I am also trying to reproduce the results. The performance was bad in the beginning. After adding these lines to parse the action, the performance went back to normal:

ai-nikolai · 2024-11-20T16:21:56Z

@CSUN1997 @noahshinn @dong-river @ysymyth - It seems there are a couple of issues, which are summarised in this paper StateAct (https://arxiv.org/abs/2410.02810).

Specifically the issues are:

Different gpt models have very different performance. (With older models often performing much better).
Secondly, what @CSUN1997 mentions is also mentioned in StateAct as "Correction". Because the GPT models often produce put <object> in <place>, however, the correct alfworld syntax is put <object> in/on <place>.

Xyuan13 · 2024-12-10T09:35:47Z

@CSUN1997 @noahshinn @dong-river @ysymyth - It seems there are a couple of issues, which are summarised in this paper StateAct (https://arxiv.org/abs/2410.02810).

@ai-nikolai Hi，I also find it very hard to reproduce the reported results of ReAct and Reflextion. I’ve read the paper StateAct, and I’d like to clarify the reported results in Table 2 (Success Rate (SR) on the 135 test-set examples from Alfworld). Are these results based on a single trial, or multiple trials?

ai-nikolai · 2025-01-21T15:29:25Z

@CSUN1997 @noahshinn @dong-river @ysymyth - It seems there are a couple of issues, which are summarised in this paper StateAct (https://arxiv.org/abs/2410.02810).

@ai-nikolai Hi，I also find it very hard to reproduce the reported results of ReAct and Reflextion. I’ve read the paper StateAct, and I’d like to clarify the reported results in Table 2 (Success Rate (SR) on the 135 test-set examples from Alfworld). Are these results based on a single trial, or multiple trials?

@Xyuan13 this is single trial.

ai-nikolai mentioned this issue Nov 20, 2024

[Reproducing Results] on Alfworld ysymyth/ReAct#28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing Alfworld Results #35

Reproducing Alfworld Results #35

ai-nikolai commented Jan 15, 2024 •

edited

Loading

noahshinn commented Jan 15, 2024

ai-nikolai commented Jan 16, 2024 •

edited

Loading

ai-nikolai commented Jan 16, 2024

noahshinn commented Jan 16, 2024

ai-nikolai commented Jan 31, 2024

dong-river commented Feb 25, 2024

ai-nikolai commented Mar 8, 2024 •

edited

Loading

ai-nikolai commented Mar 20, 2024

CSUN1997 commented May 30, 2024 •

edited

Loading

ai-nikolai commented Nov 20, 2024

Xyuan13 commented Dec 10, 2024

ai-nikolai commented Jan 21, 2025

Reproducing Alfworld Results #35

Reproducing Alfworld Results #35

Comments

ai-nikolai commented Jan 15, 2024 • edited Loading

noahshinn commented Jan 15, 2024

ai-nikolai commented Jan 16, 2024 • edited Loading

ai-nikolai commented Jan 16, 2024

noahshinn commented Jan 16, 2024

ai-nikolai commented Jan 31, 2024

dong-river commented Feb 25, 2024

ai-nikolai commented Mar 8, 2024 • edited Loading

ai-nikolai commented Mar 20, 2024

CSUN1997 commented May 30, 2024 • edited Loading

ai-nikolai commented Nov 20, 2024

Xyuan13 commented Dec 10, 2024

ai-nikolai commented Jan 21, 2025

ai-nikolai commented Jan 15, 2024 •

edited

Loading

ai-nikolai commented Jan 16, 2024 •

edited

Loading

ai-nikolai commented Mar 8, 2024 •

edited

Loading

CSUN1997 commented May 30, 2024 •

edited

Loading