-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
label leaks may happen? #27
Comments
Hi @LongLiveSocialism, thanks for the note. Reflexion is a method to amplify binary rewards to natural language feedback that can be used to improve generative performance. The reward model can take many forms - as evidenced by our programming and decision-making tasks. Can you unpack your comment about "reflexion [being] some kind of in-context few-shot sft/rl"? The second half of your note seems to reference details that would be relevant if Reflexion were viewed as a supervised training process for the purpose of deployment to unseen samples, which was not the intent of the paper. The purpose is to do smart sampling conditioned on sparse feedback from the environment. I'd be happy to discuss this idea further though. |
In my opinion, @LongLiveSocialism does no focus on programming and decision-making tasks. However, the experiment in the hotpotQA task may not be very reasonable. Because the evaluation used real labels, he/she believes that in this experiment, your reflexion may be an in context few shot sft/rl. It is clear that during the react process, the lack of ground truth label is a fact . So the better performance of the reflexion can be well understood(the main reason for the improvement may not be the reflexion architecture). The purpose of reflexion is to do smart sampling conditional on sparse feedback from the environment.That is ok. Firstly, it is almost impossible to obtain the real label in actual scenarios or the vast majority of scenarios, and secondly, it may confuse others, is the role of real label supervision or is your feedback mechanism more important. |
I agree that label leakage is a concern. Although calculating rewards through ground truth doesn't directly expose the correct answers to the model, it can influence the model's decision-making process, leading it toward the correct answers. For instance, consider a binary classification problem where the model needs to output Yes or No. And we use two round iterations.
Using this approach, we can obtain a model with nearly 100% accuracy after two iterations. This weird performance boost is caused by label leakage instead of the RL process. Of course, HotpotQA is not a simple binary classification task, and the model may not necessarily converge to the correct answer after several iterations. However, the truth labels do have a substantial supervisory effect on the model. In reality, most tasks don't have ground truth available for model iterations, limiting the applicability of this method. In my opinion, a more reasonable approach would be to have the model itself (or utilize a powerful backend like GPT-4) to score the results as rewards rather than calculating them directly through ground truth, which would help avoid the issue of label leakage and make it capable to real-world scenarios that do not have truth label. |
Sorry for the late response, but I should refer you to our ablation study shown in Figure 4 of our paper. In that study, we evaluated baseline sampling (blindly sample for N samples) vs episodic memory sampling (sampling conditioned on the previous samples and binary labels) and finally, reflexion sampling. We found that episodic memory sampling improved accuracy (which could be described by the process of elimination suggestion by @lazyupdate), but did not produce performance improvements as high as the reflexion sampling strategy. Episodic memory sampling contains labels and previous answers but does not lead to the best performance. This eliminates "label leakage" from being the sole contributor to the success of reflexion on HotPotQA. Let me know if there are further questions. |
Hi Noah, I'm reproducing your work, generally I view reflexion as some kind of in-context few-shot sft/rl, which requires supervised signals (either from environment or label) . However, in your code, the evaluation on hotpotQA seems directly using the validation set label as this supervised signal, which means label leaks happened. I'm pretty confused here.
Did you do the experiments on whether reflection on training samples could generalize to the validation samples? Or did I correctly get your thought?
The text was updated successfully, but these errors were encountered: