Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About gradient accumulation implementation #44

Open
xvyaward opened this issue Jan 28, 2025 · 0 comments
Open

About gradient accumulation implementation #44

xvyaward opened this issue Jan 28, 2025 · 0 comments

Comments

@xvyaward
Copy link

Hello,
Thank you for your fascinating and impressive research and code.

I am currently using MeZO for a task with relatively long input prompts, while working with limited resources (a single GPU with low GPU VRAM). According to the paper, it seems that using a sufficiently large minibatch size B(e.g., 64) is crucial to reducing the variance of the estimated gradient and ensuring stability.

Given that I am unable to use a large per_device_batch size in my setup, I would like to implement gradient accumulation in MeZO. Could you possibly provide guidance on which parts of the code need to be modified to achieve this?

Thank you in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant