You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
Thank you for your fascinating and impressive research and code.
I am currently using MeZO for a task with relatively long input prompts, while working with limited resources (a single GPU with low GPU VRAM). According to the paper, it seems that using a sufficiently large minibatch size B(e.g., 64) is crucial to reducing the variance of the estimated gradient and ensuring stability.
Given that I am unable to use a large per_device_batch size in my setup, I would like to implement gradient accumulation in MeZO. Could you possibly provide guidance on which parts of the code need to be modified to achieve this?
Thank you in advance!
The text was updated successfully, but these errors were encountered:
Hello,
Thank you for your fascinating and impressive research and code.
I am currently using MeZO for a task with relatively long input prompts, while working with limited resources (a single GPU with low GPU VRAM). According to the paper, it seems that using a sufficiently large minibatch size B(e.g., 64) is crucial to reducing the variance of the estimated gradient and ensuring stability.
Given that I am unable to use a large per_device_batch size in my setup, I would like to implement gradient accumulation in MeZO. Could you possibly provide guidance on which parts of the code need to be modified to achieve this?
Thank you in advance!
The text was updated successfully, but these errors were encountered: