About gradient accumulation implementation #44

xvyaward · 2025-01-28T17:15:51Z

Hello,
Thank you for your fascinating and impressive research and code.

I am currently using MeZO for a task with relatively long input prompts, while working with limited resources (a single GPU with low GPU VRAM). According to the paper, it seems that using a sufficiently large minibatch size B(e.g., 64) is crucial to reducing the variance of the estimated gradient and ensuring stability.

Given that I am unable to use a large per_device_batch size in my setup, I would like to implement gradient accumulation in MeZO. Could you possibly provide guidance on which parts of the code need to be modified to achieve this?

Thank you in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About gradient accumulation implementation #44

About gradient accumulation implementation #44

xvyaward commented Jan 28, 2025

About gradient accumulation implementation #44

About gradient accumulation implementation #44

Comments

xvyaward commented Jan 28, 2025