lr_scheduler affect the actual learning rate #9

tsaizehua · 2019-02-28T09:35:18Z

# Applies bounds on actual learning rate
# lr_scheduler cannot affect final_lr, this is a workaround to apply lr decay
final_lr = group['final_lr'] * group['lr'] / base_lr`

However lr_scheduler may change param_group['lr'] during training, therefore the final_lr, lower_bound, upper_bound will also be affected.

Should I not use lr_scheduler and let AbaBound adapts the params to transform from Adam to SGD?

Thank you very much!

The text was updated successfully, but these errors were encountered:

Luolc · 2019-02-28T12:29:20Z

It's a feature.
The bounds should be affected by lr scheduler as well.
You may refer to the CIFAR-10 demo for the example of using AdaBound together with lr scheduler.

Update:
Some intuitive reasons for this design: lr decay is an independent operation that we may apply when using any optimizers. We shouldn't let a specific optimizer override the behavior of lr_scheduler.

tsaizehua · 2019-03-01T06:28:09Z

Thank you for reply.

I have tried comparing Adam with Adabound on lstm language model. Found that adabound indeed makes learning process more stable, however, the leaning speed and converge rate is slower than Adam. The initial params is shown on screenshot. Should I try higher lr (current is 0.008) next, or any better suggestion? Thank you.

P.S. I also used lr_scheduler, and do scheduler.step(valid_loss) evey 1/5 epoch.

scheduler = t.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                       mode='min',
                                                       factor=0.5,
                                                       patience=3,
                                                       verbose=True)

Luolc · 2019-03-01T16:29:45Z

however, the leaning speed and converge rate is slower than Adam.

Indeed, there's no guarantee that AdaBound would be faster than Adam, and we never claim that.
The only thing I can make sure is: Theoretically, AdaBound would robustly faster than SGD with similar settings.

Regarding your specific situation, maybe you can try a lower transformation speed from Adam to SGD, viz. lower gamma value.

Personally, I would regard this optimizer more like SGD than Adam. Its strength is that it can quickly achieve a relatively small loss, and we may directly fine-tune an SGD at the final stage. As we know, tuning SGD is not that easy, therefore we shouldn't expect a perfect result coming that easy either.

Lastly, we did find that SGD is worse than Adam in some NLP tasks and are still finding why. In this case, I am afraid AdaBound may not outperform Adam. :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lr_scheduler affect the actual learning rate #9

lr_scheduler affect the actual learning rate #9

tsaizehua commented Feb 28, 2019

Luolc commented Feb 28, 2019 •

edited

Loading

tsaizehua commented Mar 1, 2019

Luolc commented Mar 1, 2019 •

edited

Loading

lr_scheduler affect the actual learning rate #9

lr_scheduler affect the actual learning rate #9

Comments

tsaizehua commented Feb 28, 2019

Luolc commented Feb 28, 2019 • edited Loading

tsaizehua commented Mar 1, 2019

Luolc commented Mar 1, 2019 • edited Loading

Luolc commented Feb 28, 2019 •

edited

Loading

Luolc commented Mar 1, 2019 •

edited

Loading