Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lr_scheduler affect the actual learning rate #9

Open
tsaizehua opened this issue Feb 28, 2019 · 3 comments
Open

lr_scheduler affect the actual learning rate #9

tsaizehua opened this issue Feb 28, 2019 · 3 comments

Comments

@tsaizehua
Copy link

# Applies bounds on actual learning rate
# lr_scheduler cannot affect final_lr, this is a workaround to apply lr decay
final_lr = group['final_lr'] * group['lr'] / base_lr`

However lr_scheduler may change param_group['lr'] during training, therefore the final_lr, lower_bound, upper_bound will also be affected.

Should I not use lr_scheduler and let AbaBound adapts the params to transform from Adam to SGD?

Thank you very much!

@Luolc
Copy link
Owner

Luolc commented Feb 28, 2019

It's a feature.
The bounds should be affected by lr scheduler as well.
You may refer to the CIFAR-10 demo for the example of using AdaBound together with lr scheduler.

Update:
Some intuitive reasons for this design: lr decay is an independent operation that we may apply when using any optimizers. We shouldn't let a specific optimizer override the behavior of lr_scheduler.

@tsaizehua
Copy link
Author

Thank you for reply.

I have tried comparing Adam with Adabound on lstm language model. Found that adabound indeed makes learning process more stable, however, the leaning speed and converge rate is slower than Adam. The initial params is shown on screenshot. Should I try higher lr (current is 0.008) next, or any better suggestion? Thank you.

P.S. I also used lr_scheduler, and do scheduler.step(valid_loss) evey 1/5 epoch.

scheduler = t.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                       mode='min',
                                                       factor=0.5,
                                                       patience=3,
                                                       verbose=True)

adabound
adam

@Luolc
Copy link
Owner

Luolc commented Mar 1, 2019

however, the leaning speed and converge rate is slower than Adam.

Indeed, there's no guarantee that AdaBound would be faster than Adam, and we never claim that.
The only thing I can make sure is: Theoretically, AdaBound would robustly faster than SGD with similar settings.

Regarding your specific situation, maybe you can try a lower transformation speed from Adam to SGD, viz. lower gamma value.

Personally, I would regard this optimizer more like SGD than Adam. Its strength is that it can quickly achieve a relatively small loss, and we may directly fine-tune an SGD at the final stage. As we know, tuning SGD is not that easy, therefore we shouldn't expect a perfect result coming that easy either.

Lastly, we did find that SGD is worse than Adam in some NLP tasks and are still finding why. In this case, I am afraid AdaBound may not outperform Adam. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants