Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be careful when using adaptive gradient methods #17

Open
stevenyangyj opened this issue May 16, 2019 · 3 comments
Open

Be careful when using adaptive gradient methods #17

stevenyangyj opened this issue May 16, 2019 · 3 comments

Comments

@stevenyangyj
Copy link

camp

I tested three methods in a very simple problem, and got the result as above.

Code are printed here:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import adabound

class Net(nn.Module):

def __init__(self, dim):
    
    super(Net, self).__init__()
    self.fc1 = nn.Linear(dim, 2*dim)
    self.relu = nn.ReLU(inplace=True)
    self.fc2 = nn.Linear(2*dim, dim)

def forward(self, x):
    
    x = self.fc1(x)
    x = self.relu(x)
    x = self.fc2(x)
    
    return x

DIM = 30
epochs = 1000
xini = (torch.ones(1, DIM) * 100)
opti = (torch.zeros(1, DIM) * 100)

lr = 0.01
net = Net(DIM)
objfun = nn.MSELoss()

loss_adab = []
loss_adam = []
loss_sgd = []
for epoch in range(epochs):

if epoch % 100 == 0:
    lr /= 10

optimizer = adabound.AdaBound(net.parameters(), lr) 
out = net(xini)
los = objfun(out, opti)
loss_adab.append(los.detach().numpy())

optimizer.zero_grad()
los.backward()
optimizer.step()

lr = 0.01
net = Net(DIM)
objfun = nn.MSELoss()

for epoch in range(epochs):

if epoch % 100 == 0:
    lr /= 10

optimizer = torch.optim.Adam(net.parameters(), lr) 
out = net(xini)
los = objfun(out, opti)
loss_adam.append(los.detach().numpy())

optimizer.zero_grad()
los.backward()
optimizer.step()   

lr = 0.001
net = Net(DIM)
objfun = nn.MSELoss()

for epoch in range(epochs):

if epoch % 100 == 0:
    lr /= 10

optimizer = torch.optim.SGD(net.parameters(), lr, momentum=0.9) 
out = net(xini)
los = objfun(out, opti)
loss_sgd.append(los.detach().numpy())

optimizer.zero_grad()
los.backward()
optimizer.step()

plt.figure()
plt.plot(loss_adab, label='adabound')
plt.plot(loss_adam, label='adam')
plt.plot(loss_sgd, label='SGD')
plt.yscale('log')
plt.xlabel('epochs')
plt.ylabel('Log(loss)')
plt.legend()
plt.savefig('camp.png', dpi=600)
plt.show()

@LeanderK
Copy link

this is not a sensible issues. Of course you can create problems adaptive optimizers are not good at, there's no free lunch in this miserable world! This repo ist for adabound and not a general discussion of adaptive optimizers.

@stevenyangyj
Copy link
Author

Hi, LeanderK. Thanks for your comment. You may misunderstand my purpose. There's no free lunch in this world indeed, so there's also no free lunch between exploration & exploitation for optimizers. Sometimes, adaptive methods actually bring good convergence speed in early stage but get worse optimization results in end training stage. I did not impugn adabound or ANY adaptive methods, just gave some my suggests: if you are going to train a NN, please first try SGD with fine-tuned hyper-parameters in order to save your EXPENSIVE GPU time.
The link:
"On the Convergence of Adam and Beyond"
"The Marginal Value of Adaptive Gradient Methods in Machine Learning"

@ConvMech
Copy link

If you read Luo's paper, you will find the above two papers have been cited already. it's not a secret

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants