Be careful when using adaptive gradient methods #17

stevenyangyj · 2019-05-16T09:26:06Z

I tested three methods in a very simple problem, and got the result as above.

Code are printed here:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import adabound

class Net(nn.Module):

def __init__(self, dim):
    
    super(Net, self).__init__()
    self.fc1 = nn.Linear(dim, 2*dim)
    self.relu = nn.ReLU(inplace=True)
    self.fc2 = nn.Linear(2*dim, dim)

def forward(self, x):
    
    x = self.fc1(x)
    x = self.relu(x)
    x = self.fc2(x)
    
    return x

DIM = 30
epochs = 1000
xini = (torch.ones(1, DIM) * 100)
opti = (torch.zeros(1, DIM) * 100)

lr = 0.01
net = Net(DIM)
objfun = nn.MSELoss()

loss_adab = []
loss_adam = []
loss_sgd = []
for epoch in range(epochs):

if epoch % 100 == 0:
    lr /= 10

optimizer = adabound.AdaBound(net.parameters(), lr) 
out = net(xini)
los = objfun(out, opti)
loss_adab.append(los.detach().numpy())

optimizer.zero_grad()
los.backward()
optimizer.step()

lr = 0.01
net = Net(DIM)
objfun = nn.MSELoss()

for epoch in range(epochs):

if epoch % 100 == 0:
    lr /= 10

optimizer = torch.optim.Adam(net.parameters(), lr) 
out = net(xini)
los = objfun(out, opti)
loss_adam.append(los.detach().numpy())

optimizer.zero_grad()
los.backward()
optimizer.step()

lr = 0.001
net = Net(DIM)
objfun = nn.MSELoss()

for epoch in range(epochs):

if epoch % 100 == 0:
    lr /= 10

optimizer = torch.optim.SGD(net.parameters(), lr, momentum=0.9) 
out = net(xini)
los = objfun(out, opti)
loss_sgd.append(los.detach().numpy())

optimizer.zero_grad()
los.backward()
optimizer.step()

plt.figure()
plt.plot(loss_adab, label='adabound')
plt.plot(loss_adam, label='adam')
plt.plot(loss_sgd, label='SGD')
plt.yscale('log')
plt.xlabel('epochs')
plt.ylabel('Log(loss)')
plt.legend()
plt.savefig('camp.png', dpi=600)
plt.show()

The text was updated successfully, but these errors were encountered:

LeanderK · 2019-06-18T10:00:12Z

this is not a sensible issues. Of course you can create problems adaptive optimizers are not good at, there's no free lunch in this miserable world! This repo ist for adabound and not a general discussion of adaptive optimizers.

stevenyangyj · 2019-06-21T01:54:57Z

Hi, LeanderK. Thanks for your comment. You may misunderstand my purpose. There's no free lunch in this world indeed, so there's also no free lunch between exploration & exploitation for optimizers. Sometimes, adaptive methods actually bring good convergence speed in early stage but get worse optimization results in end training stage. I did not impugn adabound or ANY adaptive methods, just gave some my suggests: if you are going to train a NN, please first try SGD with fine-tuned hyper-parameters in order to save your EXPENSIVE GPU time.
The link:
"On the Convergence of Adam and Beyond"
"The Marginal Value of Adaptive Gradient Methods in Machine Learning"

ConvMech · 2019-08-11T16:09:58Z

If you read Luo's paper, you will find the above two papers have been cited already. it's not a secret

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be careful when using adaptive gradient methods #17

Be careful when using adaptive gradient methods #17

stevenyangyj commented May 16, 2019

LeanderK commented Jun 18, 2019

stevenyangyj commented Jun 21, 2019

ConvMech commented Aug 11, 2019

Be careful when using adaptive gradient methods #17

Be careful when using adaptive gradient methods #17

Comments

stevenyangyj commented May 16, 2019

LeanderK commented Jun 18, 2019

stevenyangyj commented Jun 21, 2019

ConvMech commented Aug 11, 2019