-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The provided new optimizer is sensitive on tiny batchsize #11
Comments
That's very interesting. We didn't pay attention to the impact by batch size before. Would you please provide more details of the experiments? Such as the hyperparameters of each optimizer, the scale of the dataset, etc. |
import torch generate M data points roughly forming a line (noise added)M = 50 theta_true = torch.Tensor([[0.5], [2]]) X = 10 * torch.rand(M, 2) - 5 X[:, 1] = 1.0 y = torch.mm(X, theta_true) + 0.3 * torch.randn(M, 1) def mse(t1, t2):
def model(x):
def cost_func(theta, X, y):
Definebatch_size = 1 num_epochs = 100 loss_fn = F.mse_loss train_ds = TensorDataset(X, y) train_dl = DataLoader(train_ds, batch_size, shuffle=True) model = nn.Linear(2, 1, bias=False) Define a utility function to train the modeldef fit(num_epochs, loss_fn, opt):
ADAM_t, ADAM = fit(num_epochs, loss_fn, torch.optim.Adam(model.parameters(), lr=1e-2) ) SGD_t, SGD = fit(num_epochs, loss_fn, torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0) ) ADAB_t, ADAB = fit(num_epochs, loss_fn, AdaBound(model.parameters(), lr=1e-2, final_lr=0.1) ) theta_0_vals = np.linspace(-2,4,100) theta_1_vals = np.linspace(0,4,100) theta = torch.Tensor(len(theta_0_vals),2) for i, theta_0 in enumerate(theta_0_vals):
xc,yc = np.meshgrid(theta_0_vals, theta_1_vals) contours = plt.contour(xc, yc, J, 20) plot_vals = range(0,num_epochs) plt.plot(ADAM_t[0,0,plot_vals],ADAM_t[0,1,plot_vals],'-.',lw=2, label='Adam') plt.plot(SGD_t[0,0,plot_vals],SGD_t[0,1,plot_vals],'-.',lw=2, label='Sgd') plt.plot(SGDM_t[0,0,plot_vals],SGDM_t[0,1,plot_vals],'-.',lw=2, label='Sgd+momentum') plt.plot(ADAB_t[0,0,plot_vals],ADAB_t[0,1,plot_vals],'-.',lw=2, label='AdaBound') plt.scatter(theta_true[0].numpy(),theta_true[1].numpy(),marker='*', color='red',lw=2, label='gloal') plt.legend(loc='lower left') plt.figure() plt.subplot(211) plt.plot(range(ADAB.shape[0]),ADAB,'-.',lw=2, label='AdaBound') plt.plot(range(ADAB.shape[0]),ADAM,'-.',lw=2, label='Adam') plt.plot(range(ADAB.shape[0]),SGD,'-.',lw=2, label='Sgd') plt.plot(range(ADAB.shape[0]),SGDM,'-.',lw=2, label='Sgd+momentum') plt.subplots_adjust(top=2.92, bottom=0.12, left=0.15, right=2.95, hspace=0.2, wspace=0.35) plt.legend(loc='upper right') plt.subplot(212) plt.plot(range(ADAB.shape[0]),ADAB,'-.',lw=2, label='AdaBound') plt.plot(range(ADAB.shape[0]),ADAM,'-.',lw=2, label='Adam') plt.plot(range(ADAB.shape[0]),SGD,'-.',lw=2, label='Sgd') plt.plot(range(ADAB.shape[0]),SGDM,'-.',lw=2, label='Sgd+momentum') plt.subplots_adjust(top=2.92, bottom=0.12, left=0.15, right=2.95, hspace=0.2, wspace=0.35) plt.xlim((80, 100)) plt.ylim((0, 0.3)) plt.legend(loc='upper right') |
When I used AdaBound to train a ShuffleNet V2 model with tiny batch (5-10), I met same problem. This optimizer might not be convergent. Btw: when I used "adabound.AdaBound([{'params': part of model's params, 'lr':0...}])" to prevent some parameters be updated during training process, I would get error infomation means "cannot use lr = 0". But I can use "torch.optim.Adam([{'params': part of model's params, 'lr':0...}])" to implement this purpose. Is this a BUG ?? |
Hi, Thank you for the help |
The provided new optimizer is sensitive on tiny batchsize (<4), I am testing on the very simply linear regression, while others performance looks like nice currently.
Path:
Loss curve:
![2](https://user-images.githubusercontent.com/10858450/53678907-4d026980-3cbd-11e9-97bb-e889923b8c89.png)
Zoomed Loss curve:
![3](https://user-images.githubusercontent.com/10858450/53678910-4ffd5a00-3cbd-11e9-88f3-15fc1ed8b182.png)
The text was updated successfully, but these errors were encountered: