Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoints available? #4

Open
pablogranolabar opened this issue May 1, 2021 · 10 comments
Open

Checkpoints available? #4

pablogranolabar opened this issue May 1, 2021 · 10 comments

Comments

@pablogranolabar
Copy link

Hi, thanks for making your work available.

Are there any checkpoints available for DialogBERT to experiment with, or does it have to be trained from scratch using main.py?

Can you share what hardware configuration including GPU or CPU memory that you used to train -Medium and -Large? I am getting OOM on a K80 even attempting to train -Medium.

And how long did it take to train -Medium and -Large?

Thanks in advance!

@guxd
Copy link
Owner

guxd commented May 3, 2021

We apologize that there is no checkpoint available. You have to train it from scratch using main.py.

The -Medium and -Large had shown to be overfitting in DailyDialog and MultiWOZ datasets and produced suboptimal performance. We only used the 'tiny' configuration for these two datasets.
We used the Nvidia P40 GPU with 24GB memory to train all models.
It tooks 3 days for trainining the base-sized model on the Weibo dataset and around 5 hours for the tiny model on the dailydialog dataset.

@pablogranolabar
Copy link
Author

Hi again. So I've been training DialogBERT Large for about 12 days now, had two days left on the full training run then accidentally shut down the machine. I backed up the entire directory so as to not overwrite any checkpoints; when I resumed the training it started over at 0 with 169 epochs. Does the model architecture include resume of training if it's interrupted or do I have to start it all over again?

Also I noticed in main.py that there is a dataset option and you had mentioned overfitting on the larger models. Is there an optimal set of execution parameters that you would recommend for scratch training DialogBERT-Large?

@guxd
Copy link
Owner

guxd commented May 12, 2021

You have to start it all over again in this situation.
If you want to speed up training, you can reduce the frequency of testing/validation.
Or you can start validation/testing after a number of epochs.

There is no optional set of parameters. If you train your own dataset, you need to finetune your hyperparameters.

@pablogranolabar
Copy link
Author

So what happens after pretraining? What options would be used to load the saved model checkpoints with this configuration?

Thanks in advance :)

@guxd
Copy link
Owner

guxd commented May 12, 2021

I updated the code to include the test script.
Run
python main.py --do_test --reload_from XXXX
where XXXX specifies the iteration number of your optimal checkpoint.

@pablogranolabar
Copy link
Author

Thank you!

@pablogranolabar
Copy link
Author

So the base-sized model is medium? Is there a difference between base and medium? So on a P40 base was 3 days; do you have any thoughts on what the large parameter sized model will require in terms of total training time on a single V100? I've nohupped the training process on an AWS instance and for some reason tqdm only shows the current epoch with nohup.out instead of the total progress so I am trying to figure out how much time Large parameter model training will take using the stock configuration with main.py.

Thanks in advance as always :)

@guxd
Copy link
Owner

guxd commented May 19, 2021

The base-sized model is just the same size as bert-base.

We did not try to train a large-sized model, so we cannot give you advice.
You have to train the large model in your dataset by yourself.

@pablogranolabar
Copy link
Author

Ok, I stopped training Large and attempted to restart the training process but I'm getting the following error:

$ python3 main.py --model_size=large --per_gpu_train_batch_size=24 --do_test --reload_from=640
number of gpus: 0
05/23/2021 00:19:52 - WARNING - __main__ -  Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
05/23/2021 00:20:05 - INFO - __main__ -  Training/evaluation parameters Namespace(adam_epsilon=1e-08, data_path='./data/dailydial', dataset='dailydial', device=device(type='cpu'), do_test=True, fp16=False, fp16_opt_level='O1', grad_accum_steps=2, language='english', learning_rate=5e-05, local_rank=-1, logging_steps=200, max_grad_norm=1.0, max_steps=200000, model='DialogBERT', model_size='large', n_epochs=1.0, n_gpu=0, per_gpu_eval_batch_size=1, per_gpu_train_batch_size=24, reload_from=640, save_steps=5000, save_total_limit=100, seed=42, server_ip='', server_port='', validating_steps=20, warmup_steps=5000, weight_decay=0.01)
Traceback (most recent call last):
  File "main.py", line 115, in <module>
    main()
  File "main.py", line 110, in main
    results = solver.evaluate(args)
  File "/home/ubuntu/DialogBERT/solvers.py", line 79, in evaluate
    self.load(args)
  File "/home/ubuntu/DialogBERT/solvers.py", line 53, in load
    assert args.reload_from<=0, "please specify the checkpoint iteration in args.reload_from" 
AssertionError: please specify the checkpoint iteration in args.reload_from

Where do I get the checkpoint iteration from? I've tried the most recent batch numbers as well as valid numbers (640 from valid_results640.txt) but it doesn't look like those are the checkpoint iteration #. Where do I get the checkpoint iteration # from?

Thanks in advance!

@guxd
Copy link
Owner

guxd commented May 23, 2021

you should specify --reload_from 640 when you run python main.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants