Checkpoints available? #4

pablogranolabar · 2021-05-01T03:06:52Z

Hi, thanks for making your work available.

Are there any checkpoints available for DialogBERT to experiment with, or does it have to be trained from scratch using main.py?

Can you share what hardware configuration including GPU or CPU memory that you used to train -Medium and -Large? I am getting OOM on a K80 even attempting to train -Medium.

And how long did it take to train -Medium and -Large?

Thanks in advance!

guxd · 2021-05-03T02:35:17Z

We apologize that there is no checkpoint available. You have to train it from scratch using main.py.

The -Medium and -Large had shown to be overfitting in DailyDialog and MultiWOZ datasets and produced suboptimal performance. We only used the 'tiny' configuration for these two datasets.
We used the Nvidia P40 GPU with 24GB memory to train all models.
It tooks 3 days for trainining the base-sized model on the Weibo dataset and around 5 hours for the tiny model on the dailydialog dataset.

pablogranolabar · 2021-05-11T17:54:05Z

Hi again. So I've been training DialogBERT Large for about 12 days now, had two days left on the full training run then accidentally shut down the machine. I backed up the entire directory so as to not overwrite any checkpoints; when I resumed the training it started over at 0 with 169 epochs. Does the model architecture include resume of training if it's interrupted or do I have to start it all over again?

Also I noticed in main.py that there is a dataset option and you had mentioned overfitting on the larger models. Is there an optimal set of execution parameters that you would recommend for scratch training DialogBERT-Large?

guxd · 2021-05-12T00:22:26Z

You have to start it all over again in this situation.
If you want to speed up training, you can reduce the frequency of testing/validation.
Or you can start validation/testing after a number of epochs.

There is no optional set of parameters. If you train your own dataset, you need to finetune your hyperparameters.

pablogranolabar · 2021-05-12T00:34:02Z

So what happens after pretraining? What options would be used to load the saved model checkpoints with this configuration?

Thanks in advance :)

guxd · 2021-05-12T05:36:11Z

I updated the code to include the test script.
Run
python main.py --do_test --reload_from XXXX
where XXXX specifies the iteration number of your optimal checkpoint.

pablogranolabar · 2021-05-12T20:52:25Z

Thank you!

pablogranolabar · 2021-05-18T08:25:24Z

So the base-sized model is medium? Is there a difference between base and medium? So on a P40 base was 3 days; do you have any thoughts on what the large parameter sized model will require in terms of total training time on a single V100? I've nohupped the training process on an AWS instance and for some reason tqdm only shows the current epoch with nohup.out instead of the total progress so I am trying to figure out how much time Large parameter model training will take using the stock configuration with main.py.

Thanks in advance as always :)

guxd · 2021-05-19T14:36:43Z

The base-sized model is just the same size as bert-base.

We did not try to train a large-sized model, so we cannot give you advice.
You have to train the large model in your dataset by yourself.

pablogranolabar · 2021-05-23T00:22:26Z

Ok, I stopped training Large and attempted to restart the training process but I'm getting the following error:

$ python3 main.py --model_size=large --per_gpu_train_batch_size=24 --do_test --reload_from=640
number of gpus: 0
05/23/2021 00:19:52 - WARNING - __main__ -  Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
05/23/2021 00:20:05 - INFO - __main__ -  Training/evaluation parameters Namespace(adam_epsilon=1e-08, data_path='./data/dailydial', dataset='dailydial', device=device(type='cpu'), do_test=True, fp16=False, fp16_opt_level='O1', grad_accum_steps=2, language='english', learning_rate=5e-05, local_rank=-1, logging_steps=200, max_grad_norm=1.0, max_steps=200000, model='DialogBERT', model_size='large', n_epochs=1.0, n_gpu=0, per_gpu_eval_batch_size=1, per_gpu_train_batch_size=24, reload_from=640, save_steps=5000, save_total_limit=100, seed=42, server_ip='', server_port='', validating_steps=20, warmup_steps=5000, weight_decay=0.01)
Traceback (most recent call last):
  File "main.py", line 115, in <module>
    main()
  File "main.py", line 110, in main
    results = solver.evaluate(args)
  File "/home/ubuntu/DialogBERT/solvers.py", line 79, in evaluate
    self.load(args)
  File "/home/ubuntu/DialogBERT/solvers.py", line 53, in load
    assert args.reload_from<=0, "please specify the checkpoint iteration in args.reload_from" 
AssertionError: please specify the checkpoint iteration in args.reload_from

Where do I get the checkpoint iteration from? I've tried the most recent batch numbers as well as valid numbers (640 from valid_results640.txt) but it doesn't look like those are the checkpoint iteration #. Where do I get the checkpoint iteration # from?

Thanks in advance!

guxd · 2021-05-23T14:43:40Z

you should specify --reload_from 640 when you run python main.py

pablogranolabar closed this as completed May 3, 2021

pablogranolabar reopened this May 11, 2021

pablogranolabar closed this as completed May 15, 2021

pablogranolabar reopened this May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoints available? #4

Checkpoints available? #4

pablogranolabar commented May 1, 2021

guxd commented May 3, 2021

pablogranolabar commented May 11, 2021

guxd commented May 12, 2021 •

edited

Loading

pablogranolabar commented May 12, 2021

guxd commented May 12, 2021

pablogranolabar commented May 12, 2021

pablogranolabar commented May 18, 2021

guxd commented May 19, 2021

pablogranolabar commented May 23, 2021

guxd commented May 23, 2021 •

edited

Loading

Checkpoints available? #4

Checkpoints available? #4

Comments

pablogranolabar commented May 1, 2021

guxd commented May 3, 2021

pablogranolabar commented May 11, 2021

guxd commented May 12, 2021 • edited Loading

pablogranolabar commented May 12, 2021

guxd commented May 12, 2021

pablogranolabar commented May 12, 2021

pablogranolabar commented May 18, 2021

guxd commented May 19, 2021

pablogranolabar commented May 23, 2021

guxd commented May 23, 2021 • edited Loading

guxd commented May 12, 2021 •

edited

Loading

guxd commented May 23, 2021 •

edited

Loading