-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoints available? #4
Comments
We apologize that there is no checkpoint available. You have to train it from scratch using main.py. The -Medium and -Large had shown to be overfitting in DailyDialog and MultiWOZ datasets and produced suboptimal performance. We only used the 'tiny' configuration for these two datasets. |
Hi again. So I've been training DialogBERT Large for about 12 days now, had two days left on the full training run then accidentally shut down the machine. I backed up the entire directory so as to not overwrite any checkpoints; when I resumed the training it started over at 0 with 169 epochs. Does the model architecture include resume of training if it's interrupted or do I have to start it all over again? Also I noticed in main.py that there is a dataset option and you had mentioned overfitting on the larger models. Is there an optimal set of execution parameters that you would recommend for scratch training DialogBERT-Large? |
You have to start it all over again in this situation. There is no optional set of parameters. If you train your own dataset, you need to finetune your hyperparameters. |
So what happens after pretraining? What options would be used to load the saved model checkpoints with this configuration? Thanks in advance :) |
I updated the code to include the test script. |
Thank you! |
So the base-sized model is medium? Is there a difference between base and medium? So on a P40 base was 3 days; do you have any thoughts on what the large parameter sized model will require in terms of total training time on a single V100? I've nohupped the training process on an AWS instance and for some reason tqdm only shows the current epoch with nohup.out instead of the total progress so I am trying to figure out how much time Large parameter model training will take using the stock configuration with main.py. Thanks in advance as always :) |
The base-sized model is just the same size as We did not try to train a large-sized model, so we cannot give you advice. |
Ok, I stopped training Large and attempted to restart the training process but I'm getting the following error:
Where do I get the checkpoint iteration from? I've tried the most recent batch numbers as well as valid numbers (640 from valid_results640.txt) but it doesn't look like those are the checkpoint iteration #. Where do I get the checkpoint iteration # from? Thanks in advance! |
you should specify |
Hi, thanks for making your work available.
Are there any checkpoints available for DialogBERT to experiment with, or does it have to be trained from scratch using main.py?
Can you share what hardware configuration including GPU or CPU memory that you used to train -Medium and -Large? I am getting OOM on a K80 even attempting to train -Medium.
And how long did it take to train -Medium and -Large?
Thanks in advance!
The text was updated successfully, but these errors were encountered: