Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error loading the datasets #2

Open
Dimiftb opened this issue Jul 8, 2021 · 5 comments
Open

Error loading the datasets #2

Dimiftb opened this issue Jul 8, 2021 · 5 comments

Comments

@Dimiftb
Copy link

Dimiftb commented Jul 8, 2021

Hi,

Thank you very much for your paper and your models. I'm attempting to replicate the experimental results in your paper on conll2003 and en-ontonotes. I'm currently faced with an error for both datasets, which I'm not sure how to go about solving. You can see the output of running python train.py below

Click to expand
2021-07-08 14:43:47.895031: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "BARTNER/train.py", line 131, in <module>
    data_bundle, tokenizer, mapping2id = get_data()
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/core/utils.py", line 357, in wrapper
    results = func(*args, **kwargs)
  File "BARTNER/train.py", line 123, in get_data
    data_bundle = pipe.process_from_file(paths, demo=demo)
  File "/content/BARTNER/data/pipe.py", line 206, in process_from_file
    data_bundle = Conll2003NERLoader(demo=demo).load(paths)
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/loader/loader.py", line 69, in load
    datasets = {name: self._load(path) for name, path in paths.items()}
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/loader/loader.py", line 69, in <dictcomp>
    datasets = {name: self._load(path) for name, path in paths.items()}
  File "/content/BARTNER/data/pipe.py", line 271, in _load
    target = iob2(ins['target'])
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/pipe/utils.py", line 30, in iob2
    raise TypeError("The encoding schema is not a valid IOB type.")
TypeError: The encoding schema is not a valid IOB type.

I'm running on colab.

As for conll2003, I've simply extracted the original files for English and have put them in a folder data/conll2003 as per your instructions.

As for ontonotes, to generate bio tags I've followed this repo: https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and put the files in data/en-ontonotes/english/ as per instructions.

Currently in the folder I've got onto.development.ner, onto.train.ner, onto.test.ner as you can see on image below:
image

Could you please advise what am I doing wrong? Thanks.

@yhcc
Copy link
Owner

yhcc commented Jul 9, 2021

You should make sure the first column and second column of your data are tokens and labels, respectively. Based on the sample from https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO/blob/master/onto.test.ner.sample , the results put the label in the last column.
You can also change the following code

super().__init__(headers=headers, indexes=[0, 1])

to super().__init__(headers=headers, indexes=[0, -1]) , if you do not like to change your data file. The means the loader will regard the last column as the label column.

@Dimiftb
Copy link
Author

Dimiftb commented Jul 9, 2021

Hi @yhcc,

Thank you very much for your reply. This easily fixed the issue. I managed to train the model, however I was wondering how can I display metrics (F1, recall, precision) on the test set?

This is the current output that I have once execution has finished:
image

@yhcc
Copy link
Owner

yhcc commented Jul 9, 2021

We follow previous paper merge the dev and train sets as the train set. Therefore, for the conll2003 dataset, the dev metric is the final test metric.

@Dimiftb
Copy link
Author

Dimiftb commented Jul 9, 2021

Hi @yhcc,

Thanks for your reply. How can I go about merging the train and the dev sets? Is there functionality for it already? Also how do I get the metric to display?

Thank you very much for helping me thus far

@yhcc
Copy link
Owner

yhcc commented Jul 10, 2021

The merging will happend in

if dataset_name == 'conll2003':

The metric will display once you train several epochs (15 epochs for conll2003). We set this because based on our experiments, the best performance will only occur after this epoch, for the sake of saving evaluation time, the code only evaluates after this epoch. You can change thi behavior by change

eval_start_epoch = 15

to 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants