Error loading the datasets #2

Dimiftb · 2021-07-08T14:51:34Z

Hi,

Thank you very much for your paper and your models. I'm attempting to replicate the experimental results in your paper on conll2003 and en-ontonotes. I'm currently faced with an error for both datasets, which I'm not sure how to go about solving. You can see the output of running python train.py below

Click to expand

2021-07-08 14:43:47.895031: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "BARTNER/train.py", line 131, in <module>
    data_bundle, tokenizer, mapping2id = get_data()
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/core/utils.py", line 357, in wrapper
    results = func(*args, **kwargs)
  File "BARTNER/train.py", line 123, in get_data
    data_bundle = pipe.process_from_file(paths, demo=demo)
  File "/content/BARTNER/data/pipe.py", line 206, in process_from_file
    data_bundle = Conll2003NERLoader(demo=demo).load(paths)
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/loader/loader.py", line 69, in load
    datasets = {name: self._load(path) for name, path in paths.items()}
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/loader/loader.py", line 69, in <dictcomp>
    datasets = {name: self._load(path) for name, path in paths.items()}
  File "/content/BARTNER/data/pipe.py", line 271, in _load
    target = iob2(ins['target'])
  File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/pipe/utils.py", line 30, in iob2
    raise TypeError("The encoding schema is not a valid IOB type.")
TypeError: The encoding schema is not a valid IOB type.

I'm running on colab.

As for conll2003, I've simply extracted the original files for English and have put them in a folder data/conll2003 as per your instructions.

As for ontonotes, to generate bio tags I've followed this repo: https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and put the files in data/en-ontonotes/english/ as per instructions.

Currently in the folder I've got onto.development.ner, onto.train.ner, onto.test.ner as you can see on image below:

Could you please advise what am I doing wrong? Thanks.

The text was updated successfully, but these errors were encountered:

yhcc · 2021-07-09T03:45:15Z

You should make sure the first column and second column of your data are tokens and labels, respectively. Based on the sample from https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO/blob/master/onto.test.ner.sample , the results put the label in the last column.
You can also change the following code

BARTNER/data/pipe.py

Line 249 in 5d562fd

super().__init__(headers=headers, indexes=[0, 1])

to super().__init__(headers=headers, indexes=[0, -1]) , if you do not like to change your data file. The means the loader will regard the last column as the label column.

Dimiftb · 2021-07-09T10:30:08Z

Hi @yhcc,

Thank you very much for your reply. This easily fixed the issue. I managed to train the model, however I was wondering how can I display metrics (F1, recall, precision) on the test set?

This is the current output that I have once execution has finished:

yhcc · 2021-07-09T13:57:09Z

We follow previous paper merge the dev and train sets as the train set. Therefore, for the conll2003 dataset, the dev metric is the final test metric.

Dimiftb · 2021-07-09T15:54:00Z

Hi @yhcc,

Thanks for your reply. How can I go about merging the train and the dev sets? Is there functionality for it already? Also how do I get the metric to display?

Thank you very much for helping me thus far

yhcc · 2021-07-10T00:54:43Z

The merging will happend in

BARTNER/train.py

Line 220 in a42c3bb

if dataset_name == 'conll2003':

The metric will display once you train several epochs (15 epochs for conll2003). We set this because based on our experiments, the best performance will only occur after this epoch, for the sake of saving evaluation time, the code only evaluates after this epoch. You can change thi behavior by change

BARTNER/train.py

Line 49 in a42c3bb

eval_start_epoch = 15

to 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error loading the datasets #2

Error loading the datasets #2

Dimiftb commented Jul 8, 2021

yhcc commented Jul 9, 2021 •

edited

Loading

Dimiftb commented Jul 9, 2021

yhcc commented Jul 9, 2021

Dimiftb commented Jul 9, 2021

yhcc commented Jul 10, 2021

Error loading the datasets #2

Error loading the datasets #2

Comments

Dimiftb commented Jul 8, 2021

yhcc commented Jul 9, 2021 • edited Loading

Dimiftb commented Jul 9, 2021

yhcc commented Jul 9, 2021

Dimiftb commented Jul 9, 2021

yhcc commented Jul 10, 2021

yhcc commented Jul 9, 2021 •

edited

Loading